Monitoring & Debugging
What this page covers
Health checks, log interpretation, recall-latency diagnosis, and operational metrics for Stigmem node operators.
Audience: operators and SREs running Stigmem in production.
Health checkโ
The /healthz endpoint is the canonical liveness probe:
curl -s https://your-node.example.com/healthz | jq .
{
"status": "ok",
"backend": "libsql",
"sync_lag_ms": 12,
"federation_enabled": true,
"version": "1.0.0"
}
statussync_lag_msfederation_enabledKubernetes / PaaS: point your liveness and readiness probes at /healthz.
Logsโ
The node emits structured JSON logs to stdout. Use LOG_LEVEL to control verbosity:
STIGMEM_LOG_LEVEL=debug # maximum detail; avoid in production at scale
STIGMEM_LOG_LEVEL=info # default
STIGMEM_LOG_LEVEL=warning # only warnings and errors
Key log events:
federation.pull.okfederation.pull.signature_mismatchfederation.pull.peer_unreachablestorage.migration.appliedrecall.embedding.provider_errorrecall.latency_ms > 1000snapshot.create.oksnapshot.verify.fail# Docker Compose
docker compose logs -f node
Metricsโ
curl -s https://your-node.example.com/metrics | grep stigmem_
Key metrics (see Observability for the full reference):
stigmem_fact_write_total
stigmem_fact_read_total
stigmem_request_latency_seconds
stigmem_recall_ranker_duration_seconds
stigmem_federation_ingress_total
stigmem_federation_egress_total
stigmem_replication_lag_seconds
stigmem_peer_hlc_anomaly_total
stigmem_audit_event_total
Experimental Grafana dashboard JSON is available under
experimental/deploy-grafana/dashboards/grafana/.
Grafana support is deferred per
ADR-002;
the dashboards remain available for self-import but are unsupported until they
pass the ADR-008 reintroduction gates. The canonical feature record is
features/deploy-grafana.
Recall latencyโ
Recall (semantic search) combines vector lookup with graph traversal. High latency usually has one of three causes.
1. Embedding provider slow or unavailableโ
# Check if the embedding provider is reachable
curl -s http://localhost:11434/api/tags # Ollama default
If using Ollama, ensure the model is pulled and running:
ollama pull nomic-embed-text
ollama serve &
Switch to the offline default (STIGMEM_EMBED_PROVIDER=ollama) during an outage to eliminate provider dependency.
2. Vector index not built yetโ
After a large batch of fact imports, the vector index may not be fully built. Check:
curl -s https://your-node.example.com/v1/recall/status | jq .
If index_coverage is below 1.0, the index is still building. Recall continues to work but may miss recently added facts until indexing catches up.
3. Database query slow (high fact volume)โ
# Check storage backend latency from metrics
curl -s https://your-node.example.com/metrics \
| grep stigmem_storage_query_latency_seconds
# Enable query explain for debugging (SQLite and libSQL only โ do not use in production)
STIGMEM_STORAGE_EXPLAIN_QUERIES=true stigmem serve
For Postgres at scale, ensure pgvector index is created:
-- Check vector index
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'vec_facts';
-- Create IVFFlat index if missing (adjust lists based on fact count)
CREATE INDEX ON vec_facts USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Federation debuggingโ
# Check peer status and last pull time
curl -s https://your-node.example.com/v1/federation/peers | jq '.peers[] | {id, last_pull, pull_lag_ms, error}'
# Check federation audit log
curl -s "https://your-node.example.com/v1/federation/audit?limit=20" | jq .
# Force an immediate pull cycle for a specific peer (skips the interval)
curl -X POST https://your-node.example.com/v1/federation/peers/<peer-id>/pull \
-H "Authorization: Bearer $STIGMEM_ADMIN_KEY"
Common issues:
pull_lag_ms growingSTIGMEM_FEDERATION_PULL_INTERVAL_S.signature_mismatchpeer_unreachableSTIGMEM_NODE_URL and peer's firewall.STIGMEM_FEDERATION_ALLOW_TEAM if team-scoped facts are expected.Alertsโ
Recommended alert rules (Prometheus / Alertmanager notation):
groups:
- name: stigmem
rules:
- alert: StigmemDown
expr: up{job="stigmem"} == 0
for: 1m
labels: { severity: critical }
annotations:
summary: "Stigmem node is down"
- alert: StigmemHighRecallLatency
expr: histogram_quantile(0.99, stigmem_recall_ranker_duration_seconds_bucket) > 5
for: 5m
labels: { severity: warning }
annotations:
summary: "99th-percentile recall ranker latency > 5s"
- alert: StigmemFederationIngressErrors
expr: sum by (peer_id) (increase(stigmem_federation_ingress_total{status!="accepted"}[5m])) > 3
for: 5m
labels: { severity: warning }
annotations:
summary: "Federation ingress errors spiking"
- alert: StigmemPeerHlcAnomaly
expr: sum by (peer_id) (increase(stigmem_peer_hlc_anomaly_total[1h])) > 5
for: 5m
labels: { severity: warning }
annotations:
summary: "Peer HLC anomaly events above threshold"
- alert: StigmemLibSQLHighSyncLag
expr: stigmem_libsql_sync_lag_ms > 10000
for: 2m
labels: { severity: warning }
annotations:
summary: "libSQL sync lag > 10s โ replica falling behind primary"