Skip to main content
Version: v0.9.0a2
Operator

Monitoring & Debugging

5 min readSRE ยท OperatorDay-two operations

What this page covers

Health checks, log interpretation, recall-latency diagnosis, and operational metrics for Stigmem node operators.

Audience: operators and SREs running Stigmem in production.

Health checkโ€‹

The /healthz endpoint is the canonical liveness probe:

curl -s https://your-node.example.com/healthz | jq .
{
"status": "ok",
"backend": "libsql",
"sync_lag_ms": 12,
"federation_enabled": true,
"version": "1.0.0"
}
Field
Watch for
Notes
status
must be "ok"
Any other value means the node is unhealthy.
sync_lag_ms
< 5000 ms
libSQL only. High values suggest a network issue between the replica and the Turso primary.
federation_enabled
true if enabled
Should match your configuration.

Kubernetes / PaaS: point your liveness and readiness probes at /healthz.

Logsโ€‹

The node emits structured JSON logs to stdout. Use LOG_LEVEL to control verbosity:

STIGMEM_LOG_LEVEL=debug # maximum detail; avoid in production at scale
STIGMEM_LOG_LEVEL=info # default
STIGMEM_LOG_LEVEL=warning # only warnings and errors

Key log events:

Message fragment
Class
Meaning
federation.pull.ok
success
Successful pull from a peer.
federation.pull.signature_mismatch
incident
Peer's key changed; update the pin.
federation.pull.peer_unreachable
network
Network or peer-side issue.
storage.migration.applied
startup
Schema migration ran on startup.
recall.embedding.provider_error
degraded
Embedding provider unavailable.
recall.latency_ms > 1000
performance
Recall taking > 1 s โ€” see Recall latency below.
snapshot.create.ok
backup
Backup snapshot created successfully.
snapshot.verify.fail
critical
Snapshot verification failed โ€” do not restore.
# Docker Compose
docker compose logs -f node

Metricsโ€‹

curl -s https://your-node.example.com/metrics | grep stigmem_

Key metrics (see Observability for the full reference):

stigmem_fact_write_total

stigmem_fact_read_total

stigmem_request_latency_seconds

stigmem_recall_ranker_duration_seconds

stigmem_federation_ingress_total

stigmem_federation_egress_total

stigmem_replication_lag_seconds

stigmem_peer_hlc_anomaly_total

stigmem_audit_event_total

Experimental Grafana dashboard JSON is available under experimental/deploy-grafana/dashboards/grafana/. Grafana support is deferred per ADR-002; the dashboards remain available for self-import but are unsupported until they pass the ADR-008 reintroduction gates. The canonical feature record is features/deploy-grafana.

Recall latencyโ€‹

Recall (semantic search) combines vector lookup with graph traversal. High latency usually has one of three causes.

1. Embedding provider slow or unavailableโ€‹

# Check if the embedding provider is reachable
curl -s http://localhost:11434/api/tags # Ollama default

If using Ollama, ensure the model is pulled and running:

ollama pull nomic-embed-text
ollama serve &

Switch to the offline default (STIGMEM_EMBED_PROVIDER=ollama) during an outage to eliminate provider dependency.

2. Vector index not built yetโ€‹

After a large batch of fact imports, the vector index may not be fully built. Check:

curl -s https://your-node.example.com/v1/recall/status | jq .

If index_coverage is below 1.0, the index is still building. Recall continues to work but may miss recently added facts until indexing catches up.

3. Database query slow (high fact volume)โ€‹

# Check storage backend latency from metrics
curl -s https://your-node.example.com/metrics \
| grep stigmem_storage_query_latency_seconds

# Enable query explain for debugging (SQLite and libSQL only โ€” do not use in production)
STIGMEM_STORAGE_EXPLAIN_QUERIES=true stigmem serve

For Postgres at scale, ensure pgvector index is created:

-- Check vector index
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'vec_facts';

-- Create IVFFlat index if missing (adjust lists based on fact count)
CREATE INDEX ON vec_facts USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Federation debuggingโ€‹

# Check peer status and last pull time
curl -s https://your-node.example.com/v1/federation/peers | jq '.peers[] | {id, last_pull, pull_lag_ms, error}'

# Check federation audit log
curl -s "https://your-node.example.com/v1/federation/audit?limit=20" | jq .

# Force an immediate pull cycle for a specific peer (skips the interval)
curl -X POST https://your-node.example.com/v1/federation/peers/<peer-id>/pull \
-H "Authorization: Bearer $STIGMEM_ADMIN_KEY"

Common issues:

Symptom
Cause
Fix
pull_lag_ms growing
network / overload
Increase STIGMEM_FEDERATION_PULL_INTERVAL_S.
signature_mismatch
key rotated
Update peer pin.
peer_unreachable
DNS / firewall
Check STIGMEM_NODE_URL and peer's firewall.
No facts despite healthy pull
scope filter
Check STIGMEM_FEDERATION_ALLOW_TEAM if team-scoped facts are expected.

Alertsโ€‹

Recommended alert rules (Prometheus / Alertmanager notation):

groups:
- name: stigmem
rules:
- alert: StigmemDown
expr: up{job="stigmem"} == 0
for: 1m
labels: { severity: critical }
annotations:
summary: "Stigmem node is down"

- alert: StigmemHighRecallLatency
expr: histogram_quantile(0.99, stigmem_recall_ranker_duration_seconds_bucket) > 5
for: 5m
labels: { severity: warning }
annotations:
summary: "99th-percentile recall ranker latency > 5s"

- alert: StigmemFederationIngressErrors
expr: sum by (peer_id) (increase(stigmem_federation_ingress_total{status!="accepted"}[5m])) > 3
for: 5m
labels: { severity: warning }
annotations:
summary: "Federation ingress errors spiking"

- alert: StigmemPeerHlcAnomaly
expr: sum by (peer_id) (increase(stigmem_peer_hlc_anomaly_total[1h])) > 5
for: 5m
labels: { severity: warning }
annotations:
summary: "Peer HLC anomaly events above threshold"

- alert: StigmemLibSQLHighSyncLag
expr: stigmem_libsql_sync_lag_ms > 10000
for: 2m
labels: { severity: warning }
annotations:
summary: "libSQL sync lag > 10s โ€” replica falling behind primary"