Version: v0.9.0a2

Operator

Monitoring & Debugging

5 min readSRE · OperatorDay-two operations

What this page covers

Health checks, log interpretation, recall-latency diagnosis, and operational metrics for Stigmem node operators.

Audience: operators and SREs running Stigmem in production.

Health check

The /healthz endpoint is the canonical liveness probe:

curl -s https://your-node.example.com/healthz | jq .

{
  "status": "ok",
  "backend": "libsql",
  "sync_lag_ms": 12,
  "federation_enabled": true,
  "version": "1.0.0"
}

Field

Watch for

Notes

status

must be "ok"

Any other value means the node is unhealthy.

sync_lag_ms

< 5000 ms

libSQL only. High values suggest a network issue between the replica and the Turso primary.

federation_enabled

true if enabled

Should match your configuration.

Kubernetes / PaaS: point your liveness and readiness probes at /healthz.

Logs

The node emits structured JSON logs to stdout. Use LOG_LEVEL to control verbosity:

STIGMEM_LOG_LEVEL=debug   # maximum detail; avoid in production at scale
STIGMEM_LOG_LEVEL=info    # default
STIGMEM_LOG_LEVEL=warning # only warnings and errors

Key log events:

Message fragment

Class

Meaning

federation.pull.ok

success

Successful pull from a peer.

federation.pull.signature_mismatch

incident

Peer's key changed; update the pin.

federation.pull.peer_unreachable

network

Network or peer-side issue.

storage.migration.applied

startup

Schema migration ran on startup.

recall.embedding.provider_error

degraded

Embedding provider unavailable.

recall.latency_ms > 1000

performance

Recall taking > 1 s — see Recall latency below.

snapshot.create.ok

backup

Backup snapshot created successfully.

snapshot.verify.fail

critical

Snapshot verification failed — do not restore.

# Docker Compose
docker compose logs -f node

Metrics

curl -s https://your-node.example.com/metrics | grep stigmem_

Key metrics (see Observability for the full reference):

`stigmem_fact_write_total`

`stigmem_fact_read_total`

`stigmem_request_latency_seconds`

`stigmem_recall_ranker_duration_seconds`

`stigmem_federation_ingress_total`

`stigmem_federation_egress_total`

`stigmem_replication_lag_seconds`

`stigmem_peer_hlc_anomaly_total`

`stigmem_audit_event_total`

Experimental Grafana dashboard JSON is available under experimental/deploy-grafana/dashboards/grafana/. Grafana support is deferred per ADR-002; the dashboards remain available for self-import but are unsupported until they pass the ADR-008 reintroduction gates. The canonical feature record is features/deploy-grafana.

Recall latency

Recall (semantic search) combines vector lookup with graph traversal. High latency usually has one of three causes.

1. Embedding provider slow or unavailable

# Check if the embedding provider is reachable
curl -s http://localhost:11434/api/tags   # Ollama default

If using Ollama, ensure the model is pulled and running:

ollama pull nomic-embed-text
ollama serve &

Switch to the offline default (STIGMEM_EMBED_PROVIDER=ollama) during an outage to eliminate provider dependency.

2. Vector index not built yet

After a large batch of fact imports, the vector index may not be fully built. Check:

curl -s https://your-node.example.com/v1/recall/status | jq .

If index_coverage is below 1.0, the index is still building. Recall continues to work but may miss recently added facts until indexing catches up.

3. Database query slow (high fact volume)

# Check storage backend latency from metrics
curl -s https://your-node.example.com/metrics \
  | grep stigmem_storage_query_latency_seconds

# Enable query explain for debugging (SQLite and libSQL only — do not use in production)
STIGMEM_STORAGE_EXPLAIN_QUERIES=true stigmem serve

For Postgres at scale, ensure pgvector index is created:

-- Check vector index
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'vec_facts';

-- Create IVFFlat index if missing (adjust lists based on fact count)
CREATE INDEX ON vec_facts USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Federation debugging

# Check peer status and last pull time
curl -s https://your-node.example.com/v1/federation/peers | jq '.peers[] | {id, last_pull, pull_lag_ms, error}'

# Check federation audit log
curl -s "https://your-node.example.com/v1/federation/audit?limit=20" | jq .

# Force an immediate pull cycle for a specific peer (skips the interval)
curl -X POST https://your-node.example.com/v1/federation/peers/<peer-id>/pull \
  -H "Authorization: Bearer $STIGMEM_ADMIN_KEY"

Common issues:

Symptom

Cause

Fix

pull_lag_ms growing

network / overload

Increase STIGMEM_FEDERATION_PULL_INTERVAL_S.

signature_mismatch

key rotated

Update peer pin.

peer_unreachable

DNS / firewall

Check STIGMEM_NODE_URL and peer's firewall.

No facts despite healthy pull

scope filter

Check STIGMEM_FEDERATION_ALLOW_TEAM if team-scoped facts are expected.

Alerts

Recommended alert rules (Prometheus / Alertmanager notation):

groups:
  - name: stigmem
    rules:
      - alert: StigmemDown
        expr: up{job="stigmem"} == 0
        for: 1m
        labels: { severity: critical }
        annotations:
          summary: "Stigmem node is down"

      - alert: StigmemHighRecallLatency
        expr: histogram_quantile(0.99, stigmem_recall_ranker_duration_seconds_bucket) > 5
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "99th-percentile recall ranker latency > 5s"

      - alert: StigmemFederationIngressErrors
        expr: sum by (peer_id) (increase(stigmem_federation_ingress_total{status!="accepted"}[5m])) > 3
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Federation ingress errors spiking"

      - alert: StigmemPeerHlcAnomaly
        expr: sum by (peer_id) (increase(stigmem_peer_hlc_anomaly_total[1h])) > 5
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Peer HLC anomaly events above threshold"

      - alert: StigmemLibSQLHighSyncLag
        expr: stigmem_libsql_sync_lag_ms > 10000
        for: 2m
        labels: { severity: warning }
        annotations:
          summary: "libSQL sync lag > 10s — replica falling behind primary"

Health check​

Logs​

Metrics​

stigmem_fact_write_total

stigmem_fact_read_total

stigmem_request_latency_seconds

stigmem_recall_ranker_duration_seconds

stigmem_federation_ingress_total

stigmem_federation_egress_total

stigmem_replication_lag_seconds

stigmem_peer_hlc_anomaly_total

stigmem_audit_event_total

Recall latency​

1. Embedding provider slow or unavailable​

2. Vector index not built yet​

3. Database query slow (high fact volume)​

Federation debugging​

Alerts​