Observability โ Prometheus and OpenTelemetry
What this page covers
The observability surface that is implemented in the Stigmem
reference node: a Prometheus /metrics endpoint and an OpenTelemetry
SDK for distributed tracing. Grafana dashboards and packaged
observability compose recipes remain experimental repo assets until
they pass the ADR-008 reintroduction gates.
What is includedโ
/metrics endpointprometheus-client is installed.features/deploy-grafanaQuick start โ local metricsโ
# Start a node, then scrape metrics from the node process.
curl -s http://localhost:8765/metrics | grep '^stigmem_'
Prometheus, Grafana, and Tempo deployment topology is operator-owned today.
Point your scrape target at /metrics, and import the experimental Grafana
dashboards from experimental/deploy-grafana/dashboards/grafana/ manually if
they fit your deployment.
Prometheus metrics referenceโ
Install the optional extra to enable Prometheus exposition:
pip install "stigmem-node[observability]"
/metrics is always available.
It returns 200 OK with an empty comment if prometheus-client is
not installed, so healthchecks on /metrics will not break.
Countersโ
stigmem_fact_write_totalstigmem_fact_read_totalstigmem_contradiction_totalstigmem_audit_event_totalSpec-09-Audit-Log).stigmem_quota_breach_totalstigmem_federation_ingress_totalstigmem_federation_egress_totalstigmem_subscription_event_totalHistogramsโ
stigmem_request_latency_secondsstigmem_recall_ranker_duration_secondsstigmem_capability_verify_duration_secondsGaugesโ
stigmem_subscription_connections_activestigmem_replication_lag_secondsOpenTelemetry tracingโ
Enable tracing by setting two environment variables:
STIGMEM_OTEL_ENABLED=true
STIGMEM_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
Requires stigmem-node[observability]:
pip install "stigmem-node[observability]"
Instrumented operationsโ
stigmem.assert_factstigmem.tenant, stigmem.principal, stigmem.fact_id, stigmem.contradicted.stigmem.recallstigmem.tenant, stigmem.principal, stigmem.scope, stigmem.recall_id, stigmem.total_scored, stigmem.tokens_used, stigmem.truncated.The OTLP exporter sends traces to the configured endpoint via HTTP/protobuf. Any OpenTelemetry-compatible backend works โ Grafana Tempo, Jaeger, Honeycomb, Datadog, etc.
Alerting rulesโ
Experimental Prometheus alert seeds are available at experimental/deploy-grafana/dashboards/prometheus/alerts.yml. Review and adapt them before production use.
StigmemHighContradictionRateStigmemReplicationLagHighStigmemAuditEventsMissingStigmemQuotaBreachSustainedTo load adapted rules into a standalone Prometheus:
# prometheus.yml
rule_files:
- /path/to/stigmem-alerts.yml
Smoke testโ
After the node is running, run a quick smoke workload to verify the metric set:
# 1. Write a fact
curl -s -X POST http://localhost:8765/v1/facts \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"entity":"stigmem://test/thing/1","relation":"test:label","value":{"type":"text","v":"hello"},"source":"stigmem://test/agent/1","scope":"local"}' \
| jq .id
# 2. Scrape metrics and verify counters are non-zero
curl -s http://localhost:8765/metrics | grep stigmem_fact_write_total
# 3. Run a recall to populate the ranker histogram
curl -s -X POST http://localhost:8765/v1/recall \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"query":"hello","scope":"local","token_budget":1000}' \
| jq .total_scored
curl -s http://localhost:8765/metrics | grep stigmem_recall_ranker_duration_seconds_count
Importing dashboards manuallyโ
- Open Grafana at your instance URL.
- Go to Dashboards โ Import.
- Upload
experimental/deploy-grafana/dashboards/grafana/stigmem-overview.json. - Select your Prometheus datasource when prompted.
- Repeat for
stigmem-federation.json.