Skip to main content
Version: v0.9.0a2
Operator

Observability โ€” Prometheus and OpenTelemetry

4 min readSRE ยท Operatorv0.9.0aN

What this page covers

The observability surface that is implemented in the Stigmem reference node: a Prometheus /metrics endpoint and an OpenTelemetry SDK for distributed tracing. Grafana dashboards and packaged observability compose recipes remain experimental repo assets until they pass the ADR-008 reintroduction gates.

What is includedโ€‹

Component
Status
Purpose
/metrics endpoint
always on
Prometheus text exposition when prometheus-client is installed.
OpenTelemetry SDK
opt-in
Distributed traces for assert, recall, subscribe, and federation.
features/deploy-grafana
unsupported
Feature record for the experimental Grafana dashboards, Prometheus alerts, and Tempo config.

Quick start โ€” local metricsโ€‹

# Start a node, then scrape metrics from the node process.
curl -s http://localhost:8765/metrics | grep '^stigmem_'

Prometheus, Grafana, and Tempo deployment topology is operator-owned today. Point your scrape target at /metrics, and import the experimental Grafana dashboards from experimental/deploy-grafana/dashboards/grafana/ manually if they fit your deployment.

Prometheus metrics referenceโ€‹

Install the optional extra to enable Prometheus exposition:

pip install "stigmem-node[observability]"

/metrics is always available.

It returns 200 OK with an empty comment if prometheus-client is not installed, so healthchecks on /metrics will not break.

Countersโ€‹

Metric
Labels
Description
stigmem_fact_write_total
principal, tenant
Successful fact assertions.
stigmem_fact_read_total
principal, tenant
Fact queries and recall requests.
stigmem_contradiction_total
tenant
Facts that triggered a contradiction on write.
stigmem_audit_event_total
event_type, tenant
Audit events written (Spec-09-Audit-Log).
stigmem_quota_breach_total
principal, tenant, dimension
Rate-limit 429 responses.
stigmem_federation_ingress_total
peer_id, status
Facts received via federation pull.
stigmem_federation_egress_total
peer_id, status
Facts served via federation pull endpoint.
stigmem_peer_hlc_anomaly_total
peer_id, direction
Inbound federation HLC skew rejections.
stigmem_subscription_event_total
delivery_type, status
Subscription delivery events.

Histogramsโ€‹

Metric
Buckets
Description
stigmem_request_latency_seconds
5 ms โ€“ 2.5 s
End-to-end HTTP request latency (route, method, status_code).
stigmem_recall_ranker_duration_seconds
10 ms โ€“ 2.5 s
Time spent in the hybrid recall ranker (tenant).
stigmem_capability_verify_duration_seconds
1 ms โ€“ 100 ms
Capability token verification latency (result).

Gaugesโ€‹

Metric
Labels
Description
stigmem_subscription_connections_active
tenant
Active (non-circuit-open) subscriptions.
stigmem_replication_lag_seconds
peer_id
Estimated lag to each federation peer.

OpenTelemetry tracingโ€‹

Enable tracing by setting two environment variables:

STIGMEM_OTEL_ENABLED=true
STIGMEM_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

Requires stigmem-node[observability]:

pip install "stigmem-node[observability]"

Instrumented operationsโ€‹

Span name
Type
Attributes
stigmem.assert_fact
write
stigmem.tenant, stigmem.principal, stigmem.fact_id, stigmem.contradicted.
stigmem.recall
read
stigmem.tenant, stigmem.principal, stigmem.scope, stigmem.recall_id, stigmem.total_scored, stigmem.tokens_used, stigmem.truncated.

The OTLP exporter sends traces to the configured endpoint via HTTP/protobuf. Any OpenTelemetry-compatible backend works โ€” Grafana Tempo, Jaeger, Honeycomb, Datadog, etc.

Alerting rulesโ€‹

Experimental Prometheus alert seeds are available at experimental/deploy-grafana/dashboards/prometheus/alerts.yml. Review and adapt them before production use.

Alert
Severity
Condition
StigmemHighContradictionRate
warning
> 0.1 contradictions/s for 5 m.
StigmemReplicationLagHigh
critical
Replication lag > 5 min for 5 m.
StigmemAuditEventsMissing
critical
Writes but no audit events for 2 m.
StigmemQuotaBreachSustained
warning
> 0.05 quota breaches/s for 10 m.

To load adapted rules into a standalone Prometheus:

# prometheus.yml
rule_files:
- /path/to/stigmem-alerts.yml

Smoke testโ€‹

After the node is running, run a quick smoke workload to verify the metric set:

# 1. Write a fact
curl -s -X POST http://localhost:8765/v1/facts \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"entity":"stigmem://test/thing/1","relation":"test:label","value":{"type":"text","v":"hello"},"source":"stigmem://test/agent/1","scope":"local"}' \
| jq .id

# 2. Scrape metrics and verify counters are non-zero
curl -s http://localhost:8765/metrics | grep stigmem_fact_write_total

# 3. Run a recall to populate the ranker histogram
curl -s -X POST http://localhost:8765/v1/recall \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"query":"hello","scope":"local","token_budget":1000}' \
| jq .total_scored

curl -s http://localhost:8765/metrics | grep stigmem_recall_ranker_duration_seconds_count

Importing dashboards manuallyโ€‹

  1. Open Grafana at your instance URL.
  2. Go to Dashboards โ†’ Import.
  3. Upload experimental/deploy-grafana/dashboards/grafana/stigmem-overview.json.
  4. Select your Prometheus datasource when prompted.
  5. Repeat for stigmem-federation.json.