Skip to main content
Version: v0.9.0a2
Integrator

N-node Relay Backpressure

3 min readNode operator · Multi-hop topologySpec-05-Federation-Trust

What this page is

How relay nodes signal replication lag to downstream peers in multi-hop federation chains. Lets D distinguish "C is up to date" from "C is 8 minutes behind A" — even when C is serving D at full rate.

Modular spec

N-node backpressure patterns are covered by Spec-05-Federation-Trust relay-backpressure guidance. These behaviors are SHOULD (recommended), not MUST — conformant nodes that omit them will interoperate but downstream nodes may receive stale data without warning.

The problem · silent stale data in relay chains

In a two-node federation (A ← B), replication lag is directly observable: B can see its own pull latency against A.

In an N-node chain (A ← B ← C ← D), node C can fall behind B's replication while continuing to serve D at full rate — meaning D receives facts that are stale by both B's lag and C's additional lag, with no signal that anything is wrong.

Backpressure signal · X-Stigmem-Replication-Lag

When a node's inbound replication lag exceeds STIGMEM_FEDERATION_RELAY_LAG_WARNING_MS (default 60 seconds), it includes a warning header on pull responses:

X-Stigmem-Replication-Lag: 87450

The value is the maximum lag in milliseconds across all inbound peers. Downstream nodes receiving this header SHOULD:

Log and alert

Log the lag and alert an operator if it persists.

Treat synthesized facts as stale

Treat synthesized facts from this node as potentially stale.

Optionally back off pull interval

To reduce load on the lagging node.

Hard throttle · HTTP 503

When lag exceeds STIGMEM_FEDERATION_RELAY_LAG_HARD_MS (default 5 minutes), the node returns HTTP 503 on pull requests:

HTTP/1.1 503 Service Unavailable
Retry-After: 120

{
"error": "relay_lag_exceeded",
"lag_ms": 342000,
"retry_after_s": 120
}

Downstream nodes MUST respect the retry_after_s value and not retry before that window.

Discovery via /.well-known/stigmem

Nodes expose their current relay lag in the well-known endpoint so operators can observe it without triggering a pull:

curl $STIGMEM_URL/.well-known/stigmem | jq .replication_lag_ms
# 87450

The replication_lag_ms field is:

Maximum lag

Across all inbound peers.

Omitted if leaf node

No inbound federation, or lag is within warning bounds.

Environment variables

Variable
Default
Description
STIGMEM_FEDERATION_RELAY_LAG_WARNING_MS
60000
Lag threshold for X-Stigmem-Replication-Lag warning header.
STIGMEM_FEDERATION_RELAY_LAG_HARD_MS
300000
Lag threshold for HTTP 503 throttle.
STIGMEM_FEDERATION_RELAY_ENABLED
true
Set false on leaf nodes to disable relay behavior.

Set STIGMEM_FEDERATION_RELAY_ENABLED=false on leaf nodes (nodes with no downstream peers) to suppress relay headers and avoid unnecessary lag computation.

Example · 4-node topology

A (origin) ← B (relay) ← C (relay) ← D (leaf)
Node
Role
Configuration
A
origin / source of truth
No inbound peers; sets RELAY_ENABLED=false.
B
relay
Pulls from A; exposes replication_lag_ms in well-known; serves C.
C
relay
Pulls from B; propagates lag signal if B falls behind; serves D.
D
leaf
Pulls from C; reads X-Stigmem-Replication-Lag on responses from C; RELAY_ENABLED=false.

D's agent should check X-Stigmem-Replication-Lag on every pull response from C and surface it if lag is significant:

# Check for lag header on pull response
response=$(curl -si $NODE_C_URL/v1/federation/pull \
-H "Authorization: Bearer $NODE_C_PEER_TOKEN")

lag=$(echo "$response" | grep -i 'x-stigmem-replication-lag' | awk '{print $2}' | tr -d '\r')
if [ -n "$lag" ] && [ "$lag" -gt 30000 ]; then
echo "Warning: C is ${lag}ms behind A. Facts may be stale."
fi

Operator checklist

  1. Set STIGMEM_FEDERATION_RELAY_ENABLED=false on all leaf nodes.
  2. Configure RELAY_LAG_WARNING_MS and RELAY_LAG_HARD_MS for your topology's acceptable freshness window.
  3. Monitor /.well-known/stigmem#replication_lag_ms on relay nodes.
  4. Alert if any relay node exceeds warning threshold for more than 5 minutes.
  5. Downstream agents: check X-Stigmem-Replication-Lag at context-injection time and log when stale.

See also