4-Node Federation Topology
What this page is
A full-mesh 4-node Docker Compose topology for soak testing, failure-mode verification, and topology experimentation. Validated in a 72-hour soak run with partition injection, network delay, and node restart scenarios.
Overview
Four nodes (node-a through node-d) form a full-mesh pull graph:
every node pulls from every other, giving maximum replication
coverage.
node-a (8765)
↗ ↖ ↘
node-d node-b (8766)
(8768)↘ ↗ ↗
node-c (8767)
Starting the 4-node cluster
cd stigmem/infra
docker compose -f docker-compose.soak.yml up --build -d
Wait for all four nodes to report healthy:
docker compose -f docker-compose.soak.yml ps
All four services should show healthy:
NAME STATUS PORTS
soak-node-a-1 healthy 0.0.0.0:8765->8765/tcp
soak-node-b-1 healthy 0.0.0.0:8766->8765/tcp
soak-node-c-1 healthy 0.0.0.0:8767->8765/tcp
soak-node-d-1 healthy 0.0.0.0:8768->8765/tcp
Wiring the peer mesh
The infra/soak/setup_peers.py script registers all 12 directed peer
links (each of the 4 nodes registers with each of the 3 others) and
creates federate API keys.
docker compose -f docker-compose.soak.yml exec node-a python /soak/setup_peers.py
- Generates Ed25519 keypairs if not already present (via
infra/soak/keys.py). - For each directed pair, POSTs a signed
PeerDeclarationto the remote node's/v1/federation/peers. - Verifies each registration returns
"status": "active".
After setup, each node shows 3 peers:
curl -s http://localhost:8765/v1/federation/peers | jq '[.peers[] | {node_id, status}]'
Seed workload
To validate replication under realistic load, use the included seed script:
docker compose -f docker-compose.soak.yml exec node-a python /soak/seed.py
The seed script continuously emits:
Probe facts
Public, no expiry: 1 per node per 30s for replication latency measurement.
Steady-state churn
Rotating entity states with mixed TTLs.
Deliberate contradictions
Paired assertions from two nodes every 60s.
Local-scope facts
scope=local — verify these never cross node boundaries.
Conflict storms
50-fact burst every 10 minutes.
Failure injection
The run_soak.sh script orchestrates a 72-hour run with scheduled
failure injection.
Run it:
bash infra/soak/run_soak.sh
Results are written to infra/soak/metrics/:
replication_latency.csv
p50/p90/p99 per probe fact.
conflict_counts.csv
Contradiction detection and convergence.
resources.csv
CPU/memory per node.
local_isolation.csv
Invariant violation detector (target: 0).
Failure modes · observed behaviors
FM-1 · Node partition (network isolation)
A partitioned node backs off its pull loop exponentially (1s → 2s → … → 300s max). Facts asserted during partition accumulate locally.
On reconnect, the pull loop resumes from the last committed HLC cursor — no facts are skipped.
Cross-partition contradictions are detected at ingest and stored as
first-class ConflictRecords. No data is lost; no
silent overwrites.
Recovery time ≈ O(facts_accumulated × pull_batch_size / pull_interval_s)
FM-2 · Slow peer (high RTT)
At 500ms RTT, pull succeeds but at reduced throughput. At RTT > 30s, the pull request times out; the pull retries next cycle with the unchanged cursor. No facts lost.
For multi-hop topologies, the lagging node will emit
X-Stigmem-Replication-Lag headers on its pull responses once lag
exceeds 60s — see the relay backpressure guide.
FM-3 · Node restart
Verified in TestCursorResume::test_node_restart_resumes_without_gaps.
Cursor read on startup
Node reads replication_cursors from SQLite on startup (persisted via WAL).
Pull loop resumes
From last committed cursor per peer.
Restart-to-healthy
< 5s.
Idempotent ingest
Re-delivered facts (same fact ID) are silently discarded.
If the DB is lost, see the
cursor-reset recovery guide
for the stigmem federation cursor-export / cursor-import runbook.
FM-4 · Contradiction storm
Under burst write conditions, each ingested contradiction generates
two system facts (stigmem:conflict:between,
stigmem:conflict:status). These use the stigmem: prefix and are
not re-replicated. At 50 contradictions/s, the HLC counter
increments rapidly but remains monotonically correct per
Spec-12-HLC-Bounded-Skew.
Current limitation: conflicts table has no TTL or eviction.
Sustained storms will grow it unboundedly. A conflict archival policy is planned for a future spec version.
FM-5 · Malformed or expired peer token
The pull endpoint returns HTTP 401 for expired, invalid-signature,
or replayed-nonce tokens. An event_type="rejected_token" or
"replay_attempt" entry is written to the federation audit log. The
caller retains its cursor and retries next cycle.
FM-6 · Scope boundary violation
Peers can only pull facts for scopes declared in their
PeerDeclaration, regardless of token claims. local-scope facts
never leave origin (Spec-05-Federation-Trust scope enforcement,
verified: TestScopeIsolation). Violations are rejected with HTTP
403 and logged as event_type="scope_violation".
See the scope propagation guide for
company-scoped re-federation restrictions.
Teardown
docker compose -f docker-compose.soak.yml down -v
The -v flag removes data volumes so the next run starts from a
clean state.