Skip to main content
Version: v0.9.0a2
Operator

Cursor-Reset-on-DB-Loss Recovery

5 min readFederation operatorF2 hardening

What this runbook covers

When a node loses its replication_cursors table โ€” through DB corruption, an accidental DROP TABLE, or a bare-DB restore from backup โ€” the next federation pull resets every peer's cursor to NULL. The pull loop re-fetches every fact from every peer from scratch.

Audience: Node operators managing Stigmem federation in production. Spec reference: Spec-05-Federation-Trust (federation pull loop, cursor semantics). Track: F2 โ€” pre-GA hardening.

Problemโ€‹

Cursor reset is safe (ingestion is idempotent on fact ID) but expensive.

re-pull cost โ‰ˆ total_facts_per_peer ร— (pull_interval / page_size)
= e.g. 500 000 facts ร— (10s / 100) = ~13 hours per peer

On a busy multi-peer mesh the compounding effect can delay convergence for many hours and produce unusually high I/O and network load. The cursor-export / cursor-import commands bound this cost.

Cursor-Checkpoint Workflowโ€‹

Export (before any DB operation)โ€‹

Run this before planned DB maintenance (backup restore, migration, schema change):

# Timestamped checkpoint alongside the DB backup
stigmem federation cursor-export \
--out /var/lib/stigmem/cursors-$(date +%Y%m%dT%H%M%S).json

# Or pipe to stdout
stigmem federation cursor-export | tee /backup/stigmem-cursors-latest.json

Checkpoint format:

{
"checkpoint_timestamp": "2026-05-02T18:30:00Z",
"db_path": "/var/lib/stigmem/stigmem.db",
"cursors": [
{
"peer_id": "a1b2c3d4-...",
"peer_node_id": "stigmem://node-b",
"peer_url": "http://node-b:8765",
"peer_status": "active",
"direction": "inbound",
"cursor": "1725349500000.005",
"updated_at": "2026-05-02T18:25:00Z"
}
]
}

Recommended cron setup โ€” schedule cursor-export alongside your SQLite backup:

# every 15 minutes, write a cursor checkpoint next to the DB backup
*/15 * * * * stigmem federation cursor-export \
--out /backup/stigmem-cursors-latest.json 2>>/var/log/stigmem/cursor-export.log

This bounds the re-pull window to at most 15 minutes of incremental facts after any DB loss event.

Recovery procedure (after DB loss)โ€‹

Each step is idempotent โ€” you can re-run safely.

Step 1 โ€” Stop the nodeโ€‹

systemctl stop stigmem-node # or: kill $(cat /run/stigmem.pid)

The pull loop must not be running while you restore the DB or import cursors.

Step 2 โ€” Restore the DB (or start fresh)โ€‹

# If you have a DB backup:
cp /backup/stigmem-YYYYMMDD.db /var/lib/stigmem/stigmem.db

# If the DB is wholly lost, apply migrations before importing:
stigmem migrate normalize-entities --dry-run

Step 3 โ€” Re-register peers (if peer table is lost)โ€‹

The FK constraint on replication_cursors requires each peer row to exist. If the peer table is gone (fresh DB or pre-migration 002), re-register peers first:

stigmem federation register-peer \
--remote-url http://node-b:8765 \
--local-url http://this-node:8765 \
--scopes company,public
# repeat for each peer

Peers absent from the peers table are skipped with a warning during import โ€” re-register them, then re-run cursor-import.

Step 4 โ€” Import the checkpointโ€‹

stigmem federation cursor-import /backup/stigmem-cursors-latest.json

Expected output:

cursor import complete: 3 restored, 0 skipped (peer not found), 0 skipped (already set)

Default import skips cursors that already have non-null values.

If the restored DB already has non-null cursors from the backup, import skips them by default to avoid regressing a newer cursor. Use --force to override.

Step 5 โ€” Start the nodeโ€‹

systemctl start stigmem-node

The pull loop reads restored cursors and resumes from the checkpointed positions rather than from NULL (the beginning of replication time).

Step 6 โ€” Verifyโ€‹

Watch logs for resume vs. full-re-pull patterns:

# Healthy cursor resume โ€” expected after import
INFO pull from stigmem://node-b: cursor=1725349500000.005, got 12 facts, has_more=false

# Full re-pull โ€” import was skipped or peer not found
INFO pull from stigmem://node-b: cursor=None, got 100 facts, has_more=true

If you see cursor=None for a peer after import, that peer was not found in the peers table. Re-register it and re-run cursor-import --force.

Recovery without a Checkpointโ€‹

If no checkpoint is available, the node performs a full re-pull. This is safe. To minimize impact:

  1. Start the node during off-peak hours.
  2. Temporarily reduce STIGMEM_FEDERATION_PULL_INTERVAL_S (e.g., to 5) to complete the re-pull faster, then restore the original value.
  3. Monitor the facts table row count; re-pull is complete when it stabilizes.

PACELC contract: during re-pull the node is available for local writes and reads.

Facts from peers not yet re-fetched are temporarily absent. This is the system's existing PA/EL contract โ€” idempotent ingestion ensures no duplicates.

Operational Recommendationsโ€‹

Practice
Cadence
Rationale
Schedule cursor-export every 15 min
15 min
Bounds re-pull to โ‰ค15 min of missed incremental facts.
Validate replication_cursors in DB backups
per backup
Detects silent table loss before an incident.
Store checkpoint file outside DB directory
always
Survives disk corruption that takes the DB file.
Run cursor-export before planned DB maintenance
pre-change
Zero-cost re-pull resume even for planned operations.

CLI Referenceโ€‹

stigmem federation cursor-exportโ€‹

usage: stigmem federation cursor-export [--out FILE] [--db PATH]

--out FILE Output path. Use "-" or omit for stdout. (default: stdout)
--db PATH Path to stigmem.db. (default: STIGMEM_DB_PATH or settings default)

Exits 0 on success. Checkpoint JSON written to FILE or stdout.

stigmem federation cursor-importโ€‹

usage: stigmem federation cursor-import FILE [--force] [--db PATH]

FILE Checkpoint JSON produced by cursor-export (required).
--force Overwrite cursors that are already non-null. (default: skip)
--db PATH Path to stigmem.db. (default: STIGMEM_DB_PATH or settings default)

Entries whose peer_id is absent from the peers table are skipped with a warning โ€” re-register the peer first and re-run.

Invariants Preservedโ€‹

Idempotent ingest

Re-pulled facts already in the DB are silently discarded (no duplicates).

Lower-bound cursor

If the checkpoint is older than the actual last-seen HLC, the node re-fetches a small window of already-seen facts (harmless) rather than missing any.

No regression without --force

cursor-import only fills NULL slots by default, so it cannot cause data loss on a partially-intact DB.

See alsoโ€‹

Federation guide

Peer registration, pull loop, and scope enforcement.

4-node soak results

Cursor-resume behavior verified under failure injection.