Skip to main content
Version: v0.9.0a2
Operator

R-KEY-EXPIRY

3 min readOn-call operatorRunbook

When to use

Production traffic is blocked because a key expired before rotation completed. Trigger alerts: key_expired_blocked, repeated authentication failures for a known production caller. Supporting signal: /v1/auth/keys/expiring-soon showed the key inside the operator's alert window and was not acted on.

Identifyโ€‹

Find which key class is affected:

Class
Symptom
Notes
API key
auth failures
Callers receive authentication failures.
Capability issuer key
issuance/verification fail
Federation/capability tokens cannot be issued or verified.
Node federation key
manifest rejected
Peers reject your manifest or pull responses.
Encryption passphrase
db won't open
Node cannot open the database after a secrets change.

Capture recent auth and admin audit events:

curl -s "https://your-node.example.com/v1/audit/events?limit=200" \
-H "Authorization: Bearer $STIGMEM_ADMIN_KEY" | jq .

Containโ€‹

  1. Do not extend an expired key by editing the database by hand.
  2. Keep the failed key material for audit, but stop issuing new tokens with it.
  3. If admin access is still available, create a replacement key immediately.
  4. If admin access is unavailable, use your documented break-glass procedure.

Investigateโ€‹

Determine why the rotation was missed:

Alert configured?

Was a key_expiring_soon alert configured?

Backed by query?

Was the alert backed by /v1/auth/keys/expiring-soon or an equivalent database/SIEM query?

Right owner?

Did the alert route to the right owner?

Missing owner

Did the key lack an owner or rotation date?

Peer coordination

Was the rotation procedure blocked by peer coordination?

Recoverโ€‹

For API keys:

  1. Create a new key with the least required permissions.
  2. Redeploy the caller with the new secret.
  3. Revoke the expired key if it remains in storage.

For federation or issuer keys:

  1. Follow Key Rotation.
  2. Notify peer operators of the new public key or manifest.
  3. Ask peers to re-pin if automatic refresh is unavailable.
  4. Confirm federation pulls resume.

For encryption passphrases:

  1. Restore the last known-good secret from your secrets manager.
  2. Bring the node healthy.
  3. Schedule a controlled rekey rather than improvising during outage.

Communicateโ€‹

After recovery, add or fix the rotation reminder that should have prevented the outage.

Tell affected callers or peers which key expired, when replacement credentials will be available, and whether any data integrity risk exists.