Version: v0.9.0a2

Spec

Spec-21-Content-Addressed-IDs

5 min readSpec contributor · Node operatorDraft · v0.9.0aN

What this spec defines

The content-addressed fact identifier (CID): deterministic, tamper-evident identifiers for the canonical body of a fact.

Extraction status

This component spec extracts the content-addressed fact ID material that previously lived in the monolithic stigmem-spec-v0.9.0a1.md lineage.

CIDs are core Stigmem behavior per ADR-011.

They are not an experimental plugin feature, and a conforming default node MUST compute CIDs for new facts.

Purpose

A CID is a deterministic, tamper-evident identifier for the canonical body of a fact. Unlike a UUID-style fact id, a CID can be recomputed independently from the fact payload.

CIDs provide:

Content integrity

Checks for stored facts.

Deduplication

For identical assertions.

Stable provenance

References across nodes and transports.

Defense-in-depth

A core layer for storage immutability.

CID format

A CID MUST use the sha256: prefix followed by 64 lowercase hexadecimal characters:

sha256:<64 lowercase hex chars>

The CID MUST be computed as:

CID = "sha256:" + hex_lowercase(SHA-256(canonical_fact_body_bytes))

Only lowercase hexadecimal output is valid for sha256: CIDs.

Strings beginning with sha256: that do not match this format MUST be treated as malformed CIDs.

Canonical fact body

The canonical fact body for v0.9.0aN CID computation contains exactly these fields:

{
  "confidence": 1.0,
  "entity": "stigmem://example/entity",
  "relation": "memory:prefers",
  "scope": "local",
  "source": "agent:example",
  "value_type": "string",
  "value_v": "dark mode"
}

The canonical body MUST be serialized as compact UTF-8 JSON with deterministic lexicographic key ordering and no insignificant whitespace. The reference node uses JSON sorted keys with compact separators and ensure_ascii=false.

All seven canonical fields are CID-sensitive. Changing any of them MUST produce a different CID.

Excluded fields

The following fields MUST NOT participate in CID computation:

Field

Reason

Notes

id / fact_id

self-reference

Cannot be part of its own CID.

cid

self-reference

Cannot be self-referential.

timestamp / created_at

write-time metadata

Write time is not part of the assertion.

hlc

node-local

Logical clock metadata.

valid_until

policy

Expiry policy, not the assertion body.

derived_from

provenance

References can create circularity.

attestation_chain / signature

transport

Transport or attestation metadata.

source_trust

local

Locally derived trust score.

reason

audit context

Operator or audit context, not the assertion body.

A matching CID is not proof that excluded metadata is trustworthy.

Excluded fields may still be security-relevant. Implementations MUST validate those fields through their owning specs.

Storage contract

The fact storage model MUST support:

Nullable `cid` column

Nullable only for legacy rows pending backfill.

`fact_cid_aliases` table

Mapping stored fact ids to CIDs.

Unique CID index

For efficient CID lookup.

Fact id index

For alias maintenance.

Every new fact write MUST persist the computed CID on the fact row and insert the corresponding alias row in the same transaction.

Write path and deduplication

On local assertion, a node MUST:

Normalize the assertion fields according to their owning specs.
Compute the CID before writing the fact row.
Persist the CID with all other fact fields.
Insert the CID alias row in the same transaction.

A CID collision MUST NOT overwrite the existing record.

If the computed CID already exists for the same tenant, the node SHOULD return the existing record instead of creating a duplicate fact. If an implementation detects the same CID for a different canonical body, it MUST treat that as a CID collision.

Dual addressing

The single-fact read route MUST accept either a UUID-style fact id or a CID:

GET /v1/facts/{cid_or_fact_id}

When the path value starts with sha256:, the node MUST validate CID syntax and resolve the fact through the CID alias index. Malformed CIDs MUST return a validation error. Well-formed but unknown CIDs MUST return not found.

Fact responses SHOULD include the stored cid field. Legacy facts that have not yet been backfilled MAY return cid: null.

CID verification

Nodes MUST expose an integrity check that recomputes the CID from the stored fact body and compares it with the stored CID:

POST /v1/facts/{fact_id}/verify-cid

The response MUST include:

Field

Required

Meaning

cid_valid

yes

Whether the stored CID matches the recomputed CID.

computed_cid

yes

CID computed from the stored canonical body.

stored_cid

nullable

Stored CID, or null for legacy rows pending backfill.

mismatch_reason

conditional

Human-readable reason when cid_valid is false.

A false result SHOULD trigger operator investigation. It may indicate data corruption, storage tampering, a legacy row pending backfill, or a canonicalization bug.

Backfill

Nodes MUST provide a backfill path for legacy rows whose cid is null. The backfill process MUST:

Iterate over facts with cid IS NULL.
Recompute each CID from the canonical fact body.
Update the fact row and insert the alias row.
Be idempotent.

The reference node exposes a backfill-cids CLI command and this status route:

GET /v1/admin/cid-backfill/status

The status response MUST include:

Field

Type

Meaning

total_facts

integer

Total facts visible to the status query.

backfilled_facts

integer

Facts with non-null cid.

pending_facts

integer

Facts still missing cid.

backfill_complete

boolean

Whether pending_facts is zero.

Federation use

Receiving nodes SHOULD recompute the CID from the inbound canonical body and reject payloads whose declared CID does not match.

Federation payloads SHOULD carry CIDs when fact records cross node boundaries. Legacy CID-null rows may exist during migration and backfill windows. Federation policy for accepting or rejecting CID-null inbound facts is owned by Spec-05-Federation-Trust; this spec defines the CID format and computation needed to perform that validation.

Error conditions

Nodes SHOULD use these stable error meanings:

Error

Condition

Notes

cid_malformed

syntax

A sha256: path value is not followed by 64 lowercase hex characters.

fact_not_found

lookup

Fact id or CID does not resolve to a readable fact.

cid_mismatch

integrity

A recomputed or inbound CID does not match the declared/stored CID.

cid_collision_detected

integrity

Two different canonical fact bodies produce the same CID.

Out of scope

This spec does not define:

Federation CID-null policy

Full trust policy for legacy facts.

Hash algorithm rotation

Beyond the sha256: prefix shape.

Provenance graph

Semantics for derived_from.

Tombstone / time-travel

Or source-attestation behavior.

Extraction status​

Purpose​

Content integrity

Deduplication

Stable provenance

Defense-in-depth

CID format​

Canonical fact body​

Excluded fields​

Storage contract​

Nullable cid column

fact_cid_aliases table

Unique CID index

Fact id index

Write path and deduplication​

Dual addressing​

CID verification​

Backfill​

Federation use​

Error conditions​

Out of scope​

Federation CID-null policy

Hash algorithm rotation

Provenance graph

Tombstone / time-travel

Storage-engine DDL syntax

Extraction status

Purpose

CID format

Canonical fact body

Excluded fields

Storage contract

Nullable `cid` column

`fact_cid_aliases` table

Write path and deduplication

Dual addressing

CID verification

Backfill

Federation use

Error conditions

Out of scope