Version: v0.9.0a2

Integrator

Recall

5 min readAgent developerSpec-07-Recall-Pipeline + Spec-X11-Recall-Graph

What this page is

The recall endpoint answers: "What do I most need to know right now?" It scans the fact store with a hybrid three-stage pipeline (lexical + dense + graph), scores candidates by relevance, and packs as many facts as possible within a token budget — returning a coherent, size-bounded context slice rather than an unbounded list.

`recall` vs `query_facts`

Aspect

POST /v1/recall

GET /v1/facts

Use when

open-ended or semantic

"What do I know about project X?" vs. you know the exact entity, relation, or predicate.

Input

natural-language query

Structured filters (entity, relation, scope, date range).

Output

scored, MMR-packed slice

Paginated full list matching filters.

Graph expansion

yes — depth 1 or 2

No.

Memory card

included with entity param

Not included.

Score signal

BM25 + cosine + graph proximity

N/A.

Token control

token_budget

limit / pagination.

Use query_facts when you need a complete list for a specific predicate.

For example, fetching all memory:tag values for a project before running a report. Use recall when you need the most relevant slice of memory to include in an agent's context window.

How the pipeline works

Stage 1 · Lexical (BM25/FTS5)

The query string is matched against a full-text index of entity + relation + value fields using SQLite FTS5 / BM25 scoring. Fast; catches exact-keyword or near-exact matches. Weight: lexical (default 0.30).

Stage 2 · Dense vector (ANN)

The query string is embedded with the same model used at write time (default: nomic-embed-text-v1.5) and compared to the vec_facts index via approximate nearest-neighbour search (sqlite-vec). Catches semantic matches that lexical search misses. Weight: vector (default 0.50). Requires STIGMEM_EMBED_ENABLED=true.

Stage 3 · Graph expansion

Seed facts from stages 1–2 are expanded by traversing the entity_edges adjacency index. Related entities are pulled in even if they didn't score on the query directly. Controlled by depth (1 or 2) and weights.graph (default 0.20). Set depth=0 to skip graph expansion entirely.

MMR packing

After scoring, candidates are packed into the response using Maximal Marginal Relevance (MMR). MMR alternates between relevance and diversity: each slot picks the next candidate that is both highly relevant and dissimilar to what is already packed. This prevents five near-duplicate facts about the same entity from consuming the whole budget.

lambda_mmr

Tradeoff

Behavior

1.0

pure relevance

Highest scores first, no diversity penalty.

0.5

balanced

Equal weight on relevance and diversity.

0.0

pure diversity

Avoids all similarity to already-packed items, regardless of score.

0.7 (default)

relevance-biased

Moderate diversity.

Token budget

The token_budget parameter limits the total size of the packed response. The node counts estimated tokens across entity, relation, and value fields for each candidate fact. Field labels and metadata (id, score, source_trust, etc.) are excluded from the count.

When the budget is exhausted the response includes "truncated": true. The count of tokens actually used is returned as token_budget_used. Facts are always added in MMR order — the highest-priority items appear first in the results array even when the response is truncated.

Setting a budget.

A typical LLM context window for agent tool responses is 4,000–8,000 tokens. Leave headroom for the agent's instructions and the conversation history. A starting value of 2000– 4000 works for most agent tasks.

Check truncated

If your agent is missing relevant facts, check truncated first. If it is true, either raise token_budget or narrow the query to a specific entity.

Parameters

Parameter

Type · Default

Description

query

string · required

Natural-language or structured query.

token_budget

integer · required

Max response tokens (field labels excluded).

depth

integer · 1

Graph expansion hops; 0 disables graph stage.

weights

object · {lexical:0.30, vector:0.50, graph:0.20}

Stage weights; must sum to 1.0 ±0.001.

entity

string

Entity URI; triggers entity-centric (card-first) recall.

relation

string

Relation filter; skips memory card lookup.

scope

string · global

Garden or global scope.

lambda_mmr

float · 0.7

MMR diversity tradeoff (0–1).

min_confidence

float · 0.1

Minimum effective confidence for inclusion.

force_refresh

bool · false

Block on synchronous memory card refresh.

include_contradicted

bool · false

Include facts with unresolved contradictions.

legacy_format

query bool · false

Temporary one-minor-version compatibility switch. Omits content and instructions while preserving the legacy facts array.

Response channels

By default, recall returns the legacy facts array plus channel-separated content and instructions arrays. New adapters should consume content and instructions separately and treat recalled content as untrusted data. Older clients can call POST /v1/recall?legacy_format=true during the compatibility window to receive the pre-channel response shape.

Examples

Open-ended semantic query

curl -s -X POST http://localhost:8765/v1/recall \
  -H 'Authorization: Bearer <api-key>' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "what is the current project status?",
    "token_budget": 2000
  }'

Entity-centric query (memory card first)

When entity is set, the response includes a memory_card block — a pre-synthesized entity summary — before the ranked fact list. Use this when the agent is reasoning about a specific entity.

curl -s -X POST http://localhost:8765/v1/recall \
  -H 'Authorization: Bearer <api-key>' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "deployment history",
    "entity": "stigmem://company.example/project/api-service",
    "token_budget": 3000,
    "depth": 2
  }'

Python SDK

from stigmem import StigmemClient

client = StigmemClient(base_url="http://localhost:8765", api_key="<api-key>")

result = client.recall(
    query="what are the current blockers on phase 9?",
    token_budget=2000,
    depth=1,
)

for fact in result.results:
    print(f"{fact.entity}  {fact.relation}  {fact.value}  (score={fact.score:.3f})")

if result.truncated:
    print(f"Warning: response truncated at {result.token_budget_used} tokens")

Weight tuning

The three stage weights control how much each signal contributes to a fact's final score. Adjust when the defaults produce poor results.

Scenario

Recommended weights

Notes

Exact-keyword queries

{lexical: 0.60, vector: 0.20, graph: 0.20}

"What is task X?"

Semantic / conceptual queries

{lexical: 0.10, vector: 0.70, graph: 0.20}

"What do I know about auth?"

Graph-heavy: exploring relationships

{lexical: 0.20, vector: 0.30, graph: 0.50}

Entity-centric exploration.

Embeddings disabled

{lexical: 0.70, vector: 0.00, graph: 0.30}

STIGMEM_EMBED_ENABLED=false.

When embeddings are disabled the vector stage is skipped and its weight redistributed proportionally at recall time — but passing explicit weights that sum to 1.0 without a vector component is cleaner.

curl -s -X POST http://localhost:8765/v1/recall \
  -H 'Authorization: Bearer <api-key>' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "production incident 2026-04-15",
    "token_budget": 3000,
    "weights": {"lexical": 0.60, "vector": 0.20, "graph": 0.20}
  }'

Memory cards and recall fast-path

Memory cards are per-entity, per-scope pre-aggregated summaries stored in the memory_cards table. They accelerate recall by short-circuiting raw-fact re-ranking for entities that have a fresh, reliable card.

Stale-on-write

Every POST /v1/facts call marks the affected entity's card stale immediately after the fact is persisted. Non-blocking background operation that never delays the write path.

Refresh-on-read (fast-path)

During recall, the node calls get_fresh_card for each candidate entity. When a card passes all three conditions — is_stale = false, has_contradictions = false, and avg_confidence ≥ 0.5 — the entity's raw facts are replaced by a single synthetic ScoredFact carrying the card summary. This fact appears with from_card: true and the relation stigmem:card:summary.

Divergence policy

When any condition is false (including a transient refresh error), the entity falls through to full raw-fact re-ranking. The fallback is transparent to callers: the only signal is the absence of from_card: true on those facts.

Fetching or forcing a card refresh. Use GET /v1/cards/{entity_uri} to inspect a card directly or force a server-side refresh with ?refresh=true. See the Memory Cards guide for the full lifecycle.

Security and access control

Garden ACL filtering

All recall results are filtered by the caller's garden ACL (Spec-02) at query time — callers never see facts from gardens they don't have read access to.

Content sanitizer

The recall-time sanitizer (ADR-003) strips known prompt-injection sentinels and bidirectional-override characters from value fields.

Source-trust multiplication

source_trust on each result reflects the identity strength of the writing agent. Effective confidence = confidence × source_trust. Facts below min_confidence are excluded.

ConceptsQuerying factsStructured, predicate-based fact queries.ExperimentalEmbeddingsModel selection, reindexing, and mixed-model safety.ExperimentalMemory GardensGarden ACL and recall scoping.ExperimentalSubscriptionsPush notifications when watched facts change.

recall vs query_facts​

How the pipeline works​

Stage 1 · Lexical (BM25/FTS5)​

Stage 2 · Dense vector (ANN)​

Stage 3 · Graph expansion​

MMR packing​

Token budget​

Parameters​

Response channels​

Examples​

Open-ended semantic query​

Entity-centric query (memory card first)​

Python SDK​

Weight tuning​

Memory cards and recall fast-path​