Skip to main content
Version: v0.9.0a2
Integrator

Recall

5 min readAgent developerSpec-07-Recall-Pipeline + Spec-X11-Recall-Graph

What this page is

The recall endpoint answers: "What do I most need to know right now?" It scans the fact store with a hybrid three-stage pipeline (lexical + dense + graph), scores candidates by relevance, and packs as many facts as possible within a token budget — returning a coherent, size-bounded context slice rather than an unbounded list.

recall vs query_facts

Aspect
POST /v1/recall
GET /v1/facts
Use when
open-ended or semantic
"What do I know about project X?" vs. you know the exact entity, relation, or predicate.
Input
natural-language query
Structured filters (entity, relation, scope, date range).
Output
scored, MMR-packed slice
Paginated full list matching filters.
Graph expansion
yes — depth 1 or 2
No.
Memory card
included with entity param
Not included.
Score signal
BM25 + cosine + graph proximity
N/A.
Token control
token_budget
limit / pagination.

Use query_facts when you need a complete list for a specific predicate.

For example, fetching all memory:tag values for a project before running a report. Use recall when you need the most relevant slice of memory to include in an agent's context window.

How the pipeline works

Stage 1 · Lexical (BM25/FTS5)

The query string is matched against a full-text index of entity + relation + value fields using SQLite FTS5 / BM25 scoring. Fast; catches exact-keyword or near-exact matches. Weight: lexical (default 0.30).

Stage 2 · Dense vector (ANN)

The query string is embedded with the same model used at write time (default: nomic-embed-text-v1.5) and compared to the vec_facts index via approximate nearest-neighbour search (sqlite-vec). Catches semantic matches that lexical search misses. Weight: vector (default 0.50). Requires STIGMEM_EMBED_ENABLED=true.

Stage 3 · Graph expansion

Seed facts from stages 1–2 are expanded by traversing the entity_edges adjacency index. Related entities are pulled in even if they didn't score on the query directly. Controlled by depth (1 or 2) and weights.graph (default 0.20). Set depth=0 to skip graph expansion entirely.

MMR packing

After scoring, candidates are packed into the response using Maximal Marginal Relevance (MMR). MMR alternates between relevance and diversity: each slot picks the next candidate that is both highly relevant and dissimilar to what is already packed. This prevents five near-duplicate facts about the same entity from consuming the whole budget.

lambda_mmr
Tradeoff
Behavior
1.0
pure relevance
Highest scores first, no diversity penalty.
0.5
balanced
Equal weight on relevance and diversity.
0.0
pure diversity
Avoids all similarity to already-packed items, regardless of score.
0.7 (default)
relevance-biased
Moderate diversity.

Token budget

The token_budget parameter limits the total size of the packed response. The node counts estimated tokens across entity, relation, and value fields for each candidate fact. Field labels and metadata (id, score, source_trust, etc.) are excluded from the count.

When the budget is exhausted the response includes "truncated": true. The count of tokens actually used is returned as token_budget_used. Facts are always added in MMR order — the highest-priority items appear first in the results array even when the response is truncated.

Setting a budget.

A typical LLM context window for agent tool responses is 4,000–8,000 tokens. Leave headroom for the agent's instructions and the conversation history. A starting value of 20004000 works for most agent tasks.

Check truncated

If your agent is missing relevant facts, check truncated first. If it is true, either raise token_budget or narrow the query to a specific entity.

Parameters

Parameter
Type · Default
Description
query
string · required
Natural-language or structured query.
token_budget
integer · required
Max response tokens (field labels excluded).
depth
integer · 1
Graph expansion hops; 0 disables graph stage.
weights
object · {lexical:0.30, vector:0.50, graph:0.20}
Stage weights; must sum to 1.0 ±0.001.
entity
string
Entity URI; triggers entity-centric (card-first) recall.
relation
string
Relation filter; skips memory card lookup.
scope
string · global
Garden or global scope.
lambda_mmr
float · 0.7
MMR diversity tradeoff (0–1).
min_confidence
float · 0.1
Minimum effective confidence for inclusion.
force_refresh
bool · false
Block on synchronous memory card refresh.
include_contradicted
bool · false
Include facts with unresolved contradictions.
legacy_format
query bool · false
Temporary one-minor-version compatibility switch. Omits content and instructions while preserving the legacy facts array.

Response channels

By default, recall returns the legacy facts array plus channel-separated content and instructions arrays. New adapters should consume content and instructions separately and treat recalled content as untrusted data. Older clients can call POST /v1/recall?legacy_format=true during the compatibility window to receive the pre-channel response shape.

Examples

Open-ended semantic query

curl -s -X POST http://localhost:8765/v1/recall \
-H 'Authorization: Bearer <api-key>' \
-H 'Content-Type: application/json' \
-d '{
"query": "what is the current project status?",
"token_budget": 2000
}'

Entity-centric query (memory card first)

When entity is set, the response includes a memory_card block — a pre-synthesized entity summary — before the ranked fact list. Use this when the agent is reasoning about a specific entity.

curl -s -X POST http://localhost:8765/v1/recall \
-H 'Authorization: Bearer <api-key>' \
-H 'Content-Type: application/json' \
-d '{
"query": "deployment history",
"entity": "stigmem://company.example/project/api-service",
"token_budget": 3000,
"depth": 2
}'

Python SDK

from stigmem import StigmemClient

client = StigmemClient(base_url="http://localhost:8765", api_key="<api-key>")

result = client.recall(
query="what are the current blockers on phase 9?",
token_budget=2000,
depth=1,
)

for fact in result.results:
print(f"{fact.entity} {fact.relation} {fact.value} (score={fact.score:.3f})")

if result.truncated:
print(f"Warning: response truncated at {result.token_budget_used} tokens")

Weight tuning

The three stage weights control how much each signal contributes to a fact's final score. Adjust when the defaults produce poor results.

Scenario
Recommended weights
Notes
Exact-keyword queries
{lexical: 0.60, vector: 0.20, graph: 0.20}
"What is task X?"
Semantic / conceptual queries
{lexical: 0.10, vector: 0.70, graph: 0.20}
"What do I know about auth?"
Graph-heavy: exploring relationships
{lexical: 0.20, vector: 0.30, graph: 0.50}
Entity-centric exploration.
Embeddings disabled
{lexical: 0.70, vector: 0.00, graph: 0.30}
STIGMEM_EMBED_ENABLED=false.

When embeddings are disabled the vector stage is skipped and its weight redistributed proportionally at recall time — but passing explicit weights that sum to 1.0 without a vector component is cleaner.

curl -s -X POST http://localhost:8765/v1/recall \
-H 'Authorization: Bearer <api-key>' \
-H 'Content-Type: application/json' \
-d '{
"query": "production incident 2026-04-15",
"token_budget": 3000,
"weights": {"lexical": 0.60, "vector": 0.20, "graph": 0.20}
}'

Memory cards and recall fast-path

Memory cards are per-entity, per-scope pre-aggregated summaries stored in the memory_cards table. They accelerate recall by short-circuiting raw-fact re-ranking for entities that have a fresh, reliable card.

Stale-on-write

Every POST /v1/facts call marks the affected entity's card stale immediately after the fact is persisted. Non-blocking background operation that never delays the write path.

Refresh-on-read (fast-path)

During recall, the node calls get_fresh_card for each candidate entity. When a card passes all three conditions — is_stale = false, has_contradictions = false, and avg_confidence ≥ 0.5 — the entity's raw facts are replaced by a single synthetic ScoredFact carrying the card summary. This fact appears with from_card: true and the relation stigmem:card:summary.

Divergence policy

When any condition is false (including a transient refresh error), the entity falls through to full raw-fact re-ranking. The fallback is transparent to callers: the only signal is the absence of from_card: true on those facts.

Fetching or forcing a card refresh. Use GET /v1/cards/{entity_uri} to inspect a card directly or force a server-side refresh with ?refresh=true. See the Memory Cards guide for the full lifecycle.

Security and access control

Garden ACL filtering

All recall results are filtered by the caller's garden ACL (Spec-02) at query time — callers never see facts from gardens they don't have read access to.

Content sanitizer

The recall-time sanitizer (ADR-003) strips known prompt-injection sentinels and bidirectional-override characters from value fields.

Source-trust multiplication

source_trust on each result reflects the identity strength of the writing agent. Effective confidence = confidence × source_trust. Facts below min_confidence are excluded.