Why Similarity Search Isn't Enough

The pitch that works on stage

Every AI memory demo starts the same way. You tell the system your dog's name is Biscuit, and later, when you ask "What's my dog's name?", it finds the memory and returns "Biscuit." Applause. Investment. Launch.

But there is a question the demos never show:

"What did you recommend for my sleep issues?"

The user said this three conversations ago. The AI had suggested reducing screen time after 9 PM, trying a magnesium supplement, and keeping the bedroom at 18 degrees Celsius. That recommendation is stored as a memory object. The embedding for that memory sits in a vector space, clustered near other memories about health, supplements, and sleep.

Now run the similarity search. The query "What did you recommend for my sleep issues?" produces an embedding. The nearest neighbors in vector space will be memories about sleep: the user's mention of insomnia, a memory about their melatonin allergy, maybe a note about their partner being a light sleeper. All semantically close to "sleep issues." All plausible results. None of them are the answer.

The actual answer (the AI's own recommendation) sits in a different semantic neighborhood. It is about screen time, temperature, and magnesium. The cosine distance between the query and the answer is large, because the words are different. The meaning is connected, but the connection is not semantic similarity. It is applicability: this memory is what the user needs right now, even though it does not look like the question.

This is the gap we have been investigating. And the deeper we dug, the more failure modes we found.

Evidence from our testing

We did not set out to build a different retrieval system. We set out to build a memory engine and expected embeddings to handle retrieval. They did not. The evidence accumulated over months of testing against real conversational data and the LoCoMo benchmark, a dataset of long conversations specifically designed to test memory retrieval.

Here are the failure patterns we found, ordered from obvious to subtle.

Failure 1: The colloquial reference

User, conversation 47: "My neighbor Dave does construction, big guy, always covered in dust, told me this story about a crane collapse on his site last year."

User, conversation 52: "Did I ever tell you about the construction guy?"

Embedding search for the second message produces vectors near "construction," "guy," and "tell." It will surface memories that mention construction, maybe even memories about a "guy." What it will not do is connect "the construction guy" to "neighbor Dave," because the embedding for an informal reference shares almost no surface features with the formal memory. The connection exists in the user's mind, not in vector space.

A human listener would make this connection instantly, because they understand that "the construction guy" is a reference pattern, a colloquial callback to a person previously described in detail. This is a reference resolution problem, and embedding similarity does not solve it.

Failure 2: The multi-hop question

"What was the name of the restaurant where I had dinner with the person who recommended the hiking trail?"

This question requires chaining three memories:

1. Someone recommended a hiking trail. 2. The user had dinner with that person. 3. The dinner was at a specific restaurant.

No single memory is "similar" to this query. The answer lives in the intersection of three separate memories, each of which is only partially relevant on its own. Embedding search will find memories about restaurants, memories about hiking, and memories about dinner, but it has no mechanism to chain them. It retrieves individual points when the answer requires a path.

Failure 3: The temporal mismatch

"What was I worried about last month?"

The embedding for this query is close to memories about worry, anxiety, concern. But the actual answer is whatever the user happened to be worried about during a specific time window, which could be anything from a work deadline to a medical test to a housing search. The content of the answer is unpredictable from the content of the question. The connection is temporal, not semantic.

Adding time-window filtering helps narrow the search space, but within that window, pure embedding similarity still finds memories about worry rather than the specific worry. The temporal filter reduces noise without improving the relevance model.

Failure 4: The callback

"Did that thing you suggested actually work?"

Here the user is referring to something the AI said: a recommendation, a suggestion, a proposed solution. The memory being requested is not from the user's side of the conversation at all. It is from the AI's side. Many memory systems only store user-provided facts, not the AI's own contributions. Even those that do store both sides have no way to surface a memory based on the role it played (AI recommendation) rather than its content (whatever was recommended).

These are not edge cases. In conversational memory, these patterns are the norm. Direct factual recall ("What's my dog's name?") is the exception. The LoCoMo benchmark confirms this: the questions that matter most are exactly the ones that embedding similarity handles worst.

The landscape of approaches

Before describing our approach, it is worth mapping out the strategies the field has explored.

Embedding similarity is the most common approach. Memories are extracted as text, embedded as vectors, and retrieved by cosine distance. This is the standard RAG pattern applied to personal memory. It works well for the demo scenario: direct factual recall where query and answer share vocabulary. It struggles with every pattern described above.

Temporal graph approaches add time awareness through knowledge graphs, which partially addresses the temporal mismatch. If you ask "What was I worried about last month?", a time-filtered system can narrow results to the right window. But if retrieval within that window still depends on embedding similarity, the system finds memories about worry from last month, not necessarily the specific worry. The temporal structure improves search space reduction without fundamentally changing the relevance model.

Basic fact extraction takes a different path entirely: key-value storage of distilled facts. "User's dog is named Biscuit." "User is allergic to melatonin." This avoids the retrieval problem by not doing retrieval at all: everything stored is injected into every conversation. But it scales to perhaps a few hundred facts before the context window fills up, and it cannot handle nuanced, evolving, context-dependent memories.

Each of these strategies solves a genuine part of the problem. We found, through our own testing, that the core question remained: how do you determine that a memory matters right now when it does not look like what the user just said?

The distinction

We landed on a term for this: applicability. Similarity asks "does this memory look like the query?" Applicability asks "would this memory help the AI respond well to the current message, given everything we know about the conversation?"

These sound close. They are not. Consider:

| Query | Similar memory | Applicable memory | |-------|---------------|------------------| | "What did you recommend for sleep?" | "User has insomnia" | "AI recommended: reduce screen time, try magnesium, keep bedroom at 18C" | | "Did that work out?" | Memories containing "work" | The specific prior recommendation being referenced | | "Tell me about the construction guy" | Memories about construction | Memory about neighbor Dave who works in construction | | "What was I stressed about in March?" | Memories about stress | Whatever the user discussed during March, regardless of topic |

Similarity is a property of the memory itself: how close it is in vector space to the query. Applicability is a property of the relationship between the memory, the query, and the current conversational context. The same memory can have high similarity and low applicability, or low similarity and high applicability. They are orthogonal dimensions.

How Atagia retrieves

We will show the actual pipeline. This is how the system works, not a simplified explanation.

Stage 1: Query intelligence

Before any retrieval happens, the system analyzes the incoming message using an LLM call. The goal is to understand what kind of memory need exists. The need detector identifies nine distinct signals:

Ambiguity: the request could mean several things, widen the search
Contradiction: this conflicts with something stored, surface the conflict
Follow-up failure: prior advice did not work, find the prior advice
Loop: we have been here before, find the pattern
High stakes: this matters a lot, be thorough
Mode shift: the user is changing how they want to interact
Frustration: the user is struggling, reduce noise
Sensitive context: tread carefully with privacy
Under-specified request: not enough constraints, search broadly

The same LLM call produces sub-queries (decompositions of the original question into retrievable components) and sparse query hints that guide the lexical search. For the multi-hop restaurant question, it might produce:


sub_queries: [


  "Who recommended a hiking trail?",


  "Dinner with that person — which restaurant?"


]


sparse_query_hints: [


  {fts_phrase: "hiking trail recommended", must_keep_terms: ["hiking", "trail"]},


  {fts_phrase: "dinner restaurant", must_keep_terms: ["restaurant", "dinner"]}


]

This decomposition is where the system breaks free from the single-vector limitation. Instead of one embedding compared to all memories, we get multiple targeted searches that can converge on an answer through intersection.

Stage 2: Candidate search via FTS5

Atagia's primary retrieval channel is SQLite FTS5 (full-text search, not vector search). Each sub-query generates multiple FTS queries at different precision levels:


# From retrieval_planner.py — progressive query generation


# Tight: all terms must appear


"hiking trail recommended"


# Medium: top content terms


"hiking trail"


# Broad: any term matches


"hiking OR trail OR recommended"

The FTS queries run against a full-text index of all memory objects, filtered by user_id (a security invariant, never bypassed), scope, status, and privacy level. Results from all sub-queries are fused using Reciprocal Rank Fusion (RRF), which combines rankings from multiple queries into a single ordering without requiring score normalization:


rrf_score(memory) = sum(1 / (k + rank_in_query_i)) for each query_i that matched

This produces a candidate pool, typically the top 15-30 memories from across all queries. These are candidates, not results. The hard work has not happened yet.

Stage 3: Applicability scoring

Here is where Atagia's approach diverges most from embedding-based retrieval. The candidate pool goes to an LLM scorer (currently Claude Sonnet) which evaluates each candidate against the original message, the recent conversation context, and the detected needs. The scorer returns a llm_applicability score between 0.0 and 1.0 for each candidate.

The final score combines multiple signals:


final_score = (


    (llm_applicability * 0.65)   # Does this memory matter right now?


    + (retrieval_score * 0.15)   # How well did it match the search?


    + (vitality_boost * 0.10)    # How "alive" is this memory?


    + (confirmation_boost * 0.10) # Has this been confirmed multiple times?


    + need_boost                  # Bonus for matching detected needs


    - penalty                     # Decay, staleness, maya score


)

The weights tell the story. Retrieval score (how well the memory matched the FTS query) accounts for only 15% of the final score. The LLM applicability judgment accounts for 65%. The system trusts a language model's judgment of "does this matter right now?" far more than it trusts lexical or semantic matching.

The remaining signals add important texture. Vitality measures how "alive" a memory is: recently confirmed memories score higher than dormant ones. Confirmation boost rewards memories that have been independently corroborated across multiple conversations. Need-driven boosts adjust scores based on the detected need type; a memory about prior AI recommendations gets boosted when the system detects a follow-up-failure signal. Penalties account for temporal staleness and a "maya score" that tracks how often a memory has been surfaced but not used, a signal that it may be less applicable than its content suggests.

Stage 4: Context composition

Scored candidates get composed into the final context block within a strict token budget. This is not a matter of dumping the top 5 results into the prompt. The composer handles:

Diversity selection: avoids redundant memories that say the same thing in different words, weighted differently for different query types (a "broad_list" query gets aggressive diversity; a "slot_fill" query gets precision)
Hierarchical resolution: when both a summary (level 1-2) and its source evidence (level 0) are candidates, the composer checks for conflicts between them and prefers the fresher source when the summary is stale
Contract injection: the user's interaction contract (how they want the AI to behave) gets its own budget allocation, separate from retrieved memories
Budget enforcement: each block (contract, workspace, state, memories) gets a proportional allocation, and the system will drop lower-scored memories before truncating higher-scored ones

Why not embeddings?

A reasonable question. We are not against embeddings. The architecture has an EmbeddingIndex interface ready for Phase 2, and vector search will join FTS5 as an additional retrieval channel. But embeddings will not be the primary channel, and they will not be the scoring mechanism.

The reason is information-theoretic. An embedding is a fixed-dimensional projection of variable-dimensional meaning. A 1536-dimensional vector cannot encode the relationship between a memory and a query in context. It can only encode their proximity in a learned semantic space. That proximity is useful as a retrieval signal (find candidates), but it is actively misleading as a relevance signal (rank candidates).

When we ask "does this memory help the AI respond well to the current message?", we are asking a question that requires understanding the conversation history, the user's likely intent, the type of need being expressed, and the role the memory played in past interactions. No fixed-dimensional vector captures this. A language model, processing the full context, can.

This is why Atagia's scoring formula gives the LLM 65% of the weight and retrieval only 15%. Retrieval is how we find candidates. Applicability is how we judge them. Using the retrieval score as the relevance score is the fundamental limitation of embedding-only retrieval.

Where we are, honestly

This architecture produces strong results on certain types of questions. On the LoCoMo benchmark, we see clear wins on callback references, temporal questions, and single-hop factual recall. The applicability scorer consistently surfaces memories that FTS5 alone would rank lower.

We also have clear weaknesses. Multi-hop questions (the restaurant example above) remain our hardest category. The sub-query decomposition helps, but chaining across three or more memories still produces inconsistent results. Our current accuracy on multi-hop questions in LoCoMo is below where we want it to be, and we expect Phase 2 embeddings to help here by adding a complementary retrieval channel that can surface candidates FTS5 misses.

We also found, during development, that some of our early retrieval improvements were illusory. Hardcoded query hints were inflating benchmark scores on specific question phrasings without generalizing. We caught it, removed the overfitting, watched the numbers drop, and rebuilt on real results. The current scores are honest, and they are lower than we would like in some categories.

This is a work in progress. We are sharing the architecture because we believe the core insight (applicability as a separate dimension from similarity) is sound, even where our implementation still has room to grow.

The cost question

An honest treatment requires addressing the obvious objection: this is expensive. Every retrieval involves at least two LLM calls, one for need detection and query intelligence, one for applicability scoring. That is real latency and real cost.

The calls use small, fast models (currently Sonnet-class) with structured output schemas. The need detection call produces a constrained JSON object. Latency is typically under 500ms per call.

The alternative has its own costs. A cheaper-per-query system that returns wrong results more often creates a different kind of expense. A wrong memory in the context window does not just fail to help; it actively hurts. The AI generates a response grounded in the wrong memory, the user loses trust, and the relationship degrades. The cost of poor retrieval compounds over time in ways that are hard to measure but easy to feel.

The architecture is also designed for cost reduction as it matures. Need detection results can be cached within a conversation turn. The applicability scorer operates on a shortlist (the top-k candidates after FTS filtering), not the entire memory store. And as the memory engine accumulates data about which memories are actually used in responses (tracked via the maya score), the system gets better at pre-filtering, reducing the number of candidates that reach the expensive LLM scoring stage.

What this means for memory

We have been talking about architecture, but there is something more fundamental at stake.

Embedding-based retrieval treats memory as an archive. You store facts, you retrieve facts, relevance is a function of content similarity. This is the library model: a memory system is a well-organized collection of documents, and retrieval is the act of finding the right document.

But human memory does not work this way, and AI memory should not either. The same fact means different things at different moments. "User is allergic to melatonin" is background information during a conversation about work. It is critical context during a conversation about sleep. It is irrelevant during a conversation about cooking, unless the recipe involves tart cherries, which are high in melatonin, at which point it becomes relevant again in a way that no pre-computed similarity score could predict.

Memory is what matters now. And what matters now can only be determined now, with access to the full conversational context.

This is why applicability must be computed at retrieval time. It cannot be pre-computed and cached in a vector. It cannot be derived from content alone. It requires judgment, the same kind of judgment that a thoughtful listener applies when deciding which of their memories of a friend is relevant to what that friend just said.

We are building that judgment into the system. We are not done. But we believe the direction is right, and we are sharing the work so others can evaluate it, critique it, and build on it.

Atagia is an open-source memory engine for AI assistants. The retrieval pipeline described in this post is fully implemented and available on GitHub. Technical posts in this series explore the architecture decisions behind the system.