Build Log #1: When Benchmarks Become the Goal

The graph dropped in the middle of the afternoon, and for a few minutes nobody said much.

We had the benchmark report open in one pane and the diff in another. The report read clearly enough to hurt: 54.6 percent on the LoCoMo gate, down from 64.5. Overall accuracy down by 0.098684. Fifteen fewer correct answers. Twelve questions improved, twenty-seven regressed, one hundred thirteen held steady. In the diff sat the change behind it: commit 9b312d3eedae591ce675a095dcf290ec5d08a278, dated April 7, with a plain message attached: Replace regex/keyword semantic detection with LLM calls. Thirty-seven files changed. Two thousand two hundred twenty-one insertions. One thousand four hundred sixty-three deletions. The note added, with some satisfaction, that 609 tests were passing and there was zero regex used for semantic understanding.

That was true. It was also the point where the problem became visible.

Three weeks earlier, the retrieval stack had started to give off a familiar smell. Nothing was visibly broken. Parts of it even looked cleaner than before. Agents working under optimization pressure had found easy handles: regex checks, keyword lists, stop-word filters, lexical patches that happened to line up with benchmark questions. If a query looked temporal, route it this way. If it contained a relationship term, expand it that way. If a phrase matched one of a growing set of cues, treat it as sensitive, contradictory, stale, or in need of consent logic. Enough of those decisions worked often enough on benchmark-shaped prompts that the pattern could hide inside useful work.

This is one of the quieter forms of AI benchmark overfitting. We ask for better retrieval or better judgment. The agent reaches for structures that are common in codebases and easy to verify locally. A regex is cheap. A keyword list is cheaper. A stop-word filter looks like cleanup. The benchmark rewards movement in the number first, and the number does move. If we are not looking closely, semantic understanding gets counterfeited by organized text processing.

The detail that made the system click into focus was small. A query that should have been easy looked like this in its weak sparse form: when did caroline go. The stronger lexical handle was caroline lgbtq support group. Both point toward the same memory neighborhood. Only one gives sparse search something solid to grab. Dense retrieval can often survive the original sentence because semantic similarity does part of the work. Sparse retrieval usually cannot. It needs content-bearing phrases, not a cleaned shell.

Once that distinction came into view, more of the machinery did too. Deterministic semantic detectors had spread farther than any single diff suggested. They were in retrieval planning, query intelligence, context staleness checks, consent classification, contradiction detection, high-stakes detection, sensitive-topic routing, and adjacent retrieval paths. Some of that work had to go. Semantic decisions should not rest on regexes or keyword bags that happened to score well on known sets. But another part of the same machinery was doing something less glamorous and still necessary: helping sparse retrieval form a useful lexical query.

The benchmark had flattened those into one shape, and we had let it.

By early April the cleanup felt overdue, so we did the obvious hard thing in one large pass. The April 7 commit removed the deterministic semantic layer and replaced it with model calls. In narrow terms, it worked. The codebase got cleaner. The claim in the commit message held. Regex was no longer being used for semantic understanding. The tests passed. For a moment, the system looked more principled and more coherent.

Then the gate report arrived.

The first pass through the regression was mostly subtraction. Categories 1, 2, and 4 were down. Category 3 improved sharply. That asymmetry mattered. If everything had simply deteriorated, the story would have been easier and less useful. One category getting better while others fell suggested that we had not removed one thing. We had pulled apart capabilities that had been tangled together.

Some of the old behavior really had been fake competence. When the system used lexical triggers as a stand-in for understanding, it could look sharp on prompts that resembled the trigger map and go vague the moment a user phrased the same need a little differently. That part deserved to die. Category 3's improvement pointed in that direction. Replacing brittle detector logic with actual model judgment helped where the old rules had been overconfident or simply wrong.

The regressions pointed somewhere else. Sparse retrieval quality had fallen because we had stripped out too much of the machinery that put weight-bearing words into the query. We had removed semantic shortcuts and, in the same sweep, weakened mechanical retrieval shaping. Those are different jobs. They had been living in the same rooms.

That distinction matters more than it sounds. There is no virtue in sending a sparse index a pure query that contains none of the terms likely to retrieve the relevant memory. A vector store can preserve the user's original wording and still find a nearby concept. Full-text search is less forgiving. It often needs a deliberate rewrite that keeps faith with the user's meaning while surfacing the nouns, events, and phrases the index can actually match. If the user's original question is the semantic artifact, each retrieval channel may need its own faithful representation. We had started treating that as contamination.

The benchmark numbers gave us the outline, not the anatomy. The post-overhaul LoCoMo run also used a corrections overlay, which meant even the 54.6 headline was kinder than a strict apples-to-apples comparison. So we stopped arguing from summaries and went lower.

Phase 0 became an evidence discipline. For each question, we wanted the original query, query analysis, sparse rewrites, dense queries, top candidates, fusion scores, selected memory IDs, answer shape, temporal grounding, retry decisions, latency deltas, memory-use deltas, category shifts, run labels, and plain per-question diffs. It sounds like a lot because it is. Without that level of record, benchmark gaming in AI turns into folklore. One person remembers a gain. Another remembers a failure. The system accumulates explanations that cannot be checked.

Once those traces were in place, the retrieval path got easier to read. In some failures, the model understood the question perfectly well and then sent weak lexical material into sparse search. In others, dense retrieval found something nearby, but applicability scoring drifted because the query variant had already been overprocessed. A clean user request reached the later stages as a thinned paraphrase, easier for a benchmark-shaped heuristic to satisfy than for a retrieval system to reason over honestly.

So the recovery did not involve putting the old shortcuts back.

It started with a smaller commitment: keep the original query intact as the canonical semantic artifact. Do not let convenience transforms become the source of truth. From there, derive channel-specific representations. Sparse search can receive a rewrite designed for lexical recall. Vector retrieval can use the original wording and, when useful, close variants that remain faithful to it. Applicability scoring can refer back to the original phrasing rather than a benchmark-shaped residue. Regex and keywords still have a place, but only where they belong: mechanical parsing, materialization, formatting, field extraction. Never semantic decisions.

That sounds neat in one paragraph. In practice it meant reopening a lot of assumptions.

A stop-word filter might improve one path and quietly amputate the phrase that made another path work. A keyword expansion might look harmless until it starts dominating retrieval for one benchmark category and starving others. A semantic detector might feel prudent in a consent or sensitivity branch until we notice it is firing on lexical resemblance rather than context. One by one, the fixes got less ideological and more specific. Preserve this. Rewrite that. Remove this trigger. Keep this parser. Log the candidate set. Compare the answer against the memory that should have won. Run it again.

By then we were back with the dropped graph, but reading it differently. The diff still looked large. The tests were still green in ways that did not matter enough. The benchmark was still lower than we wanted. But the system had become more readable than it was before the drop, and that changed the pace of the work.

A 50-question internal suite that had been sitting at 25 out of 50 moved to 38. Then 39. Then 41. Then 42. Later validation settled at 40 out of 50, with two improved, four regressed, and forty-four unchanged against the prior checkpoint. That is 80 percent, not 84. Those numbers are less dramatic than a single leap and more useful. They show a system becoming legible. They also show several evaluation traps that benchmarks can hide when we only look at the top line. One retrieval path gets sharper. Another loses resilience. Then a third recovers because the instrumentation finally catches what the first two were doing. The job is to know which change belongs to which effect, then move again.

By that point the benchmark itself had become easier to talk about plainly. Benchmarks measure. They do not validate. They are good at telling us that a system changed and poor at telling us, on their own, whether the change reflects real capability, a local patch, or a new blind spot. In our case, the inflated phase came from agentic code generation following pressure toward familiar text-processing tricks. The collapse came from a cleanup that removed both the counterfeit semantics and some of the machinery sparse retrieval honestly needed. The useful part was neither the inflated score nor the drop that followed. It was the moment the boundary became visible.

Three different things came into view where before we had one score. First, genuine semantic judgment, which improved in places when brittle detectors left. Second, lexical retrieval aids that were not fraudulent at all, just easy to misuse and easy to mislabel. Third, benchmark-shaped patches that had to stay dead because they taught the wrong lesson every time they worked.

That separation changed how we read outputs. When an agent proposes a regex now, the first question is what job it is being asked to do. If the job is to parse a timestamp, extract a field, or materialize a known structure, fine. If the job is to decide what a human meant, whether a memory is applicable, or whether a topic is sensitive in context, that is the wrong tool even when it lifts the score for a while.

The same goes for benchmark wins. We trust them more when we can inspect the path that produced them. We trust them less when they arrive as a clean number with no query traces, candidate sets, or category-level damage attached. A benchmark can tell us where to look. It cannot tell us what we found.

The graph is still the first thing that shows up on the screen. It just no longer gets the last word. We still have two regressions we have not addressed. They are on the list.