A founder's 4-layer, 131-test harness for de-risking LLM agent failures
Unit tests can't catch silent, semantic failures in LLM agents. This founder's playbook details a four-layer evaluation harness that caught a critical bug unit tests missed, for just $0.03 per run.…
Unit tests can't catch silent, semantic failures in LLM agents. This founder's playbook details a four-layer evaluation harness that caught a critical bug unit tests missed, for just $0.03 per run.
An agent passed every unit test yet still gave a user financial advice it was explicitly forbidden to give. The function returned a clean 200 status. The only reason the founder, Elena Revicheva, caught the failure was a custom 131-test evaluation harness that flagged a semantic regression. The entire harness runs for a reported $0.03 per pass.
This incident highlights the structural gap between code correctness and agent behavior. Unit tests verify that code does what a developer wrote. Evals verify that an AI agent does what a developer meant. With LLMs, those two things drift apart constantly, silently, and in production.
Why unit tests structurally fail
Unit tests are built on a deterministic premise. A specific input X should always produce a specific output Y. This works for parsing a message or validating a schema. An LLM-backed agent, however, has no such fixed contract. Revicheva notes that routing the same user query to different models, like Groq for speed and Claude for reasoning, creates two distinct code paths and failure surfaces. There is no single output Y to assert against.
This leads to two common failure modes in testing. Teams either ignore the model layer, treating the prompt as mere configuration, or write brittle tests that assert a specific string is present in the output. The first approach ships dangerous bugs like the financial advice incident. The second creates tests that break with minor phrasing changes from the model and are quickly abandoned. The necessary alternative is to test for properties, not equality.
A four-layer evaluation harness
Revicheva’s solution is a four-layer harness, with each layer designed to catch a different class of bug. The layers are ordered from cheapest and fastest to most expensive and slowest.
Layer 1: Deterministic contracts. This layer contains approximately 40 standard unit tests and runs in milliseconds for free. It covers non-LLM logic: message parsing, schema validation, and the router's model-selection logic. It catches basic code errors before any model is invoked.
Layer 2: Structured output validation. With about 35 tests, this layer makes real calls to the LLMs. It does not check the meaning of the output. It only validates its structure. It confirms the model returned valid JSON, selected a tool from an allowed list, and included all required fields. This layer caught an issue where one model wrapped its JSON in a markdown fence while another did not, a divergence that broke the parser for one path.
Layer 3: Behavioral and semantic properties. This is the core of the harness, with around 45 tests. It uses a separate, powerful LLM as a "judge" to evaluate the agent's output against desired properties. Did the agent refuse to give financial advice? Did it stay in the user's language? Did it call the correct tool for the user's intent? This layer is where the critical financial advice bug was caught.
Layer 4: Human-in-the-loop review. The final safety net consists of 11 tests. These are edge cases and high-stakes scenarios that are manually reviewed. Revicheva describes this as the last line of defense for failures too subtle or critical for an automated judge to reliably catch.
What we'd change
The playbook is a strong template for building robust agents, but it omits critical implementation details. The post does not name the specific tools or frameworks used to build the harness. A founder attempting to replicate this would have to decide between frameworks like LangChain, custom Python scripts, or a dedicated platform like LangSmith, each with different engineering costs.
The claimed cost of $0.03 per pass needs context. While using Groq for some model calls is inexpensive, the cost of the Layer 3 "judge" model is not specified. A powerful model like GPT-4 or Claude 3 Opus used as a judge could significantly increase the per-pass cost, especially as the number of semantic tests grows. The $0.03 figure may represent a blended cost that is not universally achievable.
Finally, the harness's scalability is an open question. The 11 manual tests in Layer 4 create a bottleneck. While essential for catching subtle failures, this human review process does not scale with development velocity or team size. The playbook lacks a strategy for managing this bottleneck over time, such as promoting manual tests to automated semantic tests once the failure mode is understood.
Landing
Building a reliable AI agent requires a shift in testing philosophy. The focus moves from asserting deterministic outputs to verifying probabilistic properties. This four-layer harness is less a one-time setup and more a piece of living infrastructure. It represents an ongoing operational cost required to manage the inherent non-determinism of LLM-based products. For founders in this space, the question is not whether to build an eval system, but how much risk they are carrying until they do.
The investor read
This playbook signals the maturation of the AI application layer. Early-stage teams that can articulate a sophisticated evaluation strategy beyond basic unit tests demonstrate a deeper understanding of production AI risks. This harness is a form of technical moat; it enables faster, safer iteration and reduces the risk of brand-damaging failures. For investors, a key diligence question for any AI-native company should now be, "What does your evaluation harness look like?" A lack of a clear answer is a red flag. This also points to a significant opportunity for LLMOps tooling that can productize and simplify the creation of these complex, multi-layered testing systems. Companies that solve this problem will be highly valuable.
Pull quote: “With LLMs, those two things drift apart constantly, silently, and in production.”
Every claim ties to a primary source. See our methodology.