HomeReadTactics deskHow FamNest built an LLM judge to test its AI coach
Tactics·Jul 2, 2026

How FamNest built an LLM judge to test its AI coach

Evaluating non-deterministic AI output is a scaling challenge. FamNest's founder built a custom harness, using mechanical checks to mitigate common LLM biases and ensure product quality. Virginia…

Evaluating non-deterministic AI output is a scaling challenge. FamNest's founder built a custom harness, using mechanical checks to mitigate common LLM biases and ensure product quality.

Virginia Mwega, founder of FamNest, faced a problem common to AI-native products: testing. Her "coach agent" generates empathetic responses for parents, but there is no assertEqual for empathy. The choice was to either manually review every output change, which is unscalable, or build an automated evaluation system and risk it providing false confidence. She built the system.

An LLM to judge an LLM

The solution was an "LLM-as-judge" harness. It automates the process of grading the coach agent's output against a predefined rubric. The harness loops through a set of test cases, generates a response from the coach agent for each, and then uses a separate "judge" LLM to score that response. This converts a subjective quality assessment into a scalable, quantitative process. The founder's key insight was that the judge itself is a fallible model that requires its own system of checks and balances.

Four ways the judge can lie

The founder's post details four specific failure modes discovered in the harness. First was position bias, where the judge might favor the first or second response in a side-by-side comparison regardless of quality. Second was verbosity bias, where LLMs often rate longer responses as better, even if they are less helpful. Third was self-preference, where a judge model from one provider might show a preference for responses generated by its own models. The final, and perhaps most subtle, was model drift. The underlying judge model can be updated by its provider without notice, silently changing its evaluation criteria and invalidating historical scores.

Mechanical fixes for model biases

Mwega reports that the solutions were not complex prompting techniques but simple, mechanical procedures. To counter position bias, the harness shuffles the order of responses in every pairwise comparison. To prevent model drift, she pins the judge LLM to a specific version number, ensuring scoring consistency over time. Finally, she created a small, human-labeled "anchor set" of test cases. The judge's performance is periodically re-calibrated against this ground truth, which flags any deviation in its scoring behavior.

What we'd change

The post is an excellent tactical guide to building a judge harness, but it omits two critical operational factors: cost and test set quality. Running a large evaluation suite can be expensive, with each judgment representing a separate LLM call. The post does not address the cost-benefit analysis of running the full suite versus a subset on each code change. The quality of the evaluation is also entirely dependent on the quality of the test cases. A poor or unrepresentative set of tests will produce a green dashboard that means nothing, even with an unbiased judge. Finally, the tooling landscape is evolving. While building a custom harness was once a necessity, managed solutions from platforms like LangSmith or open-source frameworks now offer similar capabilities. For a founder today, the decision would be build vs. buy, not just how to build.

Landing

The work described is not just about building a testing tool. It represents the operational rigor required to move AI products from impressive demos to reliable services. If you don't design around its known biases, you build a green dashboard that means nothing. The most effective guardrails are often not sophisticated prompts, but disciplined, mechanical processes that account for the model's inherent fallibility.

The investor read

This post signals a founder focused on product quality and reliability in a category prone to hype. Building a process-driven moat like an evaluation harness, rather than just chasing model performance, is a sign of operational maturity. For an early-stage, bootstrapped product like FamNest, this demonstrates a capital-efficient approach to de-risking the core technology. While the product itself remains unverified, this focus on repeatable, automated quality assurance is precisely what investors look for in teams building durable AI applications. It suggests a founder building for the long term, not just a demo.

Pull quote: “If you don't design around its known biases, you build a green dashboard that means nothing.”

Sources · how we verified
  1. Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About It)

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
M
Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.