A 6-phase pipeline for AI agents that separates fact from inference
To prevent AI research agents from presenting inferences as facts, use a deterministic pipeline where the LLM only extracts claims. Rule-based code must handle all scoring and labeling. An AI…
To prevent AI research agents from presenting inferences as facts, use a deterministic pipeline where the LLM only extracts claims. Rule-based code must handle all scoring and labeling.
An AI research agent that mixes retrieved facts with its own conclusions is a liability. The model might report a market size of 1.2 trillion won (a retrieved data point) and then infer the market is "growing fast" (a conclusion), presenting both with equal confidence. For any decision with stakes, this ambiguity is unacceptable.
The solution is not a better prompt. It is a structural division of labor. The LLM should never be allowed to decide what constitutes a fact. That judgment must be handled by deterministic, rule-based code that is auditable and reproducible.
The LLM extracts, code judges
The core of the architecture is a hard separation of duties. The LLM is used for its strength in parsing unstructured text into structured claims. All subsequent steps, including scoring, cross-checking, and labeling, are executed by deterministic code.
| The LLM does | Deterministic code does |
|---|---|
| Extract claims from a fetched page; summarize a passage | Score, cross-check, sort, deduplicate, label FACT/INFERENCE, decide freshness |
This split provides two critical guarantees. First, reproducibility: the same query will produce the same set of labeled facts and inferences on every run. Second, it prevents laundering: the model cannot promote its own guess to the status of a fact because it never controls the labeling process.
A six-phase pipeline for provenance
The author proposes an explicit, six-stage pipeline to enforce this separation. Each stage is a distinct, testable component.
- PLAN: The initial user query is broken down into specific sub-queries and a list of sources to consult.
- HARVEST: The system fetches data from the planned sources. This stage is purely data collection and does not involve an LLM.
- NORMALIZE: This is the only phase where the LLM operates. It reads the raw, harvested content and extracts structured claims from each source.
- CORROBORATE: Claims are grouped, and the system counts the number of independent sources backing each one.
- SCORE: Rules are applied to assign labels. A claim might be labeled
FACTonly if it meets a strict criterion, such as corroboration from two or more independent sources. - RENDER: The final output presents the labeled
FACTs andINFERENCEs, along with an explicit list of information gaps.
Earning the FACT label
Under this model, FACT is not a default state. It is a status that a claim must earn by satisfying a predefined, programmatic rule. The system is designed so a claim is an INFERENCE unless it passes a specific gate, like being verified by an official API or appearing in multiple, distinct sources. If your research agent gives different confidence on the same question across runs, an LLM is scoring somewhere in the pipeline. This architecture is designed to eliminate that variability.
What we'd change
This playbook offers a robust path to auditable AI outputs, but its implementation introduces significant trade-offs. The architecture is more complex and expensive than a standard Retrieval-Augmented Generation (RAG) pipeline. The HARVEST and CORROBORATE stages require fetching from and processing multiple sources for a single query, increasing both latency and operational cost.
The system's integrity is also entirely dependent on the quality of its source material. The CORROBORATE phase assumes the availability of multiple independent and accurate sources. If the top-ranked sources for a query are all citing the same incorrect information, the pipeline will confidently label a falsehood as a FACT. The model does not solve the garbage-in, garbage-out problem; it simply makes the provenance of the garbage transparent.
Finally, the process of corroborating claims is non-trivial. Simple token matching is brittle. Determining if two differently worded statements from separate sources make the same semantic claim is a difficult computer science problem in itself. A naive implementation risks misclassifying nuanced or complex information.
Landing
Building a deterministic pipeline is a deliberate choice to prioritize trust and reproducibility over speed and simplicity. It moves an AI agent from a probabilistic tool to an auditable system of record. For founders building products in high-stakes domains like finance, law, or scientific research, where the cost of a hallucination is catastrophic, this architectural discipline is not optional. It is the foundation of an enterprise-ready product.
The investor read
This playbook signals a maturation of the AI/RAG market, moving from 'magic' to auditable reliability. An architecture that separates LLM-based extraction from rule-based verification is a defensive moat against simple API wrappers. It indicates a product built for high-stakes enterprise use cases (finance, legal, medical) where hallucinations create significant liability. While more complex and costly to build and run, this approach creates a stickier, more valuable product. For investors, this is a blueprint for an enterprise-grade AI tool, not a bootstrapped side project. It's a deliberate trade of short-term velocity for long-term defensibility and trust.
Pull quote: “If your research agent gives different confidence on the same question across runs, an LLM is scoring somewhere in the pipeline.”
Every claim ties to a primary source. See our methodology.