Tactics·Jul 3, 2026

How a free local AI model increased agent costs by up to 5.3x

An experiment across 40 trials found that using a 'free' local executor model inflated the orchestrator's token usage, making the hybrid system more expensive than using a single powerful model.…

By Maya · Tactics desk·Human-reviewed·✓ Verified Jul 3, 2026·4 min read·3 sources

An experiment across 40 trials found that using a 'free' local executor model inflated the orchestrator's token usage, making the hybrid system more expensive than using a single powerful model.

Pairing a powerful orchestrator AI with a 'free' local executor model is a common strategy for reducing agent operating costs. An experiment across 40 trials, documented in a public repository, found this approach did the opposite. The configuration using a local Qwen 3.5-9B model as an executor increased the orchestrator's token consumption by 1.4x to 5.3x, making it the most expensive architecture tested.

The research, published on code-hosting platform Zenodo, directly challenges the intuition that offloading tasks to a zero-cost local model guarantees savings. The hidden cost is not in the executor's tokens, but in the orchestrator's expanding context window.

The four-arm experiment

The test measured the cost and success rate of four distinct agent architectures tasked with identical code-repair jobs. The evaluation was deterministic, using mypy, ruff, and pytest exit codes to judge success without a subjective LLM-as-judge. The goal was to find the most cost-effective setup.

The four configurations were:

Arm	Orchestrator	Executor	Role Split
A	Opus 4.7	(solo)	One model does everything
B	Opus 4.7	Qwen 3.5-9B (local)	Opus plans & verifies, Qwen edits
C	Opus 4.7	Haiku 4.5 (cloud)	Opus plans & verifies, Haiku edits
D	Haiku 4.5	(solo)	One cheap model does everything

Arm B represents the common cost-saving strategy. The local Qwen model runs via Ollama, incurring zero direct token costs. Arms A, C, and D provide benchmarks for performance and cost against all-cloud alternatives.

The orchestrator re-read tax

The supposedly 'free' configuration came out as the most expensive cloud arm on all three of the code-repair tasks. The cause was token inflation in the orchestrator. On every turn, the orchestrator model (Opus) had to re-read the entire history, including the summaries and code returned by the executor model (Qwen).

This feedback loop meant that while Qwen's tokens were free, Opus's input prompt grew significantly with each iteration. The total tokens processed by Opus in the hybrid setup were 1.4 to 5.3 times greater than when Opus ran solo. This additional processing cost for the expensive model overwhelmed the savings from the free one.

The most balanced cloud option

Among the cloud-only configurations, a two-tiered approach using Opus as the orchestrator and the cheaper Haiku model as the executor proved the most balanced. It offered a middle ground on cost without the significant failure rate of using Haiku alone. The Haiku-solo arm was 5.5 times cheaper than the Opus-solo arm on the largest task, but it failed on 25% of the trials within the iteration cap.

This suggests that for cloud-native agent architectures, a tiered model strategy can be effective. The critical factor is ensuring the orchestration logic does not create an expensive feedback loop that negates the savings from the cheaper model.

WHAT WE'D CHANGE

The experiment's design is rigorous for its stated goal, but its findings are tied to a specific agent architecture. The core issue is an orchestration pattern where the lead model re-ingests the full output of the sub-agent at every step. This is a common but not universal design.

A more sophisticated agent could use techniques to compress the state or pass only essential information back to the orchestrator. For example, instead of returning a verbose summary, the executor could return a structured object or a diff. This would reduce the token load on the orchestrator, potentially making the hybrid-local model architecture cost-effective again. The experiment's conclusion holds for naive implementations, but may not apply to agents with more advanced context management.

Furthermore, the tasks were specific to code repair. The token dynamics might differ for other tasks like web research or content generation, where the executor's output could be more or less verbose. The principle of accounting for the orchestrator's full context cost remains, but the specific inflation multiple (1.4–5.3x) is likely task-dependent.

LANDING

The primary takeaway is not that local models are inefficient, but that agent costs are a function of the entire system's tokenomics. Focusing only on the executor's price per token is a critical error. The most expensive resource in a sophisticated agent is often the context window of its most powerful model. Any architectural choice that inflates that context, even with 'free' inputs, risks creating a system that is more expensive than a simpler, single-model design. The cost of an agent is the sum of its thoughts, and the orchestrator does most of the thinking.

The investor read

This experiment is a signal of increasing maturity in the AI agent development space. The initial 'move fast and prompt things' phase is giving way to rigorous, cost-aware engineering. For investors, the key insight is that sustainable moats in agent-based products will likely come from sophisticated orchestration and token-economic modeling, not simply from access to a powerful foundation model. Companies that demonstrate a deep understanding of total system cost, including the hidden 'tax' of orchestrator context, are better positioned for scalable, profitable operations. This research, though from an independent developer, provides a clear benchmark for due diligence on any startup building complex, multi-model AI systems. A naive cost model is a significant red flag.

Pull quote: “The supposedly 'free' configuration came out as the most expensive cloud arm on all three of the code-repair tasks.”

Sources · how we verified

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

The four-arm experiment

The orchestrator re-read tax

The most balanced cloud option

WHAT WE'D CHANGE

LANDING

The investor read

A database-first playbook for building auditable, multi-tenant B2B AI

An AI pilot failed in 48 hours because the gateway wasn't operator-ready

A 16-minute bug: How a founder traced 928 missing data rows