HomeReadTactics deskHow to find the critical bugs hiding in AI-generated code
Tactics·Jun 22, 2026

How to find the critical bugs hiding in AI-generated code

An optimization project reduced ATM stockouts from 42% to 25%. A deeper look at its AI-generated code reveals a new playbook for adversarial review when linters and standard checks fail. A project to…

An optimization project reduced ATM stockouts from 42% to 25%. A deeper look at its AI-generated code reveals a new playbook for adversarial review when linters and standard checks fail.

A project to optimize cash delivery for armored trucks reduced ATM stockouts from 42% to 25%. The core model, which blended classical operations research with machine learning, cut costs by a claimed 51%. But a subsequent experiment using AI to generate a quantum computing alternative produced code with three critical bugs, none of which were caught by standard linters or code review.

The incident, detailed by the project's author, provides a clear playbook for verifying the output of code-generating AI. The core lesson is that semantic and logical bugs, not syntactic errors, represent the new frontier of technical debt. Standard software validation is insufficient for code that is probabilistically generated.

The baseline: a 42% stockout rate

The project addressed a vehicle routing problem for a cash-in-transit company. A classical approach using average daily demand left 42% of ATMs without cash, a significant operational cost. The team first built a decision-focused machine learning model (SPO+) that learned from demand uncertainty. This new model was effective, dropping the stockout rate to 25% and reducing overall costs by half.

The numbers established the team's competence in the domain. This context is critical. The bugs found later were not the result of inexperience, but of the novel failure modes introduced by AI code generation.

An AI-generated quantum experiment

With a strong baseline established, the team experimented with a quantum-inspired algorithm, QAOA, to see if it could offer further improvements. They used an AI to generate the implementation. The initial output looked complete and plausible, producing a comparison table showing the new QAOA model's performance against the existing one.

However, the results contained a mathematical inconsistency. The AI-generated QAOA model returned the exact same optimal energy score (-2.5599) as the classical model, even while reporting it violated the problem's constraints. A solution that violates constraints must, by definition, have a worse score. This logical contradiction was the only signal that something was deeply wrong.

Finding the invisible penalty bug

The investigation uncovered three bugs. The first was a miscalibrated penalty weight in the cost function. The AI had set the penalty parameter LAMBDA_C to 0.5. For a vehicle that was over capacity by 80,000 units, this produced a penalty of just 0.051. The route costs themselves were in the range of 1.2 to 3.5 units.

The penalty was so small relative to the primary cost function that it was effectively invisible to the optimizer. The QAOA model was free to violate the capacity constraint because it was barely penalized for doing so. The fix was to increase LAMBDA_C to 40.0, a value larger than the maximum possible route cost, making the penalty impossible to ignore. The source notes two other bugs were found, including one where a separate model contradicted its own constraints, but does not detail them further.

What We'd Change

The author’s post-mortem is a valuable artifact. To make it a repeatable playbook, the implicit process of discovery should be made an explicit verification framework. AI-generated code requires a new layer of review focused on logical and semantic integrity, not just style or syntax.

First, define the logical invariants of the system before writing or generating code. In this case, a key invariant is that a constrained solution cannot score better than an unconstrained one. The AI's output violated this. By listing these fundamental truths upfront, a reviewer has a checklist of non-negotiable conditions to test for, turning a bug hunt into a systematic audit.

Second, test at the extremes. The penalty bug was a failure of choosing a “reasonable” but incorrect default. An effective adversarial review would involve running the code with boundary conditions. Set the penalty weight to zero and confirm the constraint is ignored. Set it to an extremely high number and confirm the constraint is always met. This forces the model to reveal its sensitivity to key parameters.

Finally, use multiple models for cross-validation. The bug was only apparent because the QAOA results were compared against the established MILP model. When dealing with complex, AI-generated systems, having a simpler, more transparent model to serve as a benchmark is not a nice-to-have. It is a necessary tool for verification.

Landing

Code generation is becoming a commodity. The durable skill is not prompting, but verification. Senior engineers are evolving into adversarial auditors, tasked with creating systems that can prove AI-generated code is not just syntactically correct, but logically sound. The most valuable work is no longer writing the implementation, but designing the interrogation.

The investor read

This project, while not a commercial entity, signals a significant emerging market: AI code auditing and verification. As AI code generation (e.g., GitHub Copilot, Devin) becomes standard, the primary bottleneck and source of risk shifts from implementation speed to logical correctness. The bugs detailed here were semantic, not syntactic, and evaded all standard tooling. This creates a clear opportunity for a new class of developer tools that perform 'logic-based linting' or 'semantic testing.' An investment thesis could focus on startups building tools that automate the adversarial review process described: defining and testing logical invariants, running boundary condition tests, and performing automated cross-model validation. These systems would form a new, critical layer of the MLOps and DevOps stack, acting as the essential auditor for an AI-powered engineering workforce.

Pull quote: “The most valuable work is no longer writing the implementation, but designing the interrogation.”

Sources · how we verified
  1. AI Writes the Code. But Who Checks It?

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
M
Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.