Tactics·May 20, 2026

AI Code Review: When to Trust It, When to Test Manually

A founder's six-month experiment with AI for code review on a production ERP reveals specific strengths in static analysis and critical weaknesses in dynamic scenarios, alongside a key prompt…

By Maya · Tactics desk·Human-reviewed·✓ Verified May 20, 2026·4 min read·1 source

A founder's six-month experiment with AI for code review on a production ERP reveals specific strengths in static analysis and critical weaknesses in dynamic scenarios, alongside a key prompt engineering tactic.

After six months of deploying AI for code review on a production ERP system, one founder identified specific failure points and a critical 'chunking' strategy for prompt engineering. This process, applied to modules like batch processing tables and MRP allocation logic, revealed AI's reliability for detecting duplicate functions and isolated calculation bugs. It also exposed AI's inability to detect race conditions or web worker memory leaks in complex, dynamic environments. The founder's experience provides a tactical framework for integrating AI into a development workflow, delineating its effective boundaries.

Reliable AI Code Review Use Cases

The founder's initial findings indicate AI excels at identifying specific, localized code issues. In a codebase touched by multiple developers, AI consistently flagged duplicate functions, such as one mirroring formatCurrency in utils/formatters.ts. This type of redundancy often evades human review, as it is assumed to have been checked previously. AI provides a reliable first pass for such inefficiencies.

For self-contained utilities, AI demonstrated consistent accuracy in catching calculation bugs. These included off-by-one errors, incorrect operator precedence, and faulty unit conversions. The source notes that in ERP systems, errors in landed cost formulas or tax calculations can severely damage client trust. AI's ability to prevent these specific, isolated calculation errors proved valuable. It saved the founder from multiple potential client-facing issues.

AI also reliably identified direct React state mutations in simple components. Pushing directly to an array or mutating nested objects without proper spreading were consistently flagged. While not a groundbreaking discovery, this capability serves as a useful pre-compilation check, ensuring basic React immutability principles are followed.

Where AI Completely Failed

Despite its strengths, AI demonstrated significant limitations in complex, dynamic scenarios. The most painful production bug encountered was a race condition in a batch item table. This component allowed rapid concurrent mutations, leading to overlapping async state updates and silent overrides. The founder ran the component through multiple AI tools, which returned zero flags, not even a suggestion to check for such issues. The bug only surfaced when a client triggered a specific sequence of rapid interactions. AI performs static analysis; it cannot simulate erratic user interaction timing, rendering race conditions invisible to its review.

Web Worker memory leaks also proved undetectable by AI. The founder implemented workers for heavy client-side calculations and tasked the AI stack with auditing cleanup patterns, specifically for potential leaks from rapid event-driven spawning. Every AI tool confidently passed the code. However, manual browser profiling revealed that workers were not reliably terminating under specific runtime exceptions, leading to ghost processes. The AI verified that cleanup code existed, but it could not verify that cleanup actually ran across every execution path, especially under runtime exceptions.

Custom CSS layout bugs presented another area of complete failure. When building a proprietary data grid with strict UX requirements, the founder encountered padding misalignments and layout collapse under certain data payloads. AI proved almost useless. Describing the visual bug to AI and applying its suggested fixes did not resolve the rendered output. Without a real rendering environment, AI's attempts to fix cascade behavior were guesses.

The Operational Discovery: Chunking Complex Prompts

The most significant operational discovery was a method for structuring AI prompts for complex logic. Initially, when building an MRP allocation table with cascading quantities, supply constraints, and fulfillment priorities, the founder fed the full specification into AI in a single pass. Every tool failed, producing confidently wrong logic with broken dependencies and silently failing state updates on edge cases.

The solution involved splitting the task into four distinct prompts:

One prompt for the core allocation math only.
One prompt for data validation constraints only.
One prompt for the immutable React state update pattern.
A final prompt to audit each module against the others.

Each piece returned clean, and the assembled system worked perfectly in production. This indicates that AI degrades under compound logical dependencies, not under token length. An ERP module often has overlapping validation paths and complex tax calculations, which overload AI's ability to reason about the entire system simultaneously. By breaking down the problem, the founder enabled AI to process each logical dependency in isolation, then verify the integration.

What We'd Change

The founder's experience highlights the current limitations of AI in code review, particularly its reliance on static analysis. For issues like race conditions and memory leaks, AI currently serves as a supplementary tool, not a replacement for comprehensive dynamic testing. Future implementations should integrate AI's static analysis capabilities with robust unit, integration, and end-to-end testing frameworks. This ensures that while AI catches low-hanging fruit, critical runtime behaviors are validated through execution.

The

Pull quote: “AI degrades under compound logical dependencies, not under token length.”

Sources · how we verified

I Used AI for Code Review on a Production ERP for 6 Months. Here's Where It Actually Failed Me. ↗

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Reliable AI Code Review Use Cases

Where AI Completely Failed

The Operational Discovery: Chunking Complex Prompts

What We'd Change

Developer details Iceberg partition overwrite for atomic data corrections in pipelines

Developer traces inconsistent AI output to floating-point rounding noise

Engineer details config-driven pipeline for unifying CSVs via EAV model