Tools·May 31, 2026

Claude Sonnet and Gemini Flash Miss Identical Security Hardening Steps

This review benchmarks Claude Sonnet 4.6 and Gemini 2.5 Flash against custom ESLint security plugins across four domains. It highlights consistent, critical hardening omissions in AI-generated code.…

By Riley · Tools desk·Human-reviewed·✓ Verified May 31, 2026·6 min read·1 source

This review benchmarks Claude Sonnet 4.6 and Gemini 2.5 Flash against custom ESLint security plugins across four domains. It highlights consistent, critical hardening omissions in AI-generated code.

TL;DR Best for: Developers using AI models for code generation in security-sensitive domains who need to understand specific, common AI-generated code vulnerabilities. Skip if: You expect AI models to generate production-ready, fully hardened code without rigorous human review and specialized tooling. Bottom line: Both Claude Sonnet and Gemini Flash, when generating code for common application security domains, consistently miss critical hardening steps, necessitating a robust security review process.

Methodology

This v0 review draws on the author Ofri Peretz's published claims at https://dev.to/ofri-peretz/claude-vs-gemini-across-4-security-domains-a-dead-heat-and-the-hardening-63-of-ai-code-skips-mpp; independent benchmarks are pending. Update cadence: re-tested when claims diverge from observed behavior.

The review covers a benchmark comparing Claude Sonnet 4.6 (via Claude CLI) and Gemini 2.5 Flash (via Gemini CLI) for generating secure code. The tests, conducted on 2026-05-31, involved providing "feature-only" prompts (without explicit security instructions) for four distinct security domains. The generated code was then evaluated using custom ESLint security plugins developed by the author: nestjs-security, jwt, mongodb-security, and secure-coding, each mapped to specific CWEs. The comparison focused on the default CLI tooling and system prompts for each model.

What's not covered in this review includes independent performance benchmarks, long-term workflow integration, edge cases beyond the specific prompts used, or a comparison with the higher-tier Pro and Opus models. The author notes that each domain involved a single generation (n=1), though the JWT round was re-run with consistent results, suggesting stable failure modes rather than precise, generalizable counts.

What It Does

Benchmarking AI code security

The core of this analysis is a direct comparison of two prominent AI models, Claude Sonnet 4.6 and Gemini 2.5 Flash, in their ability to generate secure code. The author used specific prompts for common application components, such as NestJS services, JWT authentication middleware, MongoDB data layers, and general API functions susceptible to injection. The goal was to assess how these models perform under realistic developer usage scenarios, where security hardening is often an implicit rather than explicit requirement.

Custom ESLint plugin suite

To objectively evaluate the generated code, the author employed a suite of four custom ESLint security plugins: nestjs-security, jwt, mongodb-security, and secure-coding. These plugins are designed to identify specific security vulnerabilities and hardening omissions relevant to their respective domains, with each rule mapped to a Common Weakness Enumeration (CWE). This approach moves beyond subjective code review, providing a quantifiable measure of security posture based on predefined rules.

Specific vulnerability findings

Across 700 AI-generated functions, the author claims 63% shipped a vulnerability. The scorecard shows a "statistical dead heat" between Claude and Gemini across the four domains. In the NestJS service domain, Gemini generated code with 2 issues, while Claude produced 6. For JWT authentication and MongoDB data layers, both models yielded 5 and 8 issues respectively, indicating identical misses. In the General API (injection) domain, Gemini had 9 issues, and Claude had 13*. The critical finding is that both models frequently missed the same hardening steps, particularly those outlined in RFC 8725 for JWT.

Idiomatic framework advantage

A notable observation came from the NestJS service generation. Gemini's CLI, when prompted for a users service, defaulted to idiomatic NestJS patterns. This included using class-level @UseGuards, @Exclude() on password fields, and class-validator on DTOs. Claude, conversely, produced functionally similar code but without these framework-specific security scaffolding elements. This suggests that models with a stronger grasp of framework idioms may inherently generate more secure code by leveraging built-in security features.

What's Interesting / What's Not

What's interesting here is the consistency of the security gaps across both models. The primary takeaway is not which model "won" a specific round, but rather that both Claude Sonnet and Gemini Flash, despite their advanced capabilities, consistently fail to implement fundamental security hardening steps. This finding is crucial because it directly challenges the assumption that AI-generated code can be trusted without specialized security review. The use of custom ESLint plugins tied to CWEs provides a concrete, reproducible methodology for identifying these gaps, moving beyond anecdotal evidence. The specific detail that 63% of AI-generated functions shipped a vulnerability is a stark, actionable metric for engineering teams. Gemini's ability to leverage idiomatic framework security features in NestJS is also a significant insight, suggesting that model training on framework-specific best practices could be a path to improved security.

What's less interesting is the simple "leaderboard" aspect. The author explicitly states that the "frontier security gap is smaller than the discourse suggests" and that the count is "the least interesting number here." Focusing solely on whether Gemini had 2 issues versus Claude's 6 in one round, or their 5-5 tie in another, misses the broader point about systemic security omissions. The general notion that AI-generated code might contain vulnerabilities is not new; the value here lies in the specifics of what's missed and the consistency of those misses across different advanced models. The comparison of Flash vs. Sonnet, while relevant for pricing tiers, is secondary to the shared failure modes.

Pricing

The review compares Gemini 2.5 Flash and Claude Sonnet 4.6, noting they represent "comparable price/latency tier" models. The author also mentions that "Pro and Opus are a separate bracket." No specific dollar amounts or free-tier limits are provided in the source. This pricing snapshot is accurate as of 2026-05-31.

Verdict

For developers relying on AI models for code generation, particularly in security-sensitive contexts, both Claude Sonnet and Gemini Flash present significant and consistent security hardening gaps. The data indicates that a substantial portion of AI-generated code, 63% in the author's tests, ships with vulnerabilities when evaluated against specific security rules. While Gemini showed an advantage in opinionated frameworks by defaulting to secure idioms, neither model can be trusted to produce production-ready, secure code without rigorous human review and the application of specialized security linting tools. Teams must integrate robust security practices, including custom linters like those used in this benchmark, into their AI-assisted development workflows.

What We'd Test Next

Our next steps would involve expanding this benchmark to include the higher-tier models, Claude Opus and Gemini Pro, to assess if their advanced capabilities translate into fewer security vulnerabilities. We would also investigate the impact of explicit security prompting, such as "make it secure" or "follow OWASP best practices," on the generated code's security posture. Further testing would explore cross-language comparisons, moving beyond JavaScript/TypeScript to evaluate AI code security in languages like Python or Go, which have different security ecosystems. A crucial area for future research is comparing AI-generated code against human-written code for similar prompts, establishing a baseline for security performance. Finally, we would examine the effectiveness of integrating AI security guardrails, such as Retrieval Augmented Generation (RAG) with curated security best practices, to mitigate these identified hardening omissions.

Pull quote: “The critical finding is that both models frequently missed the same hardening steps, particularly those outlined in RFC 8725 for JWT.”

Sources · how we verified

Claude vs Gemini Across 4 Security Domains: A Dead Heat — and the Hardening 63% of AI Code Skips ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Methodology

What It Does

Benchmarking AI code security

Custom ESLint plugin suite

Specific vulnerability findings

Idiomatic framework advantage

What's Interesting / What's Not

Pricing

Verdict

What We'd Test Next

Robinhood Chain demo app shows standard Ethereum dev tools still work

Web Crypto API offers secure browser-side UUID v4 generation

Git-absorb uses git blame to automate fixup commits