LLMs for Security Testing: $1,500 Experiment Benchmarks Hacking Capabilities
A founder spent $1,500 testing GPT-4, Claude 3 Opus, and Gemini 1.5 Pro against a custom-built vulnerable app, providing a repeatable methodology and specific benchmarks for LLM-assisted security.…
A founder spent $1,500 testing GPT-4, Claude 3 Opus, and Gemini 1.5 Pro against a custom-built vulnerable app, providing a repeatable methodology and specific benchmarks for LLM-assisted security.
Kasra, operating under the handle jc4p, invested $1,500 to assess the hacking capabilities of large language models against a deliberately vulnerable application. His detailed experiment, published in a recent blog post, compared GPT-4, Claude 3 Opus, and Gemini 1.5 Pro across six common vulnerability types. The findings offer specific benchmarks and a repeatable playbook for founders exploring LLM integration into their security testing workflows.
Building the Vulnerable Application
The core of the experiment involved a custom-built Node.js application using the Express framework and a MongoDB database. This application was intentionally designed with specific vulnerabilities to serve as a controlled testing environment. The founder integrated six distinct types of vulnerabilities: SQL Injection (SQLi), Cross-Site Scripting (XSS), Remote Code Execution (RCE), Server-Side Request Forgery (SSRF), Server-Side Template Injection (SSTI), and Local File Inclusion (LFI). Each vulnerability was implemented to mimic common real-world flaws, providing clear targets for the LLMs.
Prompt Engineering and LLM Selection
Kasra tested three prominent LLMs: OpenAI's GPT-4, Anthropic's Claude 3 Opus, and Google's Gemini 1.5 Pro. The methodology centered on a zero-shot prompting approach, where the LLMs received only the application's source code and a high-level instruction to identify vulnerabilities. This initial approach was designed to test the models' baseline understanding and analytical capabilities without extensive human guidance. The founder claims that this zero-shot method was crucial for evaluating the raw security intelligence of each model, minimizing bias from specific attack vectors provided in prompts.
Reported Hacking Success Rates
The experiment reports varying success rates across the LLMs and vulnerability types. GPT-4 reportedly achieved a 50% success rate overall, demonstrating particular strength in identifying SQLi and XSS vulnerabilities. Claude 3 Opus, while generally performing well, showed a lower overall success rate compared to GPT-4, with specific challenges in RCE. Gemini 1.5 Pro reportedly struggled the most, particularly with more complex vulnerabilities like SSTI and LFI. The founder states that the total cost of running these tests, including API calls and infrastructure, amounted to $1,500, with GPT-4 accounting for the largest portion due to its higher token costs and extensive usage in the experiment.
What We'd Change
The experiment provides a valuable initial benchmark, but its generalizability has limits. The custom-built application, while intentionally vulnerable, does not fully replicate the complexity, scale, and varied tech stacks of production systems. Real-world applications often feature multiple layers of security, diverse frameworks, and legacy code that could alter LLM performance. Furthermore, the LLM landscape evolves rapidly; benchmarks established today may become outdated within months as models improve or new ones emerge. A single zero-shot prompting strategy, while useful for baseline assessment, may not represent the optimal use of LLMs in a practical security context, where iterative prompting or specialized fine-tuning could yield higher success rates. For founders, the reported $1,500 cost, while less than a full human penetration test, might still be a barrier for early-stage bootstrapped projects, especially if repeated testing is required.
Landing
This experiment confirms LLMs can identify software vulnerabilities with a measurable success rate, offering a potential tool for early-stage security assessments. The detailed methodology provides a starting point for founders to integrate LLM-assisted security into their development cycles. While not a replacement for human security expertise, these models can act as a force multiplier, automating initial scans and flagging common issues before deeper audits. The reported benchmarks underscore the current capabilities and limitations, guiding founders on where to best deploy these AI tools in their security posture. The ongoing evolution of LLMs suggests these capabilities will only expand, making this a critical area for continuous monitoring by technical founders.
The investor read
This experiment signals a growing trend: LLMs moving from content generation to specialized, technical tasks like security testing. For investors, this highlights a potential market for AI-powered security tooling, particularly for startups or SMBs that lack dedicated security teams or budget for traditional penetration testing. The reported $1,500 cost for a comprehensive LLM-driven scan offers a compelling cost-efficiency benchmark against human-led security audits. As LLM capabilities advance, the market for automated, AI-assisted security solutions could expand, attracting capital to companies building specialized agents or platforms that operationalize these models for continuous vulnerability assessment. This also underscores the increasing importance of robust, AI-native security for any product leveraging LLMs.
Pull quote: “The detailed methodology provides a starting point for founders to integrate LLM-assisted security into their development cycles.”
Every claim ties to a primary source. See our methodology.