Semgrep's benchmarks show GLM 5.2 topping Claude for cyber tasks
Semgrep's internal benchmarks report that Zhipu AI's GLM 5.2 model outperforms Anthropic's Claude 3 on specific cybersecurity tasks, signaling a potential shift towards specialized, domain-specific…
Semgrep's internal benchmarks report that Zhipu AI's GLM 5.2 model outperforms Anthropic's Claude 3 on specific cybersecurity tasks, signaling a potential shift towards specialized, domain-specific AI models.
The Answer Up Front
This analysis is for engineering leaders and founders building AI-native security tools. Semgrep's findings suggest that for highly specialized domains like code security, the best-performing model may not come from the handful of well-known US-based labs. Teams building general-purpose applications can likely skip this deep-dive for now. The bottom line is that Semgrep's internal, non-replicated benchmarks show a non-mainstream model winning on quality for specific security tasks. This points toward a future where tooling providers maintain a portfolio of models and route tasks to the best one for the job, rather than committing to a single generalist provider.
Methodology
This v0 review is based on the claims published by Semgrep in a blog post on June 28, 2026. We are analyzing the results of their internal benchmarks as presented; independent verification is pending. The models evaluated are Zhipu AI's GLM 5.2, Anthropic's Claude 3 Opus, and Claude 3 Sonnet. The evaluation was performed on three custom benchmarks developed by Semgrep: "Mythos-at-home" for vulnerability detection, "Cyber-Arena" for exploit generation, and a proprietary benchmark for generating Semgrep rules from natural language.
This review covers the methodology and results as described by Semgrep. It does not include our own independent performance testing, a cost-per-inference analysis, or an evaluation of the models' performance on non-security tasks. The benchmarks themselves are not yet publicly available for replication. We will re-evaluate these claims if the benchmarks are open-sourced or when independent results become available.
What the Benchmarks Measure
Semgrep's engineering team created a suite of benchmarks tailored to their specific product needs, arguing that general-purpose coding benchmarks like SWE-Bench are insufficient for evaluating security-specific AI capabilities.
Vulnerability detection and exploit generation
The first two benchmarks test core security analysis functions. "Mythos-at-home" is a benchmark designed to test a model's ability to identify vulnerabilities in provided code snippets. According to the post, this is a critical skill for automated code scanning tools. The second, "Cyber-Arena," evaluates the opposite capability: generating a working exploit for a known vulnerability. This tests a model's understanding of how vulnerabilities are practically leveraged.
Generating Semgrep rules
The third benchmark is highly specific to Semgrep's own product. It measures how well a model can translate a natural language description of a vulnerability (e.g., "find all instances of SQL injection using string formatting") into a precise, syntactically correct Semgrep rule. This is a direct test of the model's utility within the Semgrep platform's AI features.
What's Interesting / What's Not
The most significant finding is that a model from outside the dominant OpenAI/Anthropic/Google ecosystem is claiming state-of-the-art performance on a commercially valuable, high-stakes task. Semgrep reports that GLM 5.2, from the Chinese AI company Zhipu AI, had a higher win rate than Claude 3 Opus on both vulnerability detection and Semgrep rule generation. While Opus held a slight edge in exploit generation, GLM 5.2's performance in the other two categories was strong enough for Semgrep to declare it the overall winner for their use case.
This is interesting because it suggests the AI model market may fragment into specialized domains. A single foundation model may not remain the best at everything. Companies with deep domain expertise, like Semgrep in security, are building the specific evaluation suites required to even notice these performance differences. The benchmark itself becomes a competitive advantage.
What's less clear is the full picture. The analysis is based entirely on Semgrep's internal, unpublished benchmarks. While Semgrep is a credible source, these results are not yet independently verifiable. The blog post also focuses exclusively on the quality of the model's output, omitting crucial production metrics like inference speed, rate limits, and cost per million tokens. A model could be marginally better but 10x more expensive or slower, making it a poor choice for a real-time product.
Pricing
This benchmark evaluates models Semgrep is considering for its products. The pricing for Semgrep's platform is as follows (snapshot from June 2026):
- Free: For individuals and small teams. Includes the open-source engine, CI/CD scanning, and community rules.
- Pro: $20 per developer per month. Adds Pro rules for vulnerability detection, dependency scanning, and secrets scanning.
- Enterprise: Custom pricing. Adds features like custom rules, policy enforcement, and dedicated support.
The choice of an underlying model like GLM 5.2 or Claude would affect the cost structure and performance of features within these tiers, but is not a direct pricing component for customers.
Verdict
For founders building in the security space, Semgrep's findings are a clear signal: do not assume a mainstream model is the best for your specific domain. The effort to build or adopt domain-specific benchmarks is critical, as it can uncover higher-performing, potentially more efficient models from a global market of providers. While these specific claims about GLM 5.2 require independent verification, the strategic implication is that the best AI-powered tools will likely rely on a diverse set of specialized models, not a single general-purpose one. This is a playbook for building a defensible, best-in-class product in a specialized vertical.
What We'd Test Next
A v2 of this review would require access to the benchmark suite itself for independent replication. First, we would run the same models to verify Semgrep's reported scores. Next, we would expand the test to include other prominent models like OpenAI's GPT-4o and Google's Gemini family. A crucial addition would be a performance and cost analysis, measuring not just the quality of outputs but also the latency and cost-per-task for each model. Finally, we would compare the results against established, public security benchmarks to see if the performance advantage holds on non-proprietary tests.
The investor read
Semgrep's benchmark report is a signal of market fragmentation in the AI model layer. The era of a single 'best' foundation model is likely over, replaced by a 'best tool for the job' environment where specialized models outperform generalists in high-value verticals like cybersecurity. The durable asset here is not just the AI-powered product, but the domain-specific benchmark suite used to evaluate models. Companies that can reliably measure performance in a niche are building a moat; they become arbiters of quality and can optimize their AI supply chain for performance and cost. This result from Zhipu AI also underscores the global nature of the AI race, introducing new opportunities and supply chain complexities. An investable thesis is a company that provides 'benchmarking-as-a-service' for specific enterprise verticals, abstracting away the complexity of model selection.
Pull quote: “The bottom line is that Semgrep's internal, non-replicated benchmarks show a non-mainstream model winning on quality for specific security tasks.”
Every claim ties to a primary source. See our methodology.