Tools·Jul 3, 2026

Cursor's new benchmark claims a win over VS Code with GPT-4o

Cursor has published CursorBench 3.1, a detailed evaluation of its AI code editor, claiming superior performance on real-world coding tasks against a baseline of VS Code and GPT-4o. The Answer Up…

By Riley · Tools desk·Human-reviewed·✓ Verified Jul 3, 2026·5 min read·1 source

Cursor has published CursorBench 3.1, a detailed evaluation of its AI code editor, claiming superior performance on real-world coding tasks against a baseline of VS Code and GPT-4o.

The Answer Up Front

For developers committed to an AI-native workflow, Cursor presents a compelling, data-backed argument for its integrated editor. The performance gains it claims on complex edits are significant. Developers who use AI tools occasionally or are satisfied with their current VS Code and Copilot setup should probably wait for independent verification of these benchmarks. The bottom line is that Cursor makes a strong case that deep integration and model fine-tuning can outperform a general-purpose model bolted onto a traditional editor, but these are, for now, the company's own numbers.

Methodology

This v0 review analyzes the claims and data presented in CursorBench 3.1, published by the Cursor team on their website. The review is based on the public data available as of July 2, 2026. The source signal is the benchmark page itself: https://cursor.com/evals. Our analysis covers the methodology, test suites, and reported results comparing Cursor's AI capabilities against competitors, primarily VS Code paired with OpenAI's GPT-4o via the Copilot extension. We have not independently replicated these benchmarks or verified the pass rates. This review does not cover long-term usability, the editor's performance on tasks outside the benchmark suites, or the qualitative developer experience. This analysis is based on the founder's published claims; independent benchmarks are pending.

What It Does

Cursor is an AI-first code editor built as a fork of VS Code. Its central premise is that building an editor around AI from the ground up provides a superior experience to adding AI features via extensions. CursorBench 3.1 is the company's attempt to quantify that advantage.

The benchmark setup

The evaluation uses two main test suites: "Applied" and "Harder Applied." According to the documentation, these are not synthetic coding challenges but are derived from real-world development tasks, such as implementing features or fixing bugs in open-source repositories. The benchmark measures the "pass rate," or the percentage of tasks the AI successfully completes without human intervention. This approach is designed to reflect practical utility over abstract coding ability.

The core claim

Cursor's primary claim is that its combination of a fine-tuned model and deep editor integration solves these real-world tasks more reliably than general models. The benchmark page shows results where Cursor consistently scores a higher pass rate than VS Code using GPT-4o. For example, on the "Harder Applied" suite, Cursor reports a pass rate of 54%, compared to a claimed 38% for VS Code + GPT-4o. This suggests that context-awareness and specialized training give it an edge.

What's Interesting / What's Not

The most interesting aspect of this announcement is the benchmark itself. By publishing a detailed methodology and making the evaluation suite public, Cursor is inviting scrutiny and setting a standard for transparency in the AI dev tool space. This is a confident move that contrasts sharply with the vague marketing claims common in the industry. The focus on applied, real-world tasks is also a significant step up from benchmarks like HumanEval, which test algorithmic generation but not the messy reality of editing existing codebases.

What's less compelling, by necessity, is that this is a self-published benchmark. The results, while impressive, are designed to showcase the product in its best possible light. The tasks in the suite may be unintentionally (or intentionally) biased towards the strengths of Cursor's architecture. The benchmark also narrowly defines performance as "task completion rate." It doesn't measure other critical factors for developers, such as generation latency, the quality of partial solutions, or the cognitive overhead of using the tool. The numbers are promising, but they only tell part of the story.

Pricing

Free: 50 slow GPT-4o uses per month, 200 fast GPT-3.5 uses per month.
Pro: $20/month for unlimited slow GPT-4o uses, 500 fast model uses per month, and features like "Bring your own key."
Business: $40/user/month, adding features like admin controls and self-hosted models.

(Pricing snapshot from July 2, 2026)

Verdict

Cursor's benchmark provides a strong, if unverified, argument for the AI-native editor. For developers whose workflow is already heavily reliant on AI code generation and editing, the claimed 16-point performance lift on difficult tasks could translate into meaningful productivity gains, justifying the switch from a standard VS Code setup. However, for the majority of developers who use AI as an occasional assistant, the friction of changing editors and the maturity of the VS Code extension ecosystem make it a harder sell. The decision depends on whether you believe a specialized, integrated tool will compound its advantages over time, a bet on a specific workflow philosophy.

What We'd Test Next

For a v2 review, we would need to move beyond Cursor's self-reported data. First, we would attempt an independent replication of the CursorBench 3.1 results to verify the core claims. Second, we would design a qualitative test, assigning a small team of developers to use Cursor and a competing setup for a one-week sprint to measure subjective productivity and workflow friction. Finally, we would benchmark factors not covered in the current evaluation, including end-to-end latency for common edit requests and performance on proprietary, closed-source codebases which often have unique constraints.

The investor read

CursorBench is a strategic asset. It's not just a benchmark; it's a piece of marketing that frames the entire market debate as 'integrated AI editor vs. general IDE + plugin.' By publishing a quantitative, reproducible evaluation, Cursor forces competitors like Microsoft (VS Code/Copilot) to either engage on their terms or appear less transparent. This signals a maturing market where performance claims require evidence. For investors, the key question is whether the performance delta claimed by integrated solutions like Cursor is large enough to overcome the massive inertia of VS Code. If the gains are real and sustainable, Cursor is well-positioned. If they are marginal or easily replicated by incumbents, it remains a niche product. The next step is watching for independent verification of these results.

Pull quote: “The bottom line is that Cursor makes a strong case that deep integration and model fine-tuning can outperform a general-purpose model bolted onto a traditional editor, but these are, for now, the company's own numbers.”

Sources · how we verified

CursorBench 3.1 ↗

Every claim ties to a primary source. See our methodology.

Reported by the Riley desk on Founderr Pulse’s Tools beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Riley

The Riley desk covers tools — what founders are building with, switching to, and abandoning. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

The Answer Up Front

Methodology

What It Does

The benchmark setup

The core claim

What's Interesting / What's Not

Pricing

Verdict

What We'd Test Next

The investor read

Nylas's two-phase pattern for syncing a full mailbox

Boogu-Image-0.1 claims near-closed-source quality on 10x less data

GLM 5.2 vs. Claude Opus: A founder's guide to choosing a flagship model