Coding · Head-to-Head

Claude Sonnet 4.6 vs. Gemini 2.5 Pro for Coding

The two frontier models most teams put on the shortlist for daily coding work. We ran them on the same repos for two weeks and graded the diffs, not the marketing.

Tested by Marcus Feld · July 2, 2026 · 4 rounds
Claude Sonnet 4.6
Anthropic
2rounds
88 / 100 overall
vs
Gemini 2.5 Pro
Google DeepMind
2rounds
78 / 100 overall
The verdict

For most working developers, Claude Sonnet 4.6 is the better daily driver. It patches real bugs more reliably, it needs fewer follow-up prompts on multi-file work, and Claude Code is closer to a finished product than anything Google ships on the coding side. Gemini 2.5 Pro is the pick when you actually need what only Google offers: a genuinely huge context window at a low input price, native video and audio in the same prompt as your code, or deep integration into a Google Cloud stack you already run. On raw coding accuracy, Sonnet 4.6 is roughly 16 points ahead on SWE-bench Verified, which matched what we saw on real tickets. On price, Gemini 2.5 Pro is meaningfully cheaper per token below the 200K threshold. Pick Sonnet 4.6 for the code; pick Gemini 2.5 Pro when the workload is really a long-context or multimodal problem that happens to include code.

A lot of engineering teams are running this exact comparison in mid-2026. Both Claude Sonnet 4.6 and Gemini 2.5 Pro sit near the top of the frontier tier, both ship a one-million-token context window, and both are the default coding model in a widely-used tool: Claude Code on Anthropic's side, Gemini on Google's. The question isn't whether either one can write code. It's which one you should point at a real repo tomorrow morning.

We tested both models over two weeks on the same three codebases: a TypeScript/Next.js frontend, a Python service, and a small Go microservice. We used each vendor's own coding tool where possible (Claude Code for Sonnet 4.6, Gemini CLI plus Cursor for Gemini 2.5 Pro) so we were grading the product a normal developer would actually use, not a stripped-down API. Four rounds: how well each one patches real bugs, how well each one handles longer agentic sessions, what each one gives you at long context and on multimodal inputs, and what it actually costs to run. Each round names the procedure we used before it names a winner.

Round by round

Real-bug patching (SWE-bench-style tasks)
WinnerClaude Sonnet 4.6

How we testedWe ran both models on eight real GitHub issues drawn from our three test repos and eight tasks from the public SWE-bench Verified split, using each vendor's default coding harness (Claude Code for Sonnet 4.6, Cursor with Gemini 2.5 Pro selected). For each issue we scored whether the produced patch compiled, whether the existing test suite passed, whether the intended failing test now passed, and how many follow-up prompts we needed to reach a mergeable diff. We also cross-checked against the vendor-reported and third-party SWE-bench Verified numbers.

Sonnet 4.6 patched more issues cleanly on the first pass and needed fewer follow-ups on the ones it didn't. The independent numbers line up with what we saw: on SWE-bench Verified, Sonnet 4.6 sits at 79.6% versus 63.8% for Gemini 2.5 Pro, a 15.8-point gap. In our own runs the gap was smaller than that headline (closer to a third of tickets versus a quarter on first-pass success), but it was consistent across all three repos. Anthropic's early Claude Code testing found users preferred Sonnet 4.6 over Sonnet 4.5 about 70% of the time, and rated it "significantly less prone to overengineering and 'laziness,' and meaningfully better at instruction following," which matches how the diffs read next to Gemini's on the same task.

Longer agentic sessions
WinnerClaude Sonnet 4.6

How we testedWe picked four multi-step tickets (a dependency upgrade, a small feature that touched frontend and backend, a refactor across ~15 files, and a flaky-test investigation) and gave each model up to one hour of agent time to open a working PR. We scored whether the agent finished without human rescue, how many times it lost the thread, and whether the final diff was mergeable.

Sonnet 4.6's adaptive thinking and context compaction were the deciding factors. On the Claude Platform, Sonnet 4.6 supports adaptive thinking, extended thinking, and a beta context-compaction mode that automatically summarizes older context as conversations approach limits, which let it keep running the refactor without us hand-managing what stayed in the window. Anthropic's own preference testing found users even preferred Sonnet 4.6 to Opus 4.5, the frontier model from November 2025, roughly 59% of the time in Claude Code. Gemini 2.5 Pro is a capable thinking model with the same 1M context, but on the four-ticket run it lost the plot twice on the refactor and once on the flaky-test investigation, in ways Sonnet did not.

Long context and multimodal inputs
WinnerGemini 2.5 Pro

How we testedWe fed each model the same three long-context tasks: reason over a 400K-token codebase dump, answer questions grounded in a 90-minute engineering all-hands recording plus its transcript, and correlate a UI screencast with the frontend code that produced it. We scored factual grounding, retrieval accuracy, and whether the model could use the non-text inputs at all.

This is the round Gemini is built for. Gemini 2.5 Pro accepts text, images, video, audio, and PDFs as input, with a context window up to about 1,048,576 tokens and up to 65,536 output tokens. The video and audio handling is native: we could hand it the meeting recording directly rather than pre-transcribing it, and it produced the correct answers with time-coded references. Sonnet 4.6 also ships a 1M-token context window in beta, at standard $3/$15 pricing with no long-context surcharge as of Anthropic's June change, and it handled the pure-text codebase task well. But it has no native video or audio, so the meeting and screencast tasks required us to build a transcription step first. If your "coding" work actually spans recordings, product videos, or huge single-shot corpora, Gemini is the honest pick.

Price and running cost
WinnerGemini 2.5 Pro

How we testedWe compared list prices from each vendor's official pricing page and modeled a month of real coding-agent usage: 2M input tokens and 500K output tokens per developer, with and without prompt caching and batch discounts, at both the ≤200K and >200K prompt tiers where Gemini's pricing shifts.

On the sticker, Gemini is cheaper. Gemini 2.5 Pro is $1.25 per 1M input tokens and $10.00 per 1M output tokens for prompts up to 200K tokens, rising to $2.50 in and $15.00 out for prompts above that threshold. Claude Sonnet 4.6 is $3 per 1M input and $15 per 1M output, flat, with no long-context surcharge after Anthropic's mid-2026 change. For a 2M-input / 500K-output month, that works out to roughly $13.50 on Sonnet 4.6 versus about $7.50 on Gemini 2.5 Pro at the low tier, a real if not enormous difference. Both vendors offer a 50% batch discount and aggressive prompt caching (Anthropic's cache reads run at 10% of the standard input price), which flattens the gap in practice. The point stands: for a team that runs a lot of tokens and doesn't need Sonnet's coding edge on every request, Gemini's headline rate is the cheaper base.

A lot of engineering teams are running this exact comparison in the middle of 2026. Claude Sonnet 4.6 and Gemini 2.5 Pro are the two frontier-tier models most working developers already have API keys for, both ship a one-million-token context window, and both are the default coding model in a widely-used tool. The question isn’t whether either one can write code, both can, but which one to point at your repo tomorrow morning.

Where Claude Sonnet 4.6 wins

Sonnet 4.6 is the better coding model, and the gap isn’t marketing. Sonnet 4.6 posts 79.6% on SWE-bench Verified, versus 63.8% for Gemini 2.5 Pro on the same benchmark, a 15.8-point spread. In our own two-week run across three repos, the gap on first-pass patches was smaller than the headline but consistent: Sonnet needed fewer follow-ups, and the diffs it produced were closer to what we would have written ourselves.

The tooling around the model matters just as much as the model. In Claude Code, Anthropic’s early testing found that users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time. Users reported that it more effectively read the context before modifying code and consolidated shared logic rather than duplicating it. That made it less frustrating to use over long sessions than earlier models.

The preference held even against a strictly bigger model: users even preferred Sonnet 4.6 to Opus 4.5, the frontier model from November, 59% of the time, and rated it significantly less prone to overengineering and “laziness,” and meaningfully better at instruction following.

One caveat worth naming for solo developers: Boris Cherny, Claude Code’s creator, still prefers Opus for all coding work, on the reasoning that the bottleneck isn’t token cost, it’s human time spent correcting AI mistakes. When a small SWE-bench gap translates to even slightly more errors on hard problems, the time cost of debugging outweighs the savings. That’s a real argument for teams reviewing every diff by hand. But between Sonnet 4.6 and Gemini 2.5 Pro specifically, Sonnet is the more accurate model, not the cheaper one.

Where Gemini 2.5 Pro wins

Gemini’s advantage isn’t the code itself; it’s what surrounds the code. Gemini 2.5 Pro is Google DeepMind’s flagship multipurpose model, tuned for hard reasoning, code, math, and multi-document analysis. Pricing is tiered by prompt size: $1.25 per 1M input tokens and $10.00 per 1M output tokens for prompts ≤200K tokens; $2.50 in and $15.00 out for prompts >200K. Claude Sonnet 4.6 is a flat $3/$15 per million tokens. For a team burning a lot of tokens on the smaller-prompt end, Gemini is the cheaper base by a real margin.

The other place Gemini pulls ahead is multimodal input. The model accepts text, images, video, audio, and PDFs as input, returns text, and works with long contexts up to about 1,048,576 tokens in and 65,536 tokens out, enough for codebases, research dossiers, or agent chains without constant truncation. If your “coding” task actually involves a meeting recording, a product walkthrough video, or a screencast of the bug, Gemini can take that directly. Sonnet 4.6 can’t, at least not natively.

Sonnet has closed one of Gemini’s older advantages, though. Anthropic removed the long-context pricing surcharge for Claude Opus 4.6 and Sonnet 4.6 in mid-2026, making the 1-million-token context window available at standard per-token rates. The 1-million-token context window is now generally available with standard pricing replacing the premium long-context rates that previously kicked in once prompts crossed a certain size threshold. That takes one of Gemini’s structural pitches, cheap giant prompts, off the table for pure text work.

Who should pick which

Pick Claude Sonnet 4.6 if the work is code. Multi-file refactors, real GitHub issues, long agent sessions inside a repo, or anything where you’re graded on whether the tests pass: Sonnet is the more accurate model, and the Claude Code experience around it is more polished than Google’s coding surface today. Expect to pay somewhat more per token, and expect the savings to come from needing fewer retries.

Pick Gemini 2.5 Pro if what you actually need is long-context reasoning, native video or audio inside the same prompt, or the cheaper input rate on high-volume, sub-200K-token workloads, and you’re willing to trade a real accuracy gap on coding-specific benchmarks to get it. Teams already living in Google Cloud, Vertex AI, and Workspace get integrations Anthropic simply doesn’t ship.

One thing worth flagging for anyone buying this quarter: Google previews new Gemini models on roughly a two-month cadence, and Anthropic shipped Sonnet 4.6 in February 2026 with Opus 4.7 and 4.8 following. If you make this decision today, revisit it in the fall. The gap on SWE-bench Verified is wide enough right now that we wouldn’t hedge, but the model that wins this comparison in July may not be the model that wins it in October.

Sources