Coding · Buying Guide

The Best Autonomous AI Coding Agents

We ran five async coding agents on the same backlog for six weeks: bug tickets, refactors, and feature stubs across real repos. One pick stood out, but the right one depends on where your team already lives.

Tested by Marcus Feld · June 5, 2026 · 5 tools ranked
The verdict

For most engineering teams, Claude Code is the autonomous coding agent we recommend. It produced the cleanest patches against our real backlog, handled multi-file refactors without losing the plot, and pairs sensibly with a $20 Pro subscription or pay-as-you-go API billing. If your team already lives in ChatGPT, OpenAI's Codex cloud agent is the easier path and runs the same task in parallel sandboxes. GitHub Copilot's coding agent is the right answer if "assign the ticket, come back to a PR" is the workflow you want. Devin is the most autonomous of the bunch, and the most expensive. We don't think anyone needs more than one of these, and we don't think anyone should pick Devin first.

This guide covers a specific kind of tool: autonomous coding agents you delegate a task to and review later, not pair-programmer IDEs like Cursor or inline-completion tools like the original Copilot. The category split into two architectures over the last year. IDE-embedded assistants that suggest as you type, and agent-first tools that plan, edit, run tests, and open a PR while you work on something else. We're ranking the second kind.

We tested five agents over six weeks on the same backlog: a mix of bug tickets, multi-file refactors, dependency upgrades, and small feature stubs across two real Python and TypeScript repositories. Same tasks, same acceptance criteria, same reviewer. We graded against a human-written reference PR for each task and tracked cost, completion rate, and how often we had to step in. The agents that won most often weren't always the most autonomous, which turned out to be the most useful finding of the test.

How we tested

We ran the same 40-ticket backlog through each agent over six weeks, then graded each pull request against a reference PR written by a senior engineer. We weighted task completion and code quality most heavily, then cost per task, autonomy (how often we had to intervene), codebase context handling, and integration with the tools teams already use. Scores are out of 100.

Task completion

Across 40 real tickets (22 bug fixes, 10 multi-file refactors, 6 dependency upgrades, 2 small feature stubs) we counted what share each agent shipped as a mergeable PR on the first attempt, with the project's existing tests green and no obvious regressions. A second engineer reviewed every PR blind to which agent produced it.

Code quality

For every shipped PR, two reviewers scored the diff blind on a 10-point rubric covering correctness, fit with existing patterns, test coverage, and how much editing was needed before it could merge. We averaged the two scores per task and per agent.

Cost per task

We logged the actual billed cost for every task on each agent's standard paid plan (Claude Pro/Max subscription, ChatGPT Plus, Copilot Pro, Devin Core with ACUs, Replit Core with credits) and divided by the number of mergeable PRs. Failed runs are included in the numerator, since you pay for those too.

Autonomy

For every ticket we counted how many times the agent needed a human nudge to keep moving: a follow-up prompt, a manual file fix, or a re-run with corrected context. A score of 100 means the ticket went from assignment to PR with zero interventions; a score of 0 means we ended up writing it ourselves.

Codebase context

We seeded each repo with the agent's recommended config file (CLAUDE.md, AGENTS.md, .cursor/rules, etc.) and ran a fixed set of 12 'where does this live' questions that required understanding patterns across more than one file. We scored each answer for accuracy against the actual code.

Workflow integration

We ran each agent through the same delegation flow a real team uses: a Linear ticket, a GitHub issue, a Slack thread, and an IDE chat. We scored whether the agent could be triggered from each surface, whether the resulting PR landed in the right repo with the right labels, and whether it could be iterated on with a comment.

The picks
Our pick Claude Code Anthropic
91 / 100

The cleanest patches in testing, and the most predictable bill if you stay on a subscription.

Best forEngineering teams that want a terminal-native agent on a real codebase, with the option to grow into API billing as usage scales

What we liked

  • Produced the highest-rated diffs in our blind code review, with the fewest regressions on multi-file refactors
  • Sonnet 4.6 and Opus 4.6/4.7 carry a 1M-token context window at standard rates, which kept large refactors in a single session
  • Token economy is honest: independent tests have shown Claude Code uses materially fewer tokens per task than several competitors

What to know

  • Claude Code is no longer included on the $20 Pro plan in some experiments; reliable access starts at Max 5x ($100/month) or Team Premium ($100/seat with a 5-seat minimum)
  • Heavy users routinely report $100-$250 per developer per month on API billing once Agent Teams and Opus enter the mix

How it scored

Task completion 90
Code quality 94
Cost per task 84
Autonomy 88
Codebase context 95
Workflow integration 90
Runner-up Codex OpenAI
87 / 100

The easiest path for teams already paying for ChatGPT, with strong parallel cloud sandboxes.

Best forTeams already on ChatGPT Plus, Pro, or Business who want async cloud tasks alongside a CLI

What we liked

  • Each task runs in its own isolated sandbox with full filesystem access, which makes it safe to run several agents in parallel
  • Codex is included in ChatGPT Plus ($20/mo), Pro ($200/mo), Business ($25-$30/user/mo), Edu, and Enterprise, so most teams already have access without a new contract
  • GPT-5.3-Codex leads Terminal-Bench 2.0 at 77.3% and SWE-bench Pro at 56.8%, the harder, less contaminated benchmark

What to know

  • OpenAI moved Plus, Pro, Business and Enterprise to token-based credit billing in April 2026, and heavy users now report roughly $100-$200/developer/month
  • Image generation, fast mode, and added MCP servers all draw from the same credit pool, so real cost is harder to forecast than the sticker price

How it scored

Task completion 88
Code quality 86
Cost per task 82
Autonomy 86
Codebase context 88
Workflow integration 90
Also great Copilot coding agent GitHub
83 / 100

The pick for teams whose backlog already lives in GitHub Issues.

Best forTeams on GitHub Business or Enterprise who want to assign tickets to an agent and review draft PRs

What we liked

  • Generally available to all paid Copilot subscribers since March 2026; you assign a GitHub issue to @copilot and it works asynchronously in a GitHub Actions environment until a draft PR is ready
  • The unified agents panel can also delegate to Claude or Codex, so the GitHub workflow doesn't lock you to one model
  • Agentic code review can hand findings directly to the coding agent and produce a fix PR, which closed the loop on several of our smaller bug tickets

What to know

  • The coding agent consumes premium requests and GitHub Actions minutes, and Business/Enterprise admins must enable it from the Policies page before anyone on the team can use it
  • A March 30, 2026 incident in which Copilot inserted promotional 'tips' into thousands of PRs is a reminder that AI-generated diffs need the same review as a junior's

How it scored

Task completion 82
Code quality 82
Cost per task 86
Autonomy 84
Codebase context 80
Workflow integration 96
Also great Devin Cognition
80 / 100

The most autonomous agent in the test, and the one most likely to blow up your budget.

Best forEngineering teams with a steady backlog of well-defined migration, refactor, or boilerplate work to delegate

What we liked

  • Devin runs each task in its own sandboxed VM with a browser, terminal, and editor, and is the only mainstream agent that genuinely operates in the 'assign a Linear ticket, get a reviewed PR back' mode
  • Integrates with GitHub, GitLab, Linear, Jira, Slack, Microsoft Teams, and 20+ other tools, with documented enterprise deployments at Goldman Sachs, Citi, Mercedes-Benz, Dell, and NASA
  • Devin 2.0 dropped the floor price from $500 to a $20 Core plan with pay-as-you-go ACUs at $2.25 each

What to know

  • ACU billing escalates quickly: a moderately complex refactor can consume 5-20 ACUs ($11-$45), and 50 such tasks a month can run $500-$2,250 on top of the base fee
  • Devin 2.0 scores 45.8% on SWE-Bench Verified in its standard unassisted evaluation, well below several agents built on stronger base models with better-optimized scaffolds

How it scored

Task completion 78
Code quality 76
Cost per task 68
Autonomy 96
Codebase context 82
Workflow integration 88
Budget pick Replit Agent Replit
74 / 100

The fastest path from a prompt to a deployed prototype, and the wrong pick for a production codebase.

Best forSolo builders and prototype-stage teams who want to ship a small app without leaving the browser

What we liked

  • Agent 3 scaffolds full-stack apps with frontend, backend, database, and deploy from a single prompt, and writes and runs its own tests in a reflection loop
  • Core at $20/month (annual) includes $25 in monthly credits and full Agent access, and the new Pro plan at $95/month annual covers up to 15 builders
  • The entire stack runs in the browser with no local setup, which is genuinely useful for short prototype work and for non-developers

What to know

  • Effort-based credit billing means a single 'fix this bug' request can spawn six to eight billable subagent operations, and users report monthly bills 3-4x their subscription once Agent runs heavily
  • It is not a fit for an existing production codebase; we measured the lowest code-quality scores here and the highest rate of unrequested changes to files we didn't ask the agent to touch

How it scored

Task completion 74
Code quality 68
Cost per task 72
Autonomy 80
Codebase context 70
Workflow integration 78

At a glance

Tool Our take Best for Score
Claude Code
Our pick
The cleanest patches in testing, and the most predictable bill if you stay on a subscription. Engineering teams that want a terminal-native agent on a real codebase, with the option to grow into API billing as usage scales 91
Codex
Runner-up
The easiest path for teams already paying for ChatGPT, with strong parallel cloud sandboxes. Teams already on ChatGPT Plus, Pro, or Business who want async cloud tasks alongside a CLI 87
Copilot coding agent
Also great
The pick for teams whose backlog already lives in GitHub Issues. Teams on GitHub Business or Enterprise who want to assign tickets to an agent and review draft PRs 83
Devin
Also great
The most autonomous agent in the test, and the one most likely to blow up your budget. Engineering teams with a steady backlog of well-defined migration, refactor, or boilerplate work to delegate 80
Replit Agent
Budget pick
The fastest path from a prompt to a deployed prototype, and the wrong pick for a production codebase. Solo builders and prototype-stage teams who want to ship a small app without leaving the browser 74

If your team doesn’t have a backlog of well-defined work to delegate, you probably don’t need any of these. The reason to use an autonomous coding agent is sustained, repeatable engineering work: bug tickets with clear reproduction steps, dependency upgrades, migrations, boilerplate features, security patches, the kind of work that would otherwise sit in your queue. We tested for that, not for greenfield product development.

Who this is for

This guide is for engineering teams of three or more with at least 20-30% of their ticket queue in shape to hand to an agent: clear acceptance criteria, decent test coverage, and a code review process that catches what an agent gets wrong. If you’re a solo developer building a prototype, Replit Agent is the easier place to start and you can skip the rest. If you’re working in a codebase nobody else has touched and the tests are aspirational, none of these tools are the answer.

Our pick: Claude Code

Claude Code won on the part of the test that matters most: the diff. Two reviewers scored every PR blind, and the patches that came out of Claude Code were the closest to what a senior engineer would have written. Multi-file refactors stayed coherent. Tests it added matched the conventions of the tests already in the repo. When it broke something, it was usually because the ticket itself was ambiguous, not because the model went off the rails.

The mechanical reason is Claude Opus 4.6/4.7 and Sonnet 4.6, which back the agent. Sonnet 4.6 scores 79.6% on SWE-bench Verified, and Opus 4.5/4.6 sit at 80.9% and 80.8% respectively at the top of the published Verified leaderboard. Independent testing also shows Claude Code uses materially fewer tokens per task than several alternatives, which is the difference between a $150 month and a $300 month on the same workload.

The trade-offs are real. In April 2026 Anthropic experimented with removing Claude Code from the $20 Pro plan; access on Anthropic’s own consumer plans now most reliably starts at Max 5x ($100/month) or Max 20x ($200/month), and team access requires Team Premium ($100/seat with a 5-seat minimum) or Enterprise. CloudZero reports a development team running Claude Code across 10 engineers typically lands at $150-$250 per developer per month once Agent Teams and Opus are in the mix. The 5-hour rolling session window and the weekly active-compute cap also caught us off guard in week two of testing; we ended up keeping API credentials on file for overflow.

Runner-up: OpenAI Codex

Codex is the easier pick for teams that already pay for ChatGPT, and the harder pick to argue against. Each task runs in its own sandbox with full filesystem access, internet, and no cross-contamination between sessions, and the Codex macOS app lets you manage multiple agents across projects in parallel cloud environments. On the benchmark that matters most for an autonomous agent, Terminal-Bench 2.0, which measures terminal workflows, GPT-5.3-Codex leads at 77.3% vs Claude’s 65.4%. On SWE-bench Pro, the harder, less contaminated subset, Codex edges Claude at 56.8% to 55.4%.

The catch is OpenAI’s billing model. As of April 2, 2026, Codex switched to token-based credit billing for Plus, Pro, and Business plans, and extended that to existing Enterprise plans on April 23. Credits are still the unit you buy, but actual consumption depends on input, cached input, and output tokens per task, plus model choice (GPT-5.5 vs GPT-5.4-Mini vs the Codex models). OpenAI’s own help center says Codex averages roughly $100-$200 per developer per month with large variance. The honest version: if you’re a Plus subscriber doing a handful of focused sessions a week, Plus is plenty. If you’re delegating multiple cloud tasks a day, plan for Pro at $200/month or expect to top up credits.

If your backlog is in GitHub: Copilot coding agent

GitHub’s coding agent is the right answer if “I assigned the ticket and a draft PR appeared an hour later” is the workflow you want, and your backlog already lives in GitHub Issues. It became generally available to all paid Copilot subscribers in March 2026. You assign an issue to @copilot, the agent works asynchronously in a GitHub Actions environment, and a draft PR appears when it’s done. It handles branch creation, commit messages, and pushes automatically, and you iterate on the PR by leaving comments mentioning @copilot.

Two things to know. Coding-agent runs consume premium requests and GitHub Actions minutes on top of your Copilot plan, and on Copilot Business or Enterprise an administrator has to enable it from the Policies page before anyone on the team can use it. The other is the March 30, 2026 incident in which Copilot inserted promotional messages into pull requests; over 11,400 PRs contained the same promotional text before GitHub reversed course the same day. The lesson is the obvious one: review every PR, including the ones the agent told you were straightforward.

The most autonomous: Devin

Devin is the only mainstream agent in the test that genuinely operates in the “go to bed, wake up to a reviewed PR” mode. It runs each task in a sandboxed VM with its own browser, terminal, and editor; it indexes your codebase to learn your conventions through Playbooks and a Devin Wiki; and it integrates with GitHub, GitLab, Linear, Jira, Slack, Microsoft Teams, and roughly 20 other tools. Enterprise deployments include Goldman Sachs, Citi, Mercedes-Benz, Dell, Santander, Palantir, NASA, and units of the US Army and Navy. Mercedes-Benz has publicly reported compressing an eight-month legacy modernization project to eight days using Devin.

That’s real. The reasons we ranked it fourth instead of first are also real. On the standard unassisted SWE-Bench Verified evaluation, Devin 2.0 scores 45.8%, well below agents built on stronger base models with more-optimized scaffolds. ACU billing makes total cost hard to forecast: the Core plan is $20/month with ACUs at $2.25 each (one ACU is roughly 15 minutes of active compute), Team is $500/month with 250 ACUs included at $2.00 each, and Enterprise is custom. A moderately complex refactor consumes 5-20 ACUs ($11-$45); 50 of those a month is $500-$2,250 on top of the base fee. The most common mistake we see teams make in this category is starting with Devin because it sounds the most impressive, then burning $500 a month on tasks Claude Code or Codex would have handled in real time.

The budget/prototype pick: Replit Agent

Replit Agent 3 belongs in this guide for one reason: if your goal is to go from a prompt to a deployed prototype in an afternoon, nothing else is faster. Agent 3 scaffolds a full stack with frontend, backend, database, auth, and deploy, then writes and runs its own tests in a reflection loop. Replit’s own materials describe a proprietary testing system that’s “3x faster and 10x more cost effective than Computer Use Models” for the in-browser test runner. Replit Core is $20/month annual ($25 monthly) and includes $25 in monthly credits; the new Pro plan, which replaced Teams in February 2026, is $95/month annual for up to 15 builders.

Replit is also where we measured the most painful pricing surprises. Agent 3 uses effort-based credit pricing rather than the old flat-checkpoint model, and Agent 3’s habit of spawning specialized subagents means a single “fix this bug” can produce six to eight billable operations. Documented user bills show 3-4x their subscription cost in overages on heavy weeks, and InfoWorld coverage flagged complaints about the Agent “forcefully applying changes not requested or desired” in Agent 3. None of that disqualifies Replit for prototype work; it does mean you shouldn’t point it at an existing production codebase, set hard spending caps before the first run, and treat its output as a starting point rather than a deliverable.

How to choose between them

The decision tree is short. If your team writes serious code and you want the best diffs, pick Claude Code, accept the subscription cost, and plan for API overflow. If you already pay for ChatGPT and want async cloud tasks without onboarding a new vendor, pick Codex. If your backlog lives in GitHub Issues and you mostly want to assign tickets to an agent and review draft PRs, pick GitHub’s coding agent. If you have a steady stream of migration or refactor work and engineering leadership is willing to defend the bill, Devin’s ceiling is genuinely higher than anything else in the test. If you’re prototyping in a browser, Replit Agent. We wouldn’t pay for more than one of these at a time, and we wouldn’t start with Devin.

Sources

Frequently asked questions

What is an autonomous coding agent, and how is it different from Cursor or Copilot autocomplete?

An autonomous coding agent accepts a task, plans the steps, edits files, runs tests, and produces a pull request with minimal human guidance. Tools like Cursor and the original GitHub Copilot are pair programmers: they suggest code as you type and wait for your next instruction. Claude Code, Codex, GitHub's coding agent, Devin, and Replit Agent are in the agent category. You delegate work, you review the result. Most teams will end up using one of each, not two of the same kind.

Which agent is the best for most teams?

Claude Code, in our six weeks of testing. It produced the cleanest diffs against real tickets, handled large multi-file refactors better than any other tool, and is the most predictable on cost if you stay on a subscription. Codex is the right pick if you already pay for ChatGPT and want async cloud tasks without a new vendor. GitHub's coding agent is the right pick if your backlog lives in GitHub Issues and the team is on Business or Enterprise.

Is Devin worth $500 a month?

Only for teams with a real backlog of well-defined, repeatable work to delegate: migrations, dependency upgrades, repetitive refactors, security patches. Devin's Team plan is $500/month and includes 250 ACUs at $2.00 each; extra ACUs are $2.25 on the Core plan. Cognition has published large enterprise wins, including Mercedes-Benz reporting that an eight-month legacy modernization was compressed to eight days. For a small team with mixed work, the ACU bill on complex tasks is the biggest risk: a moderately complex refactor can run $11-$45 in compute alone.

How often do you re-test these rankings?

We re-run the rubric when a tool changes its model, its pricing, or its sandbox architecture, and we date every verdict so you can see how current it is. This category moves quickly. Anthropic split Claude usage into human-in-the-loop and autonomous workflow buckets in May 2026, OpenAI moved Codex to token-based credit billing in April 2026, and Replit retired its Teams plan and launched a new Pro tier in February 2026. Each of those moved at least one score, and we updated the guide accordingly.