What is Agenticness, anyway?

Nine dimensions for judging AI agents, including the one my first pass missed

May 26, 2026

“Agentic” is doing a lot of work this year. Every coding tool says it’s agentic. Every model release says the new version is more agentic than the last. Investors put it in deck headers. I run a directory built on the word, and I’ve been asked - more than once - what I think it actually means.

Honest answer: I’m still figuring it out. But the figuring has shape, and the shape is what I want to write down.

Working theory: a tool is agentic to the degree it does work you’d otherwise have to drive yourself. That sounds banal until you try to score it. Does ChatGPT count? It returns answers without you typing each word - but you’re still the loop. Does a Bash script count? It runs without you - but it doesn’t decide anything. The word lives in the gap between “things that respond to you” and “things that act on your behalf without checking back.” Different tools sit at different points in that gap. The useful question is how to put numbers on the position.

So I tried.

Nine dimensions, one number out of 36

The version I’m running today scores every listed tool on nine axes, 0–4 each, total possible 36. Briefly:

Action capability - can the tool reach beyond text? Write files, run commands, hit APIs, drive a browser, edit code in place.
Autonomy - how long can it go without a human checking back in? Single turn, short loop, multi-step plan, multi-day mission.
Planning - does it generate a plan before acting, hold a TODO list, decompose problems into subtasks?
Adaptation - when something fails, does it detect the failure and retry with a meaningfully different strategy? Or does it just try the same thing again?
State continuity - does it remember what happened across turns, across sessions, across users? Or does every interaction start from cold?
Reliability - is the underlying infrastructure stable, is the output consistent across runs, are there benchmarks?
Interoperability - does it plug into the rest of your stack, support standard protocols (MCP, OpenAI-compatible APIs, IDE extensions)?
Safety - does it ask before destructive operations, sandbox dangerous tools, log what it did?
Operator Sovereignty - how much of the tool is yours to control? Open source or closed? Bring your own model key or locked-in? Self-hostable or SaaS-only? Pricing metered or opaque?

The full per-axis scoring criteria are on the directory. The point of the list here is the shape of the answer: nine axes that, between them, capture most of what people seem to mean when they call something agentic.

It is almost certainly incomplete.

How I know it’s incomplete

For this pass I scored 18 coding agents. Every one got a number. In most categories I cover, the scores roughly track what the audience clicks on - the tools I think are good get found. Working as intended.

But in coding agents, the #1 tool by clicks - 332 of them over the 30 days ending 2026-05-22, top of the category by a comfortable margin - was scoring fourth on my rubric. The tool is Cline. Apache 2.0 VS Code extension, bring your own model key across about 30 providers. By my score it sat behind Claude Code, Cursor, and Windsurf. By the click data, the audience was treating it like the obvious answer.

I sat on that disagreement for a couple of months. It bothered me the way an experiment that almost works bothers you. The data wasn’t noise. The rubric wasn’t lazy. They just disagreed.

So last week I did the thing I should have done sooner: I hired deep research to figure out what I was missing.

The brief was simple. Here are 18 coding agents. Here’s my rubric. The market is preferring one tool to a degree my rubric doesn’t justify. Find me a dimension the rubric isn’t measuring that the audience clearly values.

The answer came back fast. Operator Sovereignty. How much of the tool is yours to control. I’d measured action capability and autonomy and planning and adaptation and continuity and reliability and interop and safety - and I’d quietly assumed deployment shape would wash out in the rest. It does not wash out. It is its own axis. And the audience picking Cline was telling me, with their clicks, that it’s a load-bearing axis.

Cline scores a 3 out of 4 on it. Cursor scores a 1. Claude Code scores a 2. Devin scores a 0 or 1 depending on how you read the enterprise pricing. The axis is real, and it separates tools that look interchangeable on the others.

So the rubric got a 9th dimension. Two other tightenings landed in the same pass - the Adaptation bar moved up (a tool now has to autonomously detect failure and retry with a meaningfully different strategy to score a 3, not just expose a retry hook), and the Reliability dimension split into two lenses (infrastructure SLAs vs output consistency, because penalizing local-execution CLIs for not publishing uptime they have no surface to publish was wrong, and now isn’t).

The reshuffled top of the table

After re-running all 18 listings against the updated dimensions:

Cline - 19/36, tied for #1. The paradox didn’t just resolve. It inverted. The market wasn’t preferring a mid-rubric tool. The rubric was failing to measure what the market valued.
Windsurf - 19/36. Codeium’s IDE. Quietly strongest on state continuity. Public status page running 99.95% over the last 90 days.
Tied at 18/36, five-way: OpenHands (open-source, 74.3K GitHub stars, 68.4% on SWE-bench Verified - the strongest open-stack number I’ve seen published), Claude Code, GitHub Copilot, Cursor, Factory AI’s Droid.

Underneath: Augment Code and Zencoder at 17, Kilo Code at 15, Devin at 14, Replit Agent at 13, Bolt and Kiro at 11, Codex CLI at 10, Manus at 9, Aider at 7, v0 at 4.

A couple of honest notes. The Cline number could plausibly be 20 with reproducible-build evidence; the evaluator wanted that before pushing to 4 on Sovereignty. The Devin number is wrong in an instructive way - the automated crawl pulled Cognition’s marketing pages rather than the Devin 2.0 product docs, which have specific primary-source evidence for autonomous retry behavior they call “Self-Healing Code.” Devin should probably be at 17 or 18. The failure mode of automated evaluation against vendor SEO. (The fix is letting vendors claim their listings; I’ll get to that.)

Aider at 7 is going to upset some people. Aider is one of the most beloved CLI tools in this whole space. Its low score is honest about a specific thing: by default it does not autonomously test-and-retry. The --auto-test flag exists; it’s off. That’s the gap. Open-source plus BYOK is a feature, not a substitute for agent-loop maturity. The Cline / Kilo Code / OpenHands top tier proves you can have both.

What fell out that I wasn’t looking for

The research pass surfaced a second thing. Most coding agents in the matrix do not autonomously retry when work breaks.

Five tools clear the new Adaptation bar - autonomous detection of failure plus meaningfully different retry, without the user kicking the loop:

Cursor - Bugbot Autofix spawns cloud agents in virtual machines. Resolution rate moved from 52% to 76% after the February release.
GitHub Copilot - the agent mode runs a documented plan-execute-verify-iterate loop. From Microsoft Learn: “If it detects problems - a failing test, a compilation error, or a linter warning - it moves to the next stage … This loop continues until the task passes or it determines it cannot resolve the issue.”
Devin - from Cognition’s docs: “Self-Healing Code: When Devin writes code that fails compilation or tests, it reads the error logs, iterates on the code, and fixes the issue autonomously.”
Factory’s Droid - Mission architecture with checkpoint-based self-correction.
Claude Code - subagent failure isolation plus a StopFailure hook that fires on rate limits, auth errors, tool failures and triggers automatic retry-and-fallback.

Thirteen tools don’t clear that bar. That’s most of the field. The receipts for the ones that do are linkable, primary-source, recent. The receipts for the ones that don’t are mostly silence - nobody documents the absence of a feature. So the rubric scores conservatively against documented evidence, and where it hedges, it names what would shift the score.

What I’d actually pay for

Three tools, three different reasons:

Cline. Apache 2.0 means I can read what it’s doing. BYOK means I can pin my own provider and pay metered. The agent loop is mature. If it broke tomorrow and the maintainer ghosted, I could fork it and keep going.

Claude Code. For the work where I want the tightest loop with Anthropic’s models specifically, and where I trust the closed binary in exchange for integration depth. The MCP support and subagent spawning earn the 2 on Sovereignty.

Cursor. I’d subscribe specifically for the Bugbot Autofix behavior. The autonomous retry is the dimension I’d be paying for, and Cursor is the one that’s actually shipped it at a measurable rate.

Three tools, three deployment shapes, three different reasons. Without the Sovereignty dimension all three would look roughly interchangeable in the rankings. With it, they don’t - and the reason I’d pick each one shows up explicitly in the score breakdown.

So, what is agenticness?

Nine dimensions, today. The version sitting on the directory right now is my current best working answer. It’s specific enough to be wrong in legible ways - which is the only kind of being-wrong that updates - and the Cline episode is the case study for what that updating looks like in practice. I had the data the whole time. I had the click counts. I trusted the rubric enough to mostly ignore the months it was disagreeing with the data in one category. The world was not confused. The world was telling me about a dimension I hadn’t scored.

I expect there’ll be a tenth axis at some point. Maybe an eleventh. The point of writing the dimensions down isn’t to be right. The point is to have a thing that’s specific enough to be wrong out loud, when the world disagrees with it.

If you’re a builder and your tool is on the directory and the score looks wrong, claim the listing (there’s a button on every tool page) and document the proof. The next re-eval picks it up. A rubric that can’t be challenged is just an opinion in a table.

If you’re a user and the dimensions don’t match what you think “agentic” should mean - tell me. The working theory wants the disagreement. That’s how it moves.

Uncaged Minds

Discussion about this post

Ready for more?