Evaluating Coding Agents: Building a Rubric That Scales

Every team I talk to wants to know which coding agent is "best" for their codebase. The honest answer is that nobody knows until they measure, and most teams do not measure because measurement is harder than it sounds. A rubric that scales past a handful of cherry-picked tasks takes real work to build, but once you have one, every future model release, every new agent, every workflow change becomes something you can evaluate instead of argue about.

The Trap of Cherry-Picked Demos

The default way teams compare agents is by running the same three tasks through each one and declaring a winner. This tells you almost nothing. The tasks are usually chosen for the thing the preferred agent handles well. The task distribution does not match production. The evaluator is the same person who chose the tasks, so confirmation bias is baked in.

A serious rubric starts with the distribution. What kinds of tasks does the agent actually get in your repository? Bug fixes on small surfaces. Refactors across several files. Net-new features with a short spec. Tests that exercise existing behavior. Infrastructure changes that require reading YAML and shell scripts. Those categories are not equal, they do not require the same skills, and no agent is strong across all of them simultaneously.

What to Measure

The rubric I use has five axes, and each one gets scored independently. Conflating them into a single number hides the actual signal.

Correctness is whether the change does what was asked. It is binary per task and easy to score if the task was well specified.

Minimality is whether the change avoided unrelated edits. Agents often over-reach, rewriting nearby code that looked stale. A minimal patch is easier to review and safer to ship.

Adherence is whether the change respected repository conventions. Using the right hooks, the right logger, the right folder structure. This one is invisible until you review enough patches to see the pattern.

Verification is whether the agent actually proved the change worked, not just that the build passed. Did it run the right tests? Did it exercise the behavior it changed?

Cost is how much compute and time the change consumed. The cheapest correct patch wins ties, and sometimes breaks them.

Sampling the Task Distribution

Once the axes are set, the next question is which tasks to run. The answer is not "all of them". It is a stratified sample that represents production.

I pull tasks from recent merged PRs, sort them into the categories above, and pick a handful from each. Small bug fix, medium refactor, net-new feature, test-only change, infra-adjacent change. Five categories, five tasks each, twenty-five total. That is enough to see patterns without being so large that rerunning the rubric becomes painful.

Each task gets a canonical spec: what to do, what not to do, and how verification should work. Without the spec, scoring drifts because different reviewers interpret "correctness" differently.

Running the Evaluation

The agents all get the same spec. No hand-holding, no coaxing. If an agent needs clarification to make progress, that counts against it on the adherence axis. If an agent produces a confident patch that misses the point, that counts against correctness.

I score blind whenever I can. Renaming the branches before review strips the agent identity from the review surface. It sounds excessive, but halo effects are real, and stripping them makes the scores match intuition better after enough rounds.

The Boring Part That Changes Everything

The rubric only works if it is repeatable. That means the tasks are version-controlled, the specs are stable, the verification scripts are automated, and the scoring is documented enough that a new reviewer gets similar numbers. None of this is glamorous. All of it is what separates a rubric that ages well from a benchmark that dies after one model release.

When the rubric is stable, every future comparison is cheap. A new model comes out on Monday, I run the same 25 tasks that evening, and by Tuesday I know whether the upgrade is real for my workloads or whether it is marketing.

What the Rubric Has Actually Told Me

The most consistent finding is that agents differ more by workload category than by overall capability. One agent wins on small bug fixes, another on multi-file refactors, a third on test-only changes. The "which is best" framing is the wrong question. "Which is best for which task" is the right one.

The second finding is that cost matters more than most teams realize. A model that is 10% more accurate but 3x more expensive is usually the wrong choice for bulk work, even if the demo looked impressive.

Start Small, Then Resist the Urge to Grow It

A rubric with 25 tasks across 5 categories is enough for most teams. The temptation is to grow it to 100 tasks, then 500, then build a whole platform around it. Resist that urge. The value is in stability and repeatability, not coverage. A small rubric run monthly is worth more than a large rubric that never gets rerun.

The agents will keep changing. The point of the rubric is that your decisions do not have to depend on the noise.