← All posts

Who writes the agent's tests: three ways to define what's correct

Who writes an agent's tests defines what counts as correct. Three levels of input: spec and tests by hand, examples the agent turns into tests, or the repo's suite as the evaluator.

Illustration of three levels of input for an agent: hand-written tests, examples the agent turns into tests, and the repo's existing suite as the evaluator

When you build an agent, the question that really matters isn’t how to describe the solution—it’s who decides whether a solution is correct. In the write-test-fix loop that decision was made by tests you wrote by hand: you defined, up front, what counted as done. But writing the tests by hand is only one of the ways to give an agent that criterion, and it’s the one that scales worst. This is the conceptual leap of the series: an agent’s real input isn’t the description of the task, but the definition of how a correct solution is recognized—what I called the evaluator in the minimal-loop post. This post walks through the three levels of that input, from tests by hand to the repo’s suite, and the principle that ties them together.

TL;DR
  • An agent's real input isn't describing the solution, but defining how a correct one is recognized. That piece—the tests, a check, another model—is what the series calls the evaluator, and an agent doesn't optimize for doing the right thing: it optimizes for passing it.
  • There are three levels depending on who writes the evaluator: you by hand (spec + tests), the agent from examples you give it, or the repo's existing suite. The higher the level, the less work per task and the less control over the criterion.
  • The higher the level, the higher what you have to review climbs. At level 2 you review the tests the agent wrote itself, not just its code: an agent that defines its own criterion can set itself a lax one and go green without being right.

In this article:

The evaluator is the agent’s real input

In the previous post I called the piece that decides whether the model’s output is correct the evaluator. In the write-test-fix loop that evaluator was the tests: a process that returns green or red. It’s easy to read it as an implementation detail—“you need some tests to stop the loop”—and move on. But the evaluator isn’t a detail: it’s what you’re really giving the agent.

Look at it from the loop. The model proposes an output, the evaluator judges it, and depending on that judgment the cycle stops or tries again. The model supplies the judgment about what to do; the evaluator supplies the judgment about whether it came out right. Remove the evaluator and you don’t have an agent: you have a model that writes once and nobody checks whether it got it right.

That piece is exactly what a chat doesn’t have, and putting them side by side is the fastest way to see why the evaluator is the input that matters:

CHAT                         AGENT

prompt                       prompt  (goal)
   │                            │
   ▼                            ▼
 model                        model ──► output
   │                            │
   ▼                            ▼
response                     evaluator ──► correct?

                             no ──┘──► another turn

                            yes ──► deliver

In a chat, the model’s output is the final answer. In an agent, that same output passes through the evaluator first, and it’s that verdict—not how many times you call the model—that turns the cycle into an agent. Everything the series built up to here—the loop, the sandbox that turns the output into a reliable observation—exists to run an evaluator you supplied.

From there comes the consequence worth keeping in mind from the start, because it explains a lot of agent behavior that looks strange until you have it in your head:

An agent doesn’t optimize for doing the right thing. It optimizes for passing the evaluator you gave it. If the evaluator is lax, “correct” comes to mean “whatever the evaluator lets through.”

And from there the principle that orders the post: your job when building an agent isn’t to describe the solution, but to define how a correct solution is recognized. That definition is the evaluator, and it’s the input that really matters. It sounds abstract until you ground it in a concrete question: who writes that evaluator? In the minimal-loop post you wrote it yourself, by hand, as three tests. But that’s not the only option, and depending on who writes it, how much work it costs you to define the criterion—and how much control you have over it—changes. That’s what separates the three levels.

The three levels of input

The three levels answer the same question—where does the evaluator come from?—with three different answers. In the first you write it entirely. In the second you give examples and the agent derives it. In the third you neither write it nor derive it: it was already in the repository.

Level 1   spec + tests by hand     →  you write the evaluator
Level 2   spec + examples          →  the agent derives the evaluator
Level 3   a change in the repo     →  the repo IS the evaluator already

          less work per task       ───────────►  less control over the criterion

The bottom arrow is the tension that runs through the whole post. Moving up a level reduces the work of defining the evaluator: you stop writing tests for each task. It’s worth reading precisely, because it’s not “less work” plain and simple—in a large repo, moving up to level 3 doesn’t reduce total work, it reduces one specific kind of work, writing the criterion. And that saving has a price: you lose direct control over what counts as correct. Since the agent optimizes for passing whatever evaluator is in front of it, whatever its quality, losing control of the criterion is losing control of the result.

This table is the map for the rest of the post. Worth having it whole before going into each level:

LevelWhat you give the agentWho writes the evaluatorWhat you reviewWhere it fails
1. Tests by handspec + testsYouThe final codeDoesn’t scale: every task needs you
2. Examplesspec + examplesThe agent (you approve it)The derived testsThe agent writes itself a lax criterion
3. Repo’s suitea change to makeAlready existedThat nothing broke + the new behaviorOnly covers behavior that was already there

Notice the “what you review” column: it’s the one that moves upward as you climb. At level 1 you review the code. At level 2 you review the criterion the agent judges its own code by. At level 3 you review that the change didn’t break what already worked. The less you write the evaluator, the higher what you have to watch.

Level 1: spec and tests by hand

This is the one from the previous post, and it’s the baseline. You write the specification and the tests; the model writes the code; the loop runs until the tests pass. You’re the author of the evaluator, start to finish.

Implement `analyze(readings)` in TypeScript.
Return [peak, count, total].

# You write the tests, complete:
analyze([3, 7, 2, 9, 4])  ->  [9, 5, 25]
analyze([])               ->  [null, 0, 0]
analyze([-1, -5, -2])     ->  [-1, 3, -8]

What this level buys you is full control over the criterion. Every case that counts as correct, you decided: the normal case, the empty-list edge, the negatives one that keeps a max initialized at zero from slipping through. There’s no ambiguity about what “done” means, because “done” is exactly what you wrote, no more and no less. That’s why it’s the level where you trust green the most: the evaluator is as good as your tests, and you know your tests because you wrote them.

What it costs you is that it doesn’t scale. Every new task needs you to sit down and write its evaluator. For a function with three cases that’s trivial; for an agent that has to touch twenty different functions in one task, hand-writing the evaluator for each is more work than making the change yourself. Level 1 is perfect for understanding the pattern and for narrow, critical tasks where you want to set the criterion with your own hands. It stops serving you the moment task volume grows, and that’s where the second level begins.

Level 2: you give examples and the agent writes the tests

The second level moves who writes the evaluator from the human to the agent. You no longer hand over the complete test suite; you hand over the spec and a few examples of the expected behavior, and the agent synthesizes a first version of the evaluator from them: a test file. “Synthesizes,” not “invents”—it starts from your examples and fills in the rest—and that nuance matters for what comes next. Then, with those tests, it runs the same write-test-fix loop from the previous post, unchanged.

The input you give shrinks: from a complete suite to a handful of examples.

Implement `analyze(readings)` in TypeScript.
Return [peak, count, total].

Examples of expected behavior:
analyze([3, 7, 2, 9, 4])  ->  [9, 5, 25]
analyze([])               ->  [null, 0, 0]

# The agent derives the rest of the cases from these.

The new piece is a model call whose job isn’t to write the code, but to write the tests. It’s the same kind of call as generateCode in the previous post, with a different prompt:

// derive-tests.ts — the agent writes the tests from examples. You review them.
import { generateCode } from "./model.js"; // the same LLM call from the previous post

// `examples` are input → output pairs you give by hand. It's not the full suite:
// it's the seed the model derives the cases from.
export async function deriveTests(spec: string, examples: string): Promise<string> {
  const messages = [
    {
      role: "system",
      content:
        "You are a QA engineer. Return only a test file, with no explanations or fences. " +
        "Include the given examples and add the missing edge cases.",
    },
    { role: "user", content: `${spec}\n\nExamples of expected behavior:\n\n${examples}` },
  ];
  // You no longer write the evaluator: the model does. Review it before trusting it.
  return generateCode(messages);
}
# derive_tests.py — the agent writes the tests from examples. You review them.
from model import generate_code  # the same LLM call from the previous post

# `examples` are input → output pairs you give by hand. It's not the full suite:
# it's the seed the model derives the cases from.
def derive_tests(spec: str, examples: str) -> str:
    messages = [
        {
            "role": "system",
            "content": (
                "You are a QA engineer. Return only a test file, with no explanations or fences. "
                "Include the given examples and add the missing edge cases."
            ),
        },
        {"role": "user", "content": f"{spec}\n\nExamples of expected behavior:\n\n{examples}"},
    ]
    # You no longer write the evaluator: the model does. Review it before trusting it.
    return generate_code(messages)
<?php
// derive_tests.php — the agent writes the tests from examples. You review them.
require "model.php"; // generate_code(): the same LLM call from the previous post

// $examples are input → output pairs you give by hand. It's not the full suite:
// it's the seed the model derives the cases from.
function derive_tests(string $spec, string $examples): string {
    $messages = [
        [
            "role" => "system",
            "content" =>
                "You are a QA engineer. Return only a test file, with no explanations or fences. " .
                "Include the given examples and add the missing edge cases.",
        ],
        ["role" => "user", "content" => "$spec\n\nExamples of expected behavior:\n\n$examples"],
    ];
    // You no longer write the evaluator: the model does. Review it before trusting it.
    return generate_code($messages);
}

With those tests in hand—reviewed, and I’ll come back to that in a moment—the rest is the previous post’s loop without touching a line: the model writes analyze, the runner runs the synthesized tests, the failure comes back as context. The only thing that moved is the origin of the evaluator.

A variant of this level—a level 2.5—is to give not examples but properties: instead of asserting analyze([3,7,2,9,4]) → [9,5,25], you assert something that must always hold (that count equals the length of the input, that total equals the sum) and a property-based testing tool generates hundreds of automatic cases against that property. It covers far more edges than a handful of examples, in exchange for good properties being harder to express than examples. I won’t develop it here; it’s enough to know it exists, between level 2 and level 3.

And there’s the new risk, the one that defines this level. If the agent synthesizes both the tests and the code, it’s setting its own success criterion—and it can set itself a lax one. Better to see it than explain it. Suppose you give these two examples, with no negative among them:

Examples you gave              Tests the agent synthesizes
analyze([3,7,2,9,4]) → [9,5,25]    ✓ normal readings
analyze([]) → [null,0,0]           ✓ empty list
                                   (no case with negatives)

        │  the model writes the code against those tests

   peak = Math.max(0, ...readings)   // initializes the max at 0

        │  runs the synthesized tests

   ok  normal readings · ok  empty list        → all green

        │  but the real behavior:

   analyze([-1, -5, -2]) → [0, 3, -8]   // peak should be -1, not 0

All green, and the function is wrong: for a list of pure negatives it returns a max of 0, a value that isn’t even in the input. That Math.max(0, …) is exactly the bug the negatives test caught at level 1—the one we added “to keep a max initialized at zero from slipping through.” But that test you wrote by hand; the agent, synthesizing the evaluator only from your examples, didn’t include it, and without that case the shortcut passes. The loop ended green not because the code was correct, but because the evaluator wasn’t looking where it failed. It’s the earlier thesis turned into a bug: the agent optimized for passing the evaluator, not for doing the right thing. This has happened to me exactly like that—I let the agent synthesize the tests, everything passed on the first try, and green only measured that it met the examples I gave it.

That’s why, at level 2, what you review climbs a step. At level 1 you reviewed the code because the criterion was yours and trustworthy. Here the criterion was written by the agent, so the first thing you review is the criterion: do the synthesized tests cover the edges, or just repeat your examples? The examples you give still matter a lot—they’re the seed everything grows from—so it’s worth including among them some case a shortcut couldn’t guess, just as you would with hand-written tests. You give less, but what you give has to be more representative.

Level 3: the repo’s suite as the evaluator

The third level is the one code agents on real repositories—Claude Code, Cursor, and the like—use under the hood, and it’s the most different of the three, because the interesting part isn’t anything you write: it’s what was already written. Here you give neither tests nor examples. The task is different—“fix this bug,” “add this field,” “rename this function across the repo”—and the evaluator already exists, written over months by the whole team: it’s the project’s test suite. The agent makes a change, runs the suite, reads what broke, fixes it. The loop is identical to always; the only thing that changed is that the “correct” criterion wasn’t supplied by you for this task—you inherited it.

That’s why this level performs so well: the work of defining “correct” is already done, and done by many people over a long time. A mature suite encodes hundreds of decisions about which behavior is the good one, and the agent inherits them all. A code agent performs much better in a repo with good tests than in one without them: it’s not that the model is smarter in one than the other, it’s that in one it has an evaluator and in the other it doesn’t.

And from that, directly, the condition that defines the level: it works as well as the repo’s coverage. The suite only protects what someone decided to test. If the repo has 12% coverage, “level 3” looks more like having no evaluator than having one: the agent can break the remaining 88% without a single test turning red. In a well-tested repo you inherit a powerful evaluator; in one without tests you inherit the illusion of one.

There’s also a crack that shows up even in well-tested repos, and it’s the mirror image of the advantage: the suite covers the behavior that already existed, not the new behavior the task asks for. If you add a field, the old tests stay green because they don’t know that field was supposed to exist. Passing the suite proves you didn’t break what came before—that there’s no regression—not that you did the new thing right. To close that gap, someone has to extend the evaluator with a test for the new behavior: either you write it, or you ask the agent to write it, and there you’re back to the level 2 risk, the agent judging itself. The repo’s suite is a powerful evaluator for not regressing, and an incomplete one for moving forward.

Worth remembering, finally, that the suite isn’t the only evaluator already living in a repo. The compiler, the type checker (tsc --noEmit, mypy), and the linter also return green or red on the agent’s code, and they too bound what counts as correct. A change that passes the tests but doesn’t compile isn’t done. In practice, the evaluator of a code agent on a real repo is the sum of all of those: tests, build, types, and lint, each cutting off a different class of error.

In code, the change from level 1 is almost trivial—and that’s precisely the sign that the work went elsewhere: the runner stops writing a test file and instead invokes the command the repo already had.

// run-suite.ts — the evaluator already exists: it's the repo's test command.
import { execSync } from "node:child_process";

export function runSuite(): { passed: boolean; output: string } {
  try {
    // We don't write the tests: we run the ones the team already had.
    const output = execSync("npm test", { encoding: "utf8" });
    return { passed: true, output };
  } catch (err: any) {
    // Exit code ≠ 0: some repo test broke with the change.
    return { passed: false, output: (err.stdout ?? "") + (err.stderr ?? "") };
  }
}
# run_suite.py — the evaluator already exists: it's the repo's test command.
import subprocess

def run_suite() -> dict:
    # We don't write the tests: we run the ones the team already had.
    proc = subprocess.run(["pytest", "-q"], capture_output=True, text=True)
    # returncode ≠ 0: some repo test broke with the change.
    return {"passed": proc.returncode == 0, "output": proc.stdout + proc.stderr}
<?php
// run_suite.php — the evaluator already exists: it's the repo's test command.
function run_suite(): array {
    $output = [];
    $exitCode = 0;
    // We don't write the tests: we run the ones the team already had.
    exec("./vendor/bin/phpunit 2>&1", $output, $exitCode);
    // Exit code ≠ 0: some repo test broke with the change.
    return ["passed" => $exitCode === 0, "output" => implode("\n", $output)];
}

It’s the same structure as runTests from the minimal-loop post, pointed at a different command. What was interesting about this level was never the execSync; it was that the evaluator already existed.

The principle that connects the three levels

Seen together, the three levels say the same thing three ways. At level 1 you write the evaluator; at level 2 the agent derives it from your examples; at level 3 you inherit it from the repo. What changes isn’t the loop—it’s identical in all three—but where the “correct” criterion comes from and how much control you have over it. And in all three, what you really gave the agent was never the solution. It was the way to recognize it.

That’s the principle worth taking from the whole post: building an agent is, above all, supplying an evaluator you can trust at the lowest possible cost. The model brings the judgment of what to try; you bring the judgment of what counts as correct. Moving up a level makes that contribution of yours cheaper, but never exempts you from it: it only changes whether you write it, synthesize it, or inherit it.

From there come two practical consequences. The first: when an agent “doesn’t work,” the cause is usually in the evaluator before the model. An agent that turns bad results green almost always has too lax an evaluator, not too dumb a model. Before switching models or tuning the prompt, look at what your green is measuring.

The second: what you have to review climbs with the level. You write the evaluator by hand and review the code. The agent derives it and you review the criterion. You inherit it from the repo and you review that you broke nothing and covered the new behavior. The task doesn’t disappear as you move up; it shifts upward, from the solution to the criterion. Your attention has to follow that shift, or you end up trusting a green that doesn’t measure what you think.

When to move up or down a level

The levels aren’t a progression where the third is “the good one” and the first something you outgrow. They’re three points in a trade-off between control and scale, and which suits you depends on the task. The rule I settled on is parallel to the sandbox one, where isolation rose with distrust: here, the level drops with how much you need to control the criterion.

  • Drop to level 1 when the criterion is critical or subtle and you want to set it with your own hands: a function with dangerous edges, billing logic, anything where a lax test is expensive. The cost of writing the tests is justified because you don’t want anyone else—not even the agent—to define what’s correct.
  • Stay at level 2 when you have many similar, well-bounded tasks and writing each suite by hand is the bottleneck. You give good examples, let the agent derive, and pay for control with a review of the derived tests instead of writing from scratch.
  • Move up to level 3 when you work on a repo that already has a suite you trust. There the most valuable evaluator is already written; your job is to lean on it and extend it only for the new behavior.

What doesn’t work is moving up a level to save yourself work without taking on the review that level demands. Letting the agent derive the tests and not looking at them, or trusting the repo’s suite without adding a test for the new behavior, is raising the level of the input while dropping your guard exactly where the level demands it highest. The trade-off is real: less work per task in exchange for reviewing higher up. If you take only the saving and skip the review, what you’re lowering isn’t the cost, it’s the quality.

Frequently asked questions

Does this mean I no longer write tests?

No. It means who writes them changes and, above all, what you review. Someone always defines what’s correct: at level 1 it’s you writing the suite, at level 2 it’s the agent deriving it from your examples, at level 3 it’s the team that wrote it months ago. What moves up a level isn’t the disappearance of the evaluator, but the cost of supplying it. The definition of “correct” never evaporates; it only changes author.

Why review the tests the agent derives, if it’s going to write the code anyway?

Because if the agent writes the evaluator and the solution at once, it’s grading its own exam. A lax evaluator—tests that repeat your examples without covering edges—lets incomplete code through, and the loop ends green without being right. Reviewing the criterion before the solution is what avoids that empty green. If the tests are wrong, the code meeting them proves nothing.

Is the repo’s suite enough as an evaluator?

For not regressing, yes: if the existing tests stay green, you didn’t break what already worked. For moving forward, no: the suite covers the behavior that was already there, not the new behavior the task asks for. A change that adds a function can pass the whole suite without a single test checking that function. That’s why it’s worth complementing it with a test for the new behavior, and adding the compiler, the type checker, and the linter, which also bound what counts as correct.

Does this have a formal name?

Yes. In testing theory it’s known as the oracle problem: given a program and an input, how do you know the expected output? An agent’s evaluator is exactly that, an oracle, and the three levels are three ways to provide it: write it, derive it from examples, or reuse one that already exists. The term comes from the software-testing literature; what matters here isn’t the name but the idea, which is the same.

What if the examples I give at level 2 are few or biased?

The derived evaluator inherits that bias. The agent writes the tests from your examples, so if you only give happy cases, the tests will cover only happy cases and the code will optimize for them. Give examples that include edges and at least one case a shortcut can’t guess, just as you would with hand-written tests. At this level you give less quantity, but what you give has to be more representative, not less.

Does this only apply to code generation?

The principle is general; what changes is what the evaluator is made of. With code you have the luck of an objective, free evaluator: the tests return green or red. For tasks without that verification—writing, designing, deciding—the evaluator can be a rubric, a human who reviews, or another model that scores the response. The trade-off across the three levels is still there, but with one more layer of distrust, because a subjective evaluator is easier to fool than a test that compares two values.

Conclusion

The conceptual leap of the series is to stop thinking you describe the solution to an agent and start thinking you define how a correct one is recognized. That definition is the evaluator, and it’s the input that really drives the loop. There are three levels depending on who writes it: you by hand with spec and tests, the agent synthesizing it from your examples, or the repository that already had it written. Moving up a level reduces the work of defining the evaluator and, with it, your control over the criterion; that’s why what you review climbs with the level: from the code, to the criterion, to regression.

If you’re going to build it, choose the level by how much you need to control the criterion, not by how much work you want to save: drop to level 1 when “correct” is critical, stay at level 2 when you have many similar tasks, move up to level 3 when you trust the repo’s suite. And whatever happens, review the evaluator before the solution: an agent is never better than the definition of “correct” you gave it. In the coming posts the series returns to the mechanics of the loop—how state is managed when the context fills up, how stop conditions are tuned—but all of that still runs around the same piece we saw here.

And if you take one idea from the whole post, let it be the one that connects the entire series:

The difference between a script with an LLM in a loop and an agent isn’t how many times it calls the model, but that there’s an evaluator deciding when to stop. Change the evaluator and you change the agent.

Keep reading