← All posts

The simplest loop that works: a write-test-fix agent, step by step

The simplest agent that works: spec and tests by hand, the model writes the code, it runs, and the failure feeds back as context. A complete write-test-fix loop in code that runs.

Write-test-fix loop illustration: one node writes code, another tests it, and the failure feeds back as context until the tests pass

In the previous post we saw what an agent loop is: the pattern in the abstract. Now we’re going to build it, in the smallest agent that actually does something useful. It’s the write-test-fix loop: you write a specification and some tests by hand, the model writes the code, the program runs the tests, and the error feeds back to the model until everything passes. By the end of the post you have the whole cycle in code that runs.

TL;DR
  • The write-test-fix loop is the minimal agent: spec and tests written by hand, the model generates the code, it runs, and the test failure feeds back as context for the next attempt.
  • The stopping condition comes from the tests: green and it stops, red and it retries, up to an iteration cap. The model doesn't need to "decide" it's done.
  • Feedback is everything. A test that only says "failed" fixes nothing; one that says "expected [9,5,25] and got [null,0,0]" gives the model exactly what it needs to fix it.

In this article:

From theory to code: the simplest possible agent

The previous post left the loop in pseudocode: a while with three components—state, action, and stopping condition—around a model that decides the next action over and over. Useful for understanding the pattern, but abstract. Here we’ll instantiate it in the smallest case that’s still a real agent, and at the end you have code that runs.

The mapping is direct. Each piece of the loop has a concrete equivalent in this example:

Loop componentIn the write-test-fix loop
StateThe spec, the tests, and the history of attempts and errors
ActionThe model writes (or fixes) the code
ObservationThe test runner’s output
Stopping conditionTests green, or the iteration cap is reached

If you understood that table, you understood the post. The rest is watching it work.

Of those four pieces, the one most worth keeping in your head is the state, because it’s the one that moves. It doesn’t reset on each turn: it starts with the spec and the tests, and grows with every attempt by the model and every error from the runner. Here’s how that accumulation looks across iterations:

Iteration 0          Iteration 1               Iteration 2
                     (+ what 1 produced)       (+ what 2 produced)

spec                 spec                      spec
tests                tests                     tests
                     code (attempt 1)          code (attempt 1)
                     runner errors             runner errors
                                               code (attempt 2)
                                               runner errors

That accumulation is exactly what lets the model fix instead of repeating the same error: on iteration 2 it sees what it wrote before and why it failed. In this example the state grows by appending every previous attempt, nothing more. Later in the series we’ll see how it’s managed when the context no longer fits and you have to summarize or drop.

Why write-test-fix is the minimal loop that works

Of all the tasks you can give an agent, writing code that passes some tests is the one that teaches the pattern best, for one reason: the observation is objective and free. The program doesn’t have to ask another model to judge whether the answer is good; it runs the tests and gets a yes or a no, with the detail of what failed. That clean signal is exactly what a loop needs to feed back on itself.

The piece that decides whether the output is correct we’ll call the evaluator, and it’s a term that will come back in the series. In this example the evaluator is the tests: a process that returns green or red. Later we’ll see more complex evaluators—another model that scores the answer, a check against real data—but their function is always the same: turn the model’s output into a signal the loop can trust.

Compare it with an agent that drafts an email. How does the loop know whether the email is “good”? There’s no test that returns green. You’d have to bring in a human or a second model to evaluate, and there the stopping condition gets fuzzy and debatable. With code and tests, no: the stopping condition is a binary truth the program itself checks.

That’s why this is the minimal loop that actually works. The hard work—saying what counts as “correct”—you do once, up front, when you write the tests. From there the cycle runs on its own:

spec + tests (by hand)


┌──► the model writes the code
│           │
│           ▼
│     run the tests
│           │
│      all green?
│           │
│     no ───┘   (the error feeds back as context)

└── yes ──► deliver the code

The work by hand: the spec and the tests

The part that makes everything else possible isn’t written by the model, you write it. It’s two things: a specification of what function you want, and tests that define unambiguously what it means for it to be right. The tests are the executable spec and, as we’ll see, also the stopping condition.

The example is deliberately small so it fits whole in your head. An analyze function that takes numeric readings and returns three pieces of data:

Implement `analyze(readings)` in TypeScript.

`readings` is an array of numbers. Return an array of three values:
[peak, count, total]

- peak:  the maximum value in the array, or null if it's empty.
- count: how many readings there are.
- total: the sum of all the readings.

And the tests, written by hand. Notice that each case prints what it expected and what it got: that’s what will later help the model fix it.

// analyze.test.ts — written by hand. It's the executable spec and the stopping condition.
import { analyze } from "./analyze.js";

function check(name: string, got: unknown, want: unknown): void {
  const g = JSON.stringify(got);
  const w = JSON.stringify(want);
  if (g !== w) {
    console.error(`FAIL ${name}: expected ${w}, got ${g}`);
    process.exitCode = 1; // exiting with a non-zero code marks the failure for the runner
  } else {
    console.log(`ok   ${name}`);
  }
}

check("normal readings", analyze([3, 7, 2, 9, 4]), [9, 5, 25]);
check("empty list",      analyze([]),              [null, 0, 0]);
check("negatives",       analyze([-1, -5, -2]),    [-1, 3, -8]);
# analyze_test.py — written by hand. It's the executable spec and the stopping condition.
import json
import sys
from analyze import analyze

failed = False

def check(name, got, want):
    global failed
    g = json.dumps(got)
    w = json.dumps(want)
    if g != w:
        print(f"FAIL {name}: expected {w}, got {g}", file=sys.stderr)
        failed = True  # mark the failure without cutting off the other cases
    else:
        print(f"ok   {name}")

check("normal readings", analyze([3, 7, 2, 9, 4]), [9, 5, 25])
check("empty list",      analyze([]),              [None, 0, 0])
check("negatives",       analyze([-1, -5, -2]),    [-1, 3, -8])

sys.exit(1 if failed else 0)  # a non-zero code marks the failure for the runner
<?php
// analyze_test.php — written by hand. It's the executable spec and the stopping condition.
require "analyze.php";

$failed = false;

function check(string $name, mixed $got, mixed $want): void {
    global $failed;
    $g = json_encode($got);
    $w = json_encode($want);
    if ($g !== $w) {
        fwrite(STDERR, "FAIL $name: expected $w, got $g\n");
        $failed = true; // mark the failure without cutting off the other cases
    } else {
        echo "ok   $name\n";
    }
}

check("normal readings", analyze([3, 7, 2, 9, 4]), [9, 5, 25]);
check("empty list",      analyze([]),              [null, 0, 0]);
check("negatives",       analyze([-1, -5, -2]),    [-1, 3, -8]);

exit($failed ? 1 : 0); // a non-zero code marks the failure for the runner

Three cases: the normal one, the edge (empty list), and one with negatives so a maximum initialized to zero doesn’t sneak through. No more is needed for this example. The quality of the loop depends entirely on the quality of these tests, and I come back to that at the end.

The loop in code: generate, run, feed back

With the spec and tests fixed, the loop is short. It’s the same while from the previous post, but now each line does something real. I split it in two: the test runner and the cycle’s driver.

The runner writes the code the model produced to a file, runs the tests in a separate process, and captures the output. If the process exits with a non-zero code, there was a failure:

// run-tests.ts — runs the model's code against the tests and captures the result.
import { execSync } from "node:child_process";
import { writeFileSync } from "node:fs";

export function runTests(code: string): { passed: boolean; output: string } {
  writeFileSync("analyze.ts", code); // the code the model wrote
  try {
    const output = execSync("node analyze.test.ts", { encoding: "utf8" });
    return { passed: true, output };
  } catch (err: any) {
    // execSync throws if the process exits with a non-zero code: that's the failure and its detail.
    return { passed: false, output: (err.stdout ?? "") + (err.stderr ?? "") };
  }
}
# run_tests.py — runs the model's code against the tests and captures the result.
import subprocess

def run_tests(code: str) -> dict:
    with open("analyze.py", "w") as f:
        f.write(code)  # the code the model wrote
    proc = subprocess.run(
        ["python", "analyze_test.py"],
        capture_output=True, text=True,
    )
    # a non-zero returncode means a test failed: that's the detail.
    return {"passed": proc.returncode == 0, "output": proc.stdout + proc.stderr}
<?php
// run_tests.php — runs the model's code against the tests and captures the result.
function run_tests(string $code): array {
    file_put_contents("analyze.php", $code); // the code the model wrote
    $output = [];
    $exitCode = 0;
    // 2>&1 merges stdout and stderr; a non-zero exit code marks the failure.
    exec("php analyze_test.php 2>&1", $output, $exitCode);
    return ["passed" => $exitCode === 0, "output" => implode("\n", $output)];
}

The driver is the loop itself. The state is the messages array: it starts with the spec and the tests, and grows with every attempt by the model and every error from the runner.

// run-loop.ts — the write-test-fix agent, complete.
import { readFileSync, writeFileSync } from "node:fs";
import { generateCode } from "./model.js"; // one LLM call that returns only code
import { runTests } from "./run-tests.js";

const MAX_ITER = 5;
const spec = readFileSync("spec.md", "utf8");
const tests = readFileSync("analyze.test.ts", "utf8");

// The initial state: the goal (spec + tests) is the first entry.
const messages = [
  { role: "system", content: "You are a programmer. Return only the TS code for the requested function, with no explanations or fences." },
  { role: "user", content: `${spec}\n\nThe function must pass these tests:\n\n${tests}` },
];

for (let i = 1; i <= MAX_ITER; i++) {       // ← hard cap: never an endless loop
  const code = await generateCode(messages); // ← the model CHOOSES what code to write
  const result = runTests(code);             // ← the program RUNS the tests

  if (result.passed) {                        // ← stopping condition: all green
    console.log(`✅ Tests green on iteration ${i}`);
    writeFileSync("analyze.ts", code);
    break;
  }

  // The failure feeds back into the state as new context. THIS is what makes it a loop.
  messages.push({ role: "assistant", content: code });
  messages.push({ role: "user", content: `The tests failed:\n\n${result.output}\n\nFix the function.` });
  console.log(`❌ Iteration ${i} failed; retrying with the error as context`);
}
# run_loop.py — the write-test-fix agent, complete.
from model import generate_code  # one LLM call that returns only code
from run_tests import run_tests

MAX_ITER = 5
spec = open("spec.md").read()
tests = open("analyze_test.py").read()

# The initial state: the goal (spec + tests) is the first entry.
messages = [
    {"role": "system", "content": "You are a programmer. Return only the Python code for the requested function, with no explanations or fences."},
    {"role": "user", "content": f"{spec}\n\nThe function must pass these tests:\n\n{tests}"},
]

for i in range(1, MAX_ITER + 1):           # ← hard cap: never an endless loop
    code = generate_code(messages)         # ← the model CHOOSES what code to write
    result = run_tests(code)               # ← the program RUNS the tests

    if result["passed"]:                   # ← stopping condition: all green
        print(f"✅ Tests green on iteration {i}")
        with open("analyze.py", "w") as f:
            f.write(code)
        break

    # The failure feeds back into the state as new context. THIS is what makes it a loop.
    messages.append({"role": "assistant", "content": code})
    messages.append({"role": "user", "content": f"The tests failed:\n\n{result['output']}\n\nFix the function."})
    print(f"❌ Iteration {i} failed; retrying with the error as context")
<?php
// run_loop.php — the write-test-fix agent, complete.
require "model.php";      // generate_code(): one LLM call that returns only code
require "run_tests.php";

const MAX_ITER = 5;
$spec = file_get_contents("spec.md");
$tests = file_get_contents("analyze_test.php");

// The initial state: the goal (spec + tests) is the first entry.
$messages = [
    ["role" => "system", "content" => "You are a programmer. Return only the PHP code for the requested function, with no explanations or fences."],
    ["role" => "user", "content" => "$spec\n\nThe function must pass these tests:\n\n$tests"],
];

for ($i = 1; $i <= MAX_ITER; $i++) {        // ← hard cap: never an endless loop
    $code = generate_code($messages);       // ← the model CHOOSES what code to write
    $result = run_tests($code);             // ← the program RUNS the tests

    if ($result["passed"]) {                 // ← stopping condition: all green
        echo "✅ Tests green on iteration $i\n";
        file_put_contents("analyze.php", $code);
        break;
    }

    // The failure feeds back into the state as new context. THIS is what makes it a loop.
    $messages[] = ["role" => "assistant", "content" => $code];
    $messages[] = ["role" => "user", "content" => "The tests failed:\n\n{$result['output']}\n\nFix the function."];
    echo "❌ Iteration $i failed; retrying with the error as context\n";
}

The four lines that matter are marked: the model decides the code, the program runs it, checks the stopping condition, and if it fails, returns the error to the state. Identical to the skeleton from the previous post; the only new thing is that run now executes tests and the observation is concrete failure text. The generateCode call is an ordinary request to the model—the client pattern is the same one I used in model routing—what matters here isn’t the provider, it’s the cycle around it.

Notice the two lines that update messages when something fails: one saves the code the model wrote (role: "assistant") and the other the runner’s error (role: "user"). Both are needed. The error alone isn’t enough: the model also needs to see exactly what it wrote in order to know what to fix. Without its own attempt in front of it, on the next turn it would be guessing from scratch again instead of fixing.

That division of labor between the two lines is the same as the whole loop’s, and it’s worth being clear about because it’s one of the ideas the series keeps repeating: the model provides the judgment, the program provides the execution and the control.

The modelThe program
Proposes the codeRuns the tests
Decides the next attemptSaves the files
Reads the error feedbackControls the loop and when to stop

Iteration 1: the [null, 0, 0] that starts the engine

First turn. The model receives the spec and the tests, and writes this:

// Iteration 1: what the model wrote on the first attempt.
export function analyze(readings: number[]): [number | null, number, number] {
  // The model added a guard for the empty array... with the wrong comparison.
  if (readings.length >= 0) return [null, 0, 0];

  let peak: number | null = null;
  let count = 0;
  let total = 0;
  for (const r of readings) {
    peak = peak === null ? r : Math.max(peak, r);
    count += 1;
    total += r;
  }
  return [peak, count, total];
}
# Iteration 1: what the model wrote on the first attempt.
def analyze(readings):
    # The model added a guard for the empty list... with the wrong comparison.
    if len(readings) >= 0:
        return [None, 0, 0]

    peak = None
    count = 0
    total = 0
    for r in readings:
        peak = r if peak is None else max(peak, r)
        count += 1
        total += r
    return [peak, count, total]
<?php
// Iteration 1: what the model wrote on the first attempt.
function analyze(array $readings): array {
    // The model added a guard for the empty array... with the wrong comparison.
    if (count($readings) >= 0) return [null, 0, 0];

    $peak = null;
    $count = 0;
    $total = 0;
    foreach ($readings as $r) {
        $peak = $peak === null ? $r : max($peak, $r);
        $count += 1;
        $total += $r;
    }
    return [$peak, $count, $total];
}

The loop logic is correct. The problem is the guard on the first line: the model meant to cover the empty-list case, but it wrote >= 0 instead of === 0. And since any array has a length greater than or equal to zero, the function always goes in there and returns the empty-case result. The loop never runs. The program runs the tests and observes this:

$ node analyze.test.ts
FAIL normal readings: expected [9,5,25], got [null,0,0]
ok   empty list
FAIL negatives: expected [-1,3,-8], got [null,0,0]

That [null, 0, 0] is the heart of the post. At first glance it looks like just a failure, but it’s the most useful observation in the loop: it’s the three untouched accumulators—peak stayed null, count and total at 0—the exact fingerprint of a loop that never ran. The empty-list test passes by coincidence, because the bug returns precisely the empty-case answer for everything. The other two fail and, in failing, print the contrast between expected and got. That text isn’t thrown away: it’s what feeds back to the model on the next turn.

Iteration 2: the error turned into feedback

The driver takes the runner’s output and puts it into a new message. The state the model sees on the second turn is no longer just the spec: it includes its own code and the error it produced.

The tests failed:

FAIL normal readings: expected [9,5,25], got [null,0,0]
ok   empty list
FAIL negatives: expected [-1,3,-8], got [null,0,0]

Fix the function.

With that context, the model is no longer guessing blind. It sees that two cases return [null, 0, 0]—the empty-case result—for inputs that aren’t empty, and that points straight at the guard. It fixes the comparison:

// Iteration 2: the model saw "got [null,0,0]" and fixed the guard.
export function analyze(readings: number[]): [number | null, number, number] {
  if (readings.length === 0) return [null, 0, 0]; // === instead of >=
  let peak: number | null = null;
  let count = 0;
  let total = 0;
  for (const r of readings) {
    peak = peak === null ? r : Math.max(peak, r);
    count += 1;
    total += r;
  }
  return [peak, count, total];
}
# Iteration 2: the model saw "got [None,0,0]" and fixed the guard.
def analyze(readings):
    if len(readings) == 0:
        return [None, 0, 0]  # == instead of >=
    peak = None
    count = 0
    total = 0
    for r in readings:
        peak = r if peak is None else max(peak, r)
        count += 1
        total += r
    return [peak, count, total]
<?php
// Iteration 2: the model saw "got [null,0,0]" and fixed the guard.
function analyze(array $readings): array {
    if (count($readings) === 0) return [null, 0, 0]; // === instead of >=
    $peak = null;
    $count = 0;
    $total = 0;
    foreach ($readings as $r) {
        $peak = $peak === null ? $r : max($peak, $r);
        $count += 1;
        $total += $r;
    }
    return [$peak, $count, $total];
}

A one-character difference. The runner runs again and this time:

$ node analyze.test.ts
ok   normal readings
ok   empty list
ok   negatives

Green. The stopping condition is met, the loop writes analyze.ts, and it ends.

The jump between iteration 1 and 2 didn’t come from a better model: it came from the second prompt including the first one’s error. The same model, with the failure in front of it, fixes what it couldn’t blind. That feedback—not the model’s intelligence—is what makes it an agent.

It converged in two turns here because the bug was simple: a mis-written comparison the error pointed almost straight at. Don’t take it as the norm. In production it’s common for an agent to need several iterations—it fixes one test and breaks another, or the error takes a while to become clear—and sometimes it doesn’t converge at all. That’s why the loop’s iteration cap isn’t decorative, and why later I devote a whole section to where this pattern breaks.

Why the tests are the stopping condition

In the previous post I said the stopping condition usually combines the model’s decision with a hard cap. This loop is an especially clean case because the first half of that condition isn’t set by the model: it’s set by the tests. The agent doesn’t stop because it “thinks” it’s done; it stops because an objective check went green.

That eliminates a whole class of problems. An agent that decides on its own when it’s done can get it wrong: declare itself finished with the task half-done, or never recognize that it’s already done. Here there’s no room for that mistake, because the criterion lives outside the model, in a process that returns zero or non-zero.

The iteration cap (MAX_ITER) is still there, and it’s still essential. If the model gets into a cycle where each fix breaks the previous one, or hits a test it doesn’t know how to satisfy, the counter cuts the loop off no matter what. The two halves play their role from the previous post: the tests are the autonomy (the agent knows when it met the goal) and the cap is the control (no matter what, this ends).

Where this loop breaks

If you take one section away from this post, make it this one. The example converged cleanly in two turns, and it’s easy to finish reading with the idea that this always works. It doesn’t. This loop works as well as its tests, and it has several ways to fail that show up as soon as you leave the toy case:

  • Weak tests, false sense of green. The agent optimizes for passing the tests, not for solving the problem. If a test is lax, the model can find a shortcut that satisfies it without doing the right thing. The agent’s quality never exceeds that of your tests.
  • The model can “cheat.” With visible tests, nothing stops the model from writing code that returns the expected values through special cases instead of implementing the logic. It’s worth reviewing the final code, not just the green.
  • Tasks without objective verification. All of this depends on a binary test existing. For generating code it fits perfectly; for writing, designing, or deciding, there’s no node analyze.test.ts that returns the truth, and this pattern doesn’t apply as-is.
  • Correction loops that don’t converge. Sometimes the model fixes one test and breaks another, turn after turn. Without the iteration cap, that’s tokens burned with no progress. With it, at least it stops and tells you.
  • The error has to be readable. The whole trick depends on the observation being useful. If the runner only said “1 test failed,” the model would have much less to work with. That’s why the tests print what they expected and what they got: the [null, 0, 0] teaches more than a failure counter.

None of these invalidates the pattern; they bound it. Write-test-fix is the foundation that real code agents are built on—the ones that use tools, edit several files, and handle whole repos—but all of them, underneath, keep running this same cycle of generating, running, and feeding the error back.

Frequently asked questions

Is this the same as TDD?

It shares the idea of writing the tests before the code, but the actor is different. In TDD the developer writes the test and then writes the code that passes it. In the write-test-fix loop you write the tests and the model writes the code, in an automatic cycle that feeds back on the failure. Put another way: the loop uses the discipline of TDD as an agent’s stopping condition.

Can the model cheat to pass the tests?

Yes, and it’s a real risk. If the tests are visible and weak, the model can write code that returns exactly the expected values with special cases, without implementing the general logic. The defenses are the usual ones: tests that cover varied cases (not just a happy one), including inputs a shortcut couldn’t guess, and reviewing the final code instead of trusting only the green.

How many iterations does it usually need?

For small, well-specified tasks, few: often one or two, as in the example. The more edge cases the tests have and the more ambiguous the spec, the more turns it takes. What matters isn’t the exact number but having a cap: without MAX_ITER, a task the model can’t solve leaves the process spinning.

Is the whole history sent to the model on every iteration?

In this example, yes: each turn sends the spec, the tests, and all the previous attempts and errors. It’s the simplest thing and it works while the loop is short. The problem shows up with many iterations: the history grows until it doesn’t fit in the context window, or it gets expensive. There real systems trim—they summarize old attempts, keep only the last error, or store part of the state outside the prompt. That’s the topic of a later post in the series; for now it’s enough to know that this example’s state just grows.

Is this what Claude Code, Cursor, and similar agents do?

It’s the core, yes, but heavily reduced. A real code agent isn’t limited to one function: it uses tools to read and edit several files, run commands, search the repo, and read the build output. The “tools” layer is exactly what the next post opens. But underneath all that machinery this cycle is still beating: propose a change, run it, read the result, and fix.

What if my tests are wrong?

The agent inherits the error. If a test demands something incorrect, the model will write incorrect code that satisfies it, and the loop will end green, happy as can be. The agent doesn’t validate your spec; it satisfies it. That’s why the hand-written tests are the part that deserves the most care: they are, literally, the definition of “done.”

Does it work for any language?

Yes. Nothing about the loop is specific to TypeScript: you just need a command that runs the tests and returns success or failure with detail. Swap node analyze.test.ts for pytest, go test, or cargo test and the driver is identical. The stopping condition is the process’s exit code, not the language.

Conclusion

The write-test-fix loop is the smallest way to see a real agent working: you write the spec and the tests by hand, the model writes the code, the program runs it, and the failure feeds back as context until everything is green. The lesson it leaves is the one from the first [null, 0, 0]: the second attempt doesn’t get it right because the model is more capable, but because the first one’s error entered its context. That feedback is the agent loop from the previous post, no longer in pseudocode but in code that runs. If you want to build it, start with the tests—they’re the spec and the stopping condition at once—leave an iteration cap for safety, and review the final code instead of trusting only the green.

In the first post we saw the pattern in the abstract; in this one we just saw it work. And what comes next doesn’t change it: everything we add to an agent—tools, memory, planning—will keep running exactly this same cycle of decide, act, and observe. The next post opens the “action” component: how the tools that turn this minimal loop into an agent that can touch files, commands, and APIs are designed.

Keep reading