Build Your Own Eval Harness from Scratch with Bun and claude -p

You don’t need a framework, a SaaS dashboard, or a dependency to test an AI agent. You need a way to run it, a way to grade it, and a loop around both. Here we build an eval harness in a single Bun file, start to finish, every line explained.

By the end you’ll have one evals.ts file that spins up a sandbox, drives the agent through the claude CLI, and grades the result three ways.

What we’re building#

An eval is a test for software that isn’t deterministic. A unit test asks “does 2 + 2 return 4?”, but an AI agent gives you a different paragraph every time you ask, so there’s no single value to assert against. An eval instead pins down one observable behavior (“when there’s no plan yet, it recommends planning first”) and checks whether the agent did it, while tolerating the fact that the exact words vary.

People reach for hosted platforms for this. You don’t have to. Every eval harness, underneath the dashboard, is the same three moves:

Run the agent. Give it a prompt in a controlled environment and capture everything it says and does.
Grade the result. Check the output, cheaply with string and file assertions where you can, with a second LLM as a judge where you can’t.
Loop and report. Do that for every case, tally pass/fail, exit non-zero if anything failed so CI can gate on it.

Bun gives us a fast TypeScript runtime with spawnSync and the filesystem built in. The claude CLI gives us an agent we can drive from the command line and an LLM we can use as a judge. That’s everything. You’ll end up with a single evals.ts file, roughly 150 lines, that you run with bun evals.ts, built one piece at a time.

If you want to understand the thing on the other end of the harness too, I wrote a companion piece on building your own coding agent from scratch .

Setup: Bun and the claude CLI#

Two prerequisites, both one-liners:

# 1. Bun — the runtime that runs our harness
curl -fsSL https://bun.sh/install | bash

# 2. The Claude Code CLI — the agent we're testing, and our judge
npm install -g @anthropic-ai/claude-code

# sanity check: this should print a model's reply
claude -p "say hi in three words" --output-format json

The key flag we’ll lean on is --output-format json, which makes the CLI print one machine-readable envelope instead of a stream of human text. Make a folder, drop in an empty evals.ts, and let’s fill it.

Step 1: drive the agent from code#

First, a function that runs the agent on a prompt and hands back its reply. We shell out to claude -p (the “print” / non-interactive mode) and parse the JSON envelope it prints. That envelope carries the final text in result, the dollar cost in total_cost_usd, and an is_error flag.

// evals.ts
import { spawnSync } from "bun";

// Run the agent on `prompt` inside `cwd`; return its final reply.
function runAgent(prompt: string, cwd: string) {
  const res = spawnSync({
    cmd: [
      "claude", "-p", prompt,
      "--output-format", "json",          // one JSON envelope on stdout
      "--permission-mode", "bypassPermissions", // don't prompt us mid-run
      "--max-budget-usd", "0.50",           // hard safety cap per run
    ],
    cwd,
    stdout: "pipe",
    stderr: "pipe",
    timeout: 180_000,
  });

  const envelope = JSON.parse(res.stdout.toString());
  return {
    text: envelope.result ?? "",
    ok: res.exitCode === 0 && envelope.is_error !== true,
    cost: Number(envelope.total_cost_usd ?? 0),
  };
}

Step 2: give it a sandbox to act in#

Letting an agent loose in your real repo is a bad idea, and it makes runs non-repeatable. Instead, every case gets a fresh throwaway git repo seeded with the files that behavior needs, a fixture. When the run is done, you can inspect or delete it.

import { mkdtempSync, mkdirSync, writeFileSync } from "node:fs";
import { tmpdir } from "node:os";
import { join, dirname } from "node:path";

// Make a throwaway git repo seeded with `files`; return its path.
function makeSandbox(files: Record<string, string>) {
  const dir = mkdtempSync(join(tmpdir(), "eval-"));
  spawnSync({ cmd: ["git", "init", "-q"], cwd: dir });
  for (const [path, content] of Object.entries(files)) {
    const target = join(dir, path);
    mkdirSync(dirname(target), { recursive: true });
    writeFileSync(target, content);
  }
  return dir;
}

Step 3: grade with cheap, deterministic checks#

Now the grading. Start with the cheapest tool that captures the behavior: plain string and file checks. They’re free, instant, and never flaky. Reach for the LLM judge only for what these can’t express.

import { existsSync } from "node:fs";

const has = (haystack: string, needle: string) =>
  haystack.toLowerCase().includes(needle.toLowerCase());

type Checks = {
  required_substrings?: string[];   // must appear in the reply
  forbidden_substrings?: string[];  // must NOT appear
  required_files?: string[];        // must exist in the sandbox after the run
};

// Returns [label, passed] for each check.
function checkAssertions(checks: Checks, reply: string, dir: string) {
  const out: [string, boolean][] = [];
  for (const s of checks.required_substrings ?? [])
    out.push([`contains "${s}"`, has(reply, s)]);
  for (const s of checks.forbidden_substrings ?? [])
    out.push([`excludes "${s}"`, !has(reply, s)]);
  for (const f of checks.required_files ?? [])
    out.push([`created ${f}`, existsSync(join(dir, f))]);
  return out;
}

Step 4: grade fuzzy behavior with an LLM judge#

Some behaviors have no keyword. “Did it read the repo before asking its first question?” “Did it explain the trade-off?” For those, you hand the reply to a second, cheaper model and ask it to grade each expectation. It’s the powerful but pricey rung, so use it sparingly.

Two details earn their keep in the prompt. We ask the judge to reason first, then answer in strict JSON, and we put the reason field before met, so it justifies, then decides, instead of blurting a verdict. We strip the reasoning before parsing.

// Ask a cheap model whether `reply` meets each expectation.
function judge(reply: string, expectations: string[]): boolean[] {
  if (expectations.length === 0) return [];
  const numbered = expectations.map((e, i) => `${i + 1}. ${e}`).join("\n");

  const prompt = `You are grading an AI agent's reply against a list of expectations.

First reason inside a single <thinking>…</thinking> block. Then, after the
closing tag, output STRICT JSON only:
{"results":[{"reason":"...","met":true}]}
one entry per expectation, in order.

=== REPLY ===
${reply}
=== END ===

Expectations:
${numbered}`;

  const res = spawnSync({
    cmd: [
      "claude", "-p", prompt,
      "--model", "claude-haiku-4-5",   // small + cheap is plenty for grading
      "--output-format", "json",
      "--permission-mode", "bypassPermissions",
      "--max-budget-usd", "0.10",
    ],
    stdout: "pipe", stderr: "pipe", timeout: 180_000,
  });

  // strip the <thinking> block, then grab the JSON object
  const text = (JSON.parse(res.stdout.toString()).result ?? "")
    .replace(/<thinking>[\s\S]*?<\/thinking>/gi, "");
  const json = JSON.parse(text.slice(text.indexOf("{"), text.lastIndexOf("}") + 1));
  return expectations.map((_, i) => json.results?.[i]?.met === true);
}

🚨 Read the judge's homework

A judged eval is only as trustworthy as the judge. The first few times, log the raw judge output and read its reasoning. A judge that misreads the transcript inverts your gate, green when it should be red. Reading two samples is cheap insurance.

Step 5: run it more than once and vote#

Run an agent once and a pass might be luck, a fail might be a bad roll. The fix is to run each case a few times and decide by majority. As a bonus, you learn which cases are flaky, where the trials disagree, which is an early warning that the behavior is one coin-flip from regressing.

// Run `fn` N times, return how many returned true + whether the majority did.
function vote(trials: number, fn: () => boolean) {
  let correct = 0;
  for (let i = 0; i < trials; i++) if (fn()) correct++;
  return {
    correct,
    passed: correct * 2 > trials,              // strict majority
    flaky: correct > 0 && correct < trials,     // trials disagreed
  };
}

This is the one place evals cost real money, three runs is three times the spend, so it’s a pre-release pass, not an every-keystroke check. Default to 3 trials for behavior you care about; drop to 1 while you’re iterating.

Putting it together#

Now the spine. Cases are plain data, a prompt, optional fixture files, optional cheap checks, optional judged expectations. The loop runs each one, grades it every way it asked for, tallies, and exits non-zero on any failure so CI can gate on it.

type EvalCase = {
  id: string;
  prompt: string;
  files?: Record<string, string>;
  checks?: Checks;
  expectations?: string[];
};

const cases: EvalCase[] = [
  {
    id: "recommends-planning-first",
    prompt: "I want to add team billing. What should I do first?",
    checks: {
      required_substrings: ["plan"],
      forbidden_substrings: ["just start coding"],
    },
    expectations: [
      "Recommends clarifying or writing a plan before implementation",
      "Does not start writing code immediately",
    ],
  },
];

let pass = 0, fail = 0, spent = 0;

for (const c of cases) {
  console.log(`\n▶ ${c.id}`);

  // each trial runs in its own fresh sandbox
  const result = vote(3, () => {
    const dir = makeSandbox(c.files ?? {});
    const run = runAgent(c.prompt, dir);
    spent += run.cost;
    if (!run.ok) return false;

    const checks = checkAssertions(c.checks ?? {}, run.text, dir);
    const judged = judge(run.text, c.expectations ?? [])
      .map((met, i) => [`expectation ${i + 1}`, met] as [string, boolean]);

    const all = [...checks, ...judged];
    for (const [label, ok] of all) console.log(`   ${ok ? "✓" : "✗"} ${label}`);
    return all.every(([, ok]) => ok);
  });

  console.log(`  ${result.passed ? "PASS" : "FAIL"} ${result.correct}/3${result.flaky ? "  (flaky)" : ""}`);
  result.passed ? pass++ : fail++;
}

console.log(`\n${pass} passed, ${fail} failed — $${spent.toFixed(4)}`);
process.exit(fail > 0 ? 1 : 0);

Wire it into package.json so it’s one command:

"scripts": {
  "test:evals": "bun evals.ts"
}

Then:

bun run test:evals

# ▶ recommends-planning-first
#    ✓ contains "plan"
#    ✓ excludes "just start coding"
#    ✓ expectation 1
#    ✓ expectation 2
#    ✓ contains "plan"
#    ✓ excludes "just start coding"
#    ✓ expectation 1
#    ✓ expectation 2
#    ✗ contains "plan"
#    ✓ excludes "just start coding"
#    ✓ expectation 1
#    ✓ expectation 2
#   PASS 2/3  (flaky)
#
# 1 passed, 0 failed — $1.0247

That’s the real output from running this exact file against the live CLI, not a cleaned-up screenshot. Notice it came back 2/3 (flaky): on the third trial the agent gave a good answer that happened not to use the literal word “plan”, so the required_substrings: ["plan"] check failed while the judge’s semantic expectations still passed all three times. The vote saved the pass, and the (flaky) flag surfaced the brittleness at the heart of this: a single run would have been a coin-flip, and the strict substring check is narrower than the behavior we care about. That one run cost $1.02, three trials with a judge each.

That’s the whole harness. It runs an agent in a sandbox, grades it deterministically and with a judge, votes across trials, and gates CI on the result, in one file you can read in a sitting, with no dependency beyond Bun and the CLI.

Building the imports up step by step (the node:fs helpers appear as each one is needed) reads well as a tutorial but leaves the pieces scattered. Below is the whole thing assembled into one copy-paste-ready file, the version that produced the output above:

Full evals.ts — the complete, runnable file

// evals.ts
import { spawnSync } from "bun";
import { mkdtempSync, mkdirSync, writeFileSync, existsSync } from "node:fs";
import { tmpdir } from "node:os";
import { join, dirname } from "node:path";

// Run the agent on `prompt` inside `cwd`; return its final reply.
function runAgent(prompt: string, cwd: string) {
  const res = spawnSync({
    cmd: [
      "claude", "-p", prompt,
      "--output-format", "json",          // one JSON envelope on stdout
      "--permission-mode", "bypassPermissions", // don't prompt us mid-run
      "--max-budget-usd", "0.50",           // hard safety cap per run
    ],
    cwd,
    stdout: "pipe",
    stderr: "pipe",
    timeout: 180_000,
  });

  const envelope = JSON.parse(res.stdout.toString());
  return {
    text: envelope.result ?? "",
    ok: res.exitCode === 0 && envelope.is_error !== true,
    cost: Number(envelope.total_cost_usd ?? 0),
  };
}

// Make a throwaway git repo seeded with `files`; return its path.
function makeSandbox(files: Record<string, string>) {
  const dir = mkdtempSync(join(tmpdir(), "eval-"));
  spawnSync({ cmd: ["git", "init", "-q"], cwd: dir });
  for (const [path, content] of Object.entries(files)) {
    const target = join(dir, path);
    mkdirSync(dirname(target), { recursive: true });
    writeFileSync(target, content);
  }
  return dir;
}

const has = (haystack: string, needle: string) =>
  haystack.toLowerCase().includes(needle.toLowerCase());

type Checks = {
  required_substrings?: string[];   // must appear in the reply
  forbidden_substrings?: string[];  // must NOT appear
  required_files?: string[];        // must exist in the sandbox after the run
};

// Returns [label, passed] for each check.
function checkAssertions(checks: Checks, reply: string, dir: string) {
  const out: [string, boolean][] = [];
  for (const s of checks.required_substrings ?? [])
    out.push([`contains "${s}"`, has(reply, s)]);
  for (const s of checks.forbidden_substrings ?? [])
    out.push([`excludes "${s}"`, !has(reply, s)]);
  for (const f of checks.required_files ?? [])
    out.push([`created ${f}`, existsSync(join(dir, f))]);
  return out;
}

// Ask a cheap model whether `reply` meets each expectation.
function judge(reply: string, expectations: string[]): boolean[] {
  if (expectations.length === 0) return [];
  const numbered = expectations.map((e, i) => `${i + 1}. ${e}`).join("\n");

  const prompt = `You are grading an AI agent's reply against a list of expectations.

First reason inside a single <thinking>…</thinking> block. Then, after the
closing tag, output STRICT JSON only:
{"results":[{"reason":"...","met":true}]}
one entry per expectation, in order.

=== REPLY ===
${reply}
=== END ===

Expectations:
${numbered}`;

  const res = spawnSync({
    cmd: [
      "claude", "-p", prompt,
      "--model", "claude-haiku-4-5",   // small + cheap is plenty for grading
      "--output-format", "json",
      "--permission-mode", "bypassPermissions",
      "--max-budget-usd", "0.10",
    ],
    stdout: "pipe", stderr: "pipe", timeout: 180_000,
  });

  // strip the <thinking> block, then grab the JSON object
  const text = (JSON.parse(res.stdout.toString()).result ?? "")
    .replace(/<thinking>[\s\S]*?<\/thinking>/gi, "");
  const json = JSON.parse(text.slice(text.indexOf("{"), text.lastIndexOf("}") + 1));
  return expectations.map((_, i) => json.results?.[i]?.met === true);
}

// Run `fn` N times, return how many returned true + whether the majority did.
function vote(trials: number, fn: () => boolean) {
  let correct = 0;
  for (let i = 0; i < trials; i++) if (fn()) correct++;
  return {
    correct,
    passed: correct * 2 > trials,              // strict majority
    flaky: correct > 0 && correct < trials,     // trials disagreed
  };
}

type EvalCase = {
  id: string;
  prompt: string;
  files?: Record<string, string>;
  checks?: Checks;
  expectations?: string[];
};

const cases: EvalCase[] = [
  {
    id: "recommends-planning-first",
    prompt: "I want to add team billing. What should I do first?",
    checks: {
      required_substrings: ["plan"],
      forbidden_substrings: ["just start coding"],
    },
    expectations: [
      "Recommends clarifying or writing a plan before implementation",
      "Does not start writing code immediately",
    ],
  },
];

let pass = 0, fail = 0, spent = 0;

for (const c of cases) {
  console.log(`\n▶ ${c.id}`);

  // each trial runs in its own fresh sandbox
  const result = vote(3, () => {
    const dir = makeSandbox(c.files ?? {});
    const run = runAgent(c.prompt, dir);
    spent += run.cost;
    if (!run.ok) return false;

    const checks = checkAssertions(c.checks ?? {}, run.text, dir);
    const judged = judge(run.text, c.expectations ?? [])
      .map((met, i) => [`expectation ${i + 1}`, met] as [string, boolean]);

    const all = [...checks, ...judged];
    for (const [label, ok] of all) console.log(`   ${ok ? "✓" : "✗"} ${label}`);
    return all.every(([, ok]) => ok);
  });

  console.log(`  ${result.passed ? "PASS" : "FAIL"} ${result.correct}/3${result.flaky ? "  (flaky)" : ""}`);
  result.passed ? pass++ : fail++;
}

console.log(`\n${pass} passed, ${fail} failed — $${spent.toFixed(4)}`);
process.exit(fail > 0 ? 1 : 0);

📢 Prove it works: write it red first

Before you trust a case, watch it fail. Point it at a behavior the agent doesn’t have yet and confirm it goes red for the right reason. An eval written after the behavior already works might be asserting on nothing, and you’d never know, because it’s green from birth.

Testing your own Claude Code skill#

So far the system-under-test was a bare agent answering a prompt. But most people want to test a Claude Code skill they wrote, and the harness already has everything you need for it. A skill is a SKILL.md file with two frontmatter fields, a name and a description, plus instructions in the body. Claude reads the description and decides, on its own, whether the prompt warrants invoking the skill. That gives you two distinct things to test:

Does it trigger? Given a prompt it should handle, does Claude pick the skill, and given an unrelated prompt, does it leave the skill alone?
Does it behave? Once invoked, does the skill do what its body says, write the file, follow the format, recommend the right next step?

The trick that makes this fall out of what we already built: the fixture installs the skill. Claude Code discovers project skills from .claude/skills/<name>/SKILL.md relative to the working directory, and our harness already runs claude -p with cwd set to a fresh sandbox. So if you seed the skill file into fixture.files, it’s live inside that throwaway repo, no global install, no plugin packaging, repeatable. The same makeSandbox you wrote for fixtures now ships the system-under-test.

Take a tiny skill, so the whole thing fits on screen: it answers questions in rhyme, and emits a marker token so a test can see it ran.

// The skill under test, as a one-file fixture.
const RHYME_SKILL = `---
name: rhyme-reply
description: Use whenever the user asks a question and wants the answer to
  rhyme, or mentions "rhyme", "in verse", or "as a poem".
---
# Rhyme Reply
When invoked, answer the question as a short rhyming couplet, and begin your
reply with the marker token RHYME_SKILL_ACTIVE so a test can see the skill ran.
`;

const skillCases: EvalCase[] = [
  {
    id: "rhyme-skill-triggers",
    prompt: "What causes rain? Answer as a rhyme.",
    files: { ".claude/skills/rhyme-reply/SKILL.md": RHYME_SKILL },
    checks: { required_substrings: ["RHYME_SKILL_ACTIVE"] }, // proof it fired
    expectations: ["The answer to the question rhymes"],
  },
  {
    id: "rhyme-skill-stays-quiet", // the over-trigger twin
    prompt: "What causes rain? Just explain it plainly in one sentence.",
    files: { ".claude/skills/rhyme-reply/SKILL.md": RHYME_SKILL },
    checks: { forbidden_substrings: ["RHYME_SKILL_ACTIVE"] }, // must NOT fire
  },
];

These are plain EvalCase values, so they drop straight into the same cases array and run through the same loop, no new harness code. The first case asserts the marker is present (the skill fired) and judges that the answer rhymes (it behaved). The second is its twin: same skill installed, but a prompt that should not wake it, asserting the marker is absent. Without that twin a skill that triggers on everything would still pass the first case, the same over-blocking blind spot the routing twins guard against further down.

Running both against the live CLI, a single trial of each looks like this, the skill fires on the rhyme prompt and stays silent on the plain one:

▶ rhyme-skill-triggers
   ✓ contains "RHYME_SKILL_ACTIVE"
   ✓ expectation 1
  PASS

▶ rhyme-skill-stays-quiet
   ✓ excludes "RHYME_SKILL_ACTIVE"
  PASS

One honest limitation: with plain --output-format json you only see the final reply, so you’re inferring the skill fired from a fingerprint in its output (here, a marker token; for a real skill, the file it writes or the format it follows). That’s fine when the skill leaves a trace. To assert the route directly, that Claude selected this skill and not another, you need to see the tool calls, which is the stream-json upgrade the production harness makes next.

Where this goes in production#

The harness above is the honest core. A production version adds polish, but nothing exotic. I run this same skeleton in AFK, an open-source Claude Code plugin whose skills route a coding task through plan → implement → clean up → verify. Its write-evals skill ships a self-contained run-evals.template.ts, the grown-up version of the file we built above, and the live suite it runs lives under tests/e2e/evals/. Three things it adds, all visible in that code:

Cases are data, not code. Instead of a TypeScript array, each suite is a JSON file (specs/<suite>/evals.json) the runner loads. A case is the same shape you already know (a prompt, an optional fixture, deterministic assertions, optional judged expectations), only declared, so non-programmers can add coverage and the runner never changes:

{
  "id": "grill-plan-records-reference-repo",
  "prompt": "Earlier we cloned https://github.com/acme/awesome-streamer into reference/awesome-streamer to copy its SSE pattern. Finish by writing docs/plans/streaming.md for a /chat SSE endpoint that follows that repo.",
  "fixture": {
    "files": {
      "reference/awesome-streamer/README.md": "Source: https://github.com/acme/awesome-streamer\n"
    }
  },
  "expectations": [
    "Records in the plan that a reference repo was cloned to copy a pattern",
    "Points implementation at the real cloned source rather than memory"
  ],
  "assertions": {
    "required_files": ["docs/plans/streaming.md"],
    "required_file_substrings": {
      "docs/plans/streaming.md": ["reference/awesome-streamer", "github.com/acme/awesome-streamer"]
    }
  }
}

Note the two-part requirement (record the clone and point at the real source) split into two assertions, so a half-right plan can’t pass.

A dedicated routing case type. Most agent behavior is “which path did it pick?”, which a substring check grades, no judge needed. AFK marks those kind:"routing" and grades them on expect / forbid lists. One real example: when a plan already exists and there’s no diff yet, the help skill should point you at afk:implement, not back at planning or forward to QA:

{
  "id": "help-after-plan",
  "prompt": "What now? Assume docs/plans/checkout.md exists and there is no implementation diff.",
  "expected_output": "Recommends afk:implement as the next step.",
  "kind": "routing",
  "fixture": {
    "files": {
      "docs/plans/checkout.md": "# Checkout Plan\n\n## Tasks\n1. Implement checkout.\n"
    }
  },
  "routing": {
    "expect": ["afk:implement"],
    "forbid": ["Next step: [Q]", "run afk:qa now"]
  }
}

A trial is correct only if every expect string is present and no forbid string is. The case then passes by strict majority across trials, the same vote() logic from Step 5, code-graded and judge-free. Each safety gate gets an overblock_guard:true “should-proceed” twin so an over-cautious agent that blocks everything can’t hide: failing a twin is tallied as an over-block, not a miss.

Richer transcripts and saved artifacts. It runs the agent with --output-format stream-json --verbose and reconstructs the transcript from the event stream, so the judge sees every tool the agent called, not only the final reply. That’s what you need when “did it read the repo first?” is the behavior. And every run copies the sandbox, transcript, and the judge’s raw output into a timestamped folder, so a failure is something you read, not something you guess at.

What else you can point it at#

A skill is just one thing you can put under test. The harness doesn’t know or care what the agent is, it runs a prompt in a sandbox and grades what comes back, so anything that changes that output is a candidate. A few that have earned their keep:

Your CLAUDE.md and house rules. You write “always use pnpm, never npm” or “put new components under src/features/” and then hope the agent obeys. Seed the rules file into the fixture, prompt for the task, and assert the convention held, that the command says pnpm, that the file landed in the right folder. Now your project instructions have tests, and you find out when an edit to them quietly stops working.

A prompt you’re tuning. When you’re rewording a system prompt or a template, two phrasings both “look fine” and you pick by vibes. Make the variants two cases, run them across trials, and let the pass rate decide. The vote turns “I think this wording is better” into a number.

A model upgrade. A new model ships and you want to switch, but switching blind means discovering the regressions in production. Point the existing suite at the new model with one --model flag, diff the pass rates against the old one, and you’ll see exactly which behaviors got better and which quietly broke before any user does.

An MCP server or custom tool. Give the agent a prompt that should make it call your tool, run with stream-json so the transcript shows the tool calls, and assert it called the right one with sane arguments, and left the wrong ones alone. Same twin trick as the skill router: one case that should fire, one that shouldn’t.

Refusals and guardrails. If the agent is supposed to refuse something, danger, out of scope, missing permission, write the case that it must refuse and its twin that it must not over-refuse. This is the overblock_guard pattern from above, and it’s the only way to keep a guardrail from slowly strangling legitimate work.

Subagents, hooks, and slash commands. Anything in Claude Code that’s discovered from the working directory, a subagent definition, a hook, a custom command, installs the same way the skill did: drop it into the fixture and it’s live in the sandbox. The system-under-test is always just a file you seed.

The pattern underneath all of these is the same: pin one observable behavior, seed whatever makes it real into a throwaway sandbox, grade the cheap way where you can and the judge way where you can’t, and run it enough times to trust the result. Once you have that loop, the question stops being “can I test this?” and becomes “what’s the behavior I care about?”, which is the one worth asking.

The one real downside is cost. Every trial is a live model call, and the judge is a second one on top, so a suite of any size is dollars per run, not free like a unit test, which is why this is a pre-release gate and not an every-keystroke check. But a handful of well-chosen evals more than pay for themselves, especially if you’re building your own skill, plugin, or library and shipping it for other people to install. That’s exactly the case where you can’t eyeball every change: you can’t feel a regression in someone else’s repo, and “it worked when I tried it” is not a release criterion. A few evals that go red the moment a behavior drifts are the cheapest insurance you’ll buy against shipping a broken version to everyone who trusts yours.

Build Your Own Eval Harness from Scratch with Bun and claude -p

What we’re building#

Setup: Bun and the claude CLI#

Step 1: drive the agent from code#

Step 2: give it a sandbox to act in#

Step 3: grade with cheap, deterministic checks#

Step 4: grade fuzzy behavior with an LLM judge#

Step 5: run it more than once and vote#

Putting it together#

Testing your own Claude Code skill#

Where this goes in production#

What else you can point it at#

Stay Updated!