Cloudflare published a post on orchestrating AI code review at scale, and I could not stop thinking about it. They run up to seven specialized AI reviewers on every merge request — security, performance, code quality, docs — managed by a coordinator that deduplicates the findings and posts a single verdict. They have run it across 131,246 reviews in a month, the median review takes 3 minutes 39 seconds, and the average costs $1.19. The number that stuck with me was 1.2 findings per review. Most AI review tools drown you. Theirs is quiet on purpose.
Their whole architecture is a plugin system around OpenCode, wired into GitLab CI, with Cloudflare Workers for the control plane. I have none of that. I have a laptop and Claude Code. But the shape of what they built — fan out specialists, verify, then have a coordinator judge — is exactly the shape Claude Code’s workflow feature Claude Code Workflows: Deterministic Multi-Agent Orchestration Workflows let you script how Claude Code fans out across dozens of subagents, then synthesizes the results. Here's how they work, the primitives, and the small example I built to understand them. was built for. So I rebuilt the idea as one ~250-line JavaScript file that runs against any repo on my machine. This post is how, and the one place I disagreed with Cloudflare.
✨ TLDR
- → Cloudflare's lesson worth stealing: specialized reviewers beat one big prompt, and telling the model what NOT to flag is where the signal comes from
- → I rebuilt it as a single Claude Code workflow — no CI, no OpenCode — with four phases: Scope → Review → Verify → Coordinate
- → A Scope phase (a 'cartographer' agent) maps the repo first, so the review is language-agnostic and runs over a whole codebase, not just a diff
- → The one thing I changed: a dedicated Verify phase that refutes each finding in isolation with a stronger model, refute-by-default
- → My first version 'verified' 43 of 43 findings and dropped nothing — that's verification theater, and fixing it is the real story
- → The whole thing reuses Claude Code's existing specialist subagents (security-reviewer, architecture-reviewer, …) instead of new prompts
Table of Contents
Open Table of Contents
- What Cloudflare Actually Built
- What I Had Instead: The Workflow Primitive
- Phase 0 — Scope: Map the Repo First
- Phase 1 — Review: The Specialist Panel
- Phase 2 — Verify: Where I Disagreed With Cloudflare
- Phase 3 — Coordinate: Dedup and Judge
- What I Changed From Cloudflare, and Why
- Limitations I’m Honest About
- Conclusion
What Cloudflare Actually Built
Before stealing an idea you should be honest about which part you are stealing. Cloudflare’s post is long and most of it is plumbing I will never need: a ReviewPlugin interface, JSONL streaming, circuit breakers, failback chains, a Workers KV control plane to swap models when a provider goes down at 8am UTC. That is the cost of running in the critical path of thousands of engineers. It is real engineering and none of it is my problem.
The part worth stealing is the review philosophy, and it comes down to three decisions:
- Specialized agents, not one big prompt. Instead of one model with a generic “find bugs” prompt, they launch domain experts. A security reviewer that only knows security. A performance reviewer that only knows performance. Each has a tightly scoped prompt.
- Tell the model what not to flag. This is the line that made the post for me: “telling an LLM what not to do is where the actual prompt engineering value resides.” Their security reviewer is explicitly forbidden from flagging theoretical risks, defense-in-depth piling, or “consider using library X” suggestions. That single discipline is why they get 1.2 findings per review instead of 10 useless ones.
- A coordinator judges at the end. One strong model reads everyone’s output, deduplicates, recategorizes misfiled findings, drops the speculative ones, and renders a single review against an approval rubric biased toward approval.
That is the whole machine, minus the plumbing. Fan out specialists → coordinator judges. And it maps almost perfectly onto a primitive I already had.
What I Had Instead: The Workflow Primitive
I wrote a whole post on Claude Code workflows Claude Code Workflows: Deterministic Multi-Agent Orchestration Workflows let you script how Claude Code fans out across dozens of subagents, then synthesizes the results. Here's how they work, the primitives, and the small example I built to understand them. recently, so I will keep this short. A workflow is a plain JavaScript file that orchestrates subagents deterministically. You own the control flow — the loops, the fan-out, the intermediate results — and each individual step is delegated to a fresh subagent. Three primitives do most of the work:
agent(prompt, opts)runs one subagent. Pass aschemaand it returns validated JSON instead of free text.parallel(thunks)runs tasks concurrently and waits for all of them — a barrier.pipeline(items, stage1, stage2, …)streams each item through every stage with no barrier between stages. Item A can be in stage 3 while item B is still in stage 1.
Cloudflare wrote a custom spawn_reviewers tool, a polling loop, inactivity detection, and a scheduler to manage concurrent sessions. In a workflow, parallel and pipeline are that scheduler. The concurrency cap, the result collection, the failure-to-null handling — all built in. So the question stopped being “how do I build an orchestrator” and became “what do I want the orchestrator to do.”
I named the workflow ultracode-codebase-review. Four phases:
Phase 0 — Scope: Map the Repo First
Here is my first real departure from Cloudflare. Their system reviews a diff — a merge request. It knows exactly which files changed, runs a risk assessment on the diff size, and filters out lock files and minified assets. The diff is the input.
I wanted to point this at a whole repository, in any language, without hardcoding a thing. So the workflow opens with a single cartographer agent whose entire job is to explore the repo and decide what is worth reviewing. It reads the project’s own docs — README, AGENTS.md, CONTRIBUTING, lint config — to extract the real conventions and architectural boundaries instead of inventing them. Then, for each candidate review lens, it decides whether that lens is even relevant and which files it should read.
const scope = await agent(
`You are the REPOSITORY CARTOGRAPHER for an automated code review. ` +
`Map THIS repository efficiently — directory listings, file counts, ` +
`targeted greps; SAMPLE large files, do not read everything. Read the ` +
`project's own docs to extract real conventions and boundaries — do ` +
`not invent rules.\n\n` +
`Then decide, for each candidate lens, whether it is worth running on ` +
`this repo and exactly which files it should read. Mark a lens ` +
`irrelevant if the codebase has nothing for it (e.g. "tests" when there ` +
`is no test suite).`,
{ label: "scope:cartographer", agentType: "Explore", schema: SCOPE_SCHEMA },
);
The cartographer returns a structured map: languages, frameworks, a module map, the project’s own conventions, the architectural boundaries to enforce, the hotspots worth deep review, and a per-lens relevance flag with focus paths. Everything downstream is driven by what this one agent produces, so nothing about the target repo is hardcoded. Point it at a Rust CLI and it skips the test-coupling lens if there are no tests. Point it at a TypeScript app and it lights up the type-safety lens.
This is the trade Cloudflare made differently. They had the diff, so they did not need discovery. I wanted “review any repo,” so I paid for one mapping agent up front to earn that generality.
Phase 1 — Review: The Specialist Panel
Now the part I stole almost verbatim. The relevant lenses each get a specialist reviewer. The nice surprise: Claude Code already ships specialist subagents, so I did not have to write seven prompts from scratch. I just mapped each lens to the agent that already exists for it.
function agentTypeForLens(key) {
switch (key) {
case "security": return "security-reviewer";
case "architecture": return "architecture-reviewer";
case "performance": return "performance-reviewer";
case "quality": return isTsJs ? "typescript-reviewer-v2"
: "fowler-refactoring-reviewer";
case "tests": return isTsJs ? "kcd-test-reviewer" : undefined;
default: return undefined; // correctness, docs → general agent
}
}
Each reviewer runs on Sonnet — the workhorse tier, exactly like Cloudflare’s Standard tier for their heavy sub-reviewers. And every reviewer gets the same two things: a shared context block built entirely from the cartographer’s map (so I am not duplicating repo metadata across seven prompts, which is Cloudflare’s shared-context optimization), and a SIGNAL_RULES block that is my port of their “What NOT to Flag” list:
WHAT NOT TO FLAG (bias hard for signal — noise gets you ignored):
- Theoretical risks needing unlikely preconditions.
- Defense-in-depth piling when the primary handling is already adequate.
- Style/naming nitpicks with no behavioral impact.
- "Consider library X" suggestions.
- Anything you have not confirmed by actually reading the relevant code.
This one block does most of the work. Without it you get the firehose. With it, an empty findings array becomes an acceptable answer for a clean file, and the reviewers do return empty arrays.
Phase 2 — Verify: Where I Disagreed With Cloudflare
It started as a failure.
In Cloudflare’s design, the coordinator does the judging. It reads all the findings and applies a “reasonableness filter” — drop the speculative ones, and if unsure, read the source. One model, one pass, looking at everything at once.
My first version copied that. I had the coordinator look at all the findings and decide. And when I ran it, it kept 43 of 43 findings and dropped nothing. Every single finding survived. That is not verification — that is a model looking at a batch of plausible-sounding claims and anchoring on “well, they all look reasonable.” I had built verification theater. The check existed, it ran, it cost tokens, and it verified nothing.
The fix was three changes, and each one attacks a specific failure mode:
- Independent, not batched. Each finding is judged by its own agent that sees only that one finding. No batch, so it cannot anchor on “these all look plausible together.”
- Cross-model. Reviewers run on Sonnet; verifiers run on Opus — a different, stronger model. The check is not a model grading its own homework.
- Refute by default. The verifier’s job is to disprove the finding from source. Its default stance is that the finding is wrong. It keeps the finding only if it can confirm it by reading the actual code.
const v = await agent(
`You are an ADVERSARIAL VERIFIER on an independent model. Your default ` +
`stance is that the finding below is WRONG. Open the cited file at the ` +
`cited location and try to REFUTE it. Set confirmed=false if it is a ` +
`false positive, depends on unlikely preconditions, is contradicted by ` +
`surrounding code, is pure style, or you cannot reproduce it from the ` +
`source. Keep it ONLY if the source genuinely proves it. You see exactly ` +
`one finding — judge it on its own merits.\n\n` +
`FINDING:\n${JSON.stringify(f, null, 2)}`,
{ label: `verify:${lens.key}`, model: "opus", schema: VERDICT_SCHEMA },
);
There is a cost discipline here too. Only critical and warning findings get the per-finding Opus refute, because that is where a false positive is expensive. Suggestions pass through unverified and are labelled verified: false — logged, never silently dropped. Spending Opus tokens to adversarially verify a “consider renaming this variable” suggestion is not worth it.
The Review and Verify phases run as one pipeline, not two parallel barriers. That matters: the moment the security reviewer finishes, its findings start getting verified, while the architecture reviewer is still reading files. No reviewer sits idle waiting for the slowest one before verification can begin.
Phase 3 — Coordinate: Dedup and Judge
The last phase is pure Cloudflare, and I kept it close to their design because it was already right. One Opus coordinator — their Top tier, for the same reason, it has the hardest job — takes every confirmed finding and does three things: deduplicate (the same issue from two lenses gets kept once, in the best-fit lens), recategorize (a perf issue filed under “quality” moves to “performance”), and apply a final reasonableness filter.
Then it renders one consolidated review and makes an approval decision against a rubric lifted almost word-for-word from their post, bias toward approval included:
all clean / only trivial suggestions → approved
only suggestions, or warnings, no prod risk → approved_with_comments
multiple warnings forming a risk pattern → minor_issues
any critical / real safety or security hole → significant_concerns
The coordinator gets one extra instruction mine needs and theirs does not: the high-severity findings already survived independent verification, but the suggestions did not, so scrutinize those yourself. The verify phase changes what the coordinator can trust, so it changes what the coordinator has to do.
The full workflow script (ultra.js)
The complete script is on this GitHub gist. The skeleton:
export const meta = {
name: "ultracode-codebase-review",
description:
"Cloudflare-style orchestrated AI review of ANY codebase: scope " +
"discovery, specialist panel, independent cross-model adversarial " +
"verify, coordinator judgment",
phases: [
{ title: "Scope", detail: "cartographer maps the repo, picks files + lenses" },
{ title: "Review", detail: "specialist reviewers fan out over the scope" },
{ title: "Verify", detail: "each finding refuted in isolation by a stronger model" },
{ title: "Coordinate", detail: "dedup, re-categorize, render single review + decision" },
],
};
// PHASE 0 — Scope: cartographer maps the repo (language-agnostic)
phase("Scope");
const scope = await agent(/* … */, { agentType: "Explore", schema: SCOPE_SCHEMA });
const SHARED_CONTEXT = /* built entirely from scope */;
// PHASE 1+2 — Review then Verify, pipelined per lens
phase("Review");
const perLens = await pipeline(
reviewers,
// Stage 1 — specialist review on Sonnet
(r) => agent(/* … */, { model: "sonnet", agentType: r.agentType, schema: FINDINGS_SCHEMA })
.then((res) => ({ lens: r, review: res })),
// Stage 2 — per-finding independent adversarial verify on Opus
async (prev) => {
const toRefute = prev.review.findings.filter((f) => f.severity !== "suggestion");
const verdicts = await parallel(
toRefute.map((f) => async () => {
const v = await agent(/* refute-by-default */, { model: "opus", schema: VERDICT_SCHEMA });
return v?.confirmed && v.adjustedSeverity !== "dropped"
? { ...f, severity: v.adjustedSeverity, verified: true }
: null;
}),
);
return { lens: prev.lens.key, confirmed: verdicts.filter(Boolean) };
},
);
// PHASE 3 — Coordinate: dedup, judge, render
phase("Coordinate");
const coord = await agent(/* dedup + rubric */, { schema: COORD_SCHEMA });
return { decision: coord.decision, reviewBody: coord.reviewBody, /* … */ }; What I Changed From Cloudflare, and Why
Side by side, the differences are smaller than they look, and each one comes from a different constraint.
| Cloudflare | My workflow | |
|---|---|---|
| Input | A merge request diff | A whole repo, any language |
| Discovery | Diff is the scope | A cartographer maps the scope |
| Orchestrator | Custom spawn_reviewers tool | Built-in pipeline / parallel |
| Reviewer prompts | Hand-written markdown files | Claude Code’s existing subagents |
| Verification | Coordinator judges the batch | Independent, cross-model, refute-by-default |
| Runs in | GitLab CI, Cloudflare Workers | My laptop, one script |
The one I would defend hardest is verification. Folding the check into the coordinator is fine when the coordinator is a frontier model with a good “what not to flag” prompt and you have tens of thousands of reviews of tuning behind it. For a workflow I wrote in an evening, a batch judge rubber-stamps. Pulling verification into its own phase — one finding, one fresh stronger model, told to refute — is what turned the output from “43 plausible claims” into a list I actually trust.
Limitations I’m Honest About
This is not Cloudflare’s system and it is not pretending to be. A few things I know it does not do:
- It is not in the critical path. No CI gate, no circuit breakers, no failback when a provider is down. If Opus is overloaded mid-run, a verifier returns
nulland the finding quietly drops. Cloudflare spent real effort on resilience because engineers were blocked on it. I am not blocked on it. - It costs more per run than a diff review. Mapping and reviewing a whole repo is more tokens than reviewing a 40-line change. Cloudflare’s risk tiers — don’t send seven Opus agents to review a typo — are the obvious next thing to steal, and I have not yet.
- It inherits the model’s blind spots. Same honesty Cloudflare ended on: subtle concurrency bugs, cross-system impact, and “is this architecture moving in the right direction” are still beyond a static read of the code. This catches real bugs. It does not replace a human who knows why the system was built this way.
Conclusion
The lesson I keep taking from posts like Cloudflare’s is that the impressive part is rarely the part you need. I did not need their plugin architecture or their Workers control plane. I needed three ideas — specialists over one big prompt, tell the model what not to flag, and a coordinator that judges — and Claude Code’s workflow primitive handed me the orchestration for free.
The piece I am proudest of is the piece where I disagreed. Copying their coordinator-judges design gave me a check that verified nothing. Pulling verification into its own phase, with an independent stronger model told to refute by default, is what made the output trustworthy. If you take one thing from this post, take that: a verification step that never drops anything is not a verification step, it is a comfort blanket. Make your checker try to prove you wrong, on a different model, one claim at a time. The whole script is in the gist — point it at one of your repos and see what it refuses to flag.