TL;DR — I spent a long session composing five AI-agent tools into a single lane that runs from raw idea to reviewed build. One of them is a generator (it takes a half-formed idea and produces a validated opportunity); the other four are judges (they grade an artifact that already exists). The judges turned out to want the same plumbing, so I pulled it into one shared Runtime. The lesson wasn’t any individual tool. It was the shape: one generator, a mesh of judges, a shared substrate underneath, and clean handoffs between them. This post is the architecture, an end-to-end walkthrough on a throwaway example, and the honest list of what it can’t do.
The boring parts of shipping are usually ad-hoc
Between “I have an idea” and “we shipped it” sit a handful of judgments everyone makes and almost nobody systematizes: Is this worth building, and for whom? Is the build plan actually sound, or just a feature list? Is the thing we built healthy, or quietly leaking? Which model should run this, and should it run on-device or in the cloud?
Most teams answer those by gut and a Slack thread. But each one is a repeatable judgment against a standard — exactly the kind of thing you can hand to an agent if you can write the rubric down. So I did. Five tools, built on a multi-agent workflow runtime, each owning one judgment. The point of this post is not the tools individually; it’s how they compose.
The shape: one generator, a mesh of judges, one substrate
The cleanest way to see it is three layers.
The pipeline (the temporal spine). A raw idea flows left to right: validate → hand off → build → run. There is exactly one generator on this line. It’s the only component that produces a new artifact; everything downstream consumes what already exists.
The evaluator mesh (cross-cutting). Clamped onto that spine is a set of judges. None of them is a pipeline stage — they’re quality stations that attach where they’re needed. Critically, almost everything except the generator is a judge. That recut is the whole design: once you notice that four of your five tools are doing the same kind of work — scoring an artifact against a rubric — you stop building four bespoke things.
The substrate (below the line). Underneath sits the shared machinery the judges all draw on: the call to a judge model, the scoring math, the report format, a small reference store of vendor facts. The judges are thin; the substrate is where the reuse lives.
Here’s the map.
IDEA ──▶ [ Validator ] ──handoff──▶ BUILD ──▶ RUN (pipeline · one generator)
▲ ▲ ▲
│ │ │
[ Reviewer ] [ Auditor ] [ Bench ] (evaluator mesh · the judges)
└──────── one shared Runtime ────────┘ (substrate · the plumbing)
The five parts
The Validator (the generator). It takes a raw idea and runs a deep-research fan-out — roughly thirty agents across market sizing, competitors, existing solutions, trends, regulation, and tech feasibility, every fact carrying a citation. It then scores the opportunity on an asymmetric rubric and returns a verdict: pass, park, or kill. (That thirty-agent fan-out is a story of its own; here it’s just the Validator’s first stage.) This is the only tool that creates something; the rest react to it.
The Reviewer (a judge). When an idea graduates into a build plan, the Reviewer grades that plan against a body of product doctrine for machine-learning features: is there an error-cost model? a locked evaluation set? a release gate? a plan for the data? It scores the presence of discipline, not specific numbers — because the numbers in any doctrine go stale, but “did you define a cost-of-error contract at all?” doesn’t.
The Auditor (a judge). Once code exists, the Auditor scans it for AI-specific health: sensitive data flowing to a cloud model unredacted, untrusted input concatenated into a prompt, missing evaluation coverage, no drift monitoring after deploy. It’s deliberately the complement of an ordinary line-level code reviewer — it does not hunt for off-by-one bugs; it asks whether the system’s posture is sound. Keeping it at that altitude is a constant discipline: the moment an architecture auditor starts flagging dead code, it has become a worse version of a tool you already have.
The Bench (a conditional judge). When — and only when — there’s a real run-it-locally vs call-the-cloud decision for a model, the Bench bakes the options off against a held-out set and picks by data instead of vibes. (Its lineage is an evaluation framework I built earlier.) Most ideas don’t have that fork, and the Bench correctly stays asleep for them.
The Runtime (the substrate). Three of the four judges wanted the same plumbing: call a judge model, turn findings into a score, render a digest, write a report. So I extracted it once. The judges import it; the rubric stays bespoke to each. One detail worth flagging — the judges split into two scoring philosophies, and the Runtime supports both: additive (build a score up from the presence of good practice) and subtractive (start at 100 and deduct for each detected problem). A doctrine review is additive; a code audit is subtractive.
The move that made it a system, not five scripts
For a while these were five separate things that happened to run on the same workflow engine. Two changes turned them into one system.
The first was naming the asymmetry out loud: one generator, the rest judges. That immediately told me what was shareable. Generators and judges have little in common; judges have almost everything in common.
The second was the rule of three. Once a third judge wanted the same call-a-model-and-score loop, copy-paste stopped being acceptable and extraction started paying. The shared Runtime is small on purpose — it holds the plumbing every judge needs and nothing about any particular rubric. The rubrics are the moat; the plumbing is a commodity I only want to write once.
A walkthrough: one idea, end to end
To test the whole lane I pushed a single throwaway idea through it: an app that photographs receipts and auto-categorizes expenses for freelancers. (Neutral on purpose — the pipeline doesn’t care what the idea is.)
Validator. The thirty-agent research run came back with a real but crowded market and thin willingness-to-pay against free incumbents. Composite score landed under the venture threshold, so the verdict was park — and that is the pipeline working. The whole point of a cheap validation stage is to kill or shelve weak ideas before they cost build time. A “park” on day one is a win, not a failure.
Reviewer. For the test I drafted a V0 build plan anyway and handed it over. The Reviewer was blunt: the plan was a feature list, not a behavioral contract. It flagged the missing error-cost model (mislabeling a coffee as “office supplies” and leaking a full bank statement are not the same magnitude of error, and the plan treated them as if they were), the absent held-out evaluation set, and the hand-wave around the user’s financial data. Verdict: not ready to build.
Build + Auditor. I scaffolded a deliberately sloppy version — receipt text and user fields passed straight to a cloud model, in a loop, no redaction. The first audit pass found nothing, which taught me something useful: the sensitive-data detector was tuned for one vocabulary of fields and didn’t recognize this one. I widened it, re-ran, and the Auditor then correctly caught all of it — financial data going to the cloud unredacted, a prompt built out of untrusted receipt text (an injection surface), no evaluation coverage, and no drift monitoring. A detector that finds nothing is not the same as clean code.
Bench. And here the conditional judge finally had a reason to wake up: the audit finding created a local-vs-cloud decision. If receipts carry financial PII, maybe the categorizer should run on-device so the images never leave the phone — at some quality cost. That trade is exactly what the Bench exists to settle with data rather than argument.
What it can’t do
A pipeline that only advertises its wins is lying to you.
- The judges contain bad judgment; they don’t cure it. A clean score is not a guarantee the thing is good — it’s the absence of the specific failures the rubric knows to look for. Unknown unknowns sail straight through.
- The generator is the expensive part. A real research run is dozens of agents and real wall-clock and real money. The judges are cheap; the generator is not. Don’t re-run it to prove a one-line fix — reuse the cached result.
- Heuristics miss, silently. The empty first audit is the cautionary tale of the whole project. Every detector has a vocabulary, and anything outside it is invisible. Budget for “found nothing” being a false negative.
- Conditional tools should stay asleep. The Bench is only correct when a genuine model fork exists. Wiring it into every build “for completeness” would be pure waste — the opposite of the point.
What I’d tell someone building this
- Separate the generator from the judges. One component makes new artifacts; everything else grades them. That single distinction tells you what to share.
- Give the judges one runtime. The rubric is the value and stays bespoke. The call-and-score plumbing is a commodity — write it once.
- Score the presence of discipline, not magic numbers. “Is there an error-cost model?” ages well. “Is latency under 20ms?” is stale by next quarter.
- Keep every judge at its own altitude. An architecture auditor that drifts into line-level nitpicks has become a worse copy of a tool you already own. Hand that work back.
- Let the pipeline kill cheaply. The earliest, cheapest park or kill is the most valuable thing the whole machine does.
Closing
The hype frame is “AI agents that build products.” The honest frame is quieter: one generator, a mesh of judges that share a runtime, and clean baton-passes between them. The agents are nearly interchangeable parts. The judgment lives in the rubrics you write down — and the leverage lives in noticing that four of your five tools were the same tool all along.