The limit is your validation criteria, not the agent

Eno Reyes gave a short talk recently that deserves more attention than it got. He is co-founder and CTO of Factory — a company building autonomous software engineering agents, founded in 2023, whose main product is a coding agent called Droid. His argument is deceptively simple: the frontier of what AI agents can do is not a function of model capability. It is a function of how verifiable your environment is.

The argument opens with Karpathy on software 2.0 and a point Eno attributes to Jason about the asymmetry of verification. The intuition is essentially P vs NP made practical: there are many tasks that are far easier to check than to produce. The interesting cases — the ones that yield to agents — have five properties: they have an objective truth, they are quick to validate, they scale (you can check many in parallel), they are low-noise, and they produce continuous signals rather than a binary pass/fail.

Software development scores highly on all five. It is why software development agents are currently the most advanced agents in the world. The 20-to-30-year investment in automated testing — unit tests, end-to-end tests, linters, type systems, QA pipelines — has built exactly the verification infrastructure that agents need to self-correct.

The problem is that most codebases have not built it well enough.

Your company probably runs at 50–60% test coverage. Someone secretly hates the flaky build that fails every third run, but no one says anything. The linter exists, but it is not opinionated enough to catch AI-generated mediocrity — it catches style, not quality. These gaps are tolerable when humans fill them with judgment. They become critical failures when you introduce agents, because agents cannot substitute judgment for missing signals. They will produce code that passes every check you have, while violating every implicit standard you have not encoded.

The reframing of the development loop is the useful part of the talk. Traditional development: understand → design → code → test. Agent-assisted development with good validation: specify constraints → generate → verify (automated and human) → iterate. The shift is from writing software to curating the environment in which software is written. The engineer’s job becomes encoding opinions — which patterns are acceptable, which invariants must hold, which tests would catch specifically AI-generated slop. Eno calls this “specification-driven development” and notes that most of the better coding tools have started building around it: plan mode, spec mode, AGENTS.md files.

His point about junior versus senior developers is sharp. If your senior engineers use agents successfully and your juniors do not, the instinct is to blame skill or prompting technique. The real answer is usually that junior engineers do not know which niche practices your codebase requires, and those practices are not encoded anywhere an agent can find them. Fix the validation, and the gap closes. That is a meaningfully different diagnosis than “junior engineers are bad at prompting.”

The Google/Meta analogy grounds this nicely. A new hire with zero context can safely round YouTube’s border radius and be confident it will not take down a billion-user product. That confidence does not come from the hire’s competence. It comes from the validation infrastructure that must be satisfied before the change ships. The claim is that we can now build that infrastructure at smaller scale — and that coding agents can help identify where the gaps are. You can ask a coding agent to find where your linters are under-opinionated. You can ask it to generate tests.

One quote from Factory engineer Alvin that should survive the talk: “A slop test is better than no test.” Controversial. Also correct. A bad test that passes when your code is correct and fails when it is wrong teaches agents to write more tests. The pattern propagates. Other agents notice it, follow it, and the environment becomes more opinionated over time.

Eno is explicit that none of this is Factory-specific. The checklist — linters, tests, OpenAPI docs, type systems, AGENTS.md files — applies to any coding agent you are currently using. Spending 45 days comparing tools to find one that scores 10% better on SWE-bench is not the highest-leverage move. Investing in the validation infrastructure that makes every coding tool work better — that is where the 5–7x return comes from. Not 1.5x. Not 2x.

“The limiter is not the capability of the coding agent. The limit is your organization’s validation criteria.”

Worth writing somewhere permanent.