Test-Driven AI Development: A New Contract Between Human and Machine
A note for readers who don’t know me: this was first written for an internal work blog, where my colleagues could read it in the spirit it was intended. I’m publishing it here unchanged.
The manifesto form is deliberate. It is overstated on purpose — pitched to drag the Overton window on what “coding with AI” should mean, not offered as a finished doctrine. I still stand by every claim; I just want to flag that the volume is turned up for rhetorical effect.
Read it as a provocation, not a sermon.
I. The Problem
We have been handed a new kind of colleague: tireless, fast, and utterly untrustworthy.
Large Language Models can write code in seconds that would take humans hours. They can refactor entire codebases, implement complex algorithms, and generate thousands of lines of working software. But they hallucinate. They drift. They misunderstand. They forget context. They are, in the words of our peers, “eager puppies” or “goldfish with PhDs.”
The response has been to treat them like junior developers: prompt carefully, review meticulously, hope for the best. We write specifications in English, that most ambiguous of languages, and wonder why the AI misunderstands. We try to teach them “clean code” and “best practices” - concepts we invented to help humans maintain codebases.
This is backwards.
II. The Core Insight
The AI can do anything you want, but you cannot trust it. And if you do, you must verify.
The only verification that matters is: does it do what we specified?
Not “is the code elegant?” Not “does it follow SOLID principles?” Not “would this pass code review?”
Does it pass the tests?
Tests are not documentation. Tests are not an afterthought. Tests are not “coverage metrics.”
Tests are the specification.
III. The Principles
1. Tests are the only valid form of specification
If you cannot encode a requirement as a test, you do not have a requirement - you have a vibe.
Not: “The system should be fast”
But:assert response_time < 0.1 # 90th percentile < 100msNot: “The code should be maintainable”
But:assert cyclomatic_complexity < 10Not: “The UI should feel responsive”
But:assert frame_time_p95 < 16.67 # 60fps
If you cannot measure it, you cannot build it. If you cannot test it, you do not know if you have it.
2. Implementation is disposable, contracts are permanent
The code can be rewritten in any paradigm, any style, any architecture. The tests remain.
You can demand the AI refactor from object-oriented to functional, from synchronous to async, from monolith to microservices. As long as the tests pass, the system is correct.
3. The test suite is the interface, the codebase is a black box
You do not care what is inside the box. You care that when you invoke the interface, you get the specified behavior.
The AI can implement sort() as quicksort, mergesort, or a neural network. You don’t care. You specified the contract:
@given(lists(integers()))
def test_sort_is_sorted(input_list):
result = sort(input_list)
assert result == sorted(input_list) # Correct output
assert len(result) == len(input_list) # No elements lost
assert is_stable(input_list, result) # Stability property
4. Narrow the scope, constrain the solution space
Every test is a constraint. The more tests you write, the smaller the space of valid implementations.
This prevents the AI from “being helpful” in ways you didn’t ask for. It cannot add features, cannot make assumptions, cannot wander off task.
The tests are guard rails.
5. Types and interfaces are part of the specification
You specify whether something is a class or a function. You specify whether data is mutable or immutable. You specify the type signatures.
def test_number_type_immutability():
a = Number(4)
b = a.add(3)
assert a.value == 4 # Original unchanged
assert b.value == 7 # New value returned
This IS the specification. The AI now knows: Number is a class, add() is a method, and the type is immutable.
IV. What We Reject
We reject “clean code” as a primary virtue.
Clean code was invented to help humans read and maintain software. If the AI maintains the code, and humans only maintain the tests, then clean code is optimization for the wrong metric.
We reject DRY as sacred.
Don’t Repeat Yourself was a human ergonomic concern. The AI is happy to update code in five places. If the tests pass, repetition is irrelevant.
We reject architectural purity for its own sake.
SOLID, design patterns, layered architectures - these were cognitive tools for humans managing complexity. If the complexity is managed by tests and implemented by AI, the architecture is an implementation detail.
We reject code review as the primary quality gate.
Code review catches what humans miss. Tests catch what anyone misses. A code review happens once. Tests run forever.
We reject natural language specifications.
English is ambiguous. “Fast” means nothing. “User-friendly” means nothing. “Robust” means nothing.
Tests are unambiguous.
V. What We Embrace
We embrace tests as the engineering discipline.
Writing good tests is now the craft. Knowing what to test, how to test it, what properties matter - this is where the rigor lives.
We embrace comprehensive contracts.
- Unit tests for correctness
- Property tests for invariants
- Performance tests for speed
- Fuzz tests for security
- Integration tests for composition
- Regression tests for stability
We embrace measurable requirements.
Every requirement must be operationalized. Every quality must be quantified. Every expectation must be encoded.
We embrace fearless refactoring.
The AI can rewrite the entire codebase overnight. If the tests pass, you ship it.
We embrace the unknown implementation.
You don’t need to understand how the code works. You need to understand what it does. The tests tell you what it does.
VI. The Practices
1. Write the tests first
Before the AI writes a single line of implementation, write the comprehensive test suite:
- What are the happy paths?
- What are the edge cases?
- What are the error conditions?
- What are the performance requirements?
- What are the security properties?
2. Make tests self-documenting
def test_authentication_rate_limiting():
"""
Security requirement: After 5 failed login attempts,
the account must be locked for 15 minutes to prevent
brute force attacks.
"""
for _ in range(5):
assert login("user", "wrong") == False
assert login("user", "correct") == "rate_limited"
time.sleep(900) # 15 minutes
assert login("user", "correct") == True
The AI reads this and knows exactly what to implement.
3. Use property-based testing
Don’t just test examples, test properties:
@given(integers(), integers())
def test_addition_commutative(a, b):
assert add(a, b) == add(b, a)
@given(integers(), integers(), integers())
def test_addition_associative(a, b, c):
assert add(add(a, b), c) == add(a, add(b, c))
Let the test framework generate thousands of cases.
4. Test performance as a first-class property
def test_query_scales_logarithmically():
for size in [100, 1000, 10000, 100000]:
data = generate_dataset(size)
elapsed = time_query(data)
# Allow O(log n) growth + constant overhead
assert elapsed < 0.001 * math.log2(size) + 0.01
5. Test security with fuzzing
def test_sql_injection_resistance():
malicious_payloads = [
"'; DROP TABLE users--",
"1' OR '1'='1",
# ... hundreds more
]
for payload in malicious_payloads:
result = query_user(payload)
assert not database_was_modified()
assert not sensitive_data_leaked(result)
6. Demand refactors freely
“AI, rewrite this using async/await”
“AI, convert this to use a state machine”
“AI, optimize this for memory instead of speed”
Run the tests. If they pass, you’re done.
VII. The Skills That Matter Now
The human’s job is no longer writing implementations. It is:
1. Understanding the problem domain
What are the real requirements? What are the edge cases? What properties must hold?
2. Designing the contract
What is the interface? What are the types? What are the invariants?
3. Writing comprehensive tests
This is the craft. This is where experience matters.
4. Maintaining the specification
When requirements change, update the tests. The AI will update the implementation.
5. Eating the sin
This is why all the other skills matter.
The machine has no skin in the game. It cannot be held accountable when something goes catastrophically wrong.
You must eat the sin.
When you type git commit and git push, you are saying: “I accept responsibility for what this does.”
Not “the AI did it.” Not “the tests passed.” Not “the system validated it.”
You did it. You shipped it. You own it.
You judge what tests cannot capture:
- Is this the right problem to solve?
- Will this harm users in ways we didn’t anticipate?
- Are there ethical implications beyond correctness?
You audit the specification:
- What haven’t we tested?
- What assumptions are baked into these tests?
- What could go wrong that we didn’t anticipate?
You accept that failures are inevitable:
- No test suite is complete
- No specification is perfect
- Production will find edge cases you never imagined
When it breaks, you don’t say “the AI was wrong” or “the tests were insufficient.”
You say: “I shipped it. I own the failure. I will fix it.”
TDAID does not eliminate responsibility. It focuses it.
You cannot hide behind “I was just following best practices” or “the code looked good in review.”
You specified it. You tested it. You shipped it. You own it.
The machine has no conscience. The tests have no judgment. The system has no mercy.
Only the human can eat the sin.
Embrace the Triangle of Trust:
- The machine implements
- The tests verify
- The human owns
Remove any leg and the system falls.
You cannot outsource responsibility to the machine. You can only outsource the implementation.
VIII. The Future
In this future:
- Codebases are ephemeral, test suites are permanent
- Developers write tests, AI writes implementations
- Code review focuses on test quality, not implementation quality
- “Technical debt” means poorly tested code, not “messy” code
- Refactoring is instant and fearless
- Languages and frameworks become implementation details
The test suite becomes the codebase. Everything else is just a proof object.
IX. The Call to Action
Start today:
- On your next feature, write comprehensive tests first
- Let the AI implement it
- Verify only that tests pass, not that the code is “good”
- Refactor ruthlessly
- Observe how much faster you move when implementation quality doesn’t matter
Challenge yourself:
- Can you specify your entire system as a test suite?
- Can you operationalize every requirement?
- Can you trust the machine if you verify comprehensively?
Teach others:
- Tests are specifications, not validation
- If you can’t test it, you can’t build it
- Implementation quality is an AI problem, contract quality is a human problem
X. The Motto
“I don’t care what the code is. I only care that it does what I say it does.”
Test-Driven AI Development is not a methodology. It is a recognition that the nature of software development has fundamentally changed. The machine implements. The human specifies. The tests are the contract. Write tests. Trust nothing. Verify everything.
TDAID: Because the only code you can trust is code you’ve tested.
Write tests. Trust nothing. Verify everything. Sign your name.