Test-Driven AI Development: A New Contract Between Human and Machine

A note for readers who don’t know me: this was first written for an internal work blog, where my colleagues could read it in the spirit it was intended. I’m publishing it here unchanged.

The manifesto form is deliberate. It is overstated on purpose — pitched to drag the Overton window on what “coding with AI” should mean, not offered as a finished doctrine. I still stand by every claim; I just want to flag that the volume is turned up for rhetorical effect.

Read it as a provocation, not a sermon.

I. The Problem

We have been handed a new kind of colleague: tireless, fast, and utterly untrustworthy.

Large Language Models can write code in seconds that would take humans hours. They can refactor entire codebases, implement complex algorithms, and generate thousands of lines of working software. But they hallucinate. They drift. They misunderstand. They forget context. They are, in the words of our peers, “eager puppies” or “goldfish with PhDs.”

The response has been to treat them like junior developers: prompt carefully, review meticulously, hope for the best. We write specifications in English, that most ambiguous of languages, and wonder why the AI misunderstands. We try to teach them “clean code” and “best practices” - concepts we invented to help humans maintain codebases.

This is backwards.

II. The Core Insight

The AI can do anything you want, but you cannot trust it. And if you do, you must verify.

The only verification that matters is: does it do what we specified?

Not “is the code elegant?” Not “does it follow SOLID principles?” Not “would this pass code review?”

Does it pass the tests?

Tests are not documentation. Tests are not an afterthought. Tests are not “coverage metrics.”

Tests are the specification.

III. The Principles

1. Tests are the only valid form of specification

If you cannot encode a requirement as a test, you do not have a requirement - you have a vibe.

Not: “The system should be fast”
But: assert response_time < 0.1 # 90th percentile < 100ms

Not: “The code should be maintainable”
But: assert cyclomatic_complexity < 10

Not: “The UI should feel responsive”
But: assert frame_time_p95 < 16.67 # 60fps

If you cannot measure it, you cannot build it. If you cannot test it, you do not know if you have it.

2. Implementation is disposable, contracts are permanent

The code can be rewritten in any paradigm, any style, any architecture. The tests remain.

You can demand the AI refactor from object-oriented to functional, from synchronous to async, from monolith to microservices. As long as the tests pass, the system is correct.

3. The test suite is the interface, the codebase is a black box

You do not care what is inside the box. You care that when you invoke the interface, you get the specified behavior.

The AI can implement sort() as quicksort, mergesort, or a neural network. You don’t care. You specified the contract:

@given(lists(integers()))
def test_sort_is_sorted(input_list):
    result = sort(input_list)
    assert result == sorted(input_list)  # Correct output
    assert len(result) == len(input_list)  # No elements lost
    assert is_stable(input_list, result)  # Stability property

4. Narrow the scope, constrain the solution space

Every test is a constraint. The more tests you write, the smaller the space of valid implementations.

This prevents the AI from “being helpful” in ways you didn’t ask for. It cannot add features, cannot make assumptions, cannot wander off task.

The tests are guard rails.

5. Types and interfaces are part of the specification

You specify whether something is a class or a function. You specify whether data is mutable or immutable. You specify the type signatures.

def test_number_type_immutability():
    a = Number(4)
    b = a.add(3)
    assert a.value == 4  # Original unchanged
    assert b.value == 7  # New value returned

This IS the specification. The AI now knows: Number is a class, add() is a method, and the type is immutable.

IV. What We Reject

We reject “clean code” as a primary virtue.

Clean code was invented to help humans read and maintain software. If the AI maintains the code, and humans only maintain the tests, then clean code is optimization for the wrong metric.

We reject DRY as sacred.

Don’t Repeat Yourself was a human ergonomic concern. The AI is happy to update code in five places. If the tests pass, repetition is irrelevant.

We reject architectural purity for its own sake.

SOLID, design patterns, layered architectures - these were cognitive tools for humans managing complexity. If the complexity is managed by tests and implemented by AI, the architecture is an implementation detail.

We reject code review as the primary quality gate.

Code review catches what humans miss. Tests catch what anyone misses. A code review happens once. Tests run forever.

We reject natural language specifications.

English is ambiguous. “Fast” means nothing. “User-friendly” means nothing. “Robust” means nothing.

Tests are unambiguous.

V. What We Embrace

We embrace tests as the engineering discipline.

Writing good tests is now the craft. Knowing what to test, how to test it, what properties matter - this is where the rigor lives.

We embrace comprehensive contracts.

  • Unit tests for correctness
  • Property tests for invariants
  • Performance tests for speed
  • Fuzz tests for security
  • Integration tests for composition
  • Regression tests for stability

We embrace measurable requirements.

Every requirement must be operationalized. Every quality must be quantified. Every expectation must be encoded.

We embrace fearless refactoring.

The AI can rewrite the entire codebase overnight. If the tests pass, you ship it.

We embrace the unknown implementation.

You don’t need to understand how the code works. You need to understand what it does. The tests tell you what it does.

VI. The Practices

1. Write the tests first

Before the AI writes a single line of implementation, write the comprehensive test suite:

  • What are the happy paths?
  • What are the edge cases?
  • What are the error conditions?
  • What are the performance requirements?
  • What are the security properties?

2. Make tests self-documenting

def test_authentication_rate_limiting():
    """
    Security requirement: After 5 failed login attempts,
    the account must be locked for 15 minutes to prevent
    brute force attacks.
    """
    for _ in range(5):
        assert login("user", "wrong") == False

    assert login("user", "correct") == "rate_limited"

    time.sleep(900)  # 15 minutes
    assert login("user", "correct") == True

The AI reads this and knows exactly what to implement.

3. Use property-based testing

Don’t just test examples, test properties:

@given(integers(), integers())
def test_addition_commutative(a, b):
    assert add(a, b) == add(b, a)

@given(integers(), integers(), integers())  
def test_addition_associative(a, b, c):
    assert add(add(a, b), c) == add(a, add(b, c))

Let the test framework generate thousands of cases.

4. Test performance as a first-class property

def test_query_scales_logarithmically():
    for size in [100, 1000, 10000, 100000]:
        data = generate_dataset(size)
        elapsed = time_query(data)
        # Allow O(log n) growth + constant overhead
        assert elapsed < 0.001 * math.log2(size) + 0.01

5. Test security with fuzzing

def test_sql_injection_resistance():
    malicious_payloads = [
        "'; DROP TABLE users--",
        "1' OR '1'='1",
        # ... hundreds more
    ]
    for payload in malicious_payloads:
        result = query_user(payload)
        assert not database_was_modified()
        assert not sensitive_data_leaked(result)

6. Demand refactors freely

“AI, rewrite this using async/await”
“AI, convert this to use a state machine”
“AI, optimize this for memory instead of speed”

Run the tests. If they pass, you’re done.

VII. The Skills That Matter Now

The human’s job is no longer writing implementations. It is:

1. Understanding the problem domain

What are the real requirements? What are the edge cases? What properties must hold?

2. Designing the contract

What is the interface? What are the types? What are the invariants?

3. Writing comprehensive tests

This is the craft. This is where experience matters.

4. Maintaining the specification

When requirements change, update the tests. The AI will update the implementation.

5. Eating the sin

This is why all the other skills matter.

The machine has no skin in the game. It cannot be held accountable when something goes catastrophically wrong.

You must eat the sin.

When you type git commit and git push, you are saying: “I accept responsibility for what this does.”

Not “the AI did it.” Not “the tests passed.” Not “the system validated it.”

You did it. You shipped it. You own it.

You judge what tests cannot capture:

  • Is this the right problem to solve?
  • Will this harm users in ways we didn’t anticipate?
  • Are there ethical implications beyond correctness?

You audit the specification:

  • What haven’t we tested?
  • What assumptions are baked into these tests?
  • What could go wrong that we didn’t anticipate?

You accept that failures are inevitable:

  • No test suite is complete
  • No specification is perfect
  • Production will find edge cases you never imagined

When it breaks, you don’t say “the AI was wrong” or “the tests were insufficient.”

You say: “I shipped it. I own the failure. I will fix it.”

TDAID does not eliminate responsibility. It focuses it.

You cannot hide behind “I was just following best practices” or “the code looked good in review.”

You specified it. You tested it. You shipped it. You own it.

The machine has no conscience. The tests have no judgment. The system has no mercy.

Only the human can eat the sin.

Embrace the Triangle of Trust:

  • The machine implements
  • The tests verify
  • The human owns

Remove any leg and the system falls.

You cannot outsource responsibility to the machine. You can only outsource the implementation.

VIII. The Future

In this future:

  • Codebases are ephemeral, test suites are permanent
  • Developers write tests, AI writes implementations
  • Code review focuses on test quality, not implementation quality
  • “Technical debt” means poorly tested code, not “messy” code
  • Refactoring is instant and fearless
  • Languages and frameworks become implementation details

The test suite becomes the codebase. Everything else is just a proof object.

IX. The Call to Action

Start today:

  1. On your next feature, write comprehensive tests first
  2. Let the AI implement it
  3. Verify only that tests pass, not that the code is “good”
  4. Refactor ruthlessly
  5. Observe how much faster you move when implementation quality doesn’t matter

Challenge yourself:

  • Can you specify your entire system as a test suite?
  • Can you operationalize every requirement?
  • Can you trust the machine if you verify comprehensively?

Teach others:

  • Tests are specifications, not validation
  • If you can’t test it, you can’t build it
  • Implementation quality is an AI problem, contract quality is a human problem

X. The Motto

“I don’t care what the code is. I only care that it does what I say it does.”

Test-Driven AI Development is not a methodology. It is a recognition that the nature of software development has fundamentally changed. The machine implements. The human specifies. The tests are the contract. Write tests. Trust nothing. Verify everything.

TDAID: Because the only code you can trust is code you’ve tested.

Write tests. Trust nothing. Verify everything. Sign your name.