BDD, ADR, PRD, WTF: Capturing Decisions for Humans and AI Alike — Michal Cichra, Safe Intelligence

36 slides extracted.

Slide 1 — 0:08 (watch)

Hello, I'm Michal.

Slide 2 — 0:18 (watch)

Welcome to "Capturing Decisions for Humans and AI Alike."

Slide 3 — 0:28 (watch)

Yesterday, my team at Safe Intelligence released PACT27, a new product for testing agents. Prior to this, I worked at Microsoft and Red Hat, spending 10 years focused on a single product.

Slide 4 — 0:40 (watch)

The consistency problems we encounter with AI and the process of capturing decisions are evident in every product I have observed. These notes are a summary of that experience, and you can find me at the booth.

Slide 5 — 0:54 (watch)

BDD, PRD, and ADR are important acronyms in our field. Let's unpack their significance.

Slide 6 — 1:02 (watch)

You probably know this story.

Slide 7 — 1:16 (watch)

I hope this isn't an urban legend, but scientists placed five monkeys in a cage with bananas on a ladder. Whenever a monkey attempted to reach the bananas, they received a cold shower. The other monkeys would then beat up the one trying to climb. Eventually, they replaced the monkeys one by one until none of the originals remained. Despite this, every new monkey that tried to climb the ladder was beaten, unaware of the reason behind the behavior.

Slide 8 — 1:44 (watch)

Humans and LLMs share a common trait: limited context. People forget, and LLMs compact context. Humans may leave, and LLMs lack memory. After some time working on a product, the team begins to ask questions such as, "Why do we have this flow? What is the goal of this feature? Why is this code structured this way? Where does this belong?" Often, the founding engineer may not be available to provide answers. These issues can arise in any organization, and with AI, they may surface even sooner than before.

Slide 9 — 2:10 (watch)

An Architecture Decision Record (ADR) documents the rationale behind architectural choices and outlines how those decisions are enforced. It can include examples through reference documents and code snippets.

Slide 10 — 2:32 (watch)

We split code into layers to prevent n plus 1 queries. We enforce this separation by linting imports in modules. Additionally, we require reading from the database to reference plain shapes instead of ORM objects, which helps prevent these queries and duplication. There are also about 50 other ADRs that define the architecture of the product.

Slide 11 — 3:02 (watch)

There is no single format that must be used; it's merely a concept. Since it is a text-based approach, there is no specific way to enforce it. However, you will need a tool to enforce the rules. This tool will indicate the rules, explain why they are necessary, and guide you on how to address any issues. The agent will then attempt to locate the document that provides the rationale for the rule and additional information on how to resolve it.

Slide 12 — 3:22 (watch)

You can specify which files are relevant, such as Python files or specific folders, and detail how to enforce the rules.

Slide 13 — 3:34 (watch)

A Product Requirements Document (PRD) is a lighter document used when building a feature. It describes the purpose of the feature, the problems it addresses, and the user's journey through the application.

Slide 14 — 3:56 (watch)

It can be very concise; it doesn't need to be lengthy or exhaustive like a massive document. You can simply capture the why, the problem, the goal, and the journey that connects them. This approach benefits not only the agents but also you, six weeks from now, when you might forget the reasoning behind your decisions.

Slide 15 — 4:12 (watch)

Behavior-Driven Development, or BDD, focuses on the expected behavior of software. You may have encountered specification-driven development recently, but if you've practiced it, you might have experienced similar challenges as I have.

Slide 16 — 4:36 (watch)

To validate that the product adheres to the specifications, you need to ensure that the markdown document accurately describes how it is supposed to work. However, determining if it actually functions as intended can be challenging. Reading AI code is difficult, and reading AI tests is even harder. What if there were an intermediate layer that describes the product's behavior in human language? Behavior-driven development (BDD) is not a new concept, but it can be both executable and readable.

Slide 17 — 5:06 (watch)

Enter Cucumber. It may have been almost forgotten, but it's suddenly useful again. Cucumber is definitely easier to review than your average tests. You can connect scenarios directly to your PRDs and critical user journeys. It is readable and executable, effectively closing the loop that spec-driven development leaves open.

Slide 18 — 5:38 (watch)

These specifications are parsed into steps and executed as code. You can write, read, and review them, making them understandable. The language used is flexible and does not need to be enforced. There are various ways to write these features, but they all describe how to navigate the application, the purpose of each component, and its functionality. Additionally, they can reference all relevant documentation explaining the rationale behind these elements.

Slide 19 — 6:08 (watch)

Creating consistent UIs with agents presents an additional challenge. A design system and pattern library are essential for building these consistent UIs. This approach was effective before AI and remains effective today.

Slide 20 — 6:48 (watch)

You document your design language by specifying details such as the characteristics of a primary button. For example, you define it as blue, with a specific shape and size. You establish rules, such as having only one primary button visible on a site or page at any given time, and then you enforce these rules. You also define components and patterns. If you have multiple colors and states for buttons, you create components and previews to demonstrate their functionality, allowing both you and the agents to visualize them. You can then review whether these elements adhere to your established principles and visuals, and reuse them accordingly. Similar to coding, you build these components from small pieces into larger ones, composing and reusing them to avoid chaos.

Slide 21 — 7:26 (watch)

These are great ideas, but how do I enforce them so my team and agents adhere to them? How do I maintain consistency? The answer lies in the loop.

Slide 22 — 7:38 (watch)

You have probably heard about closing the loop, the reinforcement loop, and the harness, which help remind agents of their rules and how to follow them.

Slide 23 — 7:46 (watch)

Our loop is straightforward.

Slide 24 — 7:56 (watch)

We implement Git hooks to run predefined tasks, which are essential for delivering a pull request. These hooks include various checks, skills, and continuous integration (CI) processes.

Slide 25 — 8:26 (watch)

These tasks are executed on the CI and are the same tasks run as hooks. If agents skip or neglect to execute them, they will be caught. We include linting, formatting, type checking, code duplication checks, architecture checks, and document linting—essentially everything possible. In the past, code reviews focused on style issues like tabs and spaces, but that is no longer acceptable. These aspects are not open for discussion; they are rules that are enforced and automated. The focus has shifted to high-level concepts. What cannot be identified cannot be enforced.

Slide 26 — 9:02 (watch)

We enforce the architecture of the product and the code by separating modules and their imports to control what can be accessed from where. For instance, our end-to-end BDD test suite is prohibited from accessing the database. We restrict access to any module that could connect to the database, ensuring that the models iterate without using the database and rely solely on the browser features of the application.

Slide 27 — 9:32 (watch)

In the product, we enforce a rule that prevents database access from rendering templates. This ensures that there are no N plus one queries. Instead of continuously identifying these issues, we focus on defining methods to prevent them entirely. When an agent attempts to commit and push changes, they receive feedback on their commit. If the commit is rejected, they are linked back to the relevant document to read, fix, and iterate on their work.

Slide 28 — 10:02 (watch)

There are some considerations to address. This loop is generic; it involves doing work, pushing it, receiving feedback, and iterating. However, the focus of the loop can vary. Sometimes you might be working on a product feature, other times on a user interface, or even on back-end tasks. While the loop remains the same, the specific focus shifts depending on the context.

Slide 29 — 10:40 (watch)

We have different skills for various aspects of development. For Architectural Decision Records (ADRs), the agent looks them up to understand how to operate with them and identify the affected code. The same applies to Product Requirement Documents (PRDs). In the UI loop, we skip several checks to enable rapid iteration in a browser. The testing skill identifies which tests to run based on code coverage and file changes, allowing us to execute only the relevant part of the test suite rather than the entire suite. Additionally, we implement goal execution to retain the decisions made by the model for later review. While all these elements provide focus within the loop, the fundamental structure of the loop remains unchanged.

Slide 30 — 11:06 (watch)

There are drawbacks; it is very context-heavy.

Slide 31 — 11:16 (watch)

You can run out of half of the context when starting the research, but I have no fear of context compacts. This approach has been effective for the last six months.

Slide 32 — 11:32 (watch)

In my sessions, there are 20 to 50 context compacts, and that's acceptable. The important information survives, and the agent will always reference it again. The goal is to have multi-hour sessions with a clear objective, allowing the agent to operate autonomously within the defined rules.

Slide 33 — 11:44 (watch)

That is the goal.

Slide 34 — 11:52 (watch)

There are decisions that you can record.

Slide 35 — 12:02 (watch)

You can describe the reasons for the existence of certain parts of the product. Cucumber and BDD provide executable specifications that are readable, reviewable, and understandable.

Slide 36 — 12:20 (watch)

Design systems help create a consistent UI using components and enforce rules, such as prohibiting inline styles. You can use Harness to integrate everything. Thank you.

Slide 1 — 0:08 (watch)#

Slide 2 — 0:18 (watch)#

Slide 3 — 0:28 (watch)#

Slide 4 — 0:40 (watch)#

Slide 5 — 0:54 (watch)#

Slide 6 — 1:02 (watch)#

Slide 7 — 1:16 (watch)#

Slide 8 — 1:44 (watch)#

Slide 9 — 2:10 (watch)#

Slide 10 — 2:32 (watch)#

Slide 11 — 3:02 (watch)#

Slide 12 — 3:22 (watch)#

Slide 13 — 3:34 (watch)#

Slide 14 — 3:56 (watch)#

Slide 15 — 4:12 (watch)#

Slide 16 — 4:36 (watch)#

Slide 17 — 5:06 (watch)#

Slide 18 — 5:38 (watch)#

Slide 19 — 6:08 (watch)#

Slide 20 — 6:48 (watch)#

Slide 21 — 7:26 (watch)#

Slide 22 — 7:38 (watch)#

Slide 23 — 7:46 (watch)#

Slide 24 — 7:56 (watch)#

Slide 25 — 8:26 (watch)#

Slide 26 — 9:02 (watch)#

Slide 27 — 9:32 (watch)#

Slide 28 — 10:02 (watch)#

Slide 29 — 10:40 (watch)#

Slide 30 — 11:06 (watch)#

Slide 31 — 11:16 (watch)#

Slide 32 — 11:32 (watch)#

Slide 33 — 11:44 (watch)#

Slide 34 — 11:52 (watch)#

Slide 35 — 12:02 (watch)#

Slide 36 — 12:20 (watch)#