110 slides extracted.


Slide 1 — 0:04 (watch)

Slide 1I'm Ben Eggers from OpenAI. To start, could I have a volunteer from the front row?

Slide 2 — 0:10 (watch)

Slide 2What's your name? I tend to get nervous when I speak, so if I'm talking too fast, please let me know. Thank you.

Slide 3 — 0:20 (watch)

Slide 3I'm Ben Eggers, and I work at OpenAI. I'm here to explain that nothing has changed in software development.

Slide 4 — 0:28 (watch)

Slide 4Software development remains unchanged.

Slide 5 — 0:38 (watch)

Slide 5Nothing has changed. First, let's conduct a poll. Please raise your hand if you have used a large language model (LLM) to develop software.

Slide 6 — 0:52 (watch)

Slide 6How many people here believe that we experienced a significant leap in model intelligence a few months ago? How many would trust an LLM to develop all of their software? Let's discuss.

Slide 7 — 1:04 (watch)

Slide 7I have a personal question. How many of you are using MacBook Pros that can't connect to Wi-Fi after putting them to sleep and then reopening them? I've been struggling with that issue. Great, I'm Ben.

Slide 8 — 1:14 (watch)

Slide 8This slide features a photo of me speaking at Bug Bash last year, alongside a childhood photo of myself that I shared during the event.

Slide 9 — 1:26 (watch)

Slide 9I work on infrastructure at OpenAI. Previously, I was the eighth highest seven-day trailing token user at OpenAI, with a spike large enough to trigger rate limiting. I also have several hobbies, so if you're interested in long-distance hiking, jazz, living abroad, or therapy, let's talk.

Slide 10 — 1:44 (watch)

Slide 10Before I begin this talk, I want to thank these individuals. If any of them apply to work with you, you should hire them immediately. They have significantly contributed to the ideas presented in this talk and have made me a better engineer. These are some of my engineering heroes, most of whom work at OpenAI. I wouldn't feel right giving this talk without acknowledging them and expressing my appreciation for their greatness.

Slide 11 — 2:14 (watch)

Slide 11Someone once told me that the key to giving a good talk is to outline what you will discuss, present the content, and then summarize what you covered. So, let me outline my main point: all the challenging aspects of building software remain difficult and unchanged. We will address this in two parts. First, I assert that writing code is indeed the hard part of software development, but not for the reasons typically associated with writing code.

Slide 12 — 2:26 (watch)

Slide 12In part two, we will discuss how agents facilitate the work and influence your thinking about it, but they do not eliminate the need for the work itself.

Slide 13 — 2:32 (watch)

Slide 13All the challenging aspects remain unchanged.

Slide 14 — 2:44 (watch)

Slide 14After Will's insightful keynote this morning, I want to emphasize that the process is not just about writing code and testing. I believe it is more accurately represented as 40% thinking, 10% writing code, and 50% testing. This perspective highlights the importance of each phase in the development process.

Slide 15 — 2:54 (watch)

Slide 15I would like to provide a disclaimer.

Slide 16 — 2:58 (watch)

Slide 16I am referring to deep and narrow systems, which have a limited surface area in terms of the APIs they offer but possess significant implementation complexity. This includes databases, file systems, and infrastructure.

Slide 17 — 3:14 (watch)

Slide 17We are not discussing web portals and applications, which typically have a broader feature set. In these cases, each feature does not significantly impact the architecture of the rest of the system.

Slide 18 — 3:22 (watch)

Slide 18I cannot definitively say that this does not apply to broad, high surface area systems. My experience with them is limited, so I do not feel qualified to comment extensively.

Slide 19 — 3:30 (watch)

Slide 19I believe I am somewhat qualified to discuss the systems on the left, so that will be our focus.

Slide 20 — 3:36 (watch)

Slide 20Many of you who attended Bug Bash last year might recall that I enjoy games. Last year, we played Guess the Impact. This year, the game is Guess What the Agent Gave Me.

Slide 21 — 3:48 (watch)

Slide 21I will describe a system and share the prompt I used with the agent. Then, you will guess the outcome.

Slide 22 — 3:56 (watch)

Slide 22Let's practice.

Slide 23 — 4:02 (watch)

Slide 23I aimed to create a Rust drop-in replacement for Python's RE module. Regular expressions present significant implementation complexity, but there is a clear correctness oracle in the Python RE module. It seems feasible to achieve better performance.

Slide 24 — 4:18 (watch)

Slide 24Python's RE module is primarily written in Python. This presents an interesting problem to explore in the context of autonomous software engineering.

Slide 25 — 4:26 (watch)

Slide 25I built a harness with an essentially empty repository that operated in a while loop.

Slide 26 — 4:38 (watch)

Slide 26Every four or five loops, the system would instantiate a Codex instance, instructing it to pick something up and move closer to it. Additionally, every fifth or sixth loop, an orchestrator agent could modify the agent harness. The while loop itself was straightforward; it simply called the Python file that the orchestrator agent could manipulate.

Slide 27 — 4:52 (watch)

Slide 27What do you think happened here? Any guesses? A Rust wrapper for Python would be even better than what I have. Do we have any other suggestions?

Slide 28 — 5:12 (watch)

Slide 28The codebase consists of one million lines, which is low but morally correct. It contains numerous markdown documents, and that is accurate.

Slide 29 — 5:26 (watch)

Slide 29After a few weeks of review, I noticed that it had generated manifests containing hundreds of thousands of lines of JSON. These manifests functioned as test cases, each highly specific, detailing conditions such as "this should match this" and tagged with numerous keywords and features. The result was overwhelming and poorly structured.

Slide 30 — 5:50 (watch)

Slide 30I SSH'd in and decided to experiment with autonomous software engineering instead of directly instructing the agent to modify the code. I attempted to modify the harness to avoid using JSON. As a result, it changed all the file types to PY and created a manifest assignment. This repository is available on GitHub for reference. Additionally, it generated everything within a single 24,000-line Rust crate.

Slide 31 — 6:18 (watch)

Slide 31You might think that 24,000 lines is a lot, but when you look at the file percentages, it makes sense. The process generated reports that were initially in JSON format but were converted to PY. One of these reports is 6.47 megabytes, indicating that we have large JSON files containing numerous test cases. It appears that the system tested these cases and produced reports detailing the results of each test.

Slide 32 — 6:36 (watch)

Slide 32This isn't exactly what I envisioned. I still believe this idea has potential, but I haven't had the time to pursue it further. That's our game, and we have a couple more similar projects in the pipeline.

Slide 33 — 6:44 (watch)

Slide 33Part one.

Slide 34 — 6:46 (watch)

Slide 34Writing code has always revealed the most challenging aspects of programming.

Slide 35 — 6:52 (watch)

Slide 35My claim is that the bottleneck has never been typing, which I believe is an uncontroversial statement. If you're in this room, you likely type at least 60, and probably 80 to 100 words per minute. When you calculate the number of lines of code produced per day, it's clear that typing is not the bottleneck.

Slide 36 — 7:10 (watch)

Slide 36Code is relatively inexpensive to produce, even at high typing speeds. In fact, code has historically been considered cheap to create.

Slide 37 — 7:26 (watch)

Slide 37My rule of thumb for a productive day or week was shipping a couple thousand lines of code, which translates to about 400 or 500 lines per day during six hours of deep focus time. However, the time spent typing those lines does not account for the necessary design discovery, integration, and correctness processes. I assert that these aspects remain the challenging part of coding.

Slide 38 — 7:50 (watch)

Slide 38This image resembles a sci-fi book cover featuring someone writing code. In the past, writing code required deep thought about the task at hand. However, in today's environment, where we can simply issue prompts like "make it better" or "make no mistakes," we tend to think less critically about our objectives. The process of coding used to compel us to uncover the true nature of the problem.

Slide 39 — 8:14 (watch)

Slide 39We've all experienced moments when we're struggling with a piece of code, and suddenly realize that a component should be placed differently or that our database indices are incorrect. If you're in this room, you've likely written code with LLMs and experienced that "aha" moment, where you recognize that the shape of the problem differs from your initial perception.

Slide 40 — 8:28 (watch)

Slide 40It also forced us to turn those boundaries into contracts. In an era where you can easily rewrite your entire API contract, that was not always the case. Perhaps most importantly, you notice the unusual issues before anyone else does. Will mentioned in his talk this morning that humans have always been unreliable, and AI agents are not necessarily less reliable. While that is true and AI is improving, it seems to me that there was something qualitatively significant about the fact that a human carefully considered every line of code. Now, we are in a situation where, because a human is not thoroughly analyzing each line, we have lost a critical aspect of quality assurance that was previously inherent in our software.

Slide 41 — 9:12 (watch)

Slide 41A human would assess whether something was correct, consider the shape of the problem, and review the API contracts. Filling in the code was often the least interesting part of the process.

Slide 42 — 9:26 (watch)

Slide 42When writing an algorithm, such as an inverted index, the process of coding itself is not necessarily boring. However, if you frame coding in these terms, filling in the code can be seen as the least interesting part. It's akin to setting up all the boxes and then simply coloring them in.

Slide 43 — 9:38 (watch)

Slide 43The slowness was a critical factor. I asked our new image model to provide an illustration of slowness as a load-bearing concept, and this is the result I received.

Slide 44 — 9:50 (watch)

Slide 44When using these interfaces regularly, the deficiencies become apparent. As you wire paths end-to-end, you notice the missing cases. You might realize that two cases are actually the same, or that you should consider an error case. This requires careful thought about the various scenarios, whereas a large language model is more likely to make mistakes.

Slide 45 — 10:04 (watch)

Slide 45When writing your queries, it's crucial to carefully consider how you manage your indices and the efficiency of your scans, especially for those of you interested in databases.

Slide 46 — 10:22 (watch)

Slide 46It's important to think critically about your approach when writing queries and conducting tests. Common wisdom suggests focusing on testing your interfaces rather than your implementation, considering potential edge cases. This method remains effective, but when a human is involved, you can be very intentional about your testing strategy.

Slide 47 — 11:00 (watch)

Slide 47I assert that there was once a natural equilibrium in software development, where the speed of writing code matched our capacity to reason about it. Generally, people created code they understood, and if they didn't, it was much more limited than it is today.

Slide 48 — 11:14 (watch)

Slide 48I truly believe that during the golden era of human software engineering, code production was slow enough that we thought more critically about it, resulting in better quality work. Today, we often hear the term "slop" used to describe the current state of software development. This reflects the reality that while humans produce less slop overall, it still exists. I understand the skepticism surrounding this issue, especially the claim that software engineering has become stochastic due to large language models.

Slide 49 — 11:58 (watch)

Slide 49Software engineering has always been stochastic. Anyone who has managed a project with interns or new graduates understands this. You might receive anywhere from 100 to 2,000 lines of code, and the quality can vary significantly. This isn't a critique of inexperienced software engineers; rather, it highlights that outsourcing work rarely yields the same level of quality as your original vision. You cannot expect to receive exactly what you had in mind when delegating tasks.

Slide 50 — 12:26 (watch)

Slide 50As a tech lead, you define the constraints and overall architecture of the project. You may not need to detail every aspect of the implementation, as long as the interfaces are correctly implemented, you have confidence in your tests, and the system functions as intended. While code reviews are essential, it's not necessary to understand every line of code or service, provided you have team members who do. Your focus can be on data storage, service communication, and ensuring the correctness of interfaces and types. However, this approach may vary when dealing with particularly complex systems.

Slide 51 — 13:14 (watch)

Slide 51Many of us have encountered frustrating database bugs, which reflect a stylized version of the role of a tech lead during the golden era of software engineering. The primary responsibility was to narrow the distribution of possible implementations. You begin with your intent, evaluate various implementations, and select one. To constrain the design, you concentrate on schemas, APIs, tests, and reviews rather than the code within individual modules.

Slide 52 — 13:22 (watch)

Slide 52In summary, the old coding loop compelled design thinking.

Slide 53 — 13:34 (watch)

Slide 53Slowness served as a critical rate limiter, and we've essentially always engaged in stochastic software engineering. While not every individual may have practiced this approach, software engineering at a large scale has fundamentally been a stochastic process. Now, let's play another game.

Slide 54 — 13:58 (watch)

Slide 54We have a comprehensive test suite and a legacy storage system that we find unsatisfactory due to its poor performance. We decided to build a better version using modern primitives, specifically object storage and a key-value store provided by OpenAI. We set up the comprehensive test suite and ran it against the legacy storage system, which passed. Following that, we implemented an agent hill climb approach.

Slide 55 — 14:16 (watch)

Slide 55The design of the harness is not particularly noteworthy. We had an agent hill climb to get everything functioning, and after a few weeks, it was successful. My question is, what do you think we achieved? Did it interact with the old storage system?

Slide 56 — 14:50 (watch)

Slide 56No, it was a better sandbox than that, although a very large S3 bucket would have been amusing. We actually had one key and one value. It worked fine on my machine. Did it perform better? Locally, it performed adequately, but upon deployment, we encountered unexpected issues. This is particularly amusing because this type of bug is something a human would almost never create.

Slide 57 — 15:08 (watch)

Slide 57Humans can make unpredictable mistakes, but this particular bug is difficult to envision as something a human would create. When you want to insert multiple keys and values into a key-value store, this is not the typical implementation approach one would choose.

Slide 58 — 15:16 (watch)

Slide 58I haven't had time to review the logs from the harness runs, but I suspect that a performance hack was implemented due to the small data scale. Future iterations likely overlooked this issue, assuming it was intentional and that the human wanted it that way.

Slide 59 — 15:38 (watch)

Slide 59I won't address it because I've seen this happen frequently in my experience. Moving on to part two: agents facilitate the work, but they do not eliminate it. You may have noticed that models can surpass the threshold of usefulness.

Slide 60 — 15:44 (watch)

Slide 60In comparing GPT 4.1 to GPT 4.6, it's clear that 4.1 performs significantly worse on all benchmarks.

Slide 61 — 15:54 (watch)

Slide 61There is also a qualitative difference between GPT 3.7 and GPT 5.4, which is another unfair comparison. Previous generations of models struggled with RKGI2, while current generations are suddenly able to handle it effectively.

Slide 62 — 16:12 (watch)

Slide 62We have model generations that were trained on the training set, but there is also a sense that these models have crossed a qualitative threshold of usefulness. Many of us feel that we can trust them more now, and they perform better in various tasks. This shift has led to discussions about agentic engineering and the concept of an intelligence explosion.

Slide 63 — 16:22 (watch)

Slide 63I feel that a model is often better at my job than I am, and this change occurred about three months ago.

Slide 64 — 16:42 (watch)

Slide 64I maintain that models can write better code when you provide the initial guidance. You must define what success looks like and supply a comprehensive test suite. It’s essential to outline the structure of the solution; simply instructing the model to "build this thing" or "make it better" is insufficient. Additionally, you need to establish in advance how you will evaluate whether the change was effective; guessing is not an option.

Slide 65 — 16:54 (watch)

Slide 65These are the same principles that humans need to build good software, which supports my claim that software engineering has not changed. I consider these aspects when mentoring an intern, writing notes for myself, or when I sit down to do some work.

Slide 66 — 17:10 (watch)

Slide 66I need to consider what the desired outcome looks like and how to achieve it effectively. Therefore, it's essential to make decisions first.

Slide 67 — 17:14 (watch)

Slide 67You define the desired behavior changes, identify what must continue functioning, and specify the trade-offs you are willing to accept.

Slide 68 — 17:26 (watch)

Slide 68For example, avoid placing all keys and values in a singleton within the key-value store. Instead, utilize the key-value store effectively. There is potential for a separate, interesting discussion on this topic. Prompts can be likened to mathematical expressions. Ultimately, you must be extremely specific and detailed in your requests, particularly in distinguishing between terms like "for any," "for all," and "choose one such that."

Slide 69 — 17:50 (watch)

Slide 69These details are significant for implementation. As we increasingly interact with code and computers through prose, prompts are likely to become more mathematical across the industry. While I believe this topic warrants a separate discussion, it is not the focus of this presentation.

Slide 70 — 18:02 (watch)

Slide 70Design is possibly the most important slide in this entire presentation, as it reflects what I have learned about developing with agents.

Slide 71 — 18:10 (watch)

Slide 71I write all of my schemas by hand. For very simple schemas, I might use an agent, but I have spent a week struggling with a 300-line Prisma schema because it needs to be precise. Therefore, I do not let the agent handle it; I write my APIs and interfaces by hand for the same reasons.

Slide 72 — 18:20 (watch)

Slide 72I have coworkers who write all of their tests by hand, which I find excessive. I believe that's the most challenging aspect of software engineering.

Slide 73 — 18:32 (watch)

Slide 73However, I know people who strongly advocate for it.

Slide 74 — 18:36 (watch)

Slide 74There is an interesting perspective that unit testing may be becoming obsolete. We've observed instances where AI-generated unit tests assert unusual values and provide little meaningful information.

Slide 75 — 18:46 (watch)

Slide 75If you are writing unit tests yourself, they can be useful because you, as the human, ensure their correctness. However, simply instructing AI to add more tests is generally not effective. This topic deserves its own discussion, but it is not the focus of this talk.

Slide 76 — 19:02 (watch)

Slide 76Correctness requires providing clear guidelines. You should specify the system diagram you want, without concern for the color of the boxes or the programming language used. The internal abstractions and implementation details are generally less important, as long as they are correct.

Slide 77 — 19:16 (watch)

Slide 77If you have a provable correctness harness, the aesthetics of the code become irrelevant. You specify which data models are immutable and provide a comprehensive test harness.

Slide 78 — 19:28 (watch)

Slide 78Always implement tests in a different context. This is an important strategy. When you ask an AI to write code and then request unit tests for that code, the AI retains the context of the code it just generated. The most effective workflow I have found is to separate these tasks. First, write your interfaces and a set of tests, then instruct the AI to focus on making the code work without modifying the tests.

Slide 79 — 19:46 (watch)

Slide 79This is the only effective approach to AI-driven, test-driven development, particularly concerning unit tests.

Slide 80 — 19:50 (watch)

Slide 80In summary, here is a checklist for managing your AI agents.

Slide 81 — 20:02 (watch)

Slide 81Clearly define the desired behavior change in prose and mathematics. Implement your schemas and interfaces manually or with an agent, which are typically not large relative to the overall system. Create tests for the change, and then allow an agent to fill in the gaps. This process is similar to managing your less motivated self on a Monday at 8 AM.

Slide 82 — 20:20 (watch)

Slide 82That brings us to our final game, Contrafact. I developed a practice journal to track jazz practice.

Slide 83 — 20:28 (watch)

Slide 83The details of the app are not highly specific, but it is important to note that it is not a complex system. Instead, it is a relatively shallow system with many screens.

Slide 84 — 20:44 (watch)

Slide 84I developed my storage models, feature views, and domain model patterns from scratch. I established the core codebase and patterns to ensure they were complete and comprehensive. Then, I would identify a feature I wanted, implement it, and verify that it worked. After each successful implementation, I would repeat the process for the next feature. Eventually, I found myself with 30,000 lines of code and wondered what all that code was doing. Fortunately, there was no markdown involved this time. Did you say time management?

Slide 85 — 21:40 (watch)

Slide 85That's an interesting guess. There was some problematic clock functionality, but it wasn't a significant part. Duplicated date formatting is the closest guess, so we'll give you that. I requested a practice session with etudes and tunes, which relates to the domain model layer. This layer is not the storage layer; it consists of in-memory abstractions for how we manage and manipulate data. However, what I received was a poorly designed UUID soft look-up layer. In this setup, every entity had a UUID, and each one stored multiple UUIDs. The application performed UUID look-ups in the repository layer, and if it retrieved the wrong type, the entire application would crash.

Slide 86 — 21:56 (watch)

Slide 86During a late-night coding session, I discovered that a change had inadvertently added UUIDs to my data models. While they were correctly linked at the data model layer, they also stored these UUIDs. If those UUIDs became out of sync, the consequences were unclear.

Slide 87 — 22:12 (watch)

Slide 87I completely unraveled the situation because it was problematic. It's crucial to pay attention to your data models.

Slide 88 — 22:20 (watch)

Slide 88I outlined what I intended to convey, and now I've shared that information with you.

Slide 89 — 22:24 (watch)

Slide 89Writing code used to require significant design thinking that often occurred implicitly.

Slide 90 — 22:32 (watch)

Slide 90Agentic coding requires us to be more explicit in our approach. It does not reduce the need for critical thinking or effort. Now, everyone must take on the responsibilities of a tech lead, which means you are now functioning as a tech lead, while the overall process remains unchanged.

Slide 91 — 22:44 (watch)

Slide 91Software engineering is fundamentally unchanged. We prioritize correctness and how we demonstrate that correctness is a fundamentally stochastic process, though it has become faster in some respects.

Slide 92 — 22:52 (watch)

Slide 92Code has become inexpensive, but ensuring its correctness remains a challenge.

Slide 93 — 22:54 (watch)

Slide 93Thank you.

Slide 94 — 23:24 (watch)

Slide 94I ran a bit over time, but I believe I still have a few minutes for any thoughts, feelings, or questions.

Slide 95 — 23:38 (watch)

Slide 95I have not observed any programmatic enforcement regarding restrictions on modifying schemas or other elements, and I have also seen this approach fail at the prompt level.

Slide 96 — 23:46 (watch)

Slide 96I would like to see a product, open-source library, or VS Code extension that allows you to specify which files the model can or cannot access. This type of guardrail is likely to emerge soon.

Slide 97 — 24:04 (watch)

Slide 97As an inexperienced engineer, you might wonder how to transition to a tech lead role, where the responsibilities differ significantly from those of junior engineers. This is an important question.

Slide 98 — 24:16 (watch)

Slide 98That is one of the key questions in the industry right now.

Slide 99 — 24:24 (watch)

Slide 99Tech lead work is qualitatively different from implementation work, which involves writing code and developing a service once a good API contract is established. I believe you can begin practicing tech lead responsibilities directly.

Slide 100 — 24:34 (watch)

Slide 100You can begin by defining what you want to build.

Slide 101 — 24:40 (watch)

Slide 101Consider the structure of your database and identify the core entities that are important to you.

Slide 102 — 24:46 (watch)

Slide 102Understanding the entire system can be a challenging process initially, as it requires a comprehensive grasp of each component. However, I believe this skill can be developed without the need to spend years meticulously building each part on others' projects before gaining the ability to do it independently.

Slide 103 — 25:08 (watch)

Slide 103People can learn to do this directly. Thank you. I have a quick question. Even if a pattern is abstracted, do you quantify how familiar it is to a model before asking it a question?

Slide 104 — 25:16 (watch)

Slide 104For a novel concept, you would develop more architecture around it, whereas for something that is relatively standard, you would do less.

Slide 105 — 25:20 (watch)

Slide 105That's a good question.

Slide 106 — 25:24 (watch)

Slide 106There are many aspects of the software world that I am not familiar with.

Slide 107 — 25:30 (watch)

Slide 107Many patterns and practices exist that I am unaware of in places I have never visited.

Slide 108 — 25:44 (watch)

Slide 108In my experience, the models I have encountered tend to produce core patterns that are about 80% reasonable and effective. However, there is often around 20% of their output that may not make much sense.

Slide 109 — 25:52 (watch)

Slide 109When I build a website or engage in tasks where I lack extensive experience, I often encounter a "gel man amnesia" effect. The models provide suggestions that seem reasonable at first, but as I explore further, I realize that certain patterns significantly hinder my development and complicate my work.

Slide 110 — 26:10 (watch)

Slide 110I make a concerted effort to establish all the core patterns myself and understand their underlying reasons, even when a model is generating the content. Thank you.