110 slides extracted.
Slide 1 — 0:04 (watch)
![]() | I'm Ben Eggers, and I work at OpenAI. First, can I have a volunteer from the front row? |
Slide 2 — 0:10 (watch)
![]() | What's your name? I get nervous when I speak, so if I'm talking too fast, please guide me. Thank you. |
Slide 3 — 0:20 (watch)
![]() | I'm Ben Eggers, and I work at OpenAI. Today, I will explain how nothing has changed about software development. |
Slide 4 — 0:28 (watch)
![]() | It's the same as it has always been. |
Slide 5 — 0:38 (watch)
![]() | Nothing has changed. First, let's take a poll. Please raise your hand if you have used an LLM to develop software. |
Slide 6 — 0:52 (watch)
![]() | How many people here feel that, a few months ago, we experienced a qualitative leap in model intelligence? How many would trust an LLM to develop all of their software? Let's discuss. |
Slide 7 — 1:04 (watch)
Slide 8 — 1:14 (watch)
![]() | This is a picture of me speaking at Bug Bash last year. Behind me is a childhood photo that I used during my presentation last year. |
Slide 9 — 1:26 (watch)
Slide 10 — 1:44 (watch)
Slide 11 — 2:14 (watch)
Slide 12 — 2:26 (watch)
![]() | In part two, we will discuss how agents shift the work and influence your thinking about it, but they do not eliminate the need for that work. |
Slide 13 — 2:32 (watch)
![]() | All the challenging aspects of software engineering remain unchanged. |
Slide 14 — 2:44 (watch)
Slide 15 — 2:54 (watch)
![]() | I also want to provide a disclaimer. |
Slide 16 — 2:58 (watch)
![]() | I am referring to deep and narrow systems, which have a reasonably limited API surface area but significant implementation complexity. This includes databases, file systems, and infrastructure. |
Slide 17 — 3:14 (watch)
Slide 18 — 3:22 (watch)
![]() | I wouldn't say this is untrue for broad, high surface area systems. However, I have much less experience working on them, so I don't feel qualified to comment extensively. |
Slide 19 — 3:30 (watch)
![]() | I believe I am somewhat qualified to discuss the left type of systems, so that will be our focus today. |
Slide 20 — 3:36 (watch)
![]() | Many of you who attended Bug Bash last year may recall that I really enjoy games. Last year, the game was Guess the Impact. This year, the game is Guess What the Agent Gave Me. |
Slide 21 — 3:48 (watch)
![]() | I will describe a system and share the prompt I used for the agent. Then, you will guess the outcome. |
Slide 22 — 3:56 (watch)
![]() | Let's practice. |
Slide 23 — 4:02 (watch)
Slide 24 — 4:18 (watch)
![]() | Python's RE module is primarily written in Python. This presents an interesting problem to explore in the context of autonomous software engineering. |
Slide 25 — 4:26 (watch)
![]() | I built a harness with an essentially empty repository that operated on a while loop. |
Slide 26 — 4:38 (watch)
Slide 27 — 4:52 (watch)
![]() | So my question is, what do you think happened here? A Rust wrapper for Python would be even better than what I have. Do we have any other guesses? |
Slide 28 — 5:12 (watch)
![]() | One million lines. While that number is low, it is morally correct. There are many markdown documents, and that is correct as well. |
Slide 29 — 5:26 (watch)
Slide 30 — 5:50 (watch)
Slide 31 — 6:18 (watch)
Slide 32 — 6:36 (watch)
Slide 33 — 6:44 (watch)
![]() | Part one. |
Slide 34 — 6:46 (watch)
![]() | Writing code has always been where the most challenging aspects arise. |
Slide 35 — 6:52 (watch)
Slide 36 — 7:10 (watch)
![]() | Code is relatively inexpensive, even at typing speed. I assert that code has historically been relatively cheap as well. |
Slide 37 — 7:26 (watch)
Slide 38 — 7:50 (watch)
Slide 39 — 8:14 (watch)
Slide 40 — 8:28 (watch)
Slide 41 — 9:12 (watch)
Slide 42 — 9:26 (watch)
Slide 43 — 9:38 (watch)
![]() | The slowness was indeed load-bearing. I asked our new image model to provide an illustration of slowness being load-bearing, and this is the result I received. |
Slide 44 — 9:50 (watch)
Slide 45 — 10:04 (watch)
![]() | When writing your queries, it's crucial to consider how you structure your indices and how your scans operate, especially for those interested in databases. |
Slide 46 — 10:22 (watch)
Slide 47 — 11:00 (watch)
Slide 48 — 11:14 (watch)
Slide 49 — 11:58 (watch)
Slide 50 — 12:26 (watch)
Slide 51 — 13:14 (watch)
Slide 52 — 13:22 (watch)
![]() | In summary, the old coding loop enforced design thinking. |
Slide 53 — 13:34 (watch)
Slide 54 — 13:58 (watch)
Slide 55 — 14:16 (watch)
Slide 56 — 14:50 (watch)
Slide 57 — 15:08 (watch)
Slide 58 — 15:16 (watch)
Slide 59 — 15:38 (watch)
Slide 60 — 15:44 (watch)
![]() | In comparing GPT-4.1 to GPT-4.6, GPT-4.1 performs significantly worse across all benchmarks. |
Slide 61 — 15:54 (watch)
Slide 62 — 16:12 (watch)
Slide 63 — 16:22 (watch)
![]() | I feel that a model is often better at my job than I am, and this change occurred about three months ago. |
Slide 64 — 16:42 (watch)
Slide 65 — 16:54 (watch)
Slide 66 — 17:10 (watch)
![]() | I need to consider what the desired outcome looks like and how to achieve it effectively. The first step is to make decisions with intent. |
Slide 67 — 17:14 (watch)
![]() | You define the desired behavior changes, specify what must remain functional, and identify the trade-offs you are willing to accept. |
Slide 68 — 17:26 (watch)
Slide 69 — 17:50 (watch)
Slide 70 — 18:02 (watch)
![]() | Design is perhaps the most important slide in this entire deck, as it reflects what I have learned about developing with agents. |
Slide 71 — 18:10 (watch)
Slide 72 — 18:20 (watch)
![]() | I also have coworkers who write all of their tests by hand, which I find surprising. That aspect of software engineering can be quite challenging. |
Slide 73 — 18:32 (watch)
![]() | However, I know people who swear by writing tests by hand. |
Slide 74 — 18:36 (watch)
![]() | There is an interesting argument to be made that unit testing is becoming obsolete. We've observed AI-generated unit tests that assert unusual values and provide little meaningful information. |
Slide 75 — 18:46 (watch)
Slide 76 — 19:02 (watch)
Slide 77 — 19:16 (watch)
![]() | If you have a provable correctness harness, the aesthetics of the code become less important. You specify which data models are immutable and provide a comprehensive test harness. |
Slide 78 — 19:28 (watch)
Slide 79 — 19:46 (watch)
Slide 80 — 19:50 (watch)
![]() | In summary, here is a checklist for managing your army of AI agents. |
Slide 81 — 20:02 (watch)
Slide 82 — 20:20 (watch)
![]() | That brings us to our final game, Contrafact. I built a practice journal for tracking jazz practice. |
Slide 83 — 20:28 (watch)
![]() | The details of the app are not particularly specific, but the key point is that it is an application rather than a deep system. This is a relatively shallow system with many screens. |
Slide 84 — 20:44 (watch)
Slide 85 — 21:40 (watch)
Slide 86 — 21:56 (watch)
Slide 87 — 22:12 (watch)
![]() | I completely unraveled the situation because it was severely flawed. It's crucial to pay attention to your data models. |
Slide 88 — 22:20 (watch)
![]() | I outlined my main points and then elaborated on them. |
Slide 89 — 22:24 (watch)
![]() | Writing code used to require significant design thinking that occurred implicitly. |
Slide 90 — 22:32 (watch)
Slide 91 — 22:44 (watch)
Slide 92 — 22:52 (watch)
![]() | Code has become inexpensive, but ensuring correctness remains a challenge. |
Slide 93 — 22:54 (watch)
![]() | Thank you. |
Slide 94 — 23:24 (watch)
![]() | I ran a bit over time, but I believe I still have a few minutes for any thoughts, feelings, or questions. |
Slide 95 — 23:38 (watch)
![]() | I have not observed any programmatic enforcement regarding restrictions on modifying schemas or other components, and I have also seen this approach fail at the prompt level. |
Slide 96 — 23:46 (watch)
Slide 97 — 24:04 (watch)
![]() | As an inexperienced engineer, how can I progress to a tech lead role where I handle significantly fewer tasks typically assigned to junior engineers? That's a good question. |
Slide 98 — 24:16 (watch)
![]() | That is one of the key questions in the industry right now. |
Slide 99 — 24:24 (watch)
Slide 100 — 24:34 (watch)
![]() | You can start by defining what you want to build. |
Slide 101 — 24:40 (watch)
![]() | What does my database look like, and what are the core entities I care about? |
Slide 102 — 24:46 (watch)
Slide 103 — 25:08 (watch)
![]() | I believe people can learn how to do this directly. I have a quick question: even if a pattern is abstracted, do you quantify how familiar it is to a model before asking it a question? |
Slide 104 — 25:16 (watch)
![]() | For example, when dealing with something novel, you focus more on the architecture surrounding it, whereas with something relatively standard, you do less. |
Slide 105 — 25:20 (watch)
![]() | That's a good question. |
Slide 106 — 25:24 (watch)
![]() | There are many things I don't know about in the software world. |
Slide 107 — 25:30 (watch)
![]() | Many patterns and practices exist in areas I have not explored. |
Slide 108 — 25:44 (watch)
![]() | In my experience, when I ask today's models to outline core patterns based on what I know, about 80% of their responses are reasonable and sound, while roughly 20% may not make much sense. |
Slide 109 — 25:52 (watch)
Slide 110 — 26:10 (watch)
![]() | I generally make a strong effort to establish all of the core patterns myself and understand their rationale, even if a model is generating the content. Thank you. |













































































































