YouTube on Random Domain

WWDC26: Optimize custom machine learning operations with Metal tensors | Apple

Mon, 08 Jun 2026 00:00:00 +0000

64 slides extracted.

Slide 1 — 0:04 (watch)

Hello, my name is Xiao, and I am a GPU software engineer.

Slide 2 — 0:16 (watch)

Today, I will guide you through an exploration of Metal tensors and demonstrate how to write optimized custom ML kernels using tensor operations.

Slide 3 — 0:34 (watch)

Apple platforms offer robust support for running machine learning models across the software stack. High-level frameworks such as Core AI and MLX simplify model deployment with minimal coding. In contrast, lower-level APIs like Metal Performance Shaders grant access to high-performance Metal kernels. All these layers leverage the low-level acceleration provided by Metal performance primitives and the tensor ops library.

Slide 4 — 0:52 (watch)

There are several reasons to work at the Metal level.

Slide 5 — 1:00 (watch)

ML research progresses rapidly, so implementing custom operations that integrate with high-level frameworks like Core AI can be beneficial. Additionally, writing Metal kernels may be necessary if you are contributing to an ML framework such as MLX or llama.cpp.

Slide 6 — 1:14 (watch)

If you are working on a Metal-based application, the easiest way to get started is by using the tensor ops library.

Slide 7 — 1:22 (watch)

Tensor ops is a Metal shading language API that accelerates tensor operations on the GPU, such as matrix multiplication and convolution.

Slide 8 — 1:38 (watch)

It automatically utilizes any available hardware acceleration across all Apple Silicon GPU generations, eliminating concerns about differences between hardware generations. Specifically, it fully leverages the Neural Accelerator in the M5 chip family.

Slide 9 — 1:56 (watch)

The Neural Accelerator is a new hardware block in the M5 chip, located directly within each shader core. It operates alongside the other GPU pipelines and is specifically designed to accelerate dense compute-bound tasks, such as the pre-fill stage of a large language model (LLM).

Slide 10 — 2:06 (watch)

You can refer to the related sessions to learn the fundamentals of tensor operations.

Slide 11 — 2:12 (watch)

In this session, I will build on those basics by discussing best practices for working with quantizer data.

Slide 12 — 2:22 (watch)

I will demonstrate how to build advanced customer operations, including flash attention.

Slide 13 — 2:26 (watch)

Let's dive into the first topic: working with quantizer data.

Slide 14 — 2:32 (watch)

State-of-the-art machine learning models are increasingly large.

Slide 15 — 2:42 (watch)

The inference stage is typically limited by memory bandwidth, making it necessary to compress the weights. This compression helps fit models into memory more efficiently and reduces memory bandwidth usage.

Slide 16 — 2:52 (watch)

The standard approach for compressing weights is quantization. This involves taking higher precision weights and reducing them to lower precision data types.

Slide 17 — 3:08 (watch)

For example, 16-bit half-precision weights can be compressed to just 4 bits. These quantized weights are accompanied by scale factors, allowing us to scale the quantized values back to the original range during computation. In addition to 16- and 32-bit floating-point types, tensor operations now natively support quantized data types.

Slide 18 — 3:34 (watch)

We added support for 4- and 8-bit integer types in the update to macOS and iOS 26, and we are extending support to additional data types in macOS and iOS 27. This includes 4- and 8-bit floating-point types, as well as 2-bit integer types.

Slide 19 — 3:48 (watch)

You can create and pass your IAFS quantizer tensors to tensor operations, which will automatically utilize any available hardware acceleration.

Slide 20 — 4:04 (watch)

Creating a tensor with a quantized data type is similar to creating a regular tensor. You fill in the descriptor's properties as you would for any other tensor, but specify a quantized data type. Then, create the tensor by calling newTensorWithDescriptor on your Metal device.

Slide 21 — 4:18 (watch)

You can store your quantized element data using this method. Next, we will discuss scale factors.

Slide 22 — 4:36 (watch)

In macOS and iOS 27, a single MTL tensor object can represent scales alongside the tensor's quantized data as an additional scales plan. This plan supports the FP8 EAM0 block-wide scale factor format. Each element of the scales plan corresponds to a block for each element in the data plan. Declaring the scales plan is similar to declaring a tensor.

Slide 23 — 5:04 (watch)

First, create a descriptor object for the scales plan. Next, fill in the data type and block factors. Then, create an auxiliary plan map to specify that this plan is for scales. Attach the auxiliary plan map to your original tensor descriptor. The quantized data, scales, and metadata will all be packed into a single tensor object. Now, let's put this into practice by extending a basic matrix multiplication kernel to support quantization.

Slide 24 — 5:26 (watch)

Matrix multiplication is a fundamental operation in machine learning workloads.

Slide 25 — 5:34 (watch)

LMS performs millions of matrix multiplications during inference.

Slide 26 — 5:48 (watch)

We discussed the fundamentals of writing a high-performance matrix multiplication kernel using TensorOps in the M5 machine learning talk. The approach involves slicing the input matrices into smaller tiles and performing tile-wise matrix multiplications with TensorOps. This method maximizes parallelism and ensures that data remains in the cache.

Slide 27 — 6:02 (watch)

We can use quantization to reduce memory traffic and accommodate larger models in memory.

Slide 28 — 6:24 (watch)

In the kernel, it is beneficial to define type aliases before binding the tensors. We declare a scale factor plan using the FP8 EAM0 data type and a block size of 32 by 1. This configuration means that every 32 elements in the data plan share a single element in the scales plan. Next, we declare a full tensor type, specifying the FP8 data type along with the scales plan. These tensors can then be easily bound to buffer binding points, allowing the kernel to access the tensors allocated on the host side.

Slide 29 — 6:46 (watch)

Alternatively, if you prefer not to create a full mtel tensor on the host, you can create a temporary tensor directly on the shader stack.

Slide 30 — 6:54 (watch)

The syntax is nearly identical; simply replace the tag tensor handle with tensor inline.

Slide 31 — 7:02 (watch)

Pass your buffer pointers and other metadata to the tensor constructor to create a tensor on the stack.

Slide 32 — 7:12 (watch)

We will divide the problem among multiple thread groups to enhance parallelism. First, we will slice out the tile for each thread group, and then we will perform the multiplication using tensor operations.

Slide 33 — 7:26 (watch)

To achieve this, call the slice function on your input and output tensors using the thread group ID. Both the data and the scales plan will be sliced simultaneously based on the block size.

Slide 34 — 7:46 (watch)

Setting up matrix multiplication with a quantizer tensor is the same as with normal tensors. First, configure the matmul2d descriptor by specifying the tile sizes and other parameters. Next, create a matmul2d operation, indicating the number of command groups in the thread group. Finally, pass in your quantizer tensor, and the tensor operations will manage the dequantization automatically.

Slide 35 — 8:06 (watch)

In most cases, you should input your quantizer data directly into tensor operations to automatically leverage any available hardware acceleration.

Slide 36 — 8:12 (watch)

If you need to dequantize a custom format, tensor operations can still accommodate this requirement.

Slide 37 — 8:28 (watch)

The simplest approach involves each thread loading a chunk of quantizer data from device memory and dequantizing it to F16 values in the thread group memory. This data can then be passed as an inline thread group tensor to tensor operations. However, this method necessitates additional load and store operations through thread group memory. Ideally, we should keep all this data in thread registers instead.

Slide 38 — 8:44 (watch)

You can achieve this by dequantizing the data into a cooperative tensor, which can then be used as the input for the matmul2d operation.

Slide 39 — 8:54 (watch)

Cooperative tensors distribute their storage across the thread-private memory of the threads participating in the matmul operation. If you cannot use the quantizer tensor directly, you can still avoid the round trip through thread group memory.

Slide 40 — 9:10 (watch)

To recap, Metal tensors natively support a wide range of quantizer data types, including the new MX scaling formats and the EM0 scale factors introduced in iOS and macOS 27.

Slide 41 — 9:22 (watch)

These new data types have additional alignment requirements compared to larger data types. Be sure to check the Metal documentation for details.

Slide 42 — 9:34 (watch)

Now, let's advance to creating a more complex custom operation using tensor operations.

Slide 43 — 9:54 (watch)

Attention is central to every transformer network, including language models. To compute attention, you first multiply two matrices, Q and K. Then, you calculate the softmax by applying reductions on the rows of the resulting intermediate matrix. Finally, you multiply by a third matrix. The widely used flash attention algorithm combines all these operations into a single kernel.

Slide 44 — 10:18 (watch)

To implement this with tensor operations, first set up a custom CMD group mapping, ensuring that each CMD group owns complete rows of the intermediate matrix. This configuration allows you to compute the softmax without the need to exchange data between CMD groups.

Slide 45 — 10:34 (watch)

You can achieve this by utilizing the execution CMD group operation scope. Each CMD group will perform independent matrix multiplications in parallel, and you can use the CMD group ID to slice your input tiles accordingly.

Slide 46 — 10:50 (watch)

We will use a cooperative tensor to store the intermediate matrix, allowing us to use it as input for the next step without writing it to memory. We will compute the softmax on the result, which requires performing a couple of reductions on the cooperative tensor.

Slide 47 — 11:02 (watch)

Tensor operations include a reduce rows function to facilitate this process.

Slide 48 — 11:12 (watch)

Threads will exchange data to calculate the maximum for each row. The result will be returned in another cooperative tensor. Let's set it up.

Slide 49 — 11:26 (watch)

First, create a cooperative tensor to store the reduction output. Next, pass the source and destination to the reduce rows function. We will use the max reduction operation with an initial value of negative infinity.

Slide 50 — 11:46 (watch)

The two cooperative tensors have different shapes, so to facilitate mapping between them, tensor operations include a map iterator function. This function takes an iterator pointing to an element in the 2D tensor and returns an iterator pointing to the corresponding element in the reduction destination.

Slide 51 — 12:06 (watch)

First, set up a loop over the 2D cooperative tensor using iterators. Then, call the map iterator function to map each element to its corresponding row maximum. Finally, dereference these iterators to compute the software maximum and store the result back into the cooperative tensor.

Slide 52 — 12:20 (watch)

Now we are ready to multiply the cooperative tensor by V.

Slide 53 — 12:28 (watch)

In macOS 26, you must first store the cooperative tensor in thread group memory. However, it is now possible to use cooperative tensors directly as inputs for manual operations.

Slide 54 — 12:50 (watch)

To use cooperative tensors as inputs to manual operations, call the getLeftInputCooperativeTensor method and pass the source cooperative tensor as an argument. You can then use the result as an input for the second manual operation. However, not every cooperative tensor can be reused as an input due to potential differences in layouts based on data types and other factors. Before proceeding, call the isCompatibleAsLeft or RightInput method to check compatibility. If it returns true, you can continue.

Slide 55 — 13:14 (watch)

If the compatibility check returns false, you must store and reload the data through thread group memory to convert it to the correct layout. Regardless of the method used, the call to op.run remains the same.

Slide 56 — 13:22 (watch)

These are the key TensorOps features necessary for building an advanced operation like FlashAttention using TensorOps.

Slide 57 — 13:30 (watch)

Now that we've discussed how to build this operation, let's examine its performance in a real model using Core AI.

Slide 58 — 13:46 (watch)

Core AI offers tools for Python developers to convert PyTorch models into Core AI models, including support for custom meta kernels. For detailed integration of a meta kernel into a Core AI model, refer to the Deep Dive into Core AI Model Authoring and Organization session.

Slide 59 — 14:24 (watch)

I followed the steps from that session to integrate our custom FlashAttention kernel into a SAM3 image segmentation model. We define the body of our custom attention kernel as a string in Python and register the TorchMetalKernel object. Next, we replace the default HuggingFaceAttention implementation with one that calls our kernel. Finally, we load the model from HuggingFace and export it from PyTorch as an optimized Core AI asset. The export will take a moment to finish. Now we're ready for inference. SAM3 performs promptable concept segmentation, so we provide the model with an image and text, and it responds with a segmentation mask indicating where objects are located in the image.

Slide 60 — 14:58 (watch)

I am prompting the model to label all pixels in the image that contain a car.

Slide 61 — 15:06 (watch)

Now I will run the segmentation.

Slide 62 — 15:18 (watch)

The final result shows that the model correctly segmented the image. The car is highlighted in blue, indicating that our attention kernel is fully integrated into the model as expected.

Slide 63 — 15:32 (watch)

Today, I covered the tools available for building optimized custom ML kernels on Apple Silicon. These include quantized data types, advanced TensorOps features such as cooperative tensors and reductions, and integration with Core AI.

Slide 64 — 15:54 (watch)

To go further, explore the Metal Performance Primitive documentation for the complete API reference and programming guide, which provide additional performance optimization guidelines. You can also download the TensorOps sample code to review details that I couldn't cover here. Additionally, be sure to check out the related sessions to learn more about Core AI and Metal. Thank you.

Building confidence in an always-in-motion distributed streaming system | Frank McSherry | Bug Bash

Thu, 14 May 2026 00:00:00 +0000

151 slides extracted.

Slide 1 — 0:16 (watch)

Can everyone hear me? Great. Today, I’m Frank McSherry from Materialize, and the title of my presentation is "Bug Batch." When I initially typed the title, I wasn't thinking too deeply about it, but I’ve come to appreciate it. It effectively conveys the topic, particularly the importance of using the right verb tenses, such as the gerund. This presentation will focus on building confidence—what it means and my perspective on it. For me, and I hope for all of you, building confidence is a process rather than a finished product. You are never done building confidence in something.

Slide 2 — 1:54 (watch)

Building confidence is an ongoing process, not a finished product. It involves various elements that help you and those around you understand and evolve the work you are doing. I want to emphasize that what I share today is largely based on personal opinions, although it does include some technology discussions. You are encouraged to evaluate and reflect on these ideas and determine if they resonate with you. Each presentation today has been a personal journey, and I invite you to consider these insights thoughtfully rather than dismiss them outright.

I also want to mention that I had a conversation with Gary yesterday, where I mentioned having some strong opinions on certain topics. While I decided not to include those hot takes in my slides to avoid controversy, I will share that my organization has seen significant benefits from neurosymbolic AI. I believe that the processes I have found effective for building confidence align well with the recent advancements in AI tools. These tools may not directly assist in building confidence but can facilitate effective implementation once confidence is established.

Nothing has changed about software engineering | Ben Eggers | Bug Bash 2026

Thu, 14 May 2026 00:00:00 +0000

110 slides extracted.

Slide 1 — 0:04 (watch)

I'm Ben Eggers from OpenAI. To start, could I have a volunteer from the front row?

Slide 2 — 0:10 (watch)

What's your name? I tend to get nervous when I speak, so if I'm talking too fast, please let me know. Thank you.

Slide 3 — 0:20 (watch)

I'm Ben Eggers, and I work at OpenAI. I'm here to explain that nothing has changed in software development.

Slide 4 — 0:28 (watch)

Software development remains unchanged.

Slide 5 — 0:38 (watch)

Nothing has changed. First, let's conduct a poll. Please raise your hand if you have used a large language model (LLM) to develop software.

Slide 6 — 0:52 (watch)

How many people here believe that we experienced a significant leap in model intelligence a few months ago? How many would trust an LLM to develop all of their software? Let's discuss.

Slide 7 — 1:04 (watch)

I have a personal question. How many of you are using MacBook Pros that can't connect to Wi-Fi after putting them to sleep and then reopening them? I've been struggling with that issue. Great, I'm Ben.

Slide 8 — 1:14 (watch)

This slide features a photo of me speaking at Bug Bash last year, alongside a childhood photo of myself that I shared during the event.

Slide 9 — 1:26 (watch)

I work on infrastructure at OpenAI. Previously, I was the eighth highest seven-day trailing token user at OpenAI, with a spike large enough to trigger rate limiting. I also have several hobbies, so if you're interested in long-distance hiking, jazz, living abroad, or therapy, let's talk.

Slide 10 — 1:44 (watch)

Before I begin this talk, I want to thank these individuals. If any of them apply to work with you, you should hire them immediately. They have significantly contributed to the ideas presented in this talk and have made me a better engineer. These are some of my engineering heroes, most of whom work at OpenAI. I wouldn't feel right giving this talk without acknowledging them and expressing my appreciation for their greatness.

Slide 11 — 2:14 (watch)

Someone once told me that the key to giving a good talk is to outline what you will discuss, present the content, and then summarize what you covered. So, let me outline my main point: all the challenging aspects of building software remain difficult and unchanged. We will address this in two parts. First, I assert that writing code is indeed the hard part of software development, but not for the reasons typically associated with writing code.

Slide 12 — 2:26 (watch)

In part two, we will discuss how agents facilitate the work and influence your thinking about it, but they do not eliminate the need for the work itself.

Slide 13 — 2:32 (watch)

All the challenging aspects remain unchanged.

Slide 14 — 2:44 (watch)

After Will's insightful keynote this morning, I want to emphasize that the process is not just about writing code and testing. I believe it is more accurately represented as 40% thinking, 10% writing code, and 50% testing. This perspective highlights the importance of each phase in the development process.

Slide 15 — 2:54 (watch)

I would like to provide a disclaimer.

Slide 16 — 2:58 (watch)

I am referring to deep and narrow systems, which have a limited surface area in terms of the APIs they offer but possess significant implementation complexity. This includes databases, file systems, and infrastructure.

Slide 17 — 3:14 (watch)

We are not discussing web portals and applications, which typically have a broader feature set. In these cases, each feature does not significantly impact the architecture of the rest of the system.

Slide 18 — 3:22 (watch)

I cannot definitively say that this does not apply to broad, high surface area systems. My experience with them is limited, so I do not feel qualified to comment extensively.

Slide 19 — 3:30 (watch)

I believe I am somewhat qualified to discuss the systems on the left, so that will be our focus.

Slide 20 — 3:36 (watch)

Many of you who attended Bug Bash last year might recall that I enjoy games. Last year, we played Guess the Impact. This year, the game is Guess What the Agent Gave Me.

Slide 21 — 3:48 (watch)

I will describe a system and share the prompt I used with the agent. Then, you will guess the outcome.

Slide 22 — 3:56 (watch)

Let's practice.

Slide 23 — 4:02 (watch)

I aimed to create a Rust drop-in replacement for Python's RE module. Regular expressions present significant implementation complexity, but there is a clear correctness oracle in the Python RE module. It seems feasible to achieve better performance.

Slide 24 — 4:18 (watch)

Python's RE module is primarily written in Python. This presents an interesting problem to explore in the context of autonomous software engineering.

Slide 25 — 4:26 (watch)

I built a harness with an essentially empty repository that operated in a while loop.

Slide 26 — 4:38 (watch)

Every four or five loops, the system would instantiate a Codex instance, instructing it to pick something up and move closer to it. Additionally, every fifth or sixth loop, an orchestrator agent could modify the agent harness. The while loop itself was straightforward; it simply called the Python file that the orchestrator agent could manipulate.

Slide 27 — 4:52 (watch)

What do you think happened here? Any guesses? A Rust wrapper for Python would be even better than what I have. Do we have any other suggestions?

Slide 28 — 5:12 (watch)

The codebase consists of one million lines, which is low but morally correct. It contains numerous markdown documents, and that is accurate.

Slide 29 — 5:26 (watch)

After a few weeks of review, I noticed that it had generated manifests containing hundreds of thousands of lines of JSON. These manifests functioned as test cases, each highly specific, detailing conditions such as "this should match this" and tagged with numerous keywords and features. The result was overwhelming and poorly structured.

Slide 30 — 5:50 (watch)

I SSH'd in and decided to experiment with autonomous software engineering instead of directly instructing the agent to modify the code. I attempted to modify the harness to avoid using JSON. As a result, it changed all the file types to PY and created a manifest assignment. This repository is available on GitHub for reference. Additionally, it generated everything within a single 24,000-line Rust crate.

Slide 31 — 6:18 (watch)

You might think that 24,000 lines is a lot, but when you look at the file percentages, it makes sense. The process generated reports that were initially in JSON format but were converted to PY. One of these reports is 6.47 megabytes, indicating that we have large JSON files containing numerous test cases. It appears that the system tested these cases and produced reports detailing the results of each test.

Slide 32 — 6:36 (watch)

This isn't exactly what I envisioned. I still believe this idea has potential, but I haven't had the time to pursue it further. That's our game, and we have a couple more similar projects in the pipeline.

Slide 33 — 6:44 (watch)

Part one.

Slide 34 — 6:46 (watch)

Writing code has always revealed the most challenging aspects of programming.

Slide 35 — 6:52 (watch)

My claim is that the bottleneck has never been typing, which I believe is an uncontroversial statement. If you're in this room, you likely type at least 60, and probably 80 to 100 words per minute. When you calculate the number of lines of code produced per day, it's clear that typing is not the bottleneck.

Slide 36 — 7:10 (watch)

Code is relatively inexpensive to produce, even at high typing speeds. In fact, code has historically been considered cheap to create.

Slide 37 — 7:26 (watch)

My rule of thumb for a productive day or week was shipping a couple thousand lines of code, which translates to about 400 or 500 lines per day during six hours of deep focus time. However, the time spent typing those lines does not account for the necessary design discovery, integration, and correctness processes. I assert that these aspects remain the challenging part of coding.

Slide 38 — 7:50 (watch)

This image resembles a sci-fi book cover featuring someone writing code. In the past, writing code required deep thought about the task at hand. However, in today's environment, where we can simply issue prompts like "make it better" or "make no mistakes," we tend to think less critically about our objectives. The process of coding used to compel us to uncover the true nature of the problem.

Slide 39 — 8:14 (watch)

We've all experienced moments when we're struggling with a piece of code, and suddenly realize that a component should be placed differently or that our database indices are incorrect. If you're in this room, you've likely written code with LLMs and experienced that "aha" moment, where you recognize that the shape of the problem differs from your initial perception.

Slide 40 — 8:28 (watch)

It also forced us to turn those boundaries into contracts. In an era where you can easily rewrite your entire API contract, that was not always the case. Perhaps most importantly, you notice the unusual issues before anyone else does. Will mentioned in his talk this morning that humans have always been unreliable, and AI agents are not necessarily less reliable. While that is true and AI is improving, it seems to me that there was something qualitatively significant about the fact that a human carefully considered every line of code. Now, we are in a situation where, because a human is not thoroughly analyzing each line, we have lost a critical aspect of quality assurance that was previously inherent in our software.

Slide 41 — 9:12 (watch)

A human would assess whether something was correct, consider the shape of the problem, and review the API contracts. Filling in the code was often the least interesting part of the process.

Slide 42 — 9:26 (watch)

When writing an algorithm, such as an inverted index, the process of coding itself is not necessarily boring. However, if you frame coding in these terms, filling in the code can be seen as the least interesting part. It's akin to setting up all the boxes and then simply coloring them in.

Slide 43 — 9:38 (watch)

The slowness was a critical factor. I asked our new image model to provide an illustration of slowness as a load-bearing concept, and this is the result I received.

Slide 44 — 9:50 (watch)

When using these interfaces regularly, the deficiencies become apparent. As you wire paths end-to-end, you notice the missing cases. You might realize that two cases are actually the same, or that you should consider an error case. This requires careful thought about the various scenarios, whereas a large language model is more likely to make mistakes.

Slide 45 — 10:04 (watch)

When writing your queries, it's crucial to carefully consider how you manage your indices and the efficiency of your scans, especially for those of you interested in databases.

Slide 46 — 10:22 (watch)

It's important to think critically about your approach when writing queries and conducting tests. Common wisdom suggests focusing on testing your interfaces rather than your implementation, considering potential edge cases. This method remains effective, but when a human is involved, you can be very intentional about your testing strategy.

Slide 47 — 11:00 (watch)

I assert that there was once a natural equilibrium in software development, where the speed of writing code matched our capacity to reason about it. Generally, people created code they understood, and if they didn't, it was much more limited than it is today.

Slide 48 — 11:14 (watch)

I truly believe that during the golden era of human software engineering, code production was slow enough that we thought more critically about it, resulting in better quality work. Today, we often hear the term "slop" used to describe the current state of software development. This reflects the reality that while humans produce less slop overall, it still exists. I understand the skepticism surrounding this issue, especially the claim that software engineering has become stochastic due to large language models.

Slide 49 — 11:58 (watch)

Software engineering has always been stochastic. Anyone who has managed a project with interns or new graduates understands this. You might receive anywhere from 100 to 2,000 lines of code, and the quality can vary significantly. This isn't a critique of inexperienced software engineers; rather, it highlights that outsourcing work rarely yields the same level of quality as your original vision. You cannot expect to receive exactly what you had in mind when delegating tasks.

Slide 50 — 12:26 (watch)

As a tech lead, you define the constraints and overall architecture of the project. You may not need to detail every aspect of the implementation, as long as the interfaces are correctly implemented, you have confidence in your tests, and the system functions as intended. While code reviews are essential, it's not necessary to understand every line of code or service, provided you have team members who do. Your focus can be on data storage, service communication, and ensuring the correctness of interfaces and types. However, this approach may vary when dealing with particularly complex systems.

Slide 51 — 13:14 (watch)

Many of us have encountered frustrating database bugs, which reflect a stylized version of the role of a tech lead during the golden era of software engineering. The primary responsibility was to narrow the distribution of possible implementations. You begin with your intent, evaluate various implementations, and select one. To constrain the design, you concentrate on schemas, APIs, tests, and reviews rather than the code within individual modules.

Slide 52 — 13:22 (watch)

In summary, the old coding loop compelled design thinking.

Slide 53 — 13:34 (watch)

Slowness served as a critical rate limiter, and we've essentially always engaged in stochastic software engineering. While not every individual may have practiced this approach, software engineering at a large scale has fundamentally been a stochastic process. Now, let's play another game.

Slide 54 — 13:58 (watch)

We have a comprehensive test suite and a legacy storage system that we find unsatisfactory due to its poor performance. We decided to build a better version using modern primitives, specifically object storage and a key-value store provided by OpenAI. We set up the comprehensive test suite and ran it against the legacy storage system, which passed. Following that, we implemented an agent hill climb approach.

Slide 55 — 14:16 (watch)

The design of the harness is not particularly noteworthy. We had an agent hill climb to get everything functioning, and after a few weeks, it was successful. My question is, what do you think we achieved? Did it interact with the old storage system?

Slide 56 — 14:50 (watch)

No, it was a better sandbox than that, although a very large S3 bucket would have been amusing. We actually had one key and one value. It worked fine on my machine. Did it perform better? Locally, it performed adequately, but upon deployment, we encountered unexpected issues. This is particularly amusing because this type of bug is something a human would almost never create.

Slide 57 — 15:08 (watch)

Humans can make unpredictable mistakes, but this particular bug is difficult to envision as something a human would create. When you want to insert multiple keys and values into a key-value store, this is not the typical implementation approach one would choose.

Slide 58 — 15:16 (watch)

I haven't had time to review the logs from the harness runs, but I suspect that a performance hack was implemented due to the small data scale. Future iterations likely overlooked this issue, assuming it was intentional and that the human wanted it that way.

Slide 59 — 15:38 (watch)

I won't address it because I've seen this happen frequently in my experience. Moving on to part two: agents facilitate the work, but they do not eliminate it. You may have noticed that models can surpass the threshold of usefulness.

Slide 60 — 15:44 (watch)

In comparing GPT 4.1 to GPT 4.6, it's clear that 4.1 performs significantly worse on all benchmarks.

Slide 61 — 15:54 (watch)

There is also a qualitative difference between GPT 3.7 and GPT 5.4, which is another unfair comparison. Previous generations of models struggled with RKGI2, while current generations are suddenly able to handle it effectively.

Slide 62 — 16:12 (watch)

We have model generations that were trained on the training set, but there is also a sense that these models have crossed a qualitative threshold of usefulness. Many of us feel that we can trust them more now, and they perform better in various tasks. This shift has led to discussions about agentic engineering and the concept of an intelligence explosion.

Slide 63 — 16:22 (watch)

I feel that a model is often better at my job than I am, and this change occurred about three months ago.

Slide 64 — 16:42 (watch)

I maintain that models can write better code when you provide the initial guidance. You must define what success looks like and supply a comprehensive test suite. It’s essential to outline the structure of the solution; simply instructing the model to "build this thing" or "make it better" is insufficient. Additionally, you need to establish in advance how you will evaluate whether the change was effective; guessing is not an option.

Slide 65 — 16:54 (watch)

These are the same principles that humans need to build good software, which supports my claim that software engineering has not changed. I consider these aspects when mentoring an intern, writing notes for myself, or when I sit down to do some work.

Slide 66 — 17:10 (watch)

I need to consider what the desired outcome looks like and how to achieve it effectively. Therefore, it's essential to make decisions first.

Slide 67 — 17:14 (watch)

You define the desired behavior changes, identify what must continue functioning, and specify the trade-offs you are willing to accept.

Slide 68 — 17:26 (watch)

For example, avoid placing all keys and values in a singleton within the key-value store. Instead, utilize the key-value store effectively. There is potential for a separate, interesting discussion on this topic. Prompts can be likened to mathematical expressions. Ultimately, you must be extremely specific and detailed in your requests, particularly in distinguishing between terms like "for any," "for all," and "choose one such that."

Slide 69 — 17:50 (watch)

These details are significant for implementation. As we increasingly interact with code and computers through prose, prompts are likely to become more mathematical across the industry. While I believe this topic warrants a separate discussion, it is not the focus of this presentation.

Slide 70 — 18:02 (watch)

Design is possibly the most important slide in this entire presentation, as it reflects what I have learned about developing with agents.

Slide 71 — 18:10 (watch)

I write all of my schemas by hand. For very simple schemas, I might use an agent, but I have spent a week struggling with a 300-line Prisma schema because it needs to be precise. Therefore, I do not let the agent handle it; I write my APIs and interfaces by hand for the same reasons.

Slide 72 — 18:20 (watch)

I have coworkers who write all of their tests by hand, which I find excessive. I believe that's the most challenging aspect of software engineering.

Slide 73 — 18:32 (watch)

However, I know people who strongly advocate for it.

Slide 74 — 18:36 (watch)

There is an interesting perspective that unit testing may be becoming obsolete. We've observed instances where AI-generated unit tests assert unusual values and provide little meaningful information.

Slide 75 — 18:46 (watch)

If you are writing unit tests yourself, they can be useful because you, as the human, ensure their correctness. However, simply instructing AI to add more tests is generally not effective. This topic deserves its own discussion, but it is not the focus of this talk.

Slide 76 — 19:02 (watch)

Correctness requires providing clear guidelines. You should specify the system diagram you want, without concern for the color of the boxes or the programming language used. The internal abstractions and implementation details are generally less important, as long as they are correct.

Slide 77 — 19:16 (watch)

If you have a provable correctness harness, the aesthetics of the code become irrelevant. You specify which data models are immutable and provide a comprehensive test harness.

Slide 78 — 19:28 (watch)

Always implement tests in a different context. This is an important strategy. When you ask an AI to write code and then request unit tests for that code, the AI retains the context of the code it just generated. The most effective workflow I have found is to separate these tasks. First, write your interfaces and a set of tests, then instruct the AI to focus on making the code work without modifying the tests.

Slide 79 — 19:46 (watch)

This is the only effective approach to AI-driven, test-driven development, particularly concerning unit tests.

Slide 80 — 19:50 (watch)

In summary, here is a checklist for managing your AI agents.

Slide 81 — 20:02 (watch)

Clearly define the desired behavior change in prose and mathematics. Implement your schemas and interfaces manually or with an agent, which are typically not large relative to the overall system. Create tests for the change, and then allow an agent to fill in the gaps. This process is similar to managing your less motivated self on a Monday at 8 AM.

Slide 82 — 20:20 (watch)

That brings us to our final game, Contrafact. I developed a practice journal to track jazz practice.

Slide 83 — 20:28 (watch)

The details of the app are not highly specific, but it is important to note that it is not a complex system. Instead, it is a relatively shallow system with many screens.

Slide 84 — 20:44 (watch)

I developed my storage models, feature views, and domain model patterns from scratch. I established the core codebase and patterns to ensure they were complete and comprehensive. Then, I would identify a feature I wanted, implement it, and verify that it worked. After each successful implementation, I would repeat the process for the next feature. Eventually, I found myself with 30,000 lines of code and wondered what all that code was doing. Fortunately, there was no markdown involved this time. Did you say time management?

Slide 85 — 21:40 (watch)

That's an interesting guess. There was some problematic clock functionality, but it wasn't a significant part. Duplicated date formatting is the closest guess, so we'll give you that. I requested a practice session with etudes and tunes, which relates to the domain model layer. This layer is not the storage layer; it consists of in-memory abstractions for how we manage and manipulate data. However, what I received was a poorly designed UUID soft look-up layer. In this setup, every entity had a UUID, and each one stored multiple UUIDs. The application performed UUID look-ups in the repository layer, and if it retrieved the wrong type, the entire application would crash.

Slide 86 — 21:56 (watch)

During a late-night coding session, I discovered that a change had inadvertently added UUIDs to my data models. While they were correctly linked at the data model layer, they also stored these UUIDs. If those UUIDs became out of sync, the consequences were unclear.

Slide 87 — 22:12 (watch)

I completely unraveled the situation because it was problematic. It's crucial to pay attention to your data models.

Slide 88 — 22:20 (watch)

I outlined what I intended to convey, and now I've shared that information with you.

Slide 89 — 22:24 (watch)

Writing code used to require significant design thinking that often occurred implicitly.

Slide 90 — 22:32 (watch)

Agentic coding requires us to be more explicit in our approach. It does not reduce the need for critical thinking or effort. Now, everyone must take on the responsibilities of a tech lead, which means you are now functioning as a tech lead, while the overall process remains unchanged.

Slide 91 — 22:44 (watch)

Software engineering is fundamentally unchanged. We prioritize correctness and how we demonstrate that correctness is a fundamentally stochastic process, though it has become faster in some respects.

Slide 92 — 22:52 (watch)

Code has become inexpensive, but ensuring its correctness remains a challenge.

Slide 93 — 22:54 (watch)

Thank you.

Slide 94 — 23:24 (watch)

I ran a bit over time, but I believe I still have a few minutes for any thoughts, feelings, or questions.

Slide 95 — 23:38 (watch)

I have not observed any programmatic enforcement regarding restrictions on modifying schemas or other elements, and I have also seen this approach fail at the prompt level.

Slide 96 — 23:46 (watch)

I would like to see a product, open-source library, or VS Code extension that allows you to specify which files the model can or cannot access. This type of guardrail is likely to emerge soon.

Slide 97 — 24:04 (watch)

As an inexperienced engineer, you might wonder how to transition to a tech lead role, where the responsibilities differ significantly from those of junior engineers. This is an important question.

Slide 98 — 24:16 (watch)

That is one of the key questions in the industry right now.

Slide 99 — 24:24 (watch)

Tech lead work is qualitatively different from implementation work, which involves writing code and developing a service once a good API contract is established. I believe you can begin practicing tech lead responsibilities directly.

Slide 100 — 24:34 (watch)

You can begin by defining what you want to build.

Slide 101 — 24:40 (watch)

Consider the structure of your database and identify the core entities that are important to you.

Slide 102 — 24:46 (watch)

Understanding the entire system can be a challenging process initially, as it requires a comprehensive grasp of each component. However, I believe this skill can be developed without the need to spend years meticulously building each part on others' projects before gaining the ability to do it independently.

Slide 103 — 25:08 (watch)

People can learn to do this directly. Thank you. I have a quick question. Even if a pattern is abstracted, do you quantify how familiar it is to a model before asking it a question?

Slide 104 — 25:16 (watch)

For a novel concept, you would develop more architecture around it, whereas for something that is relatively standard, you would do less.

Slide 105 — 25:20 (watch)

That's a good question.

Slide 106 — 25:24 (watch)

There are many aspects of the software world that I am not familiar with.

Slide 107 — 25:30 (watch)

Many patterns and practices exist that I am unaware of in places I have never visited.

Slide 108 — 25:44 (watch)

In my experience, the models I have encountered tend to produce core patterns that are about 80% reasonable and effective. However, there is often around 20% of their output that may not make much sense.

Slide 109 — 25:52 (watch)

When I build a website or engage in tasks where I lack extensive experience, I often encounter a "gel man amnesia" effect. The models provide suggestions that seem reasonable at first, but as I explore further, I realize that certain patterns significantly hinder my development and complicate my work.

Slide 110 — 26:10 (watch)

I make a concerted effort to establish all the core patterns myself and understand their underlying reasons, even when a model is generating the content. Thank you.