93 slides extracted.
Slide 1 — 0:08 (watch)
![]() | It's nice to meet you all. |
Slide 2 — 0:20 (watch)
![]() | I'm Ash, and this is Andrew. We both work as engineers in the Applied AI team at Anthropic. |
Slide 3 — 0:36 (watch)
![]() | This session is inspired by a blog post we published a couple of weeks ago about building agents that can run for extended periods, specifically five to six hours or more. |
Slide 4 — 0:52 (watch)
Slide 5 — 1:02 (watch)
![]() | In the first half, my colleague Andrew will discuss how we arrived at this point, the primitives we've implemented in our code, and our current status. |
Slide 6 — 1:14 (watch)
![]() | I will return to discuss some of the more experimental aspects we are exploring, including the harnesses and a few examples of our findings. Thank you, Ash. |
Slide 7 — 1:38 (watch)
Slide 8 — 2:04 (watch)
Slide 9 — 2:22 (watch)
Slide 10 — 3:10 (watch)
Slide 11 — 4:02 (watch)
Slide 12 — 4:34 (watch)
Slide 13 — 5:16 (watch)
Slide 14 — 5:54 (watch)
Slide 15 — 6:18 (watch)
Slide 16 — 6:46 (watch)
Slide 17 — 7:14 (watch)
Slide 18 — 7:44 (watch)
Slide 19 — 8:20 (watch)
Slide 20 — 9:12 (watch)
Slide 21 — 9:38 (watch)
![]() | While it may not be considered a true Ralph loop, you would set the maximum iterations and a safe word. A stop hook would then intercept when Claude typically stops. |
Slide 22 — 9:46 (watch)
![]() | If it's not finished, it will continue until it meets one of those exit criteria. |
Slide 23 — 10:14 (watch)
Slide 24 — 10:42 (watch)
![]() | At this point, we can run for about 30 hours with Claude Sonnet 4.5. |
Slide 25 — 11:16 (watch)
Slide 26 — 11:54 (watch)
Slide 27 — 12:38 (watch)
Slide 28 — 13:12 (watch)
![]() | From there, it would enter a harness loop consisting of multiple steps. |
Slide 29 — 13:26 (watch)
Slide 30 — 13:52 (watch)
Slide 31 — 14:16 (watch)
![]() | This is the first iteration of the long-running harnesses. Continuing with the history tour, we have Opus 4.6 and Sonnet 4.6. |
Slide 32 — 14:58 (watch)
Slide 33 — 15:36 (watch)
![]() | We also introduced server-side compaction, which allows these models to run indefinitely while compaction occurs on the server side. This leads to a context size of 1 million. |
Slide 34 — 15:50 (watch)
Slide 35 — 16:00 (watch)
![]() | This slide provides an overview of the various releases displayed in the table. |
Slide 36 — 16:16 (watch)
Slide 37 — 16:36 (watch)
![]() | You can build a fully featured application that runs out of the box. What's interesting is that the harness does not disappear as the models improve. |
Slide 38 — 16:46 (watch)
![]() | The harness evolves alongside the models over time. It's fascinating to identify the gaps in the model and address them using the harness. |
Slide 39 — 16:58 (watch)
![]() | You train the model using that aspect of the harness, and at some point, you may remove it entirely. This iterative loop continues over time with more and more co-releases. |
Slide 40 — 17:08 (watch)
![]() | Hopefully, that overview of cloud evolution was an interesting perspective on its relevance to long-running agents. |
Slide 41 — 17:22 (watch)
![]() | I'll hand over to Ash to continue with the current state of the art. Quick question: how many of you have agents running in the background while you're here? |
Slide 42 — 17:56 (watch)
Slide 43 — 18:44 (watch)
Slide 44 — 19:12 (watch)
![]() | The generator creates content, while the evaluator assesses it. The concept involves dividing the context, windows, system prompts, and tasks completely. |
Slide 45 — 19:24 (watch)
Slide 46 — 19:40 (watch)
Slide 47 — 20:02 (watch)
Slide 48 — 20:44 (watch)
Slide 49 — 21:32 (watch)
Slide 50 — 22:10 (watch)
Slide 51 — 22:36 (watch)
Slide 52 — 22:50 (watch)
![]() | All of these examples consist solely of HTML and CSS, which I have reviewed for about four to five hours, going through 15 rounds. |
Slide 53 — 23:06 (watch)
Slide 54 — 23:22 (watch)
![]() | If the generator struggles and consistently scores low on originality, the system will discard the entire attempt and start over from scratch. |
Slide 55 — 23:26 (watch)
Slide 56 — 23:34 (watch)
![]() | This ability to course correct over long time horizons is unique to the process of breaking down different roles involved in building a project. |
Slide 57 — 23:54 (watch)
Slide 58 — 24:26 (watch)
Slide 59 — 24:52 (watch)
![]() | This slide presents a simple organizational structure involving product management, individual contributors, and quality assurance. |
Slide 60 — 25:02 (watch)
![]() | We didn't invent this concept; we simply provided each role with its own context window. An interesting aspect to discuss is the connection between the generator and the evaluator in this setup. |
Slide 61 — 25:38 (watch)
Slide 62 — 26:02 (watch)
![]() | The evaluator grades against the contract that the two agents have established, rather than the original specification that the planner created at the beginning. |
Slide 63 — 26:34 (watch)
Slide 64 — 27:16 (watch)
![]() | The prompt was to build a retro game maker. I won't argue that this is the most cost-effective or efficient way to develop an app. Currently, it takes an excessively long time and is very expensive. |
Slide 65 — 27:36 (watch)
![]() | A lot of the functionality only starts working with this harness, which did not function properly in a solo loop. This is what the opening screen looked like when we did not have the harness. |
Slide 66 — 27:50 (watch)
![]() | The opening screen is quite simplistic and a bit boring, but it still looks nice. If this were the entire app, you might consider shipping it, but it serves more as a bait, so to speak. |
Slide 67 — 28:06 (watch)
Slide 68 — 28:28 (watch)
Slide 69 — 28:48 (watch)
![]() | The agent lacked the ability to test itself and understand what it meant to play a game successfully. Although it appeared complete at first glance, it ultimately failed when pushed to its limits. |
Slide 70 — 29:08 (watch)
Slide 71 — 29:24 (watch)
![]() | The planner made all the product decisions, while the two other agents determined how to test the product. |
Slide 72 — 29:40 (watch)
![]() | In the sprite editor, we have a complete 54-color palette, featuring the 8-bit preset from the project dialogue. The sprite is displayed at the actual game scale. |
Slide 73 — 29:58 (watch)
Slide 74 — 30:20 (watch)
Slide 75 — 30:34 (watch)
![]() | Finally, the actual results were applied. |
Slide 76 — 30:54 (watch)
Slide 77 — 31:58 (watch)
Slide 78 — 33:16 (watch)
Slide 79 — 34:58 (watch)
Slide 80 — 35:46 (watch)
Slide 81 — 37:14 (watch)
Slide 82 — 39:06 (watch)
Slide 83 — 39:28 (watch)
Slide 84 — 40:04 (watch)
Slide 85 — 40:40 (watch)
Slide 86 — 41:10 (watch)
![]() | This is an example of AI Slop. |
Slide 87 — 43:24 (watch)
Slide 88 — 53:32 (watch)
Slide 89 — 1:04:36 (watch)
Slide 90 — 1:10:46 (watch)
Slide 91 — 1:14:14 (watch)
Slide 92 — 1:15:04 (watch)
Slide 93 — 1:15:32 (watch)
![]() | Microsoft Mechanics www.microsoft.com |




























































































