223 slides extracted.
Slide 1 — 0:08 (watch)
![]() | Hello, everyone. |
Slide 2 — 0:24 (watch)
Slide 3 — 0:34 (watch)
![]() | I primarily focus on how to test AI systems and how to create AI systems that function effectively. |
Slide 4 — 0:44 (watch)
Slide 5 — 1:00 (watch)
![]() | Next, we will set up tracing, which captures the raw data necessary to run evaluations. We will also run a simple AI agent using the Claude agent SDK and examine the traces it produces. |
Slide 6 — 1:28 (watch)
Slide 7 — 1:52 (watch)
![]() | Next, we will use LLM evals, where an LLM assesses the semantic content of the output and determines success or failure in a more flexible, non-deterministic manner. |
Slide 8 — 2:02 (watch)
![]() | We will use the built-in evals and create a custom eval from scratch. Additionally, we will test the accuracy of our judges in a process known as meta-evaluation. |
Slide 9 — 2:14 (watch)
Slide 10 — 2:30 (watch)
![]() | Practical frameworks you can use after leaving this room include the impact hierarchy, the data flywheel, and techniques such as pairwise evaluation and reliability scoring. |
Slide 11 — 2:56 (watch)
Slide 12 — 3:52 (watch)
Slide 13 — 4:40 (watch)
Slide 14 — 5:04 (watch)
Slide 15 — 5:26 (watch)
![]() | With that information covered, let's discuss the basics. What is an eval? Please raise your hand if you feel you really know what an eval is. |
Slide 16 — 5:46 (watch)
Slide 17 — 6:10 (watch)
Slide 18 — 6:36 (watch)
Slide 19 — 6:56 (watch)
![]() | That is your mental model today. You are writing tests for a new world of applications that are challenging to test. |
Slide 20 — 7:08 (watch)
![]() | It's not rocket science. We need AI evaluations because of the vibes problem. Many people build an AI feature and test it by running a few queries, simply asking themselves if it looks right. |
Slide 21 — 7:54 (watch)
Slide 22 — 8:56 (watch)
Slide 23 — 9:22 (watch)
![]() | You want to avoid fixing one aspect while inadvertently breaking another. Evaluations provide a way to ensure that everything functions as intended, including all previous capabilities. |
Slide 24 — 9:40 (watch)
Slide 25 — 9:58 (watch)
![]() | This is not theoretical. |
Slide 26 — 10:14 (watch)
Slide 27 — 10:28 (watch)
![]() | Today, we will follow that arc. As I mentioned, there are two types of evaluations. |
Slide 28 — 10:48 (watch)
Slide 29 — 11:18 (watch)
Slide 30 — 11:56 (watch)
Slide 31 — 12:26 (watch)
![]() | There is no unit test that can effectively assess tone, but an LLM excels at this task. The strength of LLM judges lies in their ability to understand meaning rather than just basic strings. |
Slide 32 — 12:40 (watch)
Slide 33 — 13:32 (watch)
Slide 34 — 14:20 (watch)
![]() | So the question becomes: when do you use each approach? |
Slide 35 — 14:38 (watch)
Slide 36 — 14:58 (watch)
![]() | You must ensure that your evaluations are conducted in a manner that humans perceive as authentic and accurate, as LLM judges can make mistakes. |
Slide 37 — 15:22 (watch)
Slide 38 — 15:54 (watch)
Slide 39 — 16:18 (watch)
Slide 40 — 16:32 (watch)
![]() | All of this builds up and cascades. |
Slide 41 — 16:46 (watch)
Slide 42 — 17:02 (watch)
![]() | That is the cascading failure we aim to avoid. |
Slide 43 — 17:24 (watch)
Slide 44 — 17:50 (watch)
Slide 45 — 18:14 (watch)
Slide 46 — 18:36 (watch)
Slide 47 — 19:04 (watch)
Slide 48 — 19:30 (watch)
Slide 49 — 19:54 (watch)
Slide 50 — 20:36 (watch)
Slide 51 — 21:08 (watch)
Slide 52 — 21:32 (watch)
Slide 53 — 21:46 (watch)
![]() | Who is not yet set up with Phoenix? |
Slide 54 — 21:58 (watch)
![]() | What do you need? Where are you stuck? Or are you not? Great. It seems everyone is set up with Phoenix, or they may not feel comfortable admitting they aren't. |
Slide 55 — 22:22 (watch)
Slide 56 — 22:46 (watch)
![]() | You can run it locally on your laptop if you prefer, but we are using Phoenix Cloud today to avoid the need for installing any software. |
Slide 57 — 22:54 (watch)
![]() | Let's look at our actual notebook. |
Slide 58 — 23:52 (watch)
Slide 59 — 25:06 (watch)
Slide 60 — 25:36 (watch)
Slide 61 — 26:00 (watch)
![]() | You can use the same two lines of code to activate your agent, regardless of the framework you used to build it. Simply run that call. |
Slide 62 — 26:18 (watch)
![]() | Go to Spaces and launch the space; that's what you want. |
Slide 63 — 26:30 (watch)
![]() | Great. Let's move on. |
Slide 64 — 27:12 (watch)
Slide 65 — 28:06 (watch)
Slide 66 — 28:26 (watch)
![]() | Next, set your API keys. As mentioned, you have already completed this step. |
Slide 67 — 28:40 (watch)
Slide 68 — 29:16 (watch)
Slide 69 — 29:52 (watch)
Slide 70 — 30:08 (watch)
![]() | I have set the permission mode to accept edits. This prevents it from prompting me to change files, which turned out to be a mistake that we will discuss later. |
Slide 71 — 30:36 (watch)
Slide 72 — 31:08 (watch)
Slide 73 — 31:22 (watch)
![]() | You can input your own stock ticker and focus area if desired. For example, I used Tesla. |
Slide 74 — 31:54 (watch)
Slide 75 — 32:38 (watch)
Slide 76 — 32:56 (watch)
![]() | Q425 highlights the 2026 financial outlook, key growth drivers, and risk assessment. |
Slide 77 — 33:10 (watch)
Slide 78 — 33:36 (watch)
Slide 79 — 33:56 (watch)
Slide 80 — 34:26 (watch)
Slide 81 — 34:50 (watch)
![]() | Let's switch back to the slides. |
Slide 82 — 34:54 (watch)
![]() | A span contains a variety of information. By clicking into a span, you can access its exact details, including annotations and attributes. |
Slide 83 — 35:10 (watch)
Slide 84 — 35:24 (watch)
![]() | Phoenix enhances readability of the output, making it easier to interpret than the default format. However, a single span from one agent run is insufficient for comprehensive analysis. |
Slide 85 — 35:36 (watch)
![]() | We want to have multiple spans to gather extensive data and understand the situation better. In the notebook, there are 12 test queries available for you to run. |
Slide 86 — 35:58 (watch)
Slide 87 — 36:34 (watch)
Slide 88 — 36:56 (watch)
![]() | It is crucial to address the edge cases. |
Slide 89 — 37:22 (watch)
Slide 90 — 38:08 (watch)
Slide 91 — 38:54 (watch)
Slide 92 — 39:32 (watch)
Slide 93 — 40:08 (watch)
Slide 94 — 40:38 (watch)
![]() | This is not just a technical exercise; it is essential to involve your stakeholders, domain experts, product managers, and actual users. |
Slide 95 — 40:58 (watch)
Slide 96 — 41:26 (watch)
Slide 97 — 42:00 (watch)
Slide 98 — 42:28 (watch)
Slide 99 — 42:44 (watch)
Slide 100 — 43:00 (watch)
![]() | If you've been running it in the background, you should have finished generating those traces by now. Let's examine what Apple did. |
Slide 101 — 43:22 (watch)
Slide 102 — 43:48 (watch)
Slide 103 — 44:08 (watch)
Slide 104 — 44:26 (watch)
![]() | The NVIDIA report was thorough but not actionable; it did not indicate whether I should buy the stock. Let's examine the NVIDIA report because it is interesting. Where did I put it? There it is. |
Slide 105 — 44:40 (watch)
![]() | It conducted four web searches, covering the competitive landscape and specific competitors. |
Slide 106 — 45:20 (watch)
Slide 107 — 45:56 (watch)
![]() | This is the last example I will discuss. |
Slide 108 — 46:06 (watch)
Slide 109 — 46:54 (watch)
Slide 110 — 47:38 (watch)
![]() | I reviewed my trace categories and identified what was good and what was bad. |
Slide 111 — 47:46 (watch)
![]() | I printed out the results and generated a table. The root cause frequency appears to be mostly satisfactory. |
Slide 112 — 48:18 (watch)
Slide 113 — 49:02 (watch)
Slide 114 — 49:20 (watch)
![]() | Stacking your evaluation layers functions similarly. The code evaluation identifies basic issues initially, while the LLM as a judge detects reasoning gaps but may overlook subtle hallucinations. |
Slide 115 — 49:34 (watch)
Slide 116 — 49:46 (watch)
![]() | Let's discuss the actual evaluations. |
Slide 117 — 49:50 (watch)
![]() | Let's write some real evaluations, starting with the simplest type: a code evaluation. |
Slide 118 — 51:08 (watch)
Slide 119 — 52:32 (watch)
Slide 120 — 52:54 (watch)
Slide 121 — 54:14 (watch)
Slide 122 — 55:38 (watch)
Slide 123 — 56:36 (watch)
Slide 124 — 57:50 (watch)
Slide 125 — 58:50 (watch)
Slide 126 — 59:32 (watch)
Slide 127 — 59:46 (watch)
![]() | We need to determine the reason for this issue. We should examine the explanations closely. |
Slide 128 — 1:00:12 (watch)
Slide 129 — 1:00:46 (watch)
Slide 130 — 1:01:12 (watch)
Slide 131 — 1:01:34 (watch)
![]() | Did it write a report based solely on this research? Did it adhere to the source material? |
Slide 132 — 1:02:20 (watch)
Slide 133 — 1:03:12 (watch)
Slide 134 — 1:03:46 (watch)
Slide 135 — 1:04:04 (watch)
![]() | I'm going to use my correctness scores. Let's focus on the incorrect ones. |
Slide 136 — 1:04:28 (watch)
Slide 137 — 1:05:00 (watch)
Slide 138 — 1:05:40 (watch)
Slide 139 — 1:06:14 (watch)
Slide 140 — 1:06:48 (watch)
Slide 141 — 1:07:12 (watch)
Slide 142 — 1:07:28 (watch)
Slide 143 — 1:07:44 (watch)
Slide 144 — 1:07:54 (watch)
![]() | Part four involves adding labeled examples, which is often overlooked but is the most beneficial addition you can make. |
Slide 145 — 1:08:12 (watch)
Slide 146 — 1:08:34 (watch)
Slide 147 — 1:08:50 (watch)
![]() | NVIDIA is a significant player in the semiconductor industry, but this information does not indicate whether we should buy the stock. It serves merely as a description rather than a demonstration. |
Slide 148 — 1:09:32 (watch)
Slide 149 — 1:10:10 (watch)
![]() | Another practical tip when writing an evaluation is to have the agent articulate its thought process. Encouraging a chain of thought for judges significantly enhances their performance. |
Slide 150 — 1:10:28 (watch)
Slide 151 — 1:11:10 (watch)
Slide 152 — 1:11:52 (watch)
Slide 153 — 1:12:22 (watch)
Slide 154 — 1:12:48 (watch)
Slide 155 — 1:13:06 (watch)
![]() | This indicates that we can instruct the agent to improve, and it has the potential for significant enhancement. |
Slide 156 — 1:13:20 (watch)
![]() | We log the annotations back to Phoenix, allowing us to view the actionability scores within the platform. We can apply the same filtering methods as before. |
Slide 157 — 1:13:40 (watch)
Slide 158 — 1:14:32 (watch)
Slide 159 — 1:15:08 (watch)
![]() | Great. That is the outcome we aimed for. |
Slide 160 — 1:15:32 (watch)
Slide 161 — 1:16:30 (watch)
Slide 162 — 1:17:04 (watch)
Slide 163 — 1:17:20 (watch)
![]() | You need to identify which of your evaluations are critical ship blockers and which are merely informative. |
Slide 164 — 1:18:00 (watch)
Slide 165 — 1:18:26 (watch)
Slide 166 — 1:19:38 (watch)
Slide 167 — 1:21:02 (watch)
Slide 168 — 1:23:36 (watch)
Slide 169 — 1:25:46 (watch)
Slide 170 — 1:27:06 (watch)
Slide 171 — 1:27:28 (watch)
Slide 172 — 1:28:06 (watch)
Slide 173 — 1:28:50 (watch)
Slide 174 — 1:29:40 (watch)
![]() | You must track judge accuracy across various input categories. If the judge consistently approves long responses but fails short ones, it indicates a long bias problem. |
Slide 175 — 1:30:06 (watch)
Slide 176 — 1:30:26 (watch)
![]() | If your LLM judge achieves a consistency score of 0.4 or higher, it is performing exceptionally well. |
Slide 177 — 1:30:34 (watch)
![]() | The judge disagreeing with you is not a reason to discard your evaluation. The key factor is whether the judge disagrees with you more frequently than a human would. |
Slide 178 — 1:30:54 (watch)
Slide 179 — 1:31:16 (watch)
![]() | The evaluation was checking for an answer of 96.12, but Claude provided 96.124991. As a result, the eval incorrectly marked it as wrong because it did not match the expected value. |
Slide 180 — 1:31:34 (watch)
![]() | After fixing the eval, Opus' score increased to 95%. Your evals can incorrectly mark answers as wrong if they are too strict or if they assess something other than what you intended to evaluate. |
Slide 181 — 1:31:52 (watch)
Slide 182 — 1:32:04 (watch)
![]() | The seventh and final step involves data sets and experiments. |
Slide 183 — 1:32:14 (watch)
![]() | This process transitions from merely identifying issues to actively enhancing your agent. |
Slide 184 — 1:32:32 (watch)
Slide 185 — 1:32:52 (watch)
![]() | For this, we navigate to a different part of the Phoenix UI. |
Slide 186 — 1:33:02 (watch)
![]() | We go to our experiments evaluation. |
Slide 187 — 1:33:16 (watch)
![]() | To produce your data set, access your traces and select a group of failing traces. Then, click "Add to Data Set." |
Slide 188 — 1:33:24 (watch)
![]() | You can create a new data set using the plus icon, or you can add to an existing data set. In this case, this allows you to capture AI agent financial failures. |
Slide 189 — 1:33:44 (watch)
Slide 190 — 1:34:14 (watch)
Slide 191 — 1:34:46 (watch)
Slide 192 — 1:35:26 (watch)
Slide 193 — 1:36:00 (watch)
Slide 194 — 1:36:14 (watch)
![]() | We have created a new classification evaluator with the label "Actionability." |
Slide 195 — 1:36:44 (watch)
Slide 196 — 1:37:28 (watch)
Slide 197 — 1:38:20 (watch)
Slide 198 — 1:39:06 (watch)
Slide 199 — 1:39:22 (watch)
![]() | Ideally, you should run each of these multiple times to account for the non-determinism of your output. |
Slide 200 — 1:39:30 (watch)
![]() | The pass-at-k concept will be discussed later, but for now, a single run provides a sufficient signal to determine whether the outcomes were correct or incorrect. |
Slide 201 — 1:39:44 (watch)
![]() | The evaluation-iteration cycle is where the true value of evaluation lies. You obtain your results, enhance them, and continue to improve your application gradually. |
Slide 202 — 1:40:06 (watch)
![]() | At this point, you might ask why a human is involved in this process at all. What if we took the output of the evaluation and provided it to Cloud Code? |
Slide 203 — 1:40:54 (watch)
Slide 204 — 1:42:08 (watch)
Slide 205 — 1:43:48 (watch)
Slide 206 — 1:45:00 (watch)
Slide 207 — 1:45:48 (watch)
Slide 208 — 1:46:32 (watch)
Slide 209 — 1:46:54 (watch)
Slide 210 — 1:48:18 (watch)
Slide 211 — 1:49:50 (watch)
Slide 212 — 1:50:08 (watch)
![]() | This concludes our coverage for today. The process is a loop: you instrument, trace, evaluate, human annotate, analyze those annotations, improve your agent, and then repeat the cycle. |
Slide 213 — 1:50:20 (watch)
![]() | Some final tips: start small. You don't have to implement everything at once. Begin by reading your traces. |
Slide 214 — 1:50:32 (watch)
Slide 215 — 1:50:48 (watch)
Slide 216 — 1:51:08 (watch)
Slide 217 — 1:51:24 (watch)
![]() | Now is the time to implement this in practice. |
Slide 218 — 1:51:32 (watch)
![]() | You already have a Phoenix Cloud account linked to the Phoenix documentation. Phoenix is open source, so contributions to the project are welcome. |
Slide 219 — 1:52:08 (watch)
Slide 220 — 1:52:46 (watch)
![]() | This represents a significant upgrade in terms of capabilities. Thank you all for staying until the end. I now have time for any questions you may have. |






























































































































































































































