64 slides extracted.
Slide 1 — 0:04 (watch)
![]() | Hello, my name is Xiao, and I am a GPU software engineer. |
Slide 2 — 0:16 (watch)
![]() | Today, I will guide you through an exploration of Metal tensors and demonstrate how to write optimized custom ML kernels using tensor operations. |
Slide 3 — 0:34 (watch)
Slide 4 — 0:52 (watch)
![]() | There are several reasons to work at the Metal level. |
Slide 5 — 1:00 (watch)
Slide 6 — 1:14 (watch)
![]() | If you are working on a Metal-based application, the easiest way to get started is by using the tensor ops library. |
Slide 7 — 1:22 (watch)
![]() | Tensor ops is a Metal shading language API that accelerates tensor operations on the GPU, such as matrix multiplication and convolution. |
Slide 8 — 1:38 (watch)
Slide 9 — 1:56 (watch)
Slide 10 — 2:06 (watch)
![]() | You can refer to the related sessions to learn the fundamentals of tensor operations. |
Slide 11 — 2:12 (watch)
![]() | In this session, I will build on those basics by discussing best practices for working with quantizer data. |
Slide 12 — 2:22 (watch)
![]() | I will demonstrate how to build advanced customer operations, including flash attention. |
Slide 13 — 2:26 (watch)
![]() | Let's dive into the first topic: working with quantizer data. |
Slide 14 — 2:32 (watch)
![]() | State-of-the-art machine learning models are increasingly large. |
Slide 15 — 2:42 (watch)
Slide 16 — 2:52 (watch)
![]() | The standard approach for compressing weights is quantization. This involves taking higher precision weights and reducing them to lower precision data types. |
Slide 17 — 3:08 (watch)
Slide 18 — 3:34 (watch)
Slide 19 — 3:48 (watch)
![]() | You can create and pass your IAFS quantizer tensors to tensor operations, which will automatically utilize any available hardware acceleration. |
Slide 20 — 4:04 (watch)
Slide 21 — 4:18 (watch)
![]() | You can store your quantized element data using this method. Next, we will discuss scale factors. |
Slide 22 — 4:36 (watch)
Slide 23 — 5:04 (watch)
Slide 24 — 5:26 (watch)
![]() | Matrix multiplication is a fundamental operation in machine learning workloads. |
Slide 25 — 5:34 (watch)
![]() | LMS performs millions of matrix multiplications during inference. |
Slide 26 — 5:48 (watch)
Slide 27 — 6:02 (watch)
![]() | We can use quantization to reduce memory traffic and accommodate larger models in memory. |
Slide 28 — 6:24 (watch)
Slide 29 — 6:46 (watch)
![]() | Alternatively, if you prefer not to create a full mtel tensor on the host, you can create a temporary tensor directly on the shader stack. |
Slide 30 — 6:54 (watch)
![]() | The syntax is nearly identical; simply replace the tag tensor handle with tensor inline. |
Slide 31 — 7:02 (watch)
![]() | Pass your buffer pointers and other metadata to the tensor constructor to create a tensor on the stack. |
Slide 32 — 7:12 (watch)
Slide 33 — 7:26 (watch)
![]() | To achieve this, call the slice function on your input and output tensors using the thread group ID. Both the data and the scales plan will be sliced simultaneously based on the block size. |
Slide 34 — 7:46 (watch)
Slide 35 — 8:06 (watch)
![]() | In most cases, you should input your quantizer data directly into tensor operations to automatically leverage any available hardware acceleration. |
Slide 36 — 8:12 (watch)
![]() | If you need to dequantize a custom format, tensor operations can still accommodate this requirement. |
Slide 37 — 8:28 (watch)
Slide 38 — 8:44 (watch)
![]() | You can achieve this by dequantizing the data into a cooperative tensor, which can then be used as the input for the matmul2d operation. |
Slide 39 — 8:54 (watch)
Slide 40 — 9:10 (watch)
![]() | To recap, Metal tensors natively support a wide range of quantizer data types, including the new MX scaling formats and the EM0 scale factors introduced in iOS and macOS 27. |
Slide 41 — 9:22 (watch)
![]() | These new data types have additional alignment requirements compared to larger data types. Be sure to check the Metal documentation for details. |
Slide 42 — 9:34 (watch)
![]() | Now, let's advance to creating a more complex custom operation using tensor operations. |
Slide 43 — 9:54 (watch)
Slide 44 — 10:18 (watch)
Slide 45 — 10:34 (watch)
Slide 46 — 10:50 (watch)
Slide 47 — 11:02 (watch)
![]() | Tensor operations include a reduce rows function to facilitate this process. |
Slide 48 — 11:12 (watch)
![]() | Threads will exchange data to calculate the maximum for each row. The result will be returned in another cooperative tensor. Let's set it up. |
Slide 49 — 11:26 (watch)
Slide 50 — 11:46 (watch)
Slide 51 — 12:06 (watch)
Slide 52 — 12:20 (watch)
![]() | Now we are ready to multiply the cooperative tensor by V. |
Slide 53 — 12:28 (watch)
![]() | In macOS 26, you must first store the cooperative tensor in thread group memory. However, it is now possible to use cooperative tensors directly as inputs for manual operations. |
Slide 54 — 12:50 (watch)
Slide 55 — 13:14 (watch)
Slide 56 — 13:22 (watch)
![]() | These are the key TensorOps features necessary for building an advanced operation like FlashAttention using TensorOps. |
Slide 57 — 13:30 (watch)
![]() | Now that we've discussed how to build this operation, let's examine its performance in a real model using Core AI. |
Slide 58 — 13:46 (watch)
Slide 59 — 14:24 (watch)
Slide 60 — 14:58 (watch)
![]() | I am prompting the model to label all pixels in the image that contain a car. |
Slide 61 — 15:06 (watch)
![]() | Now I will run the segmentation. |
Slide 62 — 15:18 (watch)
![]() | The final result shows that the model correctly segmented the image. The car is highlighted in blue, indicating that our attention kernel is fully integrated into the model as expected. |































































