75 slides extracted.
Slide 1 — 0:04 (watch)
![]() | Hello, I'm Tatiana, a Research Scientist on the MLX team. |
Slide 2 — 0:20 (watch)
Slide 3 — 0:38 (watch)
![]() | In our video, "Run Local Agentic AI on the Mac using MLX," we demonstrate how to run AI agents locally. |
Slide 4 — 0:52 (watch)
Slide 5 — 1:08 (watch)
Slide 6 — 1:20 (watch)
![]() | First, we will examine the complete hardware and software stacks that enable distributed workloads on Apple Silicon. |
Slide 7 — 1:30 (watch)
Slide 8 — 1:46 (watch)
Slide 9 — 2:02 (watch)
![]() | Most examples will be presented in the command line interface. In the end, we will demonstrate how distributed communication is also accessible through Python, Swift, and C++ APIs. |
Slide 10 — 2:12 (watch)
![]() | Let's begin by examining distributed communication for Apple Silicon. |
Slide 11 — 2:16 (watch)
![]() | To send and receive data quickly, machines must be connected through a physical link and interconnect. |
Slide 12 — 2:32 (watch)
Slide 13 — 2:56 (watch)
![]() | RDMA over Thunderbolt provides the high bandwidth and low latency communication necessary for distributed workloads. However, it only facilitates raw data movement between two machines. |
Slide 14 — 3:12 (watch)
Slide 15 — 3:34 (watch)
Slide 16 — 3:52 (watch)
![]() | The final piece of the stack is a machine learning framework that utilizes the communication backend for distributed inference and training: MLX. |
Slide 17 — 4:10 (watch)
Slide 18 — 4:26 (watch)
![]() | Now that we understand the full stack, let's put it all together and build a cluster, which is a group of machines working together on the same task. We will use four M3 Ultras. |
Slide 19 — 4:40 (watch)
Slide 20 — 4:52 (watch)
![]() | Next, we will examine how to connect the machines, the topologies supported by Jackal, and the trade-offs associated with each topology. |
Slide 21 — 5:00 (watch)
![]() | Next, we will demonstrate how to enable RDMA on the machines for fast communication. Finally, we will launch distributed jobs on the cluster using MLX. |
Slide 22 — 5:28 (watch)
Slide 23 — 6:18 (watch)
Slide 24 — 6:56 (watch)
Slide 25 — 7:08 (watch)
![]() | Now that we have connected all M3 Ultras together, we need to enable RDMA on all machines. |
Slide 26 — 7:20 (watch)
![]() | Open Settings on each machine, search for RDMA, click on "Enable RDMA over Thunderbolt," enable RDMA, and then reboot the machine. |
Slide 27 — 7:32 (watch)
![]() | Great, Macs are connected with Thunderbolt 5 cables, and RDMA is enabled. Now we need a method to launch distributed programs. |
Slide 28 — 7:40 (watch)
![]() | One way to launch distributed programs is over the local network, using Wi-Fi or Ethernet. |
Slide 29 — 7:48 (watch)
Slide 30 — 8:02 (watch)
![]() | MLX provides a launch helper that automates this process for you. |
Slide 31 — 8:08 (watch)
![]() | You run MLX launch on your MacBook, and it orchestrates the cluster for you. |
Slide 32 — 8:12 (watch)
![]() | You provide the executable you want to run and a JSON host file that describes your cluster. |
Slide 33 — 8:18 (watch)
![]() | From there, it SSHs into each node using the host names from the provided host file and starts the executable on every machine. |
Slide 34 — 8:28 (watch)
![]() | The host file that describes the cluster is structured as a JSON array, with one entry for each node. |
Slide 35 — 8:42 (watch)
Slide 36 — 8:58 (watch)
![]() | You can write the configuration manually, but MLX also provides a helper script, MLX distributed config, that generates it for you. |
Slide 37 — 9:26 (watch)
Slide 38 — 9:50 (watch)
![]() | Let's run this command to generate the host file for our cluster. |
Slide 39 — 10:02 (watch)
Slide 40 — 10:22 (watch)
Slide 41 — 10:38 (watch)
![]() | The cluster is now ready. We will move on to the exciting part: distributed language model inference and fine-tuning. The easiest way to begin this is through the command-line interface and MLX LLM. |
Slide 42 — 10:58 (watch)
Slide 43 — 11:40 (watch)
Slide 44 — 12:22 (watch)
Slide 45 — 12:42 (watch)
Slide 46 — 12:54 (watch)
![]() | The cluster generates tokens at nearly three times the rate of a single machine for the QN3.6 model, which is quite an impressive speedup. |
Slide 47 — 13:22 (watch)
Slide 48 — 13:58 (watch)
Slide 49 — 14:24 (watch)
Slide 50 — 14:42 (watch)
![]() | Low latency is important, which is why the mesh topology is crucial in this scenario. Every machine can reach every other machine in a single hop. |
Slide 51 — 14:54 (watch)
Slide 52 — 15:06 (watch)
![]() | Now, let's shard a one trillion parameter Kimi2.6 on our cluster. |
Slide 53 — 15:20 (watch)
Slide 54 — 15:36 (watch)
Slide 55 — 16:02 (watch)
Slide 56 — 16:22 (watch)
Slide 57 — 16:36 (watch)
![]() | The faster we process the training data, the sooner fine-tuning is completed. To achieve this, we can utilize multiple machines to accelerate the process. |
Slide 58 — 16:54 (watch)
Slide 59 — 17:10 (watch)
![]() | With n machines, we can process data up to n times faster. This is a significant advantage. |
Slide 60 — 17:14 (watch)
![]() | We can utilize data parallelism with MLX LLM. |
Slide 61 — 17:30 (watch)
Slide 62 — 17:48 (watch)
![]() | We will fine-tune QN3.5, which has 9 billion parameters, on both a single machine and a cluster. We will compare the number of tokens processed by the model per second in each case. |
Slide 63 — 18:00 (watch)
Slide 64 — 18:16 (watch)
Slide 65 — 18:30 (watch)
![]() | With MLX, you can transform your devices into a local training cluster for efficient fine-tuning without relying on the cloud. |
Slide 66 — 18:46 (watch)
Slide 67 — 19:14 (watch)
Slide 68 — 19:36 (watch)
Slide 69 — 19:52 (watch)
Slide 70 — 20:16 (watch)
Slide 71 — 20:38 (watch)
![]() | You now understand both the high-level and low-level APIs for distributed inference and training with MLX and Jackal. You are prepared to build advanced distributed workloads using MLX. |
Slide 72 — 21:04 (watch)
Slide 73 — 21:26 (watch)
![]() | With Distributed Cluster, you can now run local AI agents powered entirely by MLX, quickly and privately, on your own hardware. |
Slide 74 — 21:42 (watch)
Slide 75 — 21:56 (watch)
![]() | We look forward to seeing what you create with MLX on Apple Silicon. |










































































