49 slides extracted.
Slide 1 — 0:02 (watch)
![]() | Thank you. |
Slide 2 — 0:28 (watch)
![]() | I'm Patrick, a member of the technical staff at Google DeepMind, where I work on the Gemini API and AI Studio. Today, I will discuss any-to-any and the process of building native multimodal agents. |
Slide 3 — 0:50 (watch)
Slide 4 — 1:06 (watch)
![]() | Any-to-any refers to the capabilities of the Gemini API. It supports a wide range of use cases because Gemini understands more than just text. |
Slide 5 — 1:20 (watch)
![]() | Gemini is natively multimodal, allowing you to input code, images, audio, video, URLs, and even Google Search. It can generate not only text but also images, speech, videos, function calls, and code. |
Slide 6 — 1:34 (watch)
![]() | This enables a wide range of exciting capabilities. |
Slide 7 — 1:44 (watch)
![]() | This slide may give the wrong impression; there are still different models in use. |
Slide 8 — 2:12 (watch)
Slide 9 — 2:48 (watch)
![]() | I want to focus on four key areas: multimodal understanding with Gemini, native image generation, native speech generation, and, if time permits, a brief discussion about the Live API. |
Slide 10 — 3:08 (watch)
Slide 11 — 3:32 (watch)
Slide 12 — 3:44 (watch)
![]() | It is connected through tool calls or function calls, which then invoke the other specialized models. |
Slide 13 — 4:04 (watch)
Slide 14 — 4:30 (watch)
Slide 15 — 4:46 (watch)
![]() | Ideally, we want cross-modal understanding, allowing the model to draw information from various sources and make connections. |
Slide 16 — 5:16 (watch)
Slide 17 — 5:54 (watch)
Slide 18 — 6:24 (watch)
Slide 19 — 7:06 (watch)
Slide 20 — 7:38 (watch)
![]() | This is multimodal understanding in a nutshell. |
Slide 21 — 7:44 (watch)
![]() | We can now use Gemini to understand various resources and generate a summary. |
Slide 22 — 7:50 (watch)
![]() | The timer is not working, so I'm unsure how much time I have left. However, I believe we are still in a good position. |
Slide 23 — 8:02 (watch)
![]() | The next phase is multimodal generation, which utilizes the agentic loop with Gemini as the core component. We integrate this with function calling. |
Slide 24 — 8:18 (watch)
Slide 25 — 8:32 (watch)
Slide 26 — 8:44 (watch)
![]() | The model we are using is NanoBanana2, which is well-known. We instruct it to create a picture. |
Slide 27 — 8:56 (watch)
![]() | In this case, it excels at creating infographics. You simply include it in your prompt, and it generates an infographic. It produces visually appealing graphics for us. |
Slide 28 — 9:12 (watch)
![]() | We also have a text-to-speech model based on Gemini 2.5. This model allows for various configurations, including the option to create a two-speaker audio file, which is suitable for a podcast style. |
Slide 29 — 9:28 (watch)
![]() | Here is a nice example, if the sound works. Does it work? |
Slide 30 — 9:48 (watch)
Slide 31 — 10:18 (watch)
Slide 32 — 10:58 (watch)
Slide 33 — 11:24 (watch)
![]() | This covers everything you need to set up the agentic function calling for the multimodal generation component. |
Slide 34 — 11:32 (watch)
![]() | I want to briefly discuss why native generation is important. We refer to these as native image generation models because they are based on Gemiini. |
Slide 35 — 12:08 (watch)
Slide 36 — 13:06 (watch)
Slide 37 — 13:34 (watch)
![]() | This is a quick checkpoint. |
Slide 38 — 13:44 (watch)
![]() | You now know how to perform both the understanding and generation components. This essentially covers the Notebook LM clone. I would like to briefly mention... |
Slide 39 — 14:06 (watch)
Slide 40 — 14:26 (watch)
![]() | I don't have time for a live demo, but you can try it at ai.studio live. |
Slide 41 — 14:32 (watch)
![]() | Here is a quick video from our colleague, Thor. |
Slide 42 — 14:46 (watch)
![]() | Hello, Gemini. How are you today? I'm doing well, thanks for asking. I'm just enjoying the chat. How are things with you? Can you see me? Yes, I can see you clearly. |
Slide 43 — 15:02 (watch)
![]() | I see you with your short hair and beard, wearing a dark brown jacket over a blue shirt. You can try it out for yourself at ai.studio live. That’s about it. |
Slide 44 — 15:14 (watch)
![]() | This slide demonstrates how to implement it in the code. Additionally, we have a skill available that you can configure. |
Slide 45 — 15:18 (watch)
![]() | Now we have covered all three checkpoints. |
Slide 46 — 15:24 (watch)
![]() | The pattern is transferable to every other field. |
Slide 47 — 15:34 (watch)
Slide 48 — 15:46 (watch)
![]() | This enables applications such as multimodal search. You can also use Gemma4 locally to achieve multimodal understanding, including voiceover for images and videos with native audio. |
Slide 49 — 16:00 (watch)
![]() | Thank you, and have fun building multimodal agents. |
















































