Google is building all the components of the AI technology stack, from custom chips, to data centers to frontier models. As a result, our new Gemini 2.0 models are more capable, faster and more efficient than previous versions. These models are natively multimodal—they are able to process text, images, audio and video. They can also generate images and steerable text-to-speech audio. With long context windows of up to 2 millions tokens, Gemini can power advanced applications that require deep understanding and memory
Additionally, Thinking model is capable of showing reasoning skills for solving complex problems, which is especially useful in math and science. Gemini can also natively use tools like Google Search to access real-time information, and DeepMind’s Project Mariner has demonstrated that agents built with the Gemini model can complete tasks using a web browser. Conversational experiences can now be built with the Gemini Multimodal Live API, which accepts audio and video streaming input. The combination of these capabilities enables a new class of agentic experiences and we’re excited to see what startups build with Gemini in 2025.