RoskerTech

Google's Gemini Pro 15 can now not only see, but also hear

General

Google has updated its very powerful artificial intelligence model Gemini Pro 15 to include for the first time the ability to listen to the content of audio and video files

The update was announced at Google Next, where the search giant confirmed that the model can listen to updated clips and provide information without requiring transcription

What this means is that you can give a documentary or video presentation and ask questions about both the audio and video in the clip

This is part of Google's broader effort to create more multimodal models that can understand a variety of input types, not just text This move is made possible by the Gemini family of models learning speech, video, text, and code simultaneously

Google introduced Gemini Pro 15 in February with a 1 million token context window This means that video can be processed when combined with multimodal training data

The technology giant is now adding voice to its input options This means that podcasts can be given and key moments or specific mentions can be heard The same can be done for audio attached to video files while analyzing the video content

The new update is part of the middle tier of the Gemini family, which has three form factors: the smaller Nano for on-device use, Pro, which offers a free version of the Gemini chatbot, and Ultra, which offers Gemini Advanced

For some reason Google released the 15 update only for Gemini Pro, not Ultra It is not clear if Gemini Ultra 15 will be available, and if so, when it will be accessible

The large context window, which starts at 250,000 (similar to the Claude 3 Opus) and exceeds 1 million for certain authorized users, means that there is no need to tweak the model with specific data Simply load that data at the start of the chat and ask a question

This update also means that Gemini can now generate transcripts of video clips

Probably after the Google I/O developer conference next month At this time, it is only available through the Google Cloud developer dashboard VertexAI

VertexAI is a powerful tool for interacting with various models, building AI applications, and testing what is possible, but it is not widely accessible and is primarily targeted at developers, businesses, and researchers rather than consumers

VertexAI allows users to insert visual or audio media, such as a short film or a person giving a speech, and add text prompts This could be "Give me 5 bullet points that summarize your speech" or "How many times did you say Gemini?"

Google's primary users of Gemini Pro 15 are businesses, and it already has partnerships with TBS, REplit, and others that it uses to tag metadata and create code

Google has also begun using Gemini Pro 15 in its own products, including Code Assist, a generative AI coding assistant for tracking changes across large code bases

The change to Gemini Pro 15 was announced at Google Next along with a major update to the DeepMind AI image model Imagen 2 that enhances Gemini's image generation capabilities

It gets in-painting and out-painting, which allows users to remove or add any element from the generated image This is similar to the recent updates OpenAI made to the DALL-E model

Google is also trying to link AI responses on Gemini and other platforms with Google search to ensure that they always contain up-to-date information