RoskerTech

Google's New AI "VLOGGER" Creates Lifelike Avatars from Photos - Voice Control

General

Google researchers have been working overtime of late, releasing new models and ideas left and right The latest is a way to turn a static image into a manipulatable avatar, born out of an AI agent playing a game

VLOGGER is not currently available for trial, but from the demo, it looks amazingly realistic, with avatars that can be created and controlled by voice

Tools like Pika Labs' Lip Sync, Hey Gen's video translation service, and Synthesia can already do something somewhat similar, but this seems like a simpler, lower bandwidth option

Currently VLOGGER is just a research project with some fun demo videos, but if commercialized it could be a new way to communicate on Teams or Slack

This is an AI model that can create animated avatars from still images and maintain the photorealistic appearance of the person in the photo in every frame of the final video

The model also takes an audio file of a person speaking and processes the body and lip movements to reflect the natural movements of the person as he or she speaks

This includes not only head movements, facial expressions, gaze, and blinking, but also hand gestures and upper body movements, which are created without reference to anything other than image and audio

The model is built on a diffusion architecture that powers text-to-image, video, and even 3D models such as MidJourney and Runway, but with additional control mechanisms added

Vlogger takes multiple steps to obtain a generated avatar First, it takes audio and images as input, passes them through a 3D motion generation process, then uses a "time diffusion" model to determine timing and motion, and finally upscales them into the final output

Basically, a neural network is built to predict the motion of the face, body, pose, gaze, and facial expressions over time, using the still image as the first frame and the audio as a guide

Training the model requires a large multimedia dataset called MENTOR This dataset contains 800,000 videos of various people conversing, with faces and body parts labeled moment by moment

This is a research preview, not an actual product, and while it can generate realistic motion, the videos do not necessarily match the movements of real people The core is still a diffusion model, which is prone to anomalous behavior

The team states that it also struggles with particularly large movements and diverse environments Also, it can only handle relatively short videos

According to Google researchers, one of the main use cases is video translation For example, an existing video in a particular language is filmed, and the lips and face are edited to match the new translated audio

Other potential use cases include creating animated avatars for virtual assistants and chatbots, and creating virtual characters that look and move realistically in a gaming environment

Tools that do something similar to this already exist, including Synthesia, which allows users to walk into a company office and create their own virtual avatar to make a presentation, but this new model seems to make the process much easier

One potential use is to provide low-bandwidth video communications Future versions of this model may allow for voice to video chat by animating static avatars [This could prove particularly useful in VR environments on headsets such as Meta Quest or Apple Vision Pro, as it works independently of the platform's own avatar model