RoskerTech

Google Makes It Easy to Turn Photos into Short Videos - Meet VDIM

General

Google has unveiled a new AI model that can take two images and fill in the gaps between them to create a seamless animation that closely resembles live action

VDIM (Video Interpolation With Diffusion Models) was created by Google's research division DeepMind and uses one image as the first frame and the other as the last frame All shots in between are then filled in using AI to create the video

This could be great for bringing to life a photo of a child playing in the park or during an event where you forgot to capture the action

While currently only a preview of the research phase, the underlying technology may one day become an everyday part of photography with smartphones

VDIM converts still images into video by creating missing frames using a diffusion model similar to those found in Midjourney, DALL-E, and Google's own Imagen 2

It begins by creating a low-resolution version of the complete final video This is done by running a cascade of diffusion models in sequence and continuously refining the video This first step allows VDIM to capture the motion and dynamics of the final output

This information is then passed to a higher resolution step, where it is upscaled and improved to be closer to the input image and to make the motion more fluid

One potential use case for VDIM that the team examined in their research paper is video restoration; AI has been used to improve old images, which could be useful for cleaning up old family films or restoring films with broken frames

Older films may have burned out frames in the middle of a sequence, making them difficult to see Or there are several frames with scratches

VDIM is given a clean frame at the beginning and end and used to recreate the motion between those two points

Since VDIM is a research project, no one outside of the Google DeepMind research team has actually used it yet, but the example of the video clip is a good start for a new type of AI video

Examples of videos shared by Google DeepMind include the start of a box-cart race with only two still images

Another video showing a woman on a swing transformed two still images into a flowing swing movement

Personally, I think this is one research project that Google should pursue and find a way to implement in live software, especially in video restoration Especially if it can be extended beyond a few seconds or a few dozen frames