Google’s DeepMind AI Can Now Generate Music for Video — And Create Full-Blown Soundtracks

Google DeepMind AI audio from video
  • Save

Google DeepMind AI audio from video
  • Save
Photo Credit: Google

Google has shared an update on its DeepMindAI and it’s ability to generate music that accompanies video—creating full-fledged soundtracks.

The process of creating video-to-audio combines video pixels with natural language text prompts to generate a soundscape for the video. Google pairs its V2A technology with video generation models like Veo to create shots that include a dramatic score, realistic sound effects, or dialogue that matches the characters and tone of a video. The model can also generate soundtracks for traditional footage from archival material, silent films, and more.

Google says the new process will give audio engineers enhanced creative control because it can generate an unlimited number of sound tracks from any video input. Engineers can use positive and negative prompting to change the feel of the music. Positive prompting guides the model toward desired sound outcomes, while negative prompting guides it away from undesirable sounds.

How Does DeepMind AI’s Video-to-Audio Technology Work?

Google says it experimented with autoregressive and diffusion approaches to discover the most scalable AI architecture. The diffusion-based approach for audio generation gave the most realistic and compelling results for synchronizing video and audio information. This V2A system starts by encoding video input into a compressed representation. Then, Google’s diffusion model iteratively refines the audio from random noise. The process is guided by visual input from the video and natural language prompts created by the engineer.

The result is synchronized, realistic audio that closely aligns with the prompt instructions and the video content. “To generate higher quality audio and add the ability to guide the model towards generating specific sounds, we added more information to the training process, including AI-generated annotations with detailed descriptions of sound and transcripts of spoken dialogue,” Google says.

Training the model on video, audio, and additional annotations means the technology learns to associate specific audio events with various visual scenes, while responding to the information provided in the annotations or transcripts. Think a swelling score that reaches its crescendo as the video peaks over a mountaintop—evoking a certain feeling of majesty.

Google says the model is highly dependent on high quality video footage to create high-quality audio. Artifacts or distortions in the video may result in a noticeable drop in audio quality. It is also working on lip-syncing technology for videos with characters, but the model may create a mismatch that results in uncanny lip-syncing—such as a character talking while their lips aren’t moving.