To keep up with the Meta’s AI video generator Make-A-Video, Google detailed its work on Imagen Video, an AI system that can also generate video clips based on a text prompt (eg, “teddy bear washing dishes”). While the results aren’t perfect (the looping clips the system generates tend to have artifacts and noise), Google says Imagen Video is a step toward a system with a “high degree of control” and world knowledge, including the ability to generate footage in a variety of artistic styles .
Text-to-video conversion systems are nothing new. Earlier this year, a group of researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence released CogVideo, a program that can convert text into fairly high-quality short clips. But Imagen Video is a significant breakthrough over previous developments, demonstrating the ability to animate captions.
“It’s definitely an improvement,” Matthew Guzdial, an assistant professor at the University of Alberta studying AI and machine learning, told. “As you can see from the video examples, even though the comms team is selecting the best outputs there’s still weird blurriness and artificing. So this definitely is not going to be used directly in animation or TV anytime soon. But it, or something like it, could definitely be embedded in tools to help speed some things up.”
Imagen Video is based on Google’s Imagen, an image generation system comparable to DALL-E 2 and OpenAI’s Stable Diffusion. Imagen is a so-called “diffusion” model that generates new data (such as video) by learning to destroy and rebuild many existing data samples. As the model is loaded with samples, it gets better at recovering the data it previously destroyed to create new works.
As the Google research team behind Imagen Video explains, the system takes a text description and generates a 16-frame video at three frames per second with a resolution of 24 by 48 pixels. The system then zooms in and “predicts” additional frames, creating a final 128-frame video at 24 frames per second at 720p (1280×768).
Unlike today’s imaging systems, Imagen Video can also render text properly.
But this does not mean that Imagen Video has no limitations. As with Make-A-Video, even clips selected from Imagen Video are shaky and distorted in places.
To improve the situation, the Imagen Video team plans to join forces with researchers at Phenaki, another text-to-video system from Google that debuted today and can turn long, detailed prompts into two-minute videos — albeit at a lower quality.
The researchers also note that the data used to train Imagen Video’s system contained problematic content that could have caused Imagen Video to produce graphically violent or sexually explicit clips. Google says it won’t release Imagen Video’s model or source code “until these issues are resolved” and, unlike Meta, won’t provide any forms for interested parties to sign up.
However, given the rapid development of text-to-video technologies, an open-source model may soon emerge that simultaneously stimulates human creativity and creates an intractable problem of counterfeiting, copyright, and misinformation.