Artificial intelligence-based image generators have been the talk of the town lately, but Meta researchers have gone one step further and publicly unveiled a new text-to-video generator, reports The Verge.
Meta’s machine learning engineering team has introduced a new system called Make-A-Video. This AI model allows users to enter a rough description of the scene, and it generates a short video that matches their text. The videos are clearly artificial, with blurry objects and distorted animations, but are still a significant advance in the field of AI content generation.
In a Facebook post, Meta CEO Mark Zuckerberg described the work as “amazing progress,” adding:
“It’s much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they’ll change over time.”
The clips last no more than five seconds and contain no sound, but cover a huge range of clues. While it is clear that the videos are computer generated, the quality of such AI models will rapidly improve in the near future. In just a few years, AI image generators have gone from creating almost incomprehensible pictures to photorealistic content. And while progress in video may be slower given the almost limitless complexity of the subject area, the rewards of seamless video generation will motivate many organizations and companies to invest significant resources in the project.
On Meta’s blog, dedicated to the Make-a-Video announcement, the company notes that the tools for creating videos can be invaluable “for creators and artists”. But there are also troubling prospects: the results of these tools can be used for disinformation, propaganda and, more likely, the creation of pornography without consent.
Meta says it wants to be “thoughtful about how we build new generative AI systems like this,” and at this time it’s only publishing an article on the Make-A-Video model. The company says it plans to release a demo version of the system, but has not said when or how it will be implemented.
In a paper describing the model, the Meta researchers note that Make-A-Video is trained on pairs of images and captions, as well as on unlabeled video footage. The training content was derived from two datasets, WebVid-10M and HD-VILA-100M, which together contain millions of videos and span hundreds of thousands of hours of video footage. By the way, this includes stock footage created by sites like Shutterstock and pulled from the Internet.
The researchers note that the model has many technical limitations, in addition to blurry frames and scattered animations. Currently, Make-A-Video outputs 16 frames of video at a resolution of 64×64 pixels, which are then resized using a separate AI model to 768×768 pixels.