On Monday, Reddit user chaindrop shared a 19-second video on the r/StableDiffusion subreddit. The AI-powered video quickly went viral on various social media platforms and drew mixed reactions, writes ArsTechnica.
The video consists of 10 two-second segments that were independently generated by artificial intelligence and then stitched together. Each segment shows different angles of a modeled Will Smith devouring spaghetti, sometimes even two generated Will Smiths can be seen in the same frame. All video is computer generated thanks to an open source artificial intelligence tool called ModelScope. This generator was developed by DAMO Vision Intelligence Lab, Alibaba’s research arm, and released to the world a few weeks ago.
ModelScope is a text2video (text-to-video) diffusion model that is trained to create new videos by analyzing millions of images and thousands of videos from datasets such as LAION5B, ImageNet and Webvid, which include materials from Shutterstock. This explains the watermark on the output. An online demonstration of ModelScope is now hosted on the AI community site HuggingFace, but it requires an account and pay for compute time to run it.
According to the words of chaindrop, the video creation workflow was simple. He gave ModelScope the query “Will Smith is eating spaghetti” and the model generated it at 24 frames per second (FPS). The interpolation tool Flowframes was then used to increase the frame rate from 24 to 48 and then slow it down to half speed, resulting in a smoother video.
While ModelScope isn’t the only text2video tool available, it’s gotten a lot of attention since a video of Will Smith eating spaghetti went viral. Other text2video tools include Runway’s Gen-2 and early text2video research projects from Meta and Google. Similar videos, including with Scarlett Johansson and Joe Biden eating spaghetti also appeared on the Internet. In one particularly horrifying video, Will Smith eats meatballs. Despite being creepy, this video became the perfect material for future memes.
As you can see, unlike popular free tools based on artificial intelligence for generating images from text descriptions that can even be perceived as real photos, AI is still not to good at translating the text into video. However, work in this direction continues, and if images easily generated by artificial intelligence already cause concerns about the spread of misinformation, then with video, most likely, the situation will become even worse.