Stability AI, a leading name in the tech space, has made its groundbreaking entry into the video generation realm with the launch of their Stable Video Diffusion (SVD). With this remarkable move, they've showcased two highly advanced AI models - SVD and SVD–XT, designed to generate short video clips from still images.
However, as of now, these state-of-the-art models are open for research purposes only. According to the company, both SVD and SVD–XT sanction high-fidelity outcomes that rival or potentially outshine the performance of other extant artificial video generators.
Stability AI aims to benefit from user feedback in fine-tuning these image-to-video models, having open-sourced them as part of the research preview. This endeavor signifies the company's intent to pave the way for eventually applying these models commercially.
A company blog post detailed that SVD and SVD-XT employ latent diffusion models that generate 576 x 1024 videos, using a single still image as a conditioning frame. Even though the output videos are brief in duration – maxing out at four seconds – these models can generate content at a pace ranging from three frames per second to 30 frames per second. Specifically, the SVD model is calibrated to derive 14 frames from a still image, while SVD-XT possesses the capability to generate up to 25 frames.
To create the SVD, Stability AI relied on an immense, meticulously curated video library consisting of approximately 600 million samples. The company used the samples compiled in the database to train a primary model, which was subsequently refined using a smaller, high-def dataset to handle downstream tasks such as image-to-video and text-to-video conversion, enabling it to predict a sequence of frames from a singular conditioning image.
A whitepaper released by Stability AI elucidates the potential of SVD as a base for refining a diffusion model to generate a multi-view synthesis, thus enabling generation of several consistent views of an object from a singular still image.
This opens up a plethora of opportunities for potential uses in various sectors, such as education, entertainment, and marketing, according to the company's blog post.
A significant note in the company's disclosure is that an external evaluation conducted by human reviewers revealed that SVD's output surpasses the quality of premiere closed text-to-video models produced by competitors such as Runway and Pika Labs.
Despite the initial success, Stability AI acknowledges that there are many limitations in the current models. For instance, these models occasionally lack photorealistic output, generate still videos, or struggle with replicating human figures accurately.
But it's merely the onset of their venture into video generation. The present research preview's data will help evolve these models by identifying the existing gaps and introducing new features, such as supporting text prompts or text rendering in the videos, making them ready for commercial applications.
With the potential of diverse applications encompassing sectors including but not limited to, advertising, education, and entertainment, platforms like AppMaster, renowned for empowering users with tools to create mobile and web applications easily, might find Stable Video Diffusion a useful integration.
The company envisages that the findings from the open investigation of these models will flag more concerns (such as biases) and assist in facilitating a safer deployment later.
Already, plans are underway to develop a variety of models that would fortify and extend the base built by stable diffusion.
However, it remains uncertain when these improvements would be available to users.