Zeroscope stems from Modelscope (demo), a multilevel text-to-video diffusion model with 1.7 billion parameters. It generates video content based on textual descriptions. Zeroscope refines this concept, offering higher resolution, without the Shutterstock watermark, and closer to a 16:9 aspect ratio.
Zeroscope features two components: Zeroscope_v2 567w, designed for rapid content creation in a resolution of 576x320 pixels to explore video concepts. Quality videos can then be upscaled to a "high definition" resolution of 1024x576 using zeroscope_v2 XL. The music in the following demo video was added in post-production.
For video generation, the model requires 7.9 GB of VRam at a resolution of 576x320 pixels with a frame rate of 30 frames per second and 15.3 GB of VRam at a resolution of 1024x576 pixels at the same frame rate. Therefore, the smaller model should operate on many standard graphics cards.
Zeroscope's training involved offset noise applied to 9,923 clips and 29,769 tagged frames, each comprising 24 frames. Offset noise might involve random shifts of objects within video frames, slight changes in frame timings, or minor distortions.
This noise introduction during training enhances the model's understanding of the data distribution. As a result, the model can generate a more diverse range of realistic videos, and more effectively interpret variations in text descriptions.
According to Zeroscope developer "Cerspense", who has experience with Modelscope, it's not "super hard" to fine-tune a model with 24 GB of VRam. He removed the Modelscope watermarks during the fine-tuning process.
Text-to-video is still in its infancy. AI-generated clips are typically only a few seconds long and have many visual flaws. However, image AI models initially faced similar issues but achieved photorealism within months. But unlike these models, video generation is far more resource-intensive, both for training and generation.
Google has already unveiled Phenaki and Imagen Video, two text-to-video models capable of generating high-resolution, lengthier, logically coherent clips, though they are not yet released. Meta's Make-a-Video, a text-to-video model, also remains unreleased.
Making AI videos is an iterative process. The videos that you create zeroscope are amazing, but also weird, so it helps to make a lot of videos and then keep the best ones. I made a Python script that makes this easy. The script creates X number of videos/music for you and then saves the output locally. It all runs in the background. Then you can open up Finder in the directory where the clips are saved and drag the ones you like into iMovie.
I always add a & to the end of the command so that it runs in the background. This way you can send off a bunch ofdifferent requests to Replicate without having to open up new terminals or wait for the predictions to finish. To seeprogress, you can run tail -F studio.log to tail the logs (quit with ctrl+c)
This is a decent starting off point, but I'd encourage you to experiment with the prompts and styles. My friend fofrAI got my started with the following parameters for zeroscope, but feel free to play with these too.
Now that we have a nice selection of narration, video clips, and music, it's a matter of dragging into iMovie and iterating from there. This process looks a lot like this. A lot of dragging, dropping, shuffling, and then creating new videos and repeating:
c80f0f1006