Diving Deep into Image-to-Video AI Magic: Runway, Pika, SVD, Moonvalley, I2VGEN_XL

I categorized the images into five categories: Landscape, Portrait, Dynamic Pose, Product, and Particles (Smoke, Fire, and Water)

4
 min. read
March 26, 2024
Diving Deep into Image-to-Video AI Magic: Runway, Pika, SVD, Moonvalley, I2VGEN_XL

I decided to compare various image to video platforms. Those include Runway ML, Pika, Stable diffusion video (SVD), Moonvalley and I2VGEN_XL by ali baba.

Runway and Pika videos are generated on websites, while Moonvalley videos are generated via a Discord server. SVD can be produced on the Leonardo website, but it doesn't offer the option to upload your own images; instead, images have to be created on their platform to be turned into videos. That's why I experimented with SDV in Comfy UI, which is relatively complicated for first-time users. I2GEN_XL does offer code on their GitHub website, but they lack a user-friendly interface, so I used Hugging Face Spaces to create videos from images.

My starting point for creating images was Katalist.ai. The experience of creating images there was extremely comfortable and easy, as the website utilizes ChatGPT for generating prompts. Users can input simple prompts, which GPT then transforms into complex ones that are well understood by Stable Diffusion, the backbone of image creation. 

My favorite part about Katalist.ai is that it keeps scenes and characters consistent. Because of that I am able to create full stories instead of just one image, which is incredibly powerful when combined with other image to video platforms. Also if I change my mind and want to switch the character I can do it on all generated images at the same time, which saves me a bunch of time. 

If you want to try Katalist.ai for yourself you can get free early access here. The results I am getting unreal.

I categorized the images into five categories: Landscape, Portrait, Dynamic Pose, Product, and Particles (Smoke, Fire, and Water) to observe how different image-to-video models handle these varied elements.

Next, I tested these images on the platforms mentioned earlier. Runway and Pika provide camera control features, including zooming, panning, tilting, and rolling. I applied the same settings on both platforms for consistency. In contrast, Moonvalley offers camera control through prompts, but the results are less predictable, with prompts often not being accurately followed. A similar situation occurs with SVD and I2VGEN_XL, where camera motion is inferred by the image's dynamics and composition. For instance, if an image suggests forward movement, the AI model will likely move the camera accordingly.

Here are my findings:

Landscape:

  • Runway is a clear winner here. It's obvious their model was trained on numerous nature videos, including aerial and drone footage. The camera motion is smooth, and the video output offers a gorgeous parallax effect.
  • Pika is a bit disappointing, as I didn’t get the motion I set in the settings. The image becomes quite distorted.
  • The camera motion in SDV is somewhat exaggerated, but this could be easily fixed in post-production by slowing it down and adding frame interpolation. The parallax effect is beautiful, and the image distortion isn't too harsh.
  • Moonvalley's footage is comparable to Pika's, as we don't see the parallax effect, and it could be better achieved with basic video editing software.
  • The I2GEN_XL video is quite a surprise – the camera motion is very subtle, but the parallax effect is clear. The added lens flare was a really pleasant surprise.

Portrait:

  • The camera motion is smooth; the main character is not overly distorted, and the background elements maintain their structural integrity. I was pleasantly surprised by some elements added due to the parallax effect. However, the image quality deteriorates in the last second, a common issue with Runway.
  • Pika seems eager to add movement to characters, even when the image composition suggests the character should be static. This often leads to significant distortion of faces and messy additional elements.
  • SDV's camera motion is quite impressive. However, the choice to pull focus was not ideal, as deblurring the background is rarely done well in post-production. The character's face, costume, and additional elements were slightly distorted, but not excessively so.
  • Moonvalley appeared unsure of how to handle the static nature of the image and simply altered the character's face.
  • I was impressed with what I2VGEN_XL did with the character. It not only subtly rotated her but also added reflections to her costume. It didn’t perform as I expected, but the model's strength clearly lies in its unpredictability.

Dynamic pose:

  • The source image suggested a dynamic movement, and I struggled with this one. Runway recognized where the image was heading but opted to play it safe by adding only slow motion to the shot.
  • After repeatedly hitting the generate button, I finally achieved the expected result, but only for the first second. Subsequently, both the main character and the background became excessively distorted, making the footage unusable.
  • SDV understood the assignment and performed as expected, capturing the dynamism effectively.
  • However, Moonvalley seemed perplexed by this image. It merely introduced a bit of noise movement throughout the image and shifted the focus from the main character to the dust in the background.
  • I2VGEN_XL started with a promise but ultimately failed. It overemphasized the character and… stumbled, neglecting to make any significant changes to the camera motion or the background.

Product:

  • As expected, this image was the easiest to work with. Runway performed as anticipated but didn't execute the camera movement I wanted. Instead of panning up and zooming out, it opted for a classic pull-out with a bit of rotation. The image was slightly distorted, but it wasn't a major issue.
  • Pika moved the camera exactly as I desired, but the effect was overly intense. I wish Pika offered a greater range of intensity levels. However, I did appreciate the new elements introduced as the camera pulled out.
  • SDV excelled in this task, delivering superb image coherency and a beautiful parallax effect in camera rotation, even adding depth to the product.
  • And yet again, Moonvalley just stared blankly into image and went into its happy place
  • I2VGEN_XL started off well but then lost its way. Rather than completing its intended task, it decided the plants in the background were superfluous and focused all its efforts on removing them, unfortunately to poor effect.

Particles - Smoke:

  • This scenario showcases the strengths of video models. Runway followed the camera movement instructions but failed to grasp the depth of the image, leading to a 'played in reverse' effect with the smoke moving alongside the camera. The movement of the smoke was quite good, although I have seen better from Runway.
  • Pika, instead of panning, opted to tilt the camera and didn't focus on the depth of the image, which surprisingly resulted in a better outcome than Runway. The smoke moved very realistically, and the fire beneath the smoke was more dynamic, just as naturally occurs.
  • SDV once again delivered the best results. The smoke moved more rapidly compared to the other videos, but its motion was the most realistic among them.
  • Moonvalley, however, seemed to add nonexistent elements, negatively impacting the overall experience with its camera movement.
  • I2VGEN_XL handled the fire particles well, but our focus was on the smoke. While the smoke animation had a reverse effect, the motion itself was very nicely executed, aside from that aspect.

Particles - Fire:

  • Honestly, I had higher expectations for Runway. I found myself pressing the 'generate' button more often than I would have liked, and the results were not as impressive as I've seen before. Given Runway's reputation for handling fire effects well, this was quite disappointing.
  • Pika performed slightly better. The video generation was faster than Runway's, but the quality still didn't match the best I've seen from Pika. Additionally, it caused distortion in the character and reduced the overall sharpness of the image.
  • SDV's handling of fire was astonishing. It excelled not only in fire representation but also in camera movement and image coherency. The way the model added additional lighting to the character's face was particularly impressive.
  • Moonvalley managed the foreground fire decently, but it fell short in animating the background fire particles. I'll refrain from commenting further.
  • I2VGEN_XL took an interesting approach by reducing the amount of fire in the video. Besides that, it executed smooth camera movements, and the image distortion was minimal.

Particles - Water:

  • Runway performed adequately. It followed the camera instructions well, but I had expected better animation of the water. The splashes were well done, but the wave motion left something to be desired.
  • Pika fared slightly better than Runway in this respect, though it caused some distortion and blurring of image details. The camera instructions weren't as closely followed in this instance, resulting in an overall mediocre outcome.
  • Once again, SDV was a pleasant surprise. The wave animations were realistic, and the overall coherence of the video was the highest among all the platforms.
  • Moonvalley, however, was another disappointment. While it did manage to create movement in the image, that was the extent of its achievement.
  • I2VGEN_XL performed impressively with the motion of the waves but overlooked the lower portion of the image. It also unexpectedly added an extra limb to a surfer, but setting that aside, it did a commendable job overall.

Conclusion:

Stable Diffusion Video surprisingly took the top spot in this experiment, overturning my initial expectations favoring Runway and Pika. Both Runway and Pika delivered decent performances, yet they required multiple attempts and toggling between 'image' and 'image with description' to nail the desired effect. In a stark contrast, SDV streamlined the process, achieving impressive results within just two or three generations. A noteworthy point, though: SDV demands a robust GPU (for instance, my Nvidia 4080 RTX needed around 120 seconds to generate a 2-second video at 1344x768 px resolution), while Runway and Pika simply rely on an internet connection and a subscription. Moonvalley fell short of being practical for use, and I2VGEN_XL, despite showing potential, is in dire need of a user-friendly web interface and some fine-tuning.