Qualitative Video Comparison

Wan vs. Ours

Comparing the Wan baseline with our method across test scenes

A red sports car drives on a road. The camera pans right.

Wan

Ours

0:00 0:00

We can see in the Wan video that in the initial frames the car switches from a convertible to a coupe. (See top). Then in the end, a sign appears and disappears and the scene is distorted heavily. In contrast, our method preserves the car and the background. Additionally, we can observe a very similar identity, camera motion and general scene structure between the two videos. This indicates that our method effectively improves the consistency.

Kitchen appliances on store shelves, a customer picking up a blender to inspect it. The camera slowly pans left.

Wan

Ours

0:00 0:00

The Wan generated video suffers from strong inconsistencies of the main subject. In contrast, our method preserves the main subject and the background. Additionally, we can observe a very similar identity, camera motion and general scene structure between the two videos. This indicates that our method effectively improves the consistency.

A man in a Hawaiian shirt holds a surfboard. He moves the surfboard to show a surfing stance.

Wan

Ours

0:00 0:00

The Wan generated video produces strong artifacts when the surfer moves the surfboard. In contrast, our method does not have issues with the surfboard. Additionally, we can observe a similar identity and general scene structure between the two videos.

A man stands near a car. He wears a jacket and cap. The camera pans right.

Wan

Ours

0:00 0:00

The Wan generated video distorts the face of the man when he turns his head. In contrast, our method preserves the face of the man. The overall identities are similar but the camera motion is not as well preserved as in previous examples.

A macro shot of three distinct metal cogwheels interlocking and turning smoothly inside a machine. The teeth of the gears mesh perfectly with one another without clipping through or melting into each other.

Wan

Ours

0:00 0:00

The Wan generated video suffers from strong artifacts when the gears rotate. In contrast, our method preserves the gears and the background. This video shows that our method optimized towards a different identity but still similar motion and scene structure.

REPA vs. Ours

Comparing the REPA baseline with our method on the test set. Videos are cropped and cut to align for better comparison.

A park with a bench and trees, a dog running across the grass. The camera slowly pans right.

REPA

Ours

0:00 0:00

In the REPA generated video the dog disappears when it runs behind the tree. In contrast, our method preserves the dog.

Hut

REPA

Ours

0:00 0:00

In the REPA generated video, the structure of the railing is intersecting with birds leading to severe artifacts.

A hand turns a dinner plate.

REPA

Ours

0:00 0:00

The REPA video introduces strong deformations of the contents on the plate. Both methods do not produce a video following the prompt exactly.

A first-person view walking smoothly down a classic metal spiral staircase. The curvature of the handrail and the uni...

REPA

Ours

0:00 0:00

We can observe strong distortions in the REPA video while our video is quite consistent.

Extended Wan vs. Ours

Additional comparisons across more prompts and in particular many out of distribution prompts

A winding neon road and palm trees are silhouetted against a purple and pink sky with a radiant sun.

Wan

Ours

0:00 0:00

Both examples look extremely similar, only the white edges of the neon road are more consistent for our method. In the Wan generated video they appear suddenly and morph towards the end of the video. No video similar to this was used in training.

A knight in red armor stands in a forest holding a sword.

Wan

Ours

0:00 0:00

The overall scene structure and movement are identical, only the helmet is significantly different. Our method does not penetrate the hand with the sword.

A green tank moves from left to right in a warehouse.

Wan

Ours

0:00 0:00

In both videos an extremely similar tank is generated and it drives in a similar scene. Our generation's background is significantly more consistent.

The video captures a lively car show taking place in a city square. The scene is filled with a variety of luxury cars...

Wan

Ours

0:00 0:00

Both scenes show similar cars and a similar background. However, our method correctly optimized the unrealistic car movement and kept the good pedestrian movement.

A person in white gloves decorates cupcakes, some shaped like mice.

Wan

Ours

0:00 0:00

The baseline fails to keep the background consistent. Two cupcakes in the background disappear. While the layout is kept the appearance of the cupcakes is not.

A bearded man in a gray apron cuts food in a kitchen.

Wan

Ours

0:00 0:00

The baseline generates strong distortions of the subject and the object being cut in the beginning. Our method keeps the appearance of the objects and the subject consistent. The overall scene is extremely similar keeping identities and scene structure intact.

A vibrant scene at an amusement park during dusk.

Wan

Ours

0:00 0:00

In the baseline video there are strong temporal artifacts in the background. Ours does not exhibit this failure. However the overall scene is changed by our optimization and people vanishing in the large crowd is also not totally eliminated.

Fireworks burst in the night sky above a crowd of people.

Wan

Ours

0:00 0:00

The fireworks flicker in the baseline video. Our method keeps the fireworks consistent and natural. No video similar to this was seen during training.

The video features a stylized animated character, a fox-like creature with blue eyes and a gray and white fur pattern...

Wan

Ours

0:00 0:00

The transition between poses of the baseline video exhibits strong deformation beyond recognition. Our method keeps the character consistent and overall scene structure and movement idea while making it consistent. There are no videos like this in the training data.

A 3D animated bear sings into a microphone on a stage for an audience of animals.

Wan

Ours

0:00 0:00

In the baseline video, another microphone appears in the hand of the bear and the eyes of the small yellow animal flicker and morph. Our method eliminates the issue with the additional microphone and reduces the issues of the eyes. No animated data was seen in the training.

In the depths of the ocean, a sea turtle gracefully swims above a vibrant coral reef. The turtle, with its hues of gr...

Wan

Ours

0:00 0:00

In the baseline, fish disappear into the background. Our method keeps the fish in the foreground and makes them consistent. The identity, movement and structure are largely kept and no underwater scenes were seen during training.

A silver sports car moves along a gravel driveway in front of a brick house.

Wan

Ours

0:00 0:00

The car in the baseline video strongly morphs during the parking process while it drives in smoothly in our generated video. The camera movement and scene structure are largely preserved.

A black Audi sits in a parking lot. It has silver rims and a rear wing. The camera pans right.

Wan

Ours

0:00 0:00

The baseline generates a shaky video in which fore- and background are largely changed. Ours stays consistent. The identity of the car and some parts of the scene layout are kept.

A red Mustang convertible drives down a road. The driver wears a hat. The camera pans right.

Wan

Ours

0:00 0:00

In the baseline the car is morphing when the perspective changes. This does not happen in our generated video. However both struggle with the hand intersection. The overall identity and movement are preserved.

The camera moves around a man with a beard, hat, black shirt, and green vest. He stands in a green forest.

Wan

Ours

0:00 0:00

In the Wan video, there are strong initial inconsistencies in the foreground. Later, the background also is not stable. The identity, movement and structure are preserved in our generated video.

The video explores the interior of a grand building, starting with a view of a high-beamed ceiling with colorful tile...

Wan

Ours

0:00 0:00

The scene background morphs in the Wan video. In our video the background is stable and the identity and movement are preserved. The camera movement is slightly simplified, excluding the most problematic part.

The video takes us through a spacious, well-lit classroom filled with natural light, showcasing a variety of educatio...

Wan

Ours

0:00 0:00

In the baseline video the lights on the top morph as the camera moves forward. This is not the case for our video. The general scene layout is kept.

An aerial view of a majestic castle-like structure nestled in the heart of a verdant forest. The building, painted in...

Wan

Ours

0:00 0:00

In the baseline one can observe how the front wall is slowly compressed horizontally and the right tower top moves to the left. This is not the case for our video. This sample shows that even small problems are optimized.

The video shows a vintage car with a flame-painted body, driving down a gravel road. The car has large white-walled t...

Wan

Ours

0:00 0:00

We can observe that the windshield wipers are not consistent in the Wan video. In our video they are completely removed and only the reflection is visible on the glass. The overall scene is very similar showing that the method optimizes even small inconsistencies.

Failure Cases

Examples where our method's improvements might be unwanted or ineffective

A brown bear stands on its hind legs, reaching for a brown circus tent.

Wan

Ours

0:00 0:00

In the wan video, the tent moves. In general this is a plausible motion of cloth-like object. However, in this specific video it is inconsistent with the non-existent wind as shown by the grass and trees.

Snow-covered mountains under a starry night sky with the Milky Way. A green tent sits in the foreground.

Wan

Ours

0:00 0:00

We can observe that the sky in the wan generated video moves. This is in general plausible as the earth rotates. However, this pace is more consistent with a timelapse video than a normal video. Our method penalizes this.

A series of murals in a modern, minimalist room with a reflective floor and white walls. the murals, vibrant and colo...

Wan

Ours

0:00 0:00

We can observe that the wall of the Wan video morphs and moves strongly. But it happens at a smaller scale with ours as well. The issue is that the error only happens at a single point in time, so it is hard to penalize as we simply average over the video length.

In the vast expanse of the azure sky, two hot air balloons are gracefully soaring.

Wan

Ours

0:00 0:00

In the Wan video two balloons appear, while in our video only one appears from thin air. The overall scene is extremely similar. While it shows that our method explicitly optimizes the consistency it shows that it does not always make it fully consistent.