Wan vs. Ours
Comparing the Wan baseline with our method across test scenes
We can see in the Wan video that in the initial frames the car switches from a convertible to a coupe. (See top). Then in the end, a sign appears and disappears and the scene is distorted heavily. In contrast, our method preserves the car and the background. Additionally, we can observe a very similar identity, camera motion and general scene structure between the two videos. This indicates that our method effectively improves the consistency.
The Wan generated video suffers from strong inconsistencies of the main subject. In contrast, our method preserves the main subject and the background. Additionally, we can observe a very similar identity, camera motion and general scene structure between the two videos. This indicates that our method effectively improves the consistency.
The Wan generated video produces strong artifacts when the surfer moves the surfboard. In contrast, our method does not have issues with the surfboard. Additionally, we can observe a similar identity and general scene structure between the two videos.
The Wan generated video distorts the face of the man when he turns his head. In contrast, our method preserves the face of the man. The overall identities are similar but the camera motion is not as well preserved as in previous examples.
The Wan generated video suffers from strong artifacts when the gears rotate. In contrast, our method preserves the gears and the background. This video shows that our method optimized towards a different identity but still similar motion and scene structure.
REPA vs. Ours
Comparing the REPA baseline with our method on the test set. Videos are cropped and cut to align for better comparison.
In the REPA generated video the dog disappears when it runs behind the tree. In contrast, our method preserves the dog.
In the REPA generated video, the structure of the railing is intersecting with birds leading to severe artifacts.
The REPA video introduces strong deformations of the contents on the plate. Both methods do not produce a video following the prompt exactly.
We can observe strong distortions in the REPA video while our video is quite consistent.
Extended Wan vs. Ours
Additional comparisons across more prompts and in particular many out of distribution prompts
Both examples look extremely similar, only the white edges of the neon road are more consistent for our method. In the Wan generated video they appear suddenly and morph towards the end of the video. No video similar to this was used in training.
The overall scene structure and movement are identical, only the helmet is significantly different. Our method does not penetrate the hand with the sword.
In both videos an extremely similar tank is generated and it drives in a similar scene. Our generation's background is significantly more consistent.
Both scenes show similar cars and a similar background. However, our method correctly optimized the unrealistic car movement and kept the good pedestrian movement.
The baseline fails to keep the background consistent. Two cupcakes in the background disappear. While the layout is kept the appearance of the cupcakes is not.
The baseline generates strong distortions of the subject and the object being cut in the beginning. Our method keeps the appearance of the objects and the subject consistent. The overall scene is extremely similar keeping identities and scene structure intact.
In the baseline video there are strong temporal artifacts in the background. Ours does not exhibit this failure. However the overall scene is changed by our optimization and people vanishing in the large crowd is also not totally eliminated.
The fireworks flicker in the baseline video. Our method keeps the fireworks consistent and natural. No video similar to this was seen during training.
The transition between poses of the baseline video exhibits strong deformation beyond recognition. Our method keeps the character consistent and overall scene structure and movement idea while making it consistent. There are no videos like this in the training data.
In the baseline video, another microphone appears in the hand of the bear and the eyes of the small yellow animal flicker and morph. Our method eliminates the issue with the additional microphone and reduces the issues of the eyes. No animated data was seen in the training.
In the baseline, fish disappear into the background. Our method keeps the fish in the foreground and makes them consistent. The identity, movement and structure are largely kept and no underwater scenes were seen during training.
The car in the baseline video strongly morphs during the parking process while it drives in smoothly in our generated video. The camera movement and scene structure are largely preserved.
The baseline generates a shaky video in which fore- and background are largely changed. Ours stays consistent. The identity of the car and some parts of the scene layout are kept.
In the baseline the car is morphing when the perspective changes. This does not happen in our generated video. However both struggle with the hand intersection. The overall identity and movement are preserved.
In the Wan video, there are strong initial inconsistencies in the foreground. Later, the background also is not stable. The identity, movement and structure are preserved in our generated video.
The scene background morphs in the Wan video. In our video the background is stable and the identity and movement are preserved. The camera movement is slightly simplified, excluding the most problematic part.
In the baseline video the lights on the top morph as the camera moves forward. This is not the case for our video. The general scene layout is kept.
In the baseline one can observe how the front wall is slowly compressed horizontally and the right tower top moves to the left. This is not the case for our video. This sample shows that even small problems are optimized.
We can observe that the windshield wipers are not consistent in the Wan video. In our video they are completely removed and only the reflection is visible on the glass. The overall scene is very similar showing that the method optimizes even small inconsistencies.
Failure Cases
Examples where our method's improvements might be unwanted or ineffective
In the wan video, the tent moves. In general this is a plausible motion of cloth-like object. However, in this specific video it is inconsistent with the non-existent wind as shown by the grass and trees.
We can observe that the sky in the wan generated video moves. This is in general plausible as the earth rotates. However, this pace is more consistent with a timelapse video than a normal video. Our method penalizes this.
We can observe that the wall of the Wan video morphs and moves strongly. But it happens at a smaller scale with ours as well. The issue is that the error only happens at a single point in time, so it is hard to penalize as we simply average over the video length.
In the Wan video two balloons appear, while in our video only one appears from thin air. The overall scene is extremely similar. While it shows that our method explicitly optimizes the consistency it shows that it does not always make it fully consistent.