While stylizing individual video frames with the power of diffusion models and ControlNets is a powerful technique with a huge variety in styles and high fidelity to content and scene layout, a naive application of this technique on all frames in a video sequence is both computationally expensive, slow, and prone to jitter, flicker and lack of temporal stability.
The task of quickly and robustly stylizing videos while harnessing the generative power of diffusion models remains an ongoing challenge for the computer vision community. Finding a comprehensive solution to this problem is yet to be fully realized.
We have therefore developed an in-house solution that combines several intelligent strategies. First, we employ smart frame selection, identifying key frames for stylization to optimize efficiency. Additionally, we implement selective stylization of keyframe areas, ensuring compatibility and consistency between frames. To achieve temporal stability and minimize artifacts, we utilize a smart and swift texture propagation method driven by optical flow and other guides. This technique enables the stylization of intermediate frames, even in dynamic videos that encompass scene changes and camera movements.
This intricate pipeline reflects our expertise in generative image editing, low-level video editing, optical flow estimation, and texture propagation. By seamlessly integrating these components, we have successfully introduced industry-leading video stylization features within our video editor, VideoLeap. This implementation not only produces visually appealing results but also maintains fast processing times, offering a seamless and efficient video stylization experience.