VidSketch: Hand-drawn Sketch-Driven Video Generation with Diffusion Control

1State Key Lab of CAD&CG, Zhejiang University, China
2College of Software Technology, Zhejiang University, China
clipasso

Video animation generated by our VidSketch. Our method generates video animations with a hand-drawn sketch sequence (corresponding sketches placed in the top-left corner of the respective frames, with examples from top to bottom guided by 1, 2, 4, and 6 sketches) and simple textual prompts. This enables the creation of high-quality, spatiotemporal-consistent video animations, breaking barriers in the art profession.
Our VidSketch method empowers users of all skill levels to effortlessly create stunning, high-quality video animations using concise text prompts and intuitive hand-drawn sketches.

clipasso

Abstract

With the advancement of generative artificial intelligence, previous studies have achieved the task of generating aesthetic images from hand-drawn sketches, fulfilling the public's needs for drawing. However, these methods are limited to static images and lack the ability to control video animation generation using hand-drawn sketches. To address this gap, we propose VidSketch, the first method capable of generating high-quality video animations directly from any number of hand-drawn sketches and simple text prompts, bridging the divide between ordinary users and professional artists. Specifically, our method introduces a Level-Based Sketch Control Strategy to automatically adjust the guidance strength of sketches during the generation process, accommodating users with varying drawing skills. Furthermore, a TempSpatial Attention mechanism is designed to enhance the spatiotemporal consistency of generated video animations, significantly improving the coherence across frames.

Hand-drawn Sketches for different categories

cup
bell
bell
bell
bell
bell
bell
bell

🎞   VidSketch in different style  🎞

Image 1 Image 2 Image 3
Image 1 Image 2 Image 3
Surrealistic-style Magical-style
Image 1 Image 2 Image 3
Image 1 Image 2 Image 3
Fantasy-style Realistic-style

How does it work?

Hand-drawn Sketch-Driven Video Generation

Pipeline of our VidSketch. During training, we use high-quality, small-scale video datasets categorized by type to train the Enhanced SparseCausal-Attention (SC-Attention) and Temporal Attention blocks, improving spatiotemporal consistency in video animations. During inference, users simply input a prompt and sketch sequences to generate tailored high-quality animations. Specifically, the first frame is generated using T2I-Adapter, while the entire sketch sequence is processed by the Inflated T2I-Adapter to extract information, which is injected into VDM's upsampling layers to guide video generation.

cars peace

Our training approach adheres to the traditional VDMs framework. First, we conducted an extensive search across the internet to collect high-quality training videos for each action category, with 8–12 videos. Subsequently, we trained the SparseCausal-Attention and Temp-Attention modules separately for each action category. This strategy effectively mitigates the challenge of limited high-quality video data, enhancing the spatiotemporal consistency and quality of the generated videos.

Abstraction-Level Sketch Control Strategy

To accommodate the significant variations in users' drawing skills, we conduct a detailed quantitative analysis of continuity, connectivity, and texture detail in sketch sequences to comprehensively evaluate the abstraction level of sketch sequences. This enables us to dynamically adjust the control strength during the video generation process. The specific implementation details of the Abstraction-Level Sketch Control Strategy are illustrated in the picture below.

cars peace

We perform a quantitative analysis of the connectivity, continuity, and texture details of sketches to automatically evaluate the abstraction level of hand-drawn sketche sequences. Sketches with varying levels of abstraction correspond to different generation control intensities, ensuring that VidSketch is adaptable to users of drawing skills, thereby enhancing the method's generalization.

Enhanced SparseCausal-Attention mechanism

The primary distinction between video animation generation and image generation tasks lies in the requirement to maintain spatiotemporal consistency across video frames. To address the inherent challenges of video animation generation,we propose an Enhanced SparseCausal-Attention Mechanism. In this mechanism, for each frame i in the video sequence, key/value (K/V) representations are extracted from both the initial frame and the preceding frame (i-1). The query Q representation of the current frame i is then employed to compute the attention mechanism.

cars peace

This mechanism effectively maintains the inter-frame consistency under identical conditions, significantly enhancing the quality of the generated video animations and better satisfying the demand for high-quality video animation production.

BibTeX

@misc{jiang2025vidsketchhanddrawnsketchdrivenvideo,
      title={VidSketch: Hand-drawn Sketch-Driven Video Generation with Diffusion Control},
      author={Lifan Jiang and Shuang Chen and Boxi Wu and Xiaotong Guan and Jiahui Zhang},
      year={2025},
      eprint={2502.01101},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.01101},
}