VidSketch: Hand-drawn Sketch-Driven Video Generation with Diffusion Control

Lifan Jiang¹, Shuang Chen¹, Boxi Wu¹, Jiahui Zhang¹ Deng Cai¹

¹State Key Lab of CAD&CG, Zhejiang University, China

Video animation generated by our VidSketch. Our method generates video animations with a hand-drawn sketch sequence (corresponding sketches placed in the top-left corner of the respective frames, with examples from top to bottom guided by 1, 2, 4, and 6 sketches) and simple textual prompts. This enables the creation of high-quality, spatiotemporal-consistent video animations, breaking barriers in the art profession.
Our VidSketch method empowers users of all skill levels to effortlessly create stunning, high-quality video animations using concise text prompts and intuitive hand-drawn sketches.

clipasso

Abstract

With the advancement of generative artificial intelligence, previous studies have achieved the task of generating aesthetic images from hand-drawn sketches, fulfilling the public's needs for drawing. However, these methods are limited to static images and lack the ability to control video animation generation using hand-drawn sketches. To address this gap, we propose VidSketch, the first method capable of generating high-quality video animations directly from any number of hand-drawn sketches and simple text prompts, bridging the divide between ordinary users and professional artists. Specifically, our method introduces a Level-Based Sketch Control Strategy to automatically adjust the guidance strength of sketches during the generation process, accommodating users with varying drawing skills. Furthermore, a TempSpatial Attention mechanism is designed to enhance the spatiotemporal consistency of generated video animations, significantly improving the coherence across frames.

Hand-drawn Sketches for different categories

🎞 VidSketch in different style 🎞


Surrealistic-style	Magical-style

Fantasy-style	Realistic-style

How does it work?

Hand-drawn Sketch-Driven Video Generation

Pipeline of our VidSketch. During training, we use high-quality, small-scale video datasets categorized by type to train the Enhanced SparseCausal-Attention (SC-Attention) and Temporal Attention blocks, improving spatiotemporal consistency in video animations. During inference, users simply input a prompt and sketch sequences to generate tailored high-quality animations. Specifically, the first frame is generated using T2I-Adapter, while the entire sketch sequence is processed by the Inflated T2I-Adapter to extract information, which is injected into VDM's upsampling layers to guide video generation.

Our training approach adheres to the traditional VDMs framework. First, we conducted an extensive search across the internet to collect high-quality training videos for each action category, with 8–12 videos. Subsequently, we trained the SparseCausal-Attention and Temp-Attention modules separately for each action category. This strategy effectively mitigates the challenge of limited high-quality video data, enhancing the spatiotemporal consistency and quality of the generated videos.

Abstraction-Level Sketch Control Strategy

To accommodate the significant variations in users' drawing skills, we conduct a detailed quantitative analysis of continuity, connectivity, and texture detail in sketch sequences to comprehensively evaluate the abstraction level of sketch sequences. This enables us to dynamically adjust the control strength during the video generation process. The effectiveness of the Level-Based Sketch Control Strategy can be intuitively demonstrated in the following figure.

We perform a quantitative analysis of the connectivity, continuity, and texture details of sketches to automatically evaluate the abstraction level of hand-drawn sketche sequences. Sketches with varying levels of abstraction correspond to different generation control intensities, ensuring that VidSketch is adaptable to users of drawing skills, thereby enhancing the method's generalization.

Enhanced SparseCausal-Attention mechanism

The primary distinction between video animation generation and image generation tasks lies in the requirement to maintain spatiotemporal consistency across video frames. To address the inherent challenges of video animation generation,we propose an Enhanced SparseCausal-Attention Mechanism. In this mechanism, for each frame i in the video sequence, key/value (K/V) representations are extracted from both the initial frame and the preceding frame (i-1). The query Q representation of the current frame i is then employed to compute the attention mechanism.

This mechanism effectively maintains the inter-frame consistency under identical conditions, significantly enhancing the quality of the generated video animations and better satisfying the demand for high-quality video animation production.

🎬 Gallery 🎬

Pouring hot tea into glass

A man turning to smile

Burning candle with glowing flame

A small private plane flying across the horizon

A candle is burning quietly

liquid being poured into a glass, glowing gently, mystical, soft-focus, fantasy ambiance, painterly light

A fufu moving her leg by the seaside

Pour clear water into a transparent glass

A handsome man turning to smile

A candle buring slightly

A musician playing an electric guitar on a rooftop, with a city skyline and sunset in the background

A street musician playing a guitar with a crowd gathered around them.

A national flag fluttering in the breeze, with bold, rich colors and clear symbols

A horse running on the ground

A magical musician playing the guitar under the bright moonlight

BibTeX

@misc{jiang2025vidsketchhanddrawnsketchdrivenvideo,
      title={VidSketch: Hand-drawn Sketch-Driven Video Generation with Diffusion Control},
      author={Lifan Jiang and Shuang Chen and Boxi Wu and Xiaotong Guan and Jiahui Zhang},
      year={2025},
      eprint={2502.01101},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.01101},
}