From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models

Pablo Acuaviva¹, Aram Davtyan¹, Mariam Hassan²,
Sebastian Stapf¹, Ahmad Rahimi², Alexandre Alahi², and Paolo Favaro¹

¹ Computer Vision Group, University of Bern — cvg.unibe.ch
² VITA Lab, EPFL — epfl.ch/labs/vita

Video Diffusion Models (VDMs) have emerged as powerful generative tools, capable of synthesizing high-quality spatiotemporal content. Yet, their potential goes far beyond mere video generation. We argue that the training dynamics of VDMs, driven by the need to model coherent sequences, naturally pushes them to internalize structured representations and an implicit understanding of the visual world. To probe the extent of this internal knowledge, we introduce a few-shot fine-tuning framework that repurposes VDMs for new tasks using only a handful of examples. Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input–output sequences without altering the generative interface of a frozen VDM. Despite minimal supervision, the model exhibits strong generalization across diverse tasks, from low-level vision (for example, segmentation and pose estimation) to high-level reasoning (for example, on ARC-AGI). These results reframe VDMs as more than generative engines. They are adaptable visual learners with the potential to serve as the backbone for future foundation models in vision.

Paper Code

Introduction

In recent years, diffusion models have redefined the frontier of generative modeling, demonstrating exceptional performance in high dimensional data domains such as images, audio, and video. Among these, Video Diffusion Models (VDMs) have emerged as particularly powerful, capable of producing temporally coherent, high-fidelity video sequences. Yet, their potential remains largely constrained to the domain of generation. This paper aims to explore a broader hypothesis: that the training objectives of VDMs, rooted in modeling complex spatiotemporal dynamics, naturally endow them with internal representations that are useful far beyond synthesis.

We investigate the capacity of pre-trained VDMs to act as general-purpose visual learners when exposed to minimal supervision. Our key insight is that fine-tuning these models on short visual transitions can repurpose them for downstream tasks without disrupting their generative interface. To this end, we introduce a lightweight few-shot fine-tuning framework that applies LoRA adapters to a frozen VDM. This setup allows us to cast a wide variety of tasks as visual transformations, enabling the model to generalize from very limited task-specific data.

Framework Diagram

Proposed framework. Given a task encoded as input-target image pairs (dashed gray box), a transition video is constructed to transform the input into the target. We fine-tune LoRA adapters with the core model frozen. At inference, the model outputs a transition video from a new input, with the final frame used as the prediction.

Results

We present qualitative results as videos (full generated transitions) and images (final generated frame).

Colorization

Gray

Colorization

Gray

Colorization

Gray

Colorization

Gray

Colorization

Examples for Colorization after training with n=10 samples.

Style Transfer

ARC-AGI

Navigate through problems with the arrows on the side. Each problem displays training examples (top row) and validation with model predictions (bottom row). To zoom in on an image, click on it.

ConceptArc

Navigate through different concepts using the dot menu below. Each concept displays training examples (top row) and validation with model predictions (bottom row). Use the arrows to explore different problems. To zoom in on an image, click on it.

Clean Up

Extend to Boundary

Extend to Boundary

Center

Order

Top Bottom 2D

Extract Objects

Extract Objects

Top Bottom 3D

Above Below

Horizontal Vertical

Horizontal Vertical

Count

Copy

Inside Outside

Filled Not Filled

Filled Not Filled

Complete Shape

Move to Boundary

Move to Boundary

Same Different

Binary Semantic Segmentation

Examples for Binary Semantic Segmentation after training with n=30 samples.

Pose Estimation

Examples for Pose Estimation after training with n=30 samples.

Mask2Image & Image2Mask

Examples for Mask2Image & Image2Mask after training with a single example (n=1).

Miscellaneous Tasks

Zoom

Vertical Flipping

Horizontal Flipping

Edges

Miscellaneous tasks learned from a single sample (n=1).

Citation

If you find our work helpful, please use the following citation:

    @misc{acuaviva2025gen2gen,
      title={From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models}, 
      author={Pablo Acuaviva and Aram Davtyan and Mariam Hassan and Sebastian Stapf 
              and Ahmad Rahimi and Alexandre Alahi and Paolo Favaro},
      year={2025},
      eprint={2506.07280},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.07280},
}