Video Diffusion Models (VDMs) have emerged as powerful generative tools, capable of synthesizing high-quality spatiotemporal content. Yet, their potential goes far beyond mere video generation. We argue that the training dynamics of VDMs, driven by the need to model coherent sequences, naturally pushes them to internalize structured representations and an implicit understanding of the visual world. To probe the extent of this internal knowledge, we introduce a few-shot fine-tuning framework that repurposes VDMs for new tasks using only a handful of examples. Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input–output sequences without altering the generative interface of a frozen VDM. Despite minimal supervision, the model exhibits strong generalization across diverse tasks, from low-level vision (for example, segmentation and pose estimation) to high-level reasoning (for example, on ARC-AGI). These results reframe VDMs as more than generative engines. They are adaptable visual learners with the potential to serve as the backbone for future foundation models in vision.
In recent years, diffusion models have redefined the frontier of generative modeling, demonstrating exceptional performance in high dimensional data domains such as images, audio, and video. Among these, Video Diffusion Models (VDMs) have emerged as particularly powerful, capable of producing temporally coherent, high-fidelity video sequences. Yet, their potential remains largely constrained to the domain of generation. This paper aims to explore a broader hypothesis: that the training objectives of VDMs, rooted in modeling complex spatiotemporal dynamics, naturally endow them with internal representations that are useful far beyond synthesis.
We investigate the capacity of pre-trained VDMs to act as general-purpose visual learners when exposed to minimal supervision. Our key insight is that fine-tuning these models on short visual transitions can repurpose them for downstream tasks without disrupting their generative interface. To this end, we introduce a lightweight few-shot fine-tuning framework that applies LoRA adapters to a frozen VDM. This setup allows us to cast a wide variety of tasks as visual transformations, enabling the model to generalize from very limited task-specific data.
Proposed framework. Given a task encoded as input-target image pairs (dashed gray box), a transition video is constructed to transform the input into the target. We fine-tune LoRA adapters with the core model frozen. At inference, the model outputs a transition video from a new input, with the final frame used as the prediction.
We present qualitative results as videos (full generated transitions) and images (final generated frame).
Examples for Colorization after training with n=10 samples.
Navigate through problems with the arrows on the side. Each problem displays training examples (top row) and validation with model predictions (bottom row). To zoom in on an image, click on it.
Navigate through different concepts using the dot menu below. Each concept displays training examples (top row) and validation with model predictions (bottom row). Use the arrows to explore different problems. To zoom in on an image, click on it.
Examples for Binary Semantic Segmentation after training with n=30 samples.
Examples for Pose Estimation after training with n=30 samples.
Examples for Mask2Image & Image2Mask after training with a single example (n=1).
Zoom
Vertical Flipping
Horizontal Flipping
Edges
Miscellaneous tasks learned from a single sample (n=1).
If you find our work helpful, please use the following citation:
@misc{acuaviva2025gen2gen,
title={From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models},
author={Pablo Acuaviva and Aram Davtyan and Mariam Hassan and Sebastian Stapf
and Ahmad Rahimi and Alexandre Alahi and Paolo Favaro},
year={2025},
eprint={2506.07280},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.07280},
}