ICML 2026

Rethinking Visual Intelligence: Insights from Video Pretraining

We compare pretrained video diffusion models and large language models under a matched LoRA adaptation protocol for structured visual tasks.

Abstract

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

Method

A matched adaptation protocol

Each task is an input-output grid pair. We use the same examples, train/test split, LoRA-only adaptation, frozen pretrained backbone, and exact-match evaluation for both model families. The VDM receives the pair as colored images in the visual domain, while the LLM receives the same grids as serialized arrays.

Explore Results

Main paper results

ARC-AGI & ConceptARC

ConceptARC results across visual concepts and model families.
ARC-AGI solved-task overlap between the two matched baselines.

Games

Accuracy as a function of training set size across visual game tasks.

Cellular automata

Training examples needed to reach threshold accuracy on Life-like rules.
Accuracy on Langton's Ant with prediction horizons from 2 to 10 steps.
Training examples needed to reach threshold accuracy on 1D ECA rules.

Citation

BibTeX

@inproceedings{visual_intelligence_video_pretraining_2026,
  title     = {Rethinking Visual Intelligence: Insights from Video Pretraining},
  booktitle = {International Conference on Machine Learning},
  year      = {2026}
}