How DINOv3 Reveals Brainlike Visual Representations in Space and Time
Study overview
Researchers compared internal activations of DINOv3, a self-supervised vision transformer trained on billions of natural images, with human brain responses to the same images. They combined high-resolution fMRI maps and fast MEG recordings to capture where and when the brain represents visual information. The goal was to test whether and how AI models recapitulate human visual processing.
Methods and experimental setup
The team trained multiple DINOv3 variants that differed in three controlled factors: model size, amount of training data, and the type of images used during training. They then measured alignment between model activations and human brain signals. fMRI provided spatial precision across early visual and higher-order cortical regions, while MEG supplied millisecond-scale timing information about the emergence of representations.
Main findings on brain-model similarity
DINOv3 activations predicted fMRI responses across both early visual areas and higher-order cortex, with peak voxel correlations around R = 0.45. MEG analyses showed alignment beginning as early as 70 milliseconds after image onset and persisting up to three seconds. Early transformer layers aligned with primary visual areas such as V1 and V2, whereas deeper layers matched activity in higher-order and prefrontal regions.
Training trajectories and developmental parallels
Tracking alignment through training revealed a clear developmental trajectory. Low-level visual correspondences emerged very early, after a small fraction of training. In contrast, higher-level alignments required exposure to billions of images. Temporal alignment appeared fastest, spatial alignment slower, and encoding similarity fell in between. These dynamics mirror human cortical development, where sensory areas mature earlier than associative cortices.
Influence of model size, data quantity, and image ecology
Larger models yielded higher similarity scores overall, particularly in higher-order cortical regions. Longer training improved alignment across the board, with high-level representations benefitting most from extended exposure. The nature of training images mattered strongly: models trained on human-centered, ecologically relevant images showed the strongest convergence with brain activity. Models trained on satellite or cellular imagery drove partial alignment in early visual cortex but much weaker similarity in associative regions.
Links to cortical structure and function
The timing of when model representations aligned with brain responses correlated with cortical properties. Regions that show greater developmental expansion, thicker cortex, or slower intrinsic timescales tended to align later in model training. Highly myelinated regions aligned earlier, consistent with their role in fast information processing. These correlations indicate that AI models can hint at biological organizing principles of the cortex.
Conceptual implications
The results illuminate an interplay between built-in architectural priors and experience. DINOv3’s hierarchical design predisposes it to process visual features in stages, but full brainlike alignment only emerges with prolonged training on ecologically valid data. This balance echoes debates between nativist and empiricist views in cognitive science. Beyond the classical visual pathway, alignment in prefrontal and multimodal areas raises intriguing questions about whether self-supervised vision models capture higher-order features relevant to reasoning and decision making.
What this means for neuroscience and AI
DINOv3 and similar large self-supervised vision models serve as computational analogues for aspects of cortical development and organization. By manipulating model size, dataset, and training duration, researchers can test hypotheses about how experience and architecture interact to produce the brain’s layered representations of the visual world. The study suggests that ecologically relevant data and extended learning are essential for models to approximate the full richness of human visual processing.
For full technical details see https://arxiv.org/pdf/2508.18226