MuSViT: A Foundation Vision Model for Sheet Music Representation

Abstract

Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks. Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music Score Vision Transformer): the first foundation vision model for sheet music representation — a ViT encoder pre-trained via Masked Autoencoders on 9.7 million pages from the International Music Score Library Project (IMSLP). To handle the complexity of real-world scores, we adopt a two-stage curriculum: a synthetic warm-up on typeset scores followed by large-scale training on the full IMSLP corpus. We evaluate MuSViT on four downstream tasks — full-page and staff-level music score recognition, music symbol detection, and score difficulty classification — under two scenarios: linear probing (frozen encoder) and fine-tuning. Under linear probing, MuSViT consistently outperforms modern vision encoders, revealing that general-purpose representations, regardless of scale, fall systematically short on the structured symbolic properties of musical notation. Under fine-tuning, MuSViT generally improves upon task-specific state-of-the-art methods. An additional embedding-transcription consistency analysis reveals that MuSViT encodes symbolic musical structure directly in its representation space — unlike other encoders, whose embeddings do not correlate with music notation content. These results establish MuSViT as a foundation backbone for sheet music understanding.

Overview

      MuSViT is the first foundation vision model for sheet music. Its representations encode
      symbolic musical structure directly, unlike any general-purpose vision encoder tested.
    

9.7M

Training pages (IMSLP)

400K

Distinct musical works

85M

Parameters (encoder)

Downstream tasks

Architecture

How MuSViT works

MuSViT uses a ViT architecture with 12 Transformer layers. Input images are split into 16x16 patches projected to 768-dimensional embeddings. 2D sinusoidal positional encodings explicitly capture vertical position — critical for reading staff lines and pitch.

MuSViT overview diagram — **Fig. Overview of MuSViT.** MuSViT is pre-trained on diverse sheet music pages using Masked Autoencoders: patches are randomly masked and the model learns to reconstruct the missing content from the remaining visible context. The encoder is then evaluated across four downstream tasks: full-page and staff-level music score recognition, music symbol detection, and score difficulty classification.

Evaluation

Four downstream tasks

Full-Page Music Score Recognition

Transcribes an entire score page into a symbol-level sequence in correct reading order. Tests fine-grained symbol detail and global spatial organization simultaneously.

Datasets: Mozarteum, Polish Digital Scores
Metrics: SER ↓

Staff-Level Music Score Recognition

Transcribes individually segmented staff images. Evaluated on 5 corpora spanning historical and modern notation, printed and handwritten.

Datasets: Capitan, Guatemala, Il Lauro Secco, AMDC, FMT
Metrics: SER ↓

Music Symbol Detection

Localizes and classifies all individual notation elements (noteheads, stems, accidentals, rests, clefs…) producing bounding boxes with class labels.

Dataset: DeepScoresV2
Metrics: mAP, w-mAP ↑

Score Difficulty Classification

Estimates performance difficulty directly from score images, bypassing symbolic transcription. Requires holistic page-level representations.

Datasets: FreeScores, Can I Play It?, PianoStreet
Metrics: Acc0, Acc1 ↑

Contributions

Summary of contributions

Public release of MuSViT — the first vision foundation model for sheet music, pre-trained via MAE on 9.7M IMSLP pages through a two-stage synthetic-to-real curriculum. Model weights, pre-training code, and evaluation scripts are publicly released.
Comprehensive evaluation across four representative downstream tasks under linear probing and fine-tuning. MuSViT outperforms all general-purpose encoders under linear probing, and surpasses task-specific SoTA on three of four tasks under fine-tuning.
Embedding-transcription consistency analysis providing direct evidence that MuSViT representations correlate with symbolic musical content (ρ > 0.6), while all general-purpose encoders yield anti-correlated embeddings.

Citation

If you use MuSViT, please cite the ECCV 2026 paper.

@inproceedings{penarrubia2026musvit,
  title     = {MuSViT: A Foundation Vision Model for Sheet Music Representation},
  author    = {Penarrubia, Carlos and Rios-Vila, Antonio and Fuentes-Martinez, Eliseo 
              and Martinez-Sevilla, Juan C. and Castellanos, Francisco J. and 
              Alfaro-Contreras, Maria and Calvo-Zaragoza, Jorge},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Acknowledgements

The authors gratefully acknowledge Edward Guo, on behalf of IMSLP/Petrucci Music Library, for providing access to the data used to train the models.

This publication is part of the LEMUR project PID2023-148259NB-I00, funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU. The first author is supported by the University of Alicante through the FPU Program (UAFPU22-19). The third author is supported by a predoctoral contract associated with the LEMUR project. The fourth author is supported by a predoctoral contract from grant CISEJI/2023/9 "Programa para el apoyo a personas investigadoras con talento (Plan GenT) de la Generalitat Valenciana".

License

MuSViT: Foundation Model for Sheet Music Representation © 2026 by Carlos Penarrubia, Antonio Rios-Vila, Eliseo Fuentes-Martinez, Juan C. Martinez-Sevilla, Francisco J. Castellanos, María Alfaro-Contreras, Jorge Calvo-Zaragoza is licensed under CC BY-NC-SA 4.0.