Pattern Recognition and Artificial Intelligence Group, University of Alicante, Spain
LITIS Laboratory - EA 4108, Rouen University, France
Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription
Antonio Ríos-Vila, Jorge Calvo-Zaragoza, Thierry Paquet
ICDAR, 2024
The implementation of the model is available at GitHub.
Two different datasets of polyphonic music scores have been used in order to perform our experiments.
The GrandStaff dataset is a corpus that consists of 53.882 printed images of single-line (or system) pianoform scores, along with their digital score encoding. The dataset is composed of both original works from six authors, from the Humdrum repository, and synthetic augmentations of the music encodings that make it possible to provide a greater variety of musical sequences and patterns. The dataset comes with an official partition, in which the 7.000 original scores are used as a test set, and the 46.882 samples generated from augmentations comprise the training set. The dataset comes with an alternative version that introduces distortions into images in order to make them resemble low quality photocopies. This version is, from here on, referred to as Camera GrandStaff.
In this paper, we also introduce the Quartets dataset. Quartets is a well-known collection employed in the Audio to Score field. As the dataset provides the Humdrum **kern transcriptions from the excerpts of music, we produced a single-system transcription version of it. The dataset provides pieces that were randomly split from the original audios, namely pieces, into portions of approximately seven seconds, resulting in a total of 38.051 excerpts. These excerpts were rendered into printed music images using the Verovio Tool. Once the music images had been generated, we distorted the image using the same operations as those employed with Camera GrandStaff and included an additional distortion that simulated old printed ink, which contains bleeding and erasing errors. This distorted image was eventually fused with a random texture from a set of images on old paper. We followed the same partitions as those provided in the original dataset. The training set specifically contains 18.162 samples for Haydn, 7.435 samples for Mozart and 12.454 samples for Beethoven. Each corpus is divided into three splits at piece level: train (70\%), validation (15\%), and test (15\%), and are combined in order to retrieve the partitions of the corpus.
You can download parts of the dataset or the entire dataset using the links below:
Once downloaded, place the data in the following directory structure:├── data │ ├── GrandStaff │ │ ├──grandstaff_dataset ├──partitions_grandstaff ├── Quartets │ ├──quartets_dataset ├──partitions_quartets
If you have any questions or suggestions, please reach out to us at arios@dlsi.ua.es.