I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition
This repository is the official implementation of 'I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition', accepted to ISMIR 2024.
In this paper we tried to pinpoint specific problems that arise in music-text multimodal systems, as well as try to estimate if the problems can be attributed mainly in the Audio or Text branch.
Two-tower systems use separate encoders for each modality in order to obtain vector representations (embeddings). In our case and for music-text multimodal systems, we have a music and a text encoder separately (e.g. HT-SAT and RoBERTa for LAION-CLAP). Fundamentally the embeddings from each respective modality (one for the chunk of audio and one for the sentence) are not directly comparable as they are of different dimensionality. In order to enable comparisons between them, we need to either:
- Map the audio embedding to the text embedding space (change audio embedding dimensionality to text embedding dimensionality)
- Map the text embedding to the audio embedding space (change text embedding dimensionality to audio embedding dimensionality)
- Map both of them in a separate space (our case). These systems are usually referred to joint audio-text models.
In order to do this, we need to obtain pairs of sentences and audio and then force their embedding to be mapped close with each other. This is successfully done via Contrastive Learning, forcing these pairs to be close, while pushing away embeddings from any other combination of audio and caption in the batch. The latter, are not a part of the dataset and are referred to as negative pairs.
We perform several tests for both the embeddings either obtained straight from the encoder or after being mapped to the joint space.
We trained MusCALL using repository (ADD ILLARIAS REPOSITORY) and the default hyper-parameters. As a dataset, we used LPMusicCaps-MTT (ADD CITATION AND LINK). If you need the model please contact the author (see email below).
Apart from that, we downloaded and used the LAION-CLAP models provided in (ADD LINK TO LAION CLAP).
We used Doktorski similarity to obtain triplets (anchor, positive, negative) of musical instrument terminology and checked if our models could successfuly evaluate that anchor is closer to positive rather than negative.
conda create --name <env> --file requirements.txt
- music_audioset_epoch_15_esc_90.14.pt
- music_speech_audioset_epoch_15_esc_89.98.pt
- MusCALL trained on LPMusicCaps-MTT
and move them in the checkpoints folder.
python evaluation_set_generation.py --evaluation_type <type>
where choose all if you want to use the full Doktorski ontology or tinysol for just the tinysol terms inside the doktorski similariy.
Download the TinySOL dataset and move every song (not the folders) in TinySol folder. Then, generate all the embeddings and save them as .npy files using:
python embeddings_generations.py
Evaluate the embeddings obtained from the models using model_evaluation.py
There are 4 choices
- Pre-joint Doktorski similarity - pre_joint_doktorski_similarity
- Joint Doktorski similarity - joint_doktorski_similarity
- Zeroshot with different prompts - zeroshot_baseline
- Misc. Negative sensitivity checking (wasn't included in the paper) - positive_negative_pronmpts
Use the options (<flag>) after the dash in the aforementioned list in:
python model_evaluation.py --experiment <flag>
For any questions or to have access to a trained MusCall model, send an email to [email protected]