Skip to content

unilight/sheet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prometheus-Logo

🗣️ SHEET / MOS-Bench 🎧

Manipulate MOS-Bench with SHEET

MOS-Bench is a benchmark designed to benchmark the generalization abilities of subjective speech quality assessment (SSQA) models. SHEET stands for the Speech Human Evaluation Estimation Toolkit. SHEET was designed to conduct research experiments with MOS-Bench.

arXiv Prometheus-Logo

Table of Contents

Key Features

  • MOS-Bench is the first large-scale collection of training and testing datasets for SSQA, covering a wide range of domains, including synthetic speech from text-to-speech (TTS), voice conversion (VC), singing voice synthetis (SVS) systems, and distorted speech with artificial and real noise, clipping, transmission, reverb, etc. Researchers can use the testing sets to benchmark their SSQA model.
  • This repository aims to provide training recipes. While there are many off-the-shelf speech quality evaluators like DNSMOS, SpeechMOS and speechmetrics, most of them do not provide training recipes, thus are not research-oriented. Newcomers may utilize this repo as a starting point to SSQA research.

MOS-Bench Overview

MOS-Bench currently contains 7 training sets and 12 test sets. Below is a screenshot of a summary table from our paper. For more details, please see our paper or egs/README.md.

Prometheus-Logo

Supported models and features

Models
  • LDNet
  • SSL-MOS
  • UTMOS (Strong learner)
    • Original repo link: https://github.com/sarulab-speech/UTMOS22/tree/master/strong
    • Paper link: [arXiv]
    • Example config: egs/bvcc/conf/utmos-strong.yaml
    • Notes: After discussion with the first author of UTMOS, Takaaki, we feel that UTMOS = SSL-MOS + listener modeling + contrastive loss + several model arch and training differences. Takaaki also felt that using phoneme and reference is not really helpful for UTMOS strong alone. Therefore we did not implement every component of UTMOS strong. For instance, we did not use domain ID and data augmentation.
  • Modified AlignNet
Features
  • Modeling
    • Listener modeling
    • Self-supervised learning (SSL) based encoder, supported by S3PRL
      • Find the complete list of supported SSL models here.
  • Training
    • Automatic best-n model saving and early stopiing based on given validation criterion
    • Visualization, including predicted score distribution, scatter plot of utterance and system level scores
    • Model averaging
    • Model ensembling by stacking

Usage

I am new to MOS prediction research. I want to train models!

You are in the right place! This is the main purpose of SHEET.

We provide complete experiment recipes, i.e., set of scripts to download and process the dataset, train and evaluate models. This structure originated from Kaldi, and is also used in many speech processing based repositories (ESPNet, ParallelWaveGAN, etc.).

Please follow the installation instructions first, then see egs/README.md for how to start.

I already have my MOS predictor. I just want to do benchmarking!

We provide scripts to collect the test sets conveniently. These scripts can be run on Linux-like platforms with basic python requirements, such that you do not need to instal all the heavy packages, like PyTorch.

Please see the related section in egs/README.md for detailed instructions.

I just want to use your trained MOS predictor!

We utilize torch.hub to provide a convenient way to load pre-trained SSQA models and predict scores of wav files or torch tensors.

# load pre-trained model
>>> predictor = torch.hub.load("unilight/sheet:v0.1.0", "default", trust_repo=True, force_reload=True)
# if you want to use cuda
>>> predictor.model.cuda()

# you can either provide a path to your wav file
>>> predictor.predict(wav_path="/path/to/wav/file.wav")
3.6066928

# or provide a torch tensor with shape [num_samples]
>>> predictor.predict(wav=torch.rand(16000))
1.5806346
# if you put the model on cuda...
>>> predictor.predict(wav=torch.rand(16000).cuda())
1.5806346

Or you can try out our HuggingFace Spaces Demo! Prometheus-Logo

Instsallation

Editable installation with virtualenv

You don't need to prepare an environment (using conda, etc.) first. The following commands will automatically construct a virtual environment in tools/. When you run the recipes, the scripts will automatically activate the virtual environment.

git clone https://github.com/unilight/sheet.git
cd sheet/tools
make

Information

Citation

If you use the training scripts, benchmarking scripts or pre-trained models from this project, please consider citing the following paper.

@article{huang2024,
      title={MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models}, 
      author={Wen-Chin Huang and Erica Cooper and Tomoki Toda},
      year={2024},
      eprint={2411.03715},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2411.03715}, 
}

Acknowledgements

This repo is greatly inspired by the following repos. Or I should say, many code snippets are directly taken from part of the following repos.

Author

Wen-Chin Huang
Toda Labotorary, Nagoya University
E-mail: [email protected]

About

Speech Human Evaluation Estimation Toolkit (SHEET)

Resources

License

Stars

Watchers

Forks

Packages

No packages published