Skip to content

turinaf/Sagalee

Repository files navigation

Sagalee: Automatic Speech Recognition Dataset for Oromo language

Sagalee dataset released under the CC BY-NC 4.0 International license, a summary of the license can be found here, and the full license can be found here.
Paper is now available on arxiv: Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language
The dataset: on this link

News

Training ASR on Sagalee Dataset

Clone this Repo

git clone https://github.com/turinaf/sagalee.git
cd sagalee
git submodule update --init --no-fetch

Create env and install dependancy

conda create -n wenet python=3.10
conda activate wenet
conda install conda-forge::sox
pip install torch==2.2.2+cu121 torchaudio==2.2.2+cu121 -f https://download.pytorch.org/whl/torch_stable.html
cd wenet
pip install -r requirements.txt

Training recipes

1 Prepare the data.

Running the script prepare_wenet_data.py will prepare data in required format inside wenet/examples/sagalee/s0/data/. It organize the wav files and text files into two files. wav.scp containing two tab-separated columns with wav_id and wav_path and text containing two tab-separated columns wav_id and text_label

wav.scp file:

sagalee_SPKR232_122     sagalee/train/SPKR232/sagalee_SPKR232_122.wav
sagalee_SPKR232_002     sagalee/train/SPKR232/sagalee_SPKR232_002.wav

text file

sagalee_SPKR232_082     HOJJATAA JIRA JECHUUN KOMATE
sagalee_SPKR232_093     SAMMUU KEE KEESSA HIN KAAYANI

2 Run the training

After preparing data, navigate to the directory containing run.sh, and simply run the stages starting from stage 1.

cd wenet/examples/sagalee/s0
bash run.sh --stage 1 --stop_stage 1
bash run.sh --stage 2 --stop_stage 2
bash run.sh --stage 3 --stop_stage 3
bash run.sh --stage 4 --stop_stage 4
bash run.sh --stage 5 --stop_stage 5
  • Stage 1: is used to extract global cmvn(cepstral mean and variance normalization) statistics. These statistics will be used to normalize the acoustic features.
  • Stage 2: Generate label token dictionary
  • Stage 3: This stage generates the WeNet required format file data.list in json format.
  • Stage 4: Training
  • Stage 4: Testing the trained model

Finetuning Whisper model

  • finetune_whisper.py is used to fine tune whisper largev3 (you can change model size) by freezing bottom layers of encoder on Sagalee dataset, you can simply run this python script to finetune.
python finetune_whisper.py
  • For full paramater finetuning, follow these steps in wenet script.

Citation

@misc{turi2025sagalee,
      title={Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language}, 
      author={Turi Abu and Ying Shi and Thomas Fang Zheng and Dong Wang},
      year={2025},
      eprint={2502.00421},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.00421}, 
}

Acknowledgement

The training code is adapted from WeNet and used to train model on our custom Sagalee Dataset.

About

Automatic Speech Recognition Dataset for Oromo Language

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages