Sagalee dataset released under the CC BY-NC 4.0 International license, a summary of the license can be found here, and the full license can be found here.
Paper is now available on arxiv: Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language
The dataset: on this link
- 🎉 [2024-12-20] Sagalee paper accepted to ICASSP 2025 Conference
- ✨ [2024-11-28] Sagalee dataset released under CC BY-NC 4.0 International license.
git clone https://github.com/turinaf/sagalee.git
cd sagalee
git submodule update --init --no-fetch
conda create -n wenet python=3.10
conda activate wenet
conda install conda-forge::sox
pip install torch==2.2.2+cu121 torchaudio==2.2.2+cu121 -f https://download.pytorch.org/whl/torch_stable.html
cd wenet
pip install -r requirements.txt
Running the script prepare_wenet_data.py
will prepare data in required format inside wenet/examples/sagalee/s0/data/
. It organize the wav files and text files into two files. wav.scp
containing two tab-separated columns with wav_id
and wav_path
and text
containing two tab-separated columns wav_id
and text_label
wav.scp
file:
sagalee_SPKR232_122 sagalee/train/SPKR232/sagalee_SPKR232_122.wav
sagalee_SPKR232_002 sagalee/train/SPKR232/sagalee_SPKR232_002.wav
text
file
sagalee_SPKR232_082 HOJJATAA JIRA JECHUUN KOMATE
sagalee_SPKR232_093 SAMMUU KEE KEESSA HIN KAAYANI
After preparing data, navigate to the directory containing run.sh
, and simply run the stages starting from stage 1.
cd wenet/examples/sagalee/s0
bash run.sh --stage 1 --stop_stage 1
bash run.sh --stage 2 --stop_stage 2
bash run.sh --stage 3 --stop_stage 3
bash run.sh --stage 4 --stop_stage 4
bash run.sh --stage 5 --stop_stage 5
- Stage 1: is used to extract global cmvn(cepstral mean and variance normalization) statistics. These statistics will be used to normalize the acoustic features.
- Stage 2: Generate label token dictionary
- Stage 3: This stage generates the WeNet required format file
data.list
in json format. - Stage 4: Training
- Stage 4: Testing the trained model
finetune_whisper.py
is used to fine tune whisper largev3 (you can change model size) by freezing bottom layers of encoder on Sagalee dataset, you can simply run this python script to finetune.
python finetune_whisper.py
- For full paramater finetuning, follow these steps in wenet script.
@misc{turi2025sagalee,
title={Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language},
author={Turi Abu and Ying Shi and Thomas Fang Zheng and Dong Wang},
year={2025},
eprint={2502.00421},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.00421},
}
The training code is adapted from WeNet and used to train model on our custom Sagalee Dataset.