GitHub - turinaf/Sagalee: Automatic Speech Recognition Dataset for Oromo Language

Sagalee: Automatic Speech Recognition Dataset for Oromo language

Sagalee dataset released under the CC BY-NC 4.0 International license, a summary of the license can be found here, and the full license can be found here.
Paper is now available on arxiv: Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language
The dataset: on this link

News

🎉 [2024-12-20] Sagalee paper accepted to ICASSP 2025 Conference
✨ [2024-11-28] Sagalee dataset released under CC BY-NC 4.0 International license.

Training ASR on Sagalee Dataset

Clone this Repo

git clone https://github.com/turinaf/sagalee.git
cd sagalee
git submodule update --init --no-fetch

Create env and install dependancy

conda create -n wenet python=3.10
conda activate wenet
conda install conda-forge::sox
pip install torch==2.2.2+cu121 torchaudio==2.2.2+cu121 -f https://download.pytorch.org/whl/torch_stable.html

cd wenet
pip install -r requirements.txt

Training recipes

1 Prepare the data.

Running the script prepare_wenet_data.py will prepare data in required format inside wenet/examples/sagalee/s0/data/. It organize the wav files and text files into two files. wav.scp containing two tab-separated columns with wav_id and wav_path and text containing two tab-separated columns wav_id and text_label

wav.scp file:

sagalee_SPKR232_122     sagalee/train/SPKR232/sagalee_SPKR232_122.wav
sagalee_SPKR232_002     sagalee/train/SPKR232/sagalee_SPKR232_002.wav

text file

sagalee_SPKR232_082     HOJJATAA JIRA JECHUUN KOMATE
sagalee_SPKR232_093     SAMMUU KEE KEESSA HIN KAAYANI

2 Run the training

After preparing data, navigate to the directory containing run.sh, and simply run the stages starting from stage 1.

cd wenet/examples/sagalee/s0

bash run.sh --stage 1 --stop_stage 1
bash run.sh --stage 2 --stop_stage 2
bash run.sh --stage 3 --stop_stage 3
bash run.sh --stage 4 --stop_stage 4
bash run.sh --stage 5 --stop_stage 5

Stage 1: is used to extract global cmvn(cepstral mean and variance normalization) statistics. These statistics will be used to normalize the acoustic features.
Stage 2: Generate label token dictionary
Stage 3: This stage generates the WeNet required format file data.list in json format.
Stage 4: Training
Stage 4: Testing the trained model

Finetuning Whisper model

finetune_whisper.py is used to fine tune whisper largev3 (you can change model size) by freezing bottom layers of encoder on Sagalee dataset, you can simply run this python script to finetune.

python finetune_whisper.py

For full paramater finetuning, follow these steps in wenet script.

Citation

@misc{turi2025sagalee,
      title={Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language}, 
      author={Turi Abu and Ying Shi and Thomas Fang Zheng and Dong Wang},
      year={2025},
      eprint={2502.00421},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.00421}, 
}

Acknowledgement

The training code is adapted from WeNet and used to train model on our custom Sagalee Dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
demo		demo
wenet @ a2d1dd6		wenet @ a2d1dd6
.gitmodules		.gitmodules
README.md		README.md
duration.py		duration.py
finetune_whisper.py		finetune_whisper.py
prepare_wenet_data.py		prepare_wenet_data.py
preprocess.py		preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sagalee: Automatic Speech Recognition Dataset for Oromo language

News

Training ASR on Sagalee Dataset

Clone this Repo

Create env and install dependancy

Training recipes

1 Prepare the data.

2 Run the training

Finetuning Whisper model

Citation

Acknowledgement

About

Releases

Packages

Languages

turinaf/Sagalee

Folders and files

Latest commit

History

Repository files navigation

Sagalee: Automatic Speech Recognition Dataset for Oromo language

News

Training ASR on Sagalee Dataset

Clone this Repo

Create env and install dependancy

Training recipes

1 Prepare the data.

2 Run the training

Finetuning Whisper model

Citation

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages