This repository contains the code for our ICASSP 2022 paper: "Self-supervised speaker recognition with loss-gated learning". We propose to filter the unreliable pseudo label in Stage II, so that train with the reliable pseudo label only to boost the system.
System | Stage 1 | Stage 2 |
---|---|---|
EER | 7.36 | 1.66 |
-
In our paper, we extend the channel size of speaker encoder to 1024 in the iteration 5, in this code we remove this setting to simply the code. You can do that in the last iteration to get the better result.
-
In our paper, we manually determinate the end of each iteration, that is not user-friendly. In this code, we end the iteration if EER can not improve in continuous N = 4 epochs. You can increase it to improve the performance.
I do not have time to run the entire code again. I have checked Stage 1 and get the EER=7.36. While I believe a EER that smaller than 2.00 can easily be obtained in Stage 2 in this code.
Note: That is the setting based on my device, you can modify the torch and torchaudio version based on your device.
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
conda install -c pytorch faiss-gpu
pip install -r utils/requirements.txt
Please follow the official code to perpare your VoxCeleb2 dataset from the 'Data preparation' part in this repository.
Dataset for training usage:
-
VoxCeleb2 training set
-
MUSAN dataset;
-
RIR dataset;
Dataset for evaluation:
I have added the test_list (Vox1_O) in utils
. This train_list contains the length for each utterances.
train_mini.txt
is a subset of VoxCeleb2. It contains 100k utterances from 4082 speakers.
Download train_list.txt
from here and put it in utils
.
Firstly, you need to train a basic speaker encoder with contrastive learning format, change the path to folder Stage1
and use:
bash run.sh
Every test_step
epoches, system will be evaluated in Vox1_O set and print the EER.
The result will be saved in Stage1/exps/exp1/score.txt
. The model will saved in Stage1/exps/exp1/model
. I also provide the model that EER=7.36.
In my case, I trained 50 epoches in one 3090 GPU. Each epoch takes 40 mins, the total training time is about 35 hours.
For the baseline approach in Stage II, change the path to folder Stage2
and use:
bash run_baseline.sh
Please modifiy the path for the init_model
in run_baseline.sh
. init_model
is the path for the best model in Stage I.
This is the end-to-end code. System will:
-
Do clustering;
-
Train the speaker encoder for classification
-
Repeat 1) and 2), if the EER in 2) can not improve in continuous 4 epochs.
Here we do 5 iterations. Each epoch takes 20 mins. Clustering takes 18 mins.
For our LGL approach in Stage II, change the path to folder Stage2
and use:
bash run_LGL.sh
This is also end-to-end code. System will:
-
Do clustering;
-
Train the speaker encoder for classification
-
Train the speaker encoder for classification with LGL, if the EER in 2) can not improve in continuous 4 epochs.
-
Repeat 1) 2) and 3), if the EER in 3) can not improve in continuous 4 epochs.
I have already added annotation to make the code as clear as possible, please read them carefully. If you have questions, please post them in issue
part.
@inproceedings{tao2022self,
title={Self-supervised speaker recognition with loss-gated learning},
author={Tao, Ruijie and Lee, Kong Aik and Das, Rohan Kumar and Hautam{\"a}ki, Ville and Li, Haizhou},
booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={6142--6146},
year={2022},
organization={IEEE}
}
We study many useful projects in our codeing process, which includes:
joonson/voxceleb_unsupervised.
Thanks for these authors to open source their code!
If you are interested to work on this topic and have some ideas to implement, I am glad to collaborate and contribute with my experiences & knowlegde in this topic. Please contact me with [email protected].