Character level speech recognizer using ctc loss with deep RNNs in TensorFlow.
This is an ongoing project, working towards an implementation of the charater-level ISR detailed in the paper by Kyuyeon Hwang and Wonyong Sung. It works at the character level using one deep RNN trained with ctc loss for the acoustic model, and one deep RNN trained for a character-level language model. The acoustic model can read in either mel frequency cepstral coefficient, or mel filterbank with delta and double delta feature vectors (40 or 120 dim inputs respectively).
The audio signal processing is done using librosa.
Currently only the acoustic model has been completed. One pre-trained example is available here and can be tried on any file (your own recorded voice for example).
Results on LibriSpeech's test-clean evaluation set for the pre-trained model is :
- CER : 15,2 %
- WER : 42,4 %
It lacks the character-level language model which is still in the works.
The datasets currently supported are :
- LibriSpeech by Vassil Panayotov
- Shtooka
- Vystadial 2013
- TED-LIUM
The data is fed through two pipelines, one for testing, and the other for training.
If you intend to use a pre-trained model you should clone the repository with the lfs plugin
$ git lfs clone https://github.com/inikdom/rnn-speech.git
If you have already cloned the repository without lfs, you can download the missing files with :
$ git lfs pull
- TensorFlow (>= 1.4)
- librosa
- mutagen
Install required dependencies by running :
$ pip3 install -r requirements.txt
GPU support is not mandatory but strongly recommended if you intend to train the RNN. Replace tensorflow by tensorflow-gpu in requirements.txt in order to install the GPU version of TensorFlow.
- sox (for live transcript only, install with
sudo apt-get install sox
orbrew install sox --with-flac
) - libcupti (for timeline only, install with :
sudo apt-get install libcupti-dev
) - pyaudio (for live transcript only, install with :
sudo apt-get install python3-pyaudio
)
I've prepared a bash script to download LibriSpeech (~700mb) and extract the data to the right place :
$ chmod +x prepare_data.sh
$ ./prepare_data.sh
It will remove the tar files after downloading and unzipping.
All hyper parameters for the network are defined in config.ini
. A different config file can be fed to the training
program using something like:
$ python stt.py --config_file="different_config_file.ini"
You should ensure it follows the same format as the one provided.
Once your dependencies are set up, and data is downloaded and extracted into the appropriate location, the optimizer can be started by doing :
$ python stt.py --train
Dynamic RNNs are used as memory consumption on the entirely unrolled network was massive, and the model would take 30 minutes to build. Unfortunately this comes at a cost to speed, but I think in this case the tradeoff is worth it (as the model can now fit on a single GPU).
You can also use a trained network to process a wav file
$ python stt.py --file "path_to_file.wav"
The result will be printed on standard input.
You can evaluate a trained network on a evaluation test set (config.ini file's test_dataset_dirs parameter)
$ python stt.py --evaluate
The resulting CER (character error rate) and WER (word error rate) will be printed on standard input.
You can add the --timeline
option in order to produce a timeline file and see how everything is going.
The resulting file will be overridden at each step. It can be opened with Chrome, opening chrome://tracing/
and
loading the file.
With verification and testing performed somewhere at every step:
Build character-level RNN code- Add ctc beam search
- Wrap acoustic model and language model into general 'Speech Recognizer'
- Add ability for human to sample and test
Ultimately I'd like to work towards bridging this with my other project neural-chatbot to make an open-source natural conversational engine.
MIT
"LibriSpeech: an ASR corpus based on public domain audio books", Vassil Panayotov, Guoguo Chen, Daniel Povey andSanjeev Khudanpur, ICASSP 2015
http://shtooka.net
Korvas, Matěj; Plátek, Ondřej; Dušek, Ondřej; et al., 2014,
Vystadial 2013 – English data, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University,
http://hdl.handle.net/11858/00-097C-0000-0023-4671-4.
A. Rousseau, P. Deléglise, and Y. Estève, "Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks",
in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), May 2014.