Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset creation #44

Open
StuartIanNaylor opened this issue Dec 24, 2024 · 1 comment
Open

Dataset creation #44

StuartIanNaylor opened this issue Dec 24, 2024 · 1 comment

Comments

@StuartIanNaylor
Copy link

StuartIanNaylor commented Dec 24, 2024

Because a classification model is just a graph its not just words you need for !KW but a balance of the pronounciation that creates spectra.
https://github.com/StuartIanNaylor/ProjectEars/tree/main/dataset/ml-commons
I was using the ml-commons dataset but its so full of dross its a shame it the only spoken word dataset we have.
I was importing the words into sqllite and using

# Natural Language Toolkit: Tokenizers
#
# Copyright (C) 2001-2022 NLTK Project
# Author: Christopher Hench <[email protected]>

My hack is here https://github.com/StuartIanNaylor/ProjectEars/tree/main/dataset/ml-commons

The idea is to balance phoneme and sylable as in a classification you have this seesaw effect where if you over bias to a particular phoneme then a lesser phoneme will rank lower, by count of submitted samples.
Phonemes by sylable count where the nearest thing I could think of that directly map to resultant spectra.
Its a gist of likely how the big guys do it but they will have professional linguists to advise on how to create balanced classifications.
https://en.wikipedia.org/wiki/Sonority_hierarchy

You biggest problem is being forced to use synthetic data and likely there are newer SotA TTS than Piper that is aimed at embedded.
Real balanced large datasets of device capture will provide best results.

PS I have just been having a listen of the 1000 KW this makes

python3 piper-sample-generator/generate_samples.py "{target_word}" \
--max-samples 1000 \
--batch-size 100 \
--output-dir generated_samples

Its not even 2 voices just repeating the KW with slight changes in pitch that maps to nothing like the variance human intonation produces, just aplay * in the generated_samples folder and its the same synthetic voice with very little variance.

Then the script goes into the realm of voodoo and exports raw room rirs as wavs... !?! is all I can say.
Room Impulse Response is the 3d space of how a room reflects sound and not a soundfile.
You need to add the point sources of mic and speaker with a sample and apply a specific room type rir to that, to recreate how the mic would capture that sample in that room.

@StuartIanNaylor
Copy link
Author

PS just to add whilst I came across it, but someone is arguing Multitaper-mel spectrograms for keyword spotting provides more accuracy https://arxiv.org/pdf/2407.04662
Dunno

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant