Dataset creation #44

StuartIanNaylor · 2024-12-24T11:00:43Z

Because a classification model is just a graph its not just words you need for !KW but a balance of the pronounciation that creates spectra.
https://github.com/StuartIanNaylor/ProjectEars/tree/main/dataset/ml-commons
I was using the ml-commons dataset but its so full of dross its a shame it the only spoken word dataset we have.
I was importing the words into sqllite and using

# Natural Language Toolkit: Tokenizers
#
# Copyright (C) 2001-2022 NLTK Project
# Author: Christopher Hench <[email protected]>

My hack is here https://github.com/StuartIanNaylor/ProjectEars/tree/main/dataset/ml-commons

The idea is to balance phoneme and sylable as in a classification you have this seesaw effect where if you over bias to a particular phoneme then a lesser phoneme will rank lower, by count of submitted samples.
Phonemes by sylable count where the nearest thing I could think of that directly map to resultant spectra.
Its a gist of likely how the big guys do it but they will have professional linguists to advise on how to create balanced classifications.
https://en.wikipedia.org/wiki/Sonority_hierarchy

You biggest problem is being forced to use synthetic data and likely there are newer SotA TTS than Piper that is aimed at embedded.
Real balanced large datasets of device capture will provide best results.

PS I have just been having a listen of the 1000 KW this makes

python3 piper-sample-generator/generate_samples.py "{target_word}" \
--max-samples 1000 \
--batch-size 100 \
--output-dir generated_samples

Its not even 2 voices just repeating the KW with slight changes in pitch that maps to nothing like the variance human intonation produces, just aplay * in the generated_samples folder and its the same synthetic voice with very little variance.

Then the script goes into the realm of voodoo and exports raw room rirs as wavs... !?! is all I can say.
Room Impulse Response is the 3d space of how a room reflects sound and not a soundfile.
You need to add the point sources of mic and speaker with a sample and apply a specific room type rir to that, to recreate how the mic would capture that sample in that room.

The text was updated successfully, but these errors were encountered:

StuartIanNaylor · 2024-12-26T14:08:13Z

PS just to add whilst I came across it, but someone is arguing Multitaper-mel spectrograms for keyword spotting provides more accuracy https://arxiv.org/pdf/2407.04662
Dunno

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset creation #44

Dataset creation #44

StuartIanNaylor commented Dec 24, 2024 •

edited

Loading

StuartIanNaylor commented Dec 26, 2024

Dataset creation #44

Dataset creation #44

Comments

StuartIanNaylor commented Dec 24, 2024 • edited Loading

StuartIanNaylor commented Dec 26, 2024

StuartIanNaylor commented Dec 24, 2024 •

edited

Loading