You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because a classification model is just a graph its not just words you need for !KW but a balance of the pronounciation that creates spectra. https://github.com/StuartIanNaylor/ProjectEars/tree/main/dataset/ml-commons
I was using the ml-commons dataset but its so full of dross its a shame it the only spoken word dataset we have.
I was importing the words into sqllite and using
# Natural Language Toolkit: Tokenizers
#
# Copyright (C) 2001-2022 NLTK Project
# Author: Christopher Hench <[email protected]>
The idea is to balance phoneme and sylable as in a classification you have this seesaw effect where if you over bias to a particular phoneme then a lesser phoneme will rank lower, by count of submitted samples.
Phonemes by sylable count where the nearest thing I could think of that directly map to resultant spectra.
Its a gist of likely how the big guys do it but they will have professional linguists to advise on how to create balanced classifications. https://en.wikipedia.org/wiki/Sonority_hierarchy
You biggest problem is being forced to use synthetic data and likely there are newer SotA TTS than Piper that is aimed at embedded.
Real balanced large datasets of device capture will provide best results.
PS I have just been having a listen of the 1000 KW this makes
Its not even 2 voices just repeating the KW with slight changes in pitch that maps to nothing like the variance human intonation produces, just aplay * in the generated_samples folder and its the same synthetic voice with very little variance.
Then the script goes into the realm of voodoo and exports raw room rirs as wavs... !?! is all I can say.
Room Impulse Response is the 3d space of how a room reflects sound and not a soundfile.
You need to add the point sources of mic and speaker with a sample and apply a specific room type rir to that, to recreate how the mic would capture that sample in that room.
The text was updated successfully, but these errors were encountered:
PS just to add whilst I came across it, but someone is arguing Multitaper-mel spectrograms for keyword spotting provides more accuracy https://arxiv.org/pdf/2407.04662
Dunno
Because a classification model is just a graph its not just words you need for !KW but a balance of the pronounciation that creates spectra.
https://github.com/StuartIanNaylor/ProjectEars/tree/main/dataset/ml-commons
I was using the ml-commons dataset but its so full of dross its a shame it the only spoken word dataset we have.
I was importing the words into sqllite and using
My hack is here https://github.com/StuartIanNaylor/ProjectEars/tree/main/dataset/ml-commons
The idea is to balance phoneme and sylable as in a classification you have this seesaw effect where if you over bias to a particular phoneme then a lesser phoneme will rank lower, by count of submitted samples.
Phonemes by sylable count where the nearest thing I could think of that directly map to resultant spectra.
Its a gist of likely how the big guys do it but they will have professional linguists to advise on how to create balanced classifications.
https://en.wikipedia.org/wiki/Sonority_hierarchy
You biggest problem is being forced to use synthetic data and likely there are newer SotA TTS than Piper that is aimed at embedded.
Real balanced large datasets of device capture will provide best results.
PS I have just been having a listen of the 1000 KW this makes
Its not even 2 voices just repeating the KW with slight changes in pitch that maps to nothing like the variance human intonation produces, just
aplay *
in the generated_samples folder and its the same synthetic voice with very little variance.Then the script goes into the realm of voodoo and exports raw room rirs as wavs... !?! is all I can say.
Room Impulse Response is the 3d space of how a room reflects sound and not a soundfile.
You need to add the point sources of mic and speaker with a sample and apply a specific room type rir to that, to recreate how the mic would capture that sample in that room.
The text was updated successfully, but these errors were encountered: