-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this repository more accurate then OpenWakeWord ? #28
Comments
This is smaller model which can run on low power devices such as ESP32 S3 |
openWakeWord uses a pre-trained model from Google to increase accuracy. This works great, but the Google model is pretty complex. microWakeWord is smaller and trained from scratch so it can run on ESP devices. |
@synesthesiam I was trying to train a wake word using microWakeWord for Hey Pixa, but was finding it a bit difficult compared to openWakeWord, have you trained any wakeword here before? something you can help with? |
It is much easier to train a good model with openWakeWord than microWakeWord. While the actual training of a mWW model only takes a couple hours on my hardware, it usually takes me a couple weeks of tweaking the TTS samples to get a usable model. That is a big part of the challenge for writing an all in one training script for microWakeWord! It is hard to compare accuracy for wake word models as there isn't a standardized test that describes all situations well. For my specific tests, I have found the new V2 models are slightly better than openWakeWords, but it is unclear if this is reflected in real world use. |
Thanks for information also for the work 🚀 |
@kahrendt i have been able to use the pre-trained words efficiently but funny enough when my wife says alexa or hey jarvis from the same distance/volume than me it does not get recognized.. i tried using a higher tone and microwakeword stop detecting.. do you thing we might be training the model only with male samples? what about having always a male and a female option? this way it would be more inclusive given that we can use more than 1 wakeword at a time in the same device :) amazing work and thanks again |
Interesting result! The samples are all generated using models trained on the LibriTTS dataset. The dataset is roughly balanced between genders as a whole, so in theory, the generated samples should also be balance. I did however restrict the number of voices when generating samples for several of the V2 models, and it is possible that resulted in an unbalanced set. I'll investigate this, thanks for pointing it out! Out of curiosity, have you tried the "Hey Mycroft" wake word? I'd be interested to hear if you experience the same behavior, as I used many real recordings in addition to TTS generated samples. |
hey Kevin! interesting enough, hey mycroft works with her :)
…On Mon, Jul 22, 2024 at 2:57 PM Kevin Ahrendt ***@***.***> wrote:
Interesting result! The samples are all generated using models trained on
the LibriTTS dataset. The dataset is roughly balanced between genders as a
whole, so in theory, the generated samples should also be balance. I did
however restrict the number of voices when generating samples for several
of the V2 models, and it is possible that resulted in an unbalanced set.
I'll investigate this, thanks for pointing it out!
Out of curiosity, have you tried the "Hey Mycroft" wake word? I'd be
interested to hear if you experience the same behavior, as I used many real
recordings in addition to TTS generated samples.
—
Reply to this email directly, view it on GitHub
<#28 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJNI25J4LW76R2X25XSSB3ZNVBXBAVCNFSM6AAAAABK7ISJEOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBTGUYDSOBWGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I had another thought as well that could improve it. The ESPHome 2024.7.0 release changed how loud the microphone audio was. Basically, it was 4 times quieter than previous version. The 2024.7.1 release reverted that change, so it should match the same behavior as before. If you were on the 2024.7.0 release, I'd appreciate if you updated ESPHome and tested again! |
hey kevin, everything was on 7.1 already ;) hey microft works, all other v2
models do not for her: ok nabu, hey jarvis nor alexa
…On Mon, Jul 22, 2024 at 5:13 PM Kevin Ahrendt ***@***.***> wrote:
I had another thought as well that could improve it. The ESPHome 2024.7.0
release changed how loud the microphone audio was. Basically, it was 4
times quieter than previous version. The 2024.7.1 release reverted that
change, so it should match the same behavior as before. If you were on the
2024.7.0 release, I'd appreciate if you updated ESPHome and tested again!
—
Reply to this email directly, view it on GitHub
<#28 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJNI26PASVEGRGSI2BQG7DZNVRVVAVCNFSM6AAAAABK7ISJEOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBTG4ZTIOJTG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@kahrendt Hi, I experience the same with Alexa and Hey Jarvis, my wife is not picked up at all unless she really lower the pitch of her voice, resulting in her telling me to remove it from the livingroom :) Currently running 2024.8.0. Tested on both the Atom Echo and the ESP32-S3-BOX-3. |
Thanks for the feedback. I'm brainstorming ideas on how to properly address this, as the bias only gets worse when I additionally train on collected real samples. One thought I is to oversample the TTS samples that are based on female voices to see if that improves things. I'm also suspicious that the samples generated with slerp (so 2 trained voices are mixed together) are making this worse. |
I'm happy to test and use my wife as the female tester :) |
can it be run on esp32c3 by any chance? |
This is very cool. I successfully got an ESP32-S3 responding to "Hey Jarvis". Unfortunately, I wasted a lot of time training a new "Hey Eddie" openWakeWord, but I now realize this microWakeWord is a separate thing. On device wake words are so much better. No continuous bandwidth to Home Assistant, snappier response. Are you in any way soliciting for new wake words? My plan was to make my Assist act like Eddie the Shipboard Computer from The Hitchhiker's Guide to the Galaxy. The AI prompt is working well, and the assistant is endlessly cheerful, and I even have a reasonably okay MaryTTS voice effect with reverb. |
I'm not actively working on new wake words at this time. It would be better to build a decent all in one training script prepared than me attempting to do more wake words. The biggest problem is it takes a lot of tweaking and experimentation to get a usable model, especially with TTS samples. Lately I've been focused on helping getting Nabu Casa's voice satellite firmware ready to go, so getting a script together has been a lower priority. As I get the firmware components merged into ESPHome, I'll spend time preparing a script to make it easier for everyone to experiment with. |
I understand, thank you for all you've already done on this, it's absolutely amazing! |
Can you explain why it takes a lot of time to tweak tts samples? Is this because tts generation is not really successful and variation is not enough? |
There are also later SotA TTS that also can clone voices from short samples https://github.com/fishaudio/fish-speech and https://github.com/Plachtaa/VALL-E-X |
By the way I tested this approach and I have got a pretty good result(I have previously trained openwakeword model with piper samples and compared to it this is much better). I have generated about 10000 samples with gpt4o mini api and verified them with gemini api. |
Correct, the variation does not seem to be enough. When I mention tweaking generation settings, I do mean playing with the various noise settings and try to use slightly different phonetic pronunciations to increase the variety. Other TTS models could be very useful, but it's possible that using samples generated with commercial TTS engines are not licensed for training a new model with. That's probably not a large concern if you are only using it yourself, but it could potentially be an issue if you want to share the model with others. |
In an ideal world, collected wake word samples would be recorded with proper audio gear in a studio where we just get the high quality sample. We then could augment them freely with various RIRs to simulate many different rooms. We could also easily add a different source of noise to the sample coming from a different spot in the same simulated room. However, collecting samples in this way is expensive and challenging. Most people aren't equipped to record samples in this way, so it would be difficult to get the variety necessary to train a robust model. Collecting as many samples as possible, which capture real room impulse responses, is a simpler alternative that can still train a decent performing model. |
The best collected wake word samples are from the device of use. It would be great also if you could load firmware or have a state machine where the device becomes a microphone so that you can use it for recordings. The whole point of not just having a singular microphone to capture farfield is that the RIR causes the ripple of audio to hit and reflect off surfaces, all at different distances and then returning and mix at the microphone. If you could just record room impulse responses via a simple voice sample the whole industry would need no microphone tech or algs, just the datasets of capture and the corresponding models. If you can convert the device to a microphone recording device then create a simple dataset using the above 'reader' of a few phonetic pangram reading mixed with the chosen KW. That is what big data did they recorded on device of voice where various tech such as beamforming has attenuated reverberation caused by RIR. https://github.com/hshi-speech/Research-and-Analysis-of-Speech-Enhancement-or-Dereverberation Removing the reverberation of RIR is the hard part and why we have so much research and tech. |
I had a go in the esphome to see if any guru's would enable the usb audio class drivers on the xmos so I can plug in and use as a microphone. The samples from Piper are far too similar and without enough variance which is an easy fix by with newer TTS running on something with a GPU even my Xeon workstation with a RTX3050 can run reasonable TTS. |
This seems an easy way for KW.
That will end you up with 1700+ 'Hey Jarvis' with a bit more variation and is a simple script if you battle pip for the missing modules. Installing from the repo didn't work for me I did some quick hacks with https://github.com/StuartIanNaylor/syn-dataset.git the english word dataset https://github.com/dwyl/english-words get wittled down to the words contained in cmudict. https://drive.google.com/file/d/1QyzbPWyls253lLBb3Wn-vlg1T_qMHguJ/view?usp=drive_link |
oh well it's good that it's not just my fantasies) I've noticed people in the voice pe threads also can't get enough accurate and stable response from the device. Those of them who are not native speakers, some have 70-80% and some even less. Although there is XMOS and it should somehow fight echo. I too have noticed that the piper generator for the dataset is extremely static. One accent and slightly different intonation. I about the openwakeword notebook - immediately thought why aren't other tts used, or several. Why is custom data not strongly suggested. Making a device with a button that when pressed writes sound and sends it to HA is 10 minutes of development, just add the button to the GPIO. Pressing it writes sound, releasing it sends it (variations) - python squeaks out all the pauses from the beginning and end of the file. I could be wrong about something or everything, but for me, as a non-native English speaker, it's quite agonizing to talk to the assistant (to wake him up). As for recognizing commands, I use the big VOSK model, it works fine for me. So these are just thoughts out loud. |
If you augment the 100-200 words to a dataset of say 2000-4000 samples in each label you can get a very accurate KWS, but unfortunately it will only work for yourself. 2000-40000 samples in each label is really a tiny fraction of what the correctly named Big Data of Google would use.
Its Whisper as SotA WER it publishes are for a few languages on the large model with a 30 second context. Both Tensorflow and Torch have on-device training methods, but just collating and retraining can also be done if you have the patience as the idle time of a smart speaker is usually very large. I did a CLI Word Boutique to collect word samples from spoken KW and !KW via what are phonetic pangrams which is a nonsense sentence containing all main Phonemes of a language. They can be created for any language so a few sentences can be the basis of adding 'own voice' to any dataset. Synthetic wise https://github.com/coqui-ai/STT ⓍTTSv2 https://github.com/coqui-ai/TTS and https://github.com/netease-youdao/EmotiVoice are very good and why existing needs to wait until its refactored and rebranded as HA before being used is questionable as there is an ever changing slection of great opensource avail Vosk being one. Even for a KWS accuracy likely could be improved by tayloring to a language family such as https://github.com/AI4Bharat/Indic-TTS as there are likely clear language branches that do not need a specific language KWS https://en.wikipedia.org/wiki/Indo-European_languages#/media/File:Indo-European_Language_Family_Branches_in_Eurasia.png If you check my repos I have been having a go again with synthetic datasets as with simple models accuracy is the dataset and can be quite a bit of work. Providing datasets for regional branches is likely best done by native speakers of that native branch and maybe hack some of the scripts and methods I have used. I do think its a shame either the USB Class1/2 Audio drivers or I2S drivers on the Pi have not been provided. |
I want to learn that what is the difference between openwakeword and this repository ? Does accuracy differ ?
The text was updated successfully, but these errors were encountered: