Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this repository more accurate then OpenWakeWord ? #28

Open
dilerbatu opened this issue Jul 16, 2024 · 27 comments
Open

Is this repository more accurate then OpenWakeWord ? #28

dilerbatu opened this issue Jul 16, 2024 · 27 comments

Comments

@dilerbatu
Copy link

I want to learn that what is the difference between openwakeword and this repository ? Does accuracy differ ?

@s-dome17
Copy link

This is smaller model which can run on low power devices such as ESP32 S3

@synesthesiam
Copy link
Collaborator

openWakeWord uses a pre-trained model from Google to increase accuracy. This works great, but the Google model is pretty complex.

microWakeWord is smaller and trained from scratch so it can run on ESP devices.

@s-dome17
Copy link

@synesthesiam I was trying to train a wake word using microWakeWord for Hey Pixa, but was finding it a bit difficult compared to openWakeWord, have you trained any wakeword here before? something you can help with?

@kahrendt
Copy link
Owner

It is much easier to train a good model with openWakeWord than microWakeWord. While the actual training of a mWW model only takes a couple hours on my hardware, it usually takes me a couple weeks of tweaking the TTS samples to get a usable model. That is a big part of the challenge for writing an all in one training script for microWakeWord!

It is hard to compare accuracy for wake word models as there isn't a standardized test that describes all situations well. For my specific tests, I have found the new V2 models are slightly better than openWakeWords, but it is unclear if this is reflected in real world use.

@dilerbatu
Copy link
Author

Thanks for information also for the work 🚀

@gustvao
Copy link

gustvao commented Jul 22, 2024

@kahrendt i have been able to use the pre-trained words efficiently but funny enough when my wife says alexa or hey jarvis from the same distance/volume than me it does not get recognized..

i tried using a higher tone and microwakeword stop detecting..

do you thing we might be training the model only with male samples? what about having always a male and a female option? this way it would be more inclusive given that we can use more than 1 wakeword at a time in the same device :)

amazing work and thanks again

@kahrendt
Copy link
Owner

Interesting result! The samples are all generated using models trained on the LibriTTS dataset. The dataset is roughly balanced between genders as a whole, so in theory, the generated samples should also be balance. I did however restrict the number of voices when generating samples for several of the V2 models, and it is possible that resulted in an unbalanced set. I'll investigate this, thanks for pointing it out!

Out of curiosity, have you tried the "Hey Mycroft" wake word? I'd be interested to hear if you experience the same behavior, as I used many real recordings in addition to TTS generated samples.

@gustvao
Copy link

gustvao commented Jul 22, 2024 via email

@kahrendt
Copy link
Owner

I had another thought as well that could improve it. The ESPHome 2024.7.0 release changed how loud the microphone audio was. Basically, it was 4 times quieter than previous version. The 2024.7.1 release reverted that change, so it should match the same behavior as before. If you were on the 2024.7.0 release, I'd appreciate if you updated ESPHome and tested again!

@gustvao
Copy link

gustvao commented Jul 22, 2024 via email

@TheStigh
Copy link

@kahrendt Hi, I experience the same with Alexa and Hey Jarvis, my wife is not picked up at all unless she really lower the pitch of her voice, resulting in her telling me to remove it from the livingroom :) Currently running 2024.8.0. Tested on both the Atom Echo and the ESP32-S3-BOX-3.

@kahrendt
Copy link
Owner

@kahrendt Hi, I experience the same with Alexa and Hey Jarvis, my wife is not picked up at all unless she really lower the pitch of her voice, resulting in her telling me to remove it from the livingroom :) Currently running 2024.8.0. Tested on both the Atom Echo and the ESP32-S3-BOX-3.

Thanks for the feedback. I'm brainstorming ideas on how to properly address this, as the bias only gets worse when I additionally train on collected real samples. One thought I is to oversample the TTS samples that are based on female voices to see if that improves things. I'm also suspicious that the samples generated with slerp (so 2 trained voices are mixed together) are making this worse.

@TheStigh
Copy link

TheStigh commented Aug 28, 2024

@kahrendt Hi, I experience the same with Alexa and Hey Jarvis, my wife is not picked up at all unless she really lower the pitch of her voice, resulting in her telling me to remove it from the livingroom :) Currently running 2024.8.0. Tested on both the Atom Echo and the ESP32-S3-BOX-3.

Thanks for the feedback. I'm brainstorming ideas on how to properly address this, as the bias only gets worse when I additionally train on collected real samples. One thought I is to oversample the TTS samples that are based on female voices to see if that improves things. I'm also suspicious that the samples generated with slerp (so 2 trained voices are mixed together) are making this worse.

I'm happy to test and use my wife as the female tester :)
Or, whatever I can do to assist.

@thirstyone
Copy link

can it be run on esp32c3 by any chance?

@Digital-Ark
Copy link

This is very cool. I successfully got an ESP32-S3 responding to "Hey Jarvis".

Unfortunately, I wasted a lot of time training a new "Hey Eddie" openWakeWord, but I now realize this microWakeWord is a separate thing.

On device wake words are so much better. No continuous bandwidth to Home Assistant, snappier response.

Are you in any way soliciting for new wake words? My plan was to make my Assist act like Eddie the Shipboard Computer from The Hitchhiker's Guide to the Galaxy. The AI prompt is working well, and the assistant is endlessly cheerful, and I even have a reasonably okay MaryTTS voice effect with reverb.

@kahrendt
Copy link
Owner

I'm not actively working on new wake words at this time. It would be better to build a decent all in one training script prepared than me attempting to do more wake words. The biggest problem is it takes a lot of tweaking and experimentation to get a usable model, especially with TTS samples. Lately I've been focused on helping getting Nabu Casa's voice satellite firmware ready to go, so getting a script together has been a lower priority. As I get the firmware components merged into ESPHome, I'll spend time preparing a script to make it easier for everyone to experiment with.

@Digital-Ark
Copy link

I understand, thank you for all you've already done on this, it's absolutely amazing!

@erkamkavak
Copy link

I'm not actively working on new wake words at this time. It would be better to build a decent all in one training script prepared than me attempting to do more wake words. The biggest problem is it takes a lot of tweaking and experimentation to get a usable model, especially with TTS samples. Lately I've been focused on helping getting Nabu Casa's voice satellite firmware ready to go, so getting a script together has been a lower priority. As I get the firmware components merged into ESPHome, I'll spend time preparing a script to make it easier for everyone to experiment with.

Can you explain why it takes a lot of time to tweak tts samples? Is this because tts generation is not really successful and variation is not enough?
I wonder if using new models like openai's gpt 4o mini audio or google's gemini flash 2.0 can be used to generate tts samples since you can customize the tone or pronunciation in the output audio. If the biggest problem of microwakeword is dataset generation, maybe these models can solve it

@StuartIanNaylor
Copy link

StuartIanNaylor commented Dec 28, 2024

I wonder if using new models like openai's gpt 4o mini audio or google's gemini flash 2.0 can be used to generate tts samples since you can customize the tone or pronunciation in the output audio. If the biggest problem of microwakeword is dataset generation, maybe these models can solve it

There are also later SotA TTS that also can clone voices from short samples https://github.com/fishaudio/fish-speech and https://github.com/Plachtaa/VALL-E-X
You are right though as I listened through to the 1000 TTS wav samples Piper makes and its 2 very close voices with one being slightly more male than the female variation with a little pitch variration.
Its a million miles away from the intonation variance differing human voices make from various regions...
Its also synthetic sounding and likely that is embedded into the spectra and with the old saying 'garbage in, garbage out' that dataset makes quite bad models.
It gets even stranger later where 3D room RIR's are converted into wavs without use of something like https://github.com/LCAV/pyroomacoustics or https://github.com/DavidDiazGuerra/gpuRIR to create the point sources of mic and speaker to record accurately how that room RIR would record the sample played...
I guess they don't understand what RIR's are as the same but from a different angle is happening here.
OHF-Voice/wake-word-collective#11
You don't want RIRs in a dataset with no metadata of device distance and also they use CommonVoice which is true testiment how badly mozilla handled that project and how badly the subsequent forced alignment of MLCommons works.
There is a huge ammount of bad in the MLCommons especially the shorter the word and short words are the majority of the capture.
Also in Common voice some universities in India or someware made a huge push to add voice to common voice and there is a real problem when non native speakers have strong accents very different to native speakers, especially when there is no metadata to filter one or the other.
I have given up on using Commonvoice and MLcommons because with the qty of bad, they are not good datasets, even if presented as otherwise and they submitted approx 50% of the English corpus.
Likely better could be done from other sources than Commonvoice but for me being English the English dataset is broken. The goals of commonvoice where great, but sadly the implementation...

@erkamkavak
Copy link

I wonder if using new models like openai's gpt 4o mini audio or google's gemini flash 2.0 can be used to generate tts samples since you can customize the tone or pronunciation in the output audio. If the biggest problem of microwakeword is dataset generation, maybe these models can solve it

By the way I tested this approach and I have got a pretty good result(I have previously trained openwakeword model with piper samples and compared to it this is much better). I have generated about 10000 samples with gpt4o mini api and verified them with gemini api.

@kahrendt
Copy link
Owner

Can you explain why it takes a lot of time to tweak tts samples? Is this because tts generation is not really successful and variation is not enough? I wonder if using new models like openai's gpt 4o mini audio or google's gemini flash 2.0 can be used to generate tts samples since you can customize the tone or pronunciation in the output audio. If the biggest problem of microwakeword is dataset generation, maybe these models can solve it

Correct, the variation does not seem to be enough. When I mention tweaking generation settings, I do mean playing with the various noise settings and try to use slightly different phonetic pronunciations to increase the variety.

Other TTS models could be very useful, but it's possible that using samples generated with commercial TTS engines are not licensed for training a new model with. That's probably not a large concern if you are only using it yourself, but it could potentially be an issue if you want to share the model with others.

@kahrendt
Copy link
Owner

It gets even stranger later where 3D room RIR's are converted into wavs without use of something like https://github.com/LCAV/pyroomacoustics or https://github.com/DavidDiazGuerra/gpuRIR to create the point sources of mic and speaker to record accurately how that room RIR would record the sample played... I guess they don't understand what RIR's are as the same but from a different angle is happening here. OHF-Voice/wake-word-collective#11

In an ideal world, collected wake word samples would be recorded with proper audio gear in a studio where we just get the high quality sample. We then could augment them freely with various RIRs to simulate many different rooms. We could also easily add a different source of noise to the sample coming from a different spot in the same simulated room. However, collecting samples in this way is expensive and challenging. Most people aren't equipped to record samples in this way, so it would be difficult to get the variety necessary to train a robust model. Collecting as many samples as possible, which capture real room impulse responses, is a simpler alternative that can still train a decent performing model.

@StuartIanNaylor
Copy link

StuartIanNaylor commented Dec 29, 2024

The best collected wake word samples are from the device of use.
It is possible to create a rolling window to capture the KW on hit and send over websockets to a middle server to collate a local dataset.
It would be great if we had that function and an opt-in to upload to a opensource dataset with metadata filled in by the user.
Same with ASR as there is a pretty clear logic if a command is correct and those sentences of device would also be great.

It would be great also if you could load firmware or have a state machine where the device becomes a microphone so that you can use it for recordings.
https://github.com/StuartIanNaylor/ProjectEars/tree/main/dataset/reader
Is just a CLI recording 'boutique' that prompts the user to say keywords and words from phonetic pangrams 'sentences' containing all phonemes in a language. It likely because they are relative nonsense sentences of each phoneme this creates a balanced graph for the !KW classification.

The whole point of not just having a singular microphone to capture farfield is that the RIR causes the ripple of audio to hit and reflect off surfaces, all at different distances and then returning and mix at the microphone.
This depending on mic to source distance causes huge spectra distortions as differing room surfaces reflect time delays that mix at the microphone, creating harmonics.
That is why near all farfield tech tries to focus and remove the effects of RIRs before submitting to KW and ASR. Its RIRS and there effect why we have all these various multi microphone technologies in various voice tech all with differing algs.
You don't capture RIRs in a wav just the distortion the mix of reverbervation will of caused in that room causing a big increase in entropy of captured samples.
Capturing samples on any device of people in any room at any distance, with zero meta data will create a relatively useless data set and a pretty pointless alternative.
Even the device you are using processes and removes RIRs by extracting voice via 2 microphones and cleans the sample before the KWS.
So why would you train that KWS on samples containing farfield reverberation?

If you could just record room impulse responses via a simple voice sample the whole industry would need no microphone tech or algs, just the datasets of capture and the corresponding models.
Good luck with the models you will create.

If you can convert the device to a microphone recording device then create a simple dataset using the above 'reader' of a few phonetic pangram reading mixed with the chosen KW.
Augment up to 2000-4000 records and use a better noise dataset as the one you are using currently is pretty flawed. Do not add further reverbervation via the augmentation leave as is as they are device real captures.
Create a rough and ready model of your voice of KW & !KW and noise dataset that contains zero voice or music and test.
The more you record with different emotions and intonation the better and create seperate datasets of nearfield <0.3m, >0.3m <1m closefield,>1m <3m farfield and test as different models.
It depends on the tech of your microphone array as it should attenuate reverberation caused by RIR.
You don't really record reverberation at <0.3m normal broadcast style.
Closefield likely will be ok but the effect on spectra will increase dataset entropy.
Farfield its just too much as the resultant recorded spectra can be vastly different to the target voice without speech enhacement.

That is what big data did they recorded on device of voice where various tech such as beamforming has attenuated reverberation caused by RIR.
They had the metadata of user profiles and from 3 mic triangles to more complex arrays you can pinpoint distance and location.
They didn't sit in pro recording studio's but likely sat in a anechoic chamber recording through the device to create a starter dataset but the gold datasets they have are of actual users and they are big and filled with metadata and likely validated.

https://github.com/hshi-speech/Research-and-Analysis-of-Speech-Enhancement-or-Dereverberation

Removing the reverberation of RIR is the hard part and why we have so much research and tech.
Adding RIR reverberation accurately on clean samples is relatively easy and why they are the samples of interest for KWS and ASR.
If you are going to create a speech enhancement model then collect farfield RIR reverberation on voice commands.

@StuartIanNaylor
Copy link

I had a go in the esphome to see if any guru's would enable the usb audio class drivers on the xmos so I can plug in and use as a microphone.
I haven't really bothered having a go at creating a dataset as really need a xmos device and one I can play with python for recording and having a listen and try to analyse how good the farfield is.
I am sort of blind without that and the request seems very closed to esp32-s3 only likely to its detriment.

The samples from Piper are far too similar and without enough variance which is an easy fix by with newer TTS running on something with a GPU even my Xeon workstation with a RTX3050 can run reasonable TTS.
The recorded RIRs is just total bunkum as they are all recorded at 1.5m but only a few are typical room sizes whilst the dataset contains many huge halls, churches and even forrests which why that dataset or that method was chosen is a complete mystery.
When you have speech enhancement for farfield then your spectra is going to be different to a single mic, but without audacity and some spectra and some analysis and trial and error I just don't know.
Not much you can do about Mlcommons as there is a huge amount of bad but analysing in a database and grabbing an even spread of phonetics by sylable will help.
Putting any voice in the noise dataset increases cross entropy as a classification model is not AI just a pure graph and it will lower overall KW hit probablity. The 'cocktail party' problem is equal to deverberation and the idea you can just add to a classification model is a basic 101 missunderstanding of the huge amount of research and papers published on how to tackle this.
Music can be very similar any vocals are so near voice the same is true and often certain instruments can give very vocal looking spectra.
You can check this by using your dataset as the input to a trained model and checking the probability of any label.
Often I would do that to remove the worst 5-10% of a dataset and retrain.

@StuartIanNaylor
Copy link

StuartIanNaylor commented Jan 13, 2025

This seems an easy way for KW.
pip install coqui-tts

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

languages = tts.languages
speakers = tts.speakers
for speaker in speakers:
    for lang in languages:
    # generate speech by cloning a voice using default settings
        if lang != "hi" and lang != "ko":
            tts.tts_to_file(text="Hey! Jarvis!",
                            file_path=speaker + "-1-" + lang + ".wav",
                            speaker=speaker,
                            language=lang,
                            split_sentences=False
                            )
            tts.tts_to_file(text="Hey Jarvis",
                            file_path=speaker + "2-" + lang + ".wav",
                            speaker=speaker,
                            language=lang,
                            split_sentences=False
                            )                           
            tts.tts_to_file(text="hey jarvis",
                            file_path=speaker + "3-" + lang + ".wav",
                            speaker=speaker,
                            language=lang,
                            split_sentences=False
                            )  

That will end you up with 1700+ 'Hey Jarvis' with a bit more variation and is a simple script if you battle pip for the missing modules. Installing from the repo didn't work for me pip install -e .[all,dev,notebooks] just pip install coqui-tts
Having a think about !KW and that really languages should be grouped as being English, I am Indo-European who shares much with Germanic (North and West) that has a nice lang map on https://en.wikipedia.org/wiki/Indo-European_languages
Probably you would exclude Portugal, Spain, France, Italy and is that Romaina, but create 2 KW models one that suits Gemanic languages and one that suites Italic (Romance) languages.
I don't think you need a KW model for each nation as the words may differ but the phonetic pronounciation is often similar where you can group them.

I did some quick hacks with https://github.com/StuartIanNaylor/syn-dataset.git the english word dataset https://github.com/dwyl/english-words get wittled down to the words contained in cmudict.
If there are better and prob didn't need to limit by cmudict as used the soundex table where you can view the 'view' in the database as its sylable based, but by nature prob gives more likely words.
MLcommons due to qty of bad it is a bit of a stinker as opposed to synthetic creation that also a habit of creating bad.
Forcing other languages to the text language does exascerbate this.
To exclude words with sylable soundex of Hey Jar Vis on the 1st 3 sylables was a hack at getting some form of distribution as uses group clauses to get unique records set also by slyable count.
Its in database, downside it is synthetic, upside it doesn't have word shortages as whatever distribution you choose, you can create...
Hacked together some scripts to augment and create https://github.com/StuartIanNaylor/create-dataset
A test dataset for HeyJarvis using far more varaince and lang accents haven't tested if a more narrow lang datset is more accurate as still short of KW voices, but will train these in a KW on arm and just see how it goes.
Ready augmented but psounds is no longer so noise dataset is a bit limited

https://drive.google.com/file/d/1QyzbPWyls253lLBb3Wn-vlg1T_qMHguJ/view?usp=drive_link
Trained model here https://github.com/StuartIanNaylor/tf-kws
Still a few mistakes as dataset input volume likely too loud with not enough room to dodge clipping.
Also still not enough voices or big enough dataset but it will do for an example that its the dataset that is of importance.
Your dataset should match your input and the best is to use on-device capture...
tflitert with flex here https://drive.google.com/file/d/1eHcoMFNgbFUlro3T28jSipAIaAsI0965/view?usp=drive_link

@indevor
Copy link

indevor commented Feb 8, 2025

oh well it's good that it's not just my fantasies) I've noticed people in the voice pe threads also can't get enough accurate and stable response from the device. Those of them who are not native speakers, some have 70-80% and some even less. Although there is XMOS and it should somehow fight echo.
my experience with a homemade device for 2 microphones without XMOS on the code from voice pe.
Small room with furniture - 1.5 -2 meters - tolerable, 85% success.
Corridor without furniture and soft objects - 1.5 - 2 meters - 65-75%, 3-5 meters - 45-55%. This is already painful. Echoing and re-rejection of sound.
(I am not a native speaker) - alexa (you need to pronounce the deaf “L” and almost with one intonation, clearly - the result will be better - the worst model.
jarvis - slightly better
ok naby - most likely trained on user dataset, the best model.

I too have noticed that the piper generator for the dataset is extremely static. One accent and slightly different intonation. I about the openwakeword notebook - immediately thought why aren't other tts used, or several. Why is custom data not strongly suggested.
Also a big question why can't custom datasets be written from the devices themselves?

Making a device with a button that when pressed writes sound and sends it to HA is 10 minutes of development, just add the button to the GPIO. Pressing it writes sound, releasing it sends it (variations) - python squeaks out all the pauses from the beginning and end of the file.
I am not an expert in AI and speech recognition at all, but as it seems to me. if you record at least 100-200 words and spend time from a real device in different rooms of your house the result will be better than what I hear was obtained on a “robotic” piper.

I could be wrong about something or everything, but for me, as a non-native English speaker, it's quite agonizing to talk to the assistant (to wake him up). As for recognizing commands, I use the big VOSK model, it works fine for me. So these are just thoughts out loud.

@StuartIanNaylor
Copy link

StuartIanNaylor commented Feb 9, 2025

I am not an expert in AI and speech recognition at all, but as it seems to me. if you record at least 100-200 words and spend time from a real device in different rooms of your house the result will be better than what I hear was obtained on a “robotic” piper.

If you augment the 100-200 words to a dataset of say 2000-4000 samples in each label you can get a very accurate KWS, but unfortunately it will only work for yourself.
You can also do on-device training but the on-device would likely need to be a middleware N100 or Casa Cloud, where you ship out a model OTA to the microcontroller.
I am sort of thinking that a PiZero2 or equiv has so much more compute than a ESP32 that its a shame they have been ommited.

2000-40000 samples in each label is really a tiny fraction of what the correctly named Big Data of Google would use.
https://developers.google.com/machine-learning/crash-course/overfitting/data-characteristics gives a rough guide to dataset sizes with KWS parameters often being arounf 400k.

I could be wrong about something or everything, but for me, as a non-native English speaker, it's quite agonizing to talk to the assistant (to wake him up). As for recognizing commands, I use the big VOSK model, it works fine for me. So these are just thoughts out loud.

Its Whisper as SotA WER it publishes are for a few languages on the large model with a 30 second context.
https://www.speechly.com/blog/analyzing-open-ais-whisper-asr-models-word-error-rates-across-languages
Use certain languages, with certain model sizes with short context and the WER is likely the worst where its debatable if your language is supported.

Both Tensorflow and Torch have on-device training methods, but just collating and retraining can also be done if you have the patience as the idle time of a smart speaker is usually very large.

I did a CLI Word Boutique to collect word samples from spoken KW and !KW via what are phonetic pangrams which is a nonsense sentence containing all main Phonemes of a language. They can be created for any language so a few sentences can be the basis of adding 'own voice' to any dataset.
Some rough hacks are in https://github.com/StuartIanNaylor/Dataset-builder or https://github.com/StuartIanNaylor/ProjectEars/tree/main/dataset/reader

Synthetic wise https://github.com/coqui-ai/STT ⓍTTSv2 https://github.com/coqui-ai/TTS and https://github.com/netease-youdao/EmotiVoice are very good and why existing needs to wait until its refactored and rebranded as HA before being used is questionable as there is an ever changing slection of great opensource avail Vosk being one.

Even for a KWS accuracy likely could be improved by tayloring to a language family such as https://github.com/AI4Bharat/Indic-TTS as there are likely clear language branches that do not need a specific language KWS https://en.wikipedia.org/wiki/Indo-European_languages#/media/File:Indo-European_Language_Family_Branches_in_Eurasia.png

If you check my repos I have been having a go again with synthetic datasets as with simple models accuracy is the dataset and can be quite a bit of work. Providing datasets for regional branches is likely best done by native speakers of that native branch and maybe hack some of the scripts and methods I have used.
https://github.com/StuartIanNaylor/create-dataset
https://github.com/StuartIanNaylor/syn-dataset
You can test out with a KWS script of your choice as I use https://github.com/google-research/google-research/tree/master/kws_streaming but having to branch at a commit of some years ago.
But use https://github.com/StuartIanNaylor/tf-kws to test for now.

I do think its a shame either the USB Class1/2 Audio drivers or I2S drivers on the Pi have not been provided.
Likely https://github.com/tRackIT-Systems/snd-i2s_rpi the dumb slave I2S microphone drivers for the Pi would work that Adafruit document. Always wondered though if the 24/16bit data is being present as a 32bit pcm as hence why they always seem so quiet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests