-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added mute/unmute functionality #62
base: master
Are you sure you want to change the base?
Conversation
@rhurey I am using continuous speech recognition along with speech synthesis (as many do, I am sure). It is rather useful to be able to pause recognition while synthesized utterances are played by the app, otherwise it hears itself. Yet regular stop/start actions are prohibitively expensive (it takes quite a bit of resources and time to establish a WebSocket connection, for example). Hence this mute/unmute solution. |
That's a good reason. I've been pretty sad myself over the lack / quality of echo cancellation that the microphone often provides. For what little its worth a good hardware canceling solution produces an amazing translation experience. My concerns with simply muting the audio are around the effects that occur at the network / service layer. The approach you've got here turns out the audio coming out of the microphone, but it doesn't stop the WebRTC layer from shoveling audio frames at the Speech SDK. So we wind up transmitting frames to the service that are full of silience. The frames also now come as fast as possible, causing throttling to kick in and slow down the data rate to 2x realtime. We wind up sending twice the data to the service muted as unmuted. (Assuming I cherry-picked the one commit into master successfully.) If the client simply ceased transmitting audio when muted, it would leave a number of service side resources tied up waiting for more audio to arrive as the service can't easily differentiate between audio being turned off, and a glitchy network. Start / Stop doesn't tear down the websocket connection, but sends a "done with audio" packet to the speech service indicating that there will be no further audio until a new recognition is started. I did some measuring, and the initial recognition is taking ~800ms. I suspect there's some parallelism that we can gain back in ServiceRecognizerBase::recognize that can speed that up around fetching the connection and the audio device in parallel. Maybe shave 300ms if we're lucky. The 2nd recognition is ~500ms from start to session running. Most of that time is spent getting the microphone again. The cost of sending the service configuration messages with an open web-socket looks to be trivial. We're going to far in tearing down the audio stack on stop when the AudioConfig object isn't closed. To test this, I changed the ServiceRecognizerBase::stopRecognizing() to call audioSource.detach(...) instead of turnOff() and the start() to session running time dropped to ~4ms. (Plus another bug fix in the audio stream around an infinite recursion. Opps.) The downside is the microphone icon in the browser stays on after stop is called until close() is called on the recognizer. We'd need to test the file and stream implementations to see what the behavior side effects are there. Sound reasonable? |
rhurey/perf.improvements |
@rhurey Thanks, Ryan. I am in the process of testing your solution in our app (on top of the promise+prepare branches). Seems to work great so far. |
One big difference noted is that microphone now stays tuned on after a call to stopContinuousRecognitionAsync(). I need to call recognizer.close() to disconnect. Not sure if I like it. |
Updated the branch to disable the audio track also. |
Disabling the track turns off the "red" microphone (Firefox) for the tab, doesn't seem to change the UI in Chrome. It's no worse than the proposed mute(). You can close() the recognizer completely and the microphone will be dropped in the same manner Stop() did before. With the change to parallelize connection creation the cost of closing the recognizer vs stopping the microphone (on Firefox) became near parallel. The bigger downside to keeping the microphone attached is that the browser keeps sending audio frames to it that it drops. |
I've tried it with the new changes. The use case is like this. We have a speech synthesis and a speech recognition services and a toggle button to turn them both on/off. Start/stop correspond well to on/off actions (start is also triggered by auto restart upon a failure). Close is called upon the service destruction. Even when toggle is on, we need to pause the recognition while speech synthesis is active (might be happening every few seconds). |
The extra audio frames doesn't surprise me. My observations below are from debugging on FireFox. The inputs have a buffering multiplexer between them and the recognizer. It's design role is to allow an input to be shared across multiple recognizers. As each recognizer attaches or detaches from the multiplexer it starts getting current audio that the input sees from the moment of attach forward and the multiplexer buffers audio for that attached recognizer as they read it at different times. (This buffer becomes important later in the story) What happens when the audio track is disabled is the incoming AudioProcessingEvent rate increases greatly, to much faster than real-time. At that point a 2nd buffer begins to accumulate the audio so we can replay in the event of service disconnection. That buffer's output is throttled @ 2x real-time as that's what the service accepts. So we now have audio coming in at much faster than real-time and a throttling buffer letting it out at a reduced (relative) rate. When unmuted, the incoming rate will drop back to real-time while transmission continues at 2x real-time and eventually the buffer drains. I haven't measured how much faster the income frame rate from the audio source is when the track is disabled because I want to elminate it totally. So, on to what the change I sent does. When it calls stop previously we didn't detach the recognizer from the audio input, but tore the entire input stack down, completely releasing the microphone. (Yes, this broke the multiple recognizers one input scenario. Bug, need to add test cases.) Now we detach from the input's mutliplexer and in the case of microphone if there are no other streams, disable the track. This still results in the AudioProcessingEvent rate increasing and the handler being called, but with no attached listeners the data goes to the moral equivalent of /dev/null. (That's not great, and before I PR the change I want to look at how to detach from the event notifications completely as long as most of the time penalty of acquiring the microphone isn't in the script attach). With the recognizer detached from incoming audio and stopping it sends the "done with audio" signal to the service and the websocket channel sits empty until it's used again, the service closes it, or the recognizer is closed. As for mute/unmute vs stop/start, I understand the case you're trying to solve and start/stop is not perfect conceptually for what you're trying to do. But there's also some conceptual sharp edges around using mute/unmute. First, mute/unmute is easily explained on a microphone. But not on any of the other input type, which AudioConfig represents all of. I'm not sure how we'd explain what mute/unmute means on a stream or a file. Second, as you've noticed stopping the audio flow from the microphone is only the beginning of the solution. Let's assume that mute was to also stop sending data to the service, it would also need to send the "audio done' packet or some strange behavior would result. Imagine a user is talking and you want to play audio so you mute the microphone and stop sending audio. If the user was silent when this happened and there was no recognition in flight you'd get a good experience. If there was a recognition mid-phrase the behavior would be odd in that the service would pause recognition waiting for more audio, and resume recognizing when it started not expecting some gap in the audio. Because of the latency (while small) between the microphone receiving audio and the recognition being returned you can wind up with a recognition spanning the pre mute and post unmute coming back post unmute. You don't see this today because silence is still being transmitted while muted. At this point, to get the functionality correct mute essentially does what Stop should do. |
@rhurey Ryan, thank you for the detailed explanation. It's much appreciated. |
Sorry about the delay in getting back to you. To answer your question directly: Yes, it is possible to change the code to get the packets to stop when muted. branch rhurey/perf.improvements has all the changes needed to make Stop() Detach instead of TurnOff the audio source. Mute / UnMute would need to call the same Deteach/Attach semantics. To remove the risk of having some in progress audio be recognized after the unmute you'd also need to end the recognition with the service, I could argue that's not needed depending on use case. As for what direction to go with the SDK in general. After debate, I'm going to do the following:
I understand this would give the stop (mute) / start (unmute) semantics for quickly start / stopping recognition to block speech capture during audio output, but at the cost of having to close() the recognizer to completely release the microphone. From my testing, the cost of re-establishing the websocket connection to create a new recognizer and start recognition again is overshadowed by the microphone capture costs. Can you help me understand what other costs I'm missing that make construction of a new recognizer un-optimal in this scenario? |
@rhurey Ryan, please have a look at the latest commit. Hopefully the use of GainNode to mute/unmute would help with your concerns. |
@rhurey Ryan, any updates on this PR? |
The core issue for us is that we want to make Start/Stop work well in all scenarios and not add a mute / unmute. The current plan we have is to take rhurey/perf.improvements and fix it up for production (there's a bunch of tracing, etc in it) and add a property to control how the microphone is released on Stop()... But the timeline is fuzzy due to other work items and tasks... (I'd gladly accept help in that endeavor since I wish it was done already) As for the subscription limit you can ask me. If you've got something specific and less public, send me an email @microsoft.com |
I just went to (finally, I know) implement a soft release, and am getting ~30ms times grabbing the Microphone in Chrome and Firefox. Where it was 10x that when we first started this discussion. |
Fully implemented in MicAudioSource only, stubbed elsewhere.