Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added mute/unmute functionality #62

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Domratchev
Copy link
Contributor

Fully implemented in MicAudioSource only, stubbed elsewhere.

@rhurey
Copy link
Member

rhurey commented Apr 29, 2019

What's the use case for mute / unmute that's different than calling Start / Stop recognition?


Refers to: src/sdk/Audio/AudioConfig.ts:252 in 17326ef. [](commit_id = 17326ef, deletion_comment = False)

@Domratchev
Copy link
Contributor Author

@rhurey I am using continuous speech recognition along with speech synthesis (as many do, I am sure). It is rather useful to be able to pause recognition while synthesized utterances are played by the app, otherwise it hears itself. Yet regular stop/start actions are prohibitively expensive (it takes quite a bit of resources and time to establish a WebSocket connection, for example). Hence this mute/unmute solution.

@Domratchev
Copy link
Contributor Author

Sorry about the baseline. The only relevant commit here is a05aa44 (or 87cee46).

@rhurey
Copy link
Member

rhurey commented May 1, 2019

That's a good reason. I've been pretty sad myself over the lack / quality of echo cancellation that the microphone often provides. For what little its worth a good hardware canceling solution produces an amazing translation experience.

My concerns with simply muting the audio are around the effects that occur at the network / service layer.

The approach you've got here turns out the audio coming out of the microphone, but it doesn't stop the WebRTC layer from shoveling audio frames at the Speech SDK. So we wind up transmitting frames to the service that are full of silience. The frames also now come as fast as possible, causing throttling to kick in and slow down the data rate to 2x realtime. We wind up sending twice the data to the service muted as unmuted. (Assuming I cherry-picked the one commit into master successfully.)

If the client simply ceased transmitting audio when muted, it would leave a number of service side resources tied up waiting for more audio to arrive as the service can't easily differentiate between audio being turned off, and a glitchy network.

Start / Stop doesn't tear down the websocket connection, but sends a "done with audio" packet to the speech service indicating that there will be no further audio until a new recognition is started.

I did some measuring, and the initial recognition is taking ~800ms. I suspect there's some parallelism that we can gain back in ServiceRecognizerBase::recognize that can speed that up around fetching the connection and the audio device in parallel. Maybe shave 300ms if we're lucky.

The 2nd recognition is ~500ms from start to session running. Most of that time is spent getting the microphone again. The cost of sending the service configuration messages with an open web-socket looks to be trivial.

We're going to far in tearing down the audio stack on stop when the AudioConfig object isn't closed.

To test this, I changed the ServiceRecognizerBase::stopRecognizing() to call audioSource.detach(...) instead of turnOff() and the start() to session running time dropped to ~4ms. (Plus another bug fix in the audio stream around an infinite recursion. Opps.)

The downside is the microphone icon in the browser stays on after stop is called until close() is called on the recognizer. We'd need to test the file and stream implementations to see what the behavior side effects are there.

Sound reasonable?

@rhurey
Copy link
Member

rhurey commented May 1, 2019

rhurey/perf.improvements

@Domratchev
Copy link
Contributor Author

@rhurey Thanks, Ryan. I am in the process of testing your solution in our app (on top of the promise+prepare branches). Seems to work great so far.

@Domratchev
Copy link
Contributor Author

One big difference noted is that microphone now stays tuned on after a call to stopContinuousRecognitionAsync(). I need to call recognizer.close() to disconnect. Not sure if I like it.

@rhurey
Copy link
Member

rhurey commented May 2, 2019

Updated the branch to disable the audio track also.

@rhurey
Copy link
Member

rhurey commented May 2, 2019

Disabling the track turns off the "red" microphone (Firefox) for the tab, doesn't seem to change the UI in Chrome.

It's no worse than the proposed mute(). You can close() the recognizer completely and the microphone will be dropped in the same manner Stop() did before. With the change to parallelize connection creation the cost of closing the recognizer vs stopping the microphone (on Firefox) became near parallel.

The bigger downside to keeping the microphone attached is that the browser keeps sending audio frames to it that it drops.

@Domratchev
Copy link
Contributor Author

I've tried it with the new changes.
Strange results actually. It appears that audio frames are cached after mute() is called and are sent for processing immediately after unmute() call.
I do prefer mute/unmute being independent of stop/start, and don't mind the browser red icon during the mute. Ideally, of course, that there would be no blank frames sent to the server.

The use case is like this. We have a speech synthesis and a speech recognition services and a toggle button to turn them both on/off. Start/stop correspond well to on/off actions (start is also triggered by auto restart upon a failure). Close is called upon the service destruction. Even when toggle is on, we need to pause the recognition while speech synthesis is active (might be happening every few seconds).

@rhurey
Copy link
Member

rhurey commented May 3, 2019

The extra audio frames doesn't surprise me. My observations below are from debugging on FireFox.

The inputs have a buffering multiplexer between them and the recognizer. It's design role is to allow an input to be shared across multiple recognizers. As each recognizer attaches or detaches from the multiplexer it starts getting current audio that the input sees from the moment of attach forward and the multiplexer buffers audio for that attached recognizer as they read it at different times. (This buffer becomes important later in the story)

What happens when the audio track is disabled is the incoming AudioProcessingEvent rate increases greatly, to much faster than real-time. At that point a 2nd buffer begins to accumulate the audio so we can replay in the event of service disconnection. That buffer's output is throttled @ 2x real-time as that's what the service accepts.

So we now have audio coming in at much faster than real-time and a throttling buffer letting it out at a reduced (relative) rate. When unmuted, the incoming rate will drop back to real-time while transmission continues at 2x real-time and eventually the buffer drains. I haven't measured how much faster the income frame rate from the audio source is when the track is disabled because I want to elminate it totally.

So, on to what the change I sent does. When it calls stop previously we didn't detach the recognizer from the audio input, but tore the entire input stack down, completely releasing the microphone. (Yes, this broke the multiple recognizers one input scenario. Bug, need to add test cases.) Now we detach from the input's mutliplexer and in the case of microphone if there are no other streams, disable the track. This still results in the AudioProcessingEvent rate increasing and the handler being called, but with no attached listeners the data goes to the moral equivalent of /dev/null. (That's not great, and before I PR the change I want to look at how to detach from the event notifications completely as long as most of the time penalty of acquiring the microphone isn't in the script attach). With the recognizer detached from incoming audio and stopping it sends the "done with audio" signal to the service and the websocket channel sits empty until it's used again, the service closes it, or the recognizer is closed.

As for mute/unmute vs stop/start, I understand the case you're trying to solve and start/stop is not perfect conceptually for what you're trying to do. But there's also some conceptual sharp edges around using mute/unmute.

First, mute/unmute is easily explained on a microphone. But not on any of the other input type, which AudioConfig represents all of. I'm not sure how we'd explain what mute/unmute means on a stream or a file.

Second, as you've noticed stopping the audio flow from the microphone is only the beginning of the solution. Let's assume that mute was to also stop sending data to the service, it would also need to send the "audio done' packet or some strange behavior would result. Imagine a user is talking and you want to play audio so you mute the microphone and stop sending audio. If the user was silent when this happened and there was no recognition in flight you'd get a good experience. If there was a recognition mid-phrase the behavior would be odd in that the service would pause recognition waiting for more audio, and resume recognizing when it started not expecting some gap in the audio. Because of the latency (while small) between the microphone receiving audio and the recognition being returned you can wind up with a recognition spanning the pre mute and post unmute coming back post unmute. You don't see this today because silence is still being transmitted while muted.

At this point, to get the functionality correct mute essentially does what Stop should do.
I can argue both sides of whether Stop or Close should release the microphone object. The argument for Stop doing it is that it will inform the end user the microphone is no longer in use. The semantic argument for Close() doing it is that's where other external resources (like the connection) are released.
The performance win from keeping the microphone and dropping the audio from it is decently compelling, especially if the microphone costs overshadows the websocket connect cost.

@Domratchev
Copy link
Contributor Author

Domratchev commented May 6, 2019

@rhurey Ryan, thank you for the detailed explanation. It's much appreciated.
Yet here is my struggle: I need stop() to turn the mike off, so that the icon disappears. I need start() capable of full recovery after stop() (and it won't work after close()). And I need to be able to quickly and efficiently mute/unmute.
Unfortunately the changes you are proposing do not meet my needs. The mike icon stays on after stop() (in Chrome). Buffered speech gets processed after stop/start cycle. You can see the changes in my "stop-mute" branch.
To proceed with my business feature I am forced to get back to my "mute" branch. Do you think it is possible to improve my code so that it (a) stops sending blank packets of data to the server while muted and (b) does not buffer anything for the future processing?
Thanks again!

@rhurey
Copy link
Member

rhurey commented May 29, 2019

Sorry about the delay in getting back to you.

To answer your question directly: Yes, it is possible to change the code to get the packets to stop when muted. branch rhurey/perf.improvements has all the changes needed to make Stop() Detach instead of TurnOff the audio source. Mute / UnMute would need to call the same Deteach/Attach semantics. To remove the risk of having some in progress audio be recognized after the unmute you'd also need to end the recognition with the service, I could argue that's not needed depending on use case.

As for what direction to go with the SDK in general. After debate, I'm going to do the following:

  1. Add a property to the AudioConfig object to allow the microphone to be soft or hard released.
    Hard release will be the existing behavior.
    Soft release will implement the Stop() behavior (minus the console telemetry) that's in rhurey/perf.improvements.
    This will be settable before recognizer construction, but will be unchangeable once the recognizer has been constructed, keeping with the current goal of having the AudioConfig be an items whose properties are copied on construction.

I understand this would give the stop (mute) / start (unmute) semantics for quickly start / stopping recognition to block speech capture during audio output, but at the cost of having to close() the recognizer to completely release the microphone.

From my testing, the cost of re-establishing the websocket connection to create a new recognizer and start recognition again is overshadowed by the microphone capture costs. Can you help me understand what other costs I'm missing that make construction of a new recognizer un-optimal in this scenario?

@Domratchev
Copy link
Contributor Author

@rhurey Ryan, please have a look at the latest commit. Hopefully the use of GainNode to mute/unmute would help with your concerns.

@Domratchev
Copy link
Contributor Author

@rhurey Ryan, any updates on this PR?
Also, where/whom do I ask a question about subscription technical limitations in Prod, e.g. what does 20 max concurrent recognitions mean?

@rhurey
Copy link
Member

rhurey commented Jul 18, 2019

The core issue for us is that we want to make Start/Stop work well in all scenarios and not add a mute / unmute.

The current plan we have is to take rhurey/perf.improvements and fix it up for production (there's a bunch of tracing, etc in it) and add a property to control how the microphone is released on Stop()... But the timeline is fuzzy due to other work items and tasks... (I'd gladly accept help in that endeavor since I wish it was done already)

As for the subscription limit you can ask me.
The short version is you can have ~20 concurrent recognitions in flight at a time by default. We can adjust this number on a per-instance level with some communication through our support channels.

If you've got something specific and less public, send me an email @microsoft.com

@rhurey
Copy link
Member

rhurey commented Feb 13, 2020

I just went to (finally, I know) implement a soft release, and am getting ~30ms times grabbing the Microphone in Chrome and Firefox. Where it was 10x that when we first started this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants