-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Rust] Worker panics when running with multiple workers #1320
Comments
@jmillan is it possible that some non-thread local static were introduced recently? |
Seems related to flatbuffers. I'll be checking this in few hours. @PaulOlteanu, I'd suggest using a single worker if possible while we troubleshoot this. |
For now I don't see any static and non-thread local variables. Definitely Rust cannot properly read the incoming buffer because it is malformed. In the following case the payload is a log message (LogRef), but Rust is considering it a notification ('NotificationRef'). EDIT: The following message contains a Notification (DataconsumerBufferedAmountLow) and also the log
|
@PaulOlteanu, can you confirm this only happens when using multiple workers? |
Yeah, I've been using mediasoup 0.15 since January 8 and never saw this issue in a single-worker setup. As soon as I tried multiple workers I saw many of these kinds of errors within 24 hours. I did experiment with multiple workers for a week a few months ago and I also never saw this issue (I think I was using mediasoup 0.12 at the time which was before the flatbuffer change). |
@PaulOlteanu, I assume you're back to using a single worker. Please let us know if you get the crash in this scenario. So far we've been unable to reproduce the crash enabling multiple workers. |
I work with @PaulOlteanu. The majority of our traffic comes from ios and android. Could that be a factor is reproducing this crash? What other info do you want to help debug this crash? |
Ideally we'd have a reliable reproduction. If you can take one of example apps (videoroom is good), modify (if necessary) to reproduce an issue and share code with us, that'd be ideal. |
One thing that comes to mind is that we're running on aarch64 which I believe has different memory model guarantees to x86. So I guess if the bug is an incorrect atomic usage somewhere it could be not reproducing because of that (I have no idea why this issue would be related to atomics though)? I've also only seen this issue happen in production where we have 1000-2000 users per server at the moment. So that would also make it difficult to reproduce |
We tried running with multiple workers in prod again with mediasoup logs set to debug level in the rust filter, and this: let mut worker_settings = WorkerSettings::default();
worker_settings.log_level = WorkerLogLevel::Debug;
worker_settings.log_tags = vec![
WorkerLogTag::Info,
WorkerLogTag::Ice,
WorkerLogTag::Dtls,
WorkerLogTag::Rtp,
WorkerLogTag::Srtp,
WorkerLogTag::Rtcp,
WorkerLogTag::Rtx,
WorkerLogTag::Bwe,
WorkerLogTag::Score,
WorkerLogTag::Simulcast,
WorkerLogTag::Svc,
WorkerLogTag::Sctp,
]; in the worker settings. We didn't see the same issue previously mentioned, but instead saw a segfault happen twice. Will experiment more tomorrow and see if I can get a core dump. In one of the logs for the segfault this is the last thing:
|
@PaulOlteanu https://mediasoup.discourse.group/t/rarely-crashes-on-sendsctpdata-when-enablesctp-is-enabled/5340/5 Since I stopped using dataproducer, the priority was lowered and no further investigation was made. I'm sorry. |
Yes we are using data producers & consumers but are phasing them out (less than 2% usage at this point). If the issue is with sctp I would be surprised that we didn't run into this issue the first time we tried multiple workers (pre-0.15). Since then we've stopped pinning mediasoup worker threads to specific cpu cores, so unless there's some way the pinning was masking that issue (I don't really see how that's possible) I'm not sure it's the same thing. Edit: I may have misunderstood. If you're referring to the segfault then I guess that's possible, I haven't really looked into that issue (or the original issue), but I think I'll have time to focus on this again now. |
When I investigated, it appeared that SCTP was running on the thread of the first Worker that executed. And other Workers are sharing it. So I think it is possible that a thread data race could corrupt the data in the flatbuffers to be sent. In the case of JSON (pre-0.15), it may have been a transmission method that was less likely to cause conflicts. |
Is this something that was not properly done when Rust layer was created? In theory DepUsrSCTP.cpp should run a UPDATE: By looking at the code it looks ok. |
@nazar-pc perhaps you see something wrong in https://github.com/versatica/mediasoup/blob/v3/worker/src/DepUsrSCTP.cpp when running multiple mediasoup The constructor in Problem (not sure) is that we start such a singleton when first void DepUsrSCTP::DeregisterSctpAssociation(RTC::SctpAssociation* sctpAssociation)
{
MS_TRACE();
const std::lock_guard<std::mutex> lock(GlobalSyncMutex);
MS_ASSERT(DepUsrSCTP::checker != nullptr, "Checker not created");
auto found = DepUsrSCTP::mapIdSctpAssociation.erase(sctpAssociation->id);
MS_ASSERT(found > 0, "SctpAssociation not found");
MS_ASSERT(DepUsrSCTP::numSctpAssociations > 0u, "numSctpAssociations was not higher than 0");
if (--DepUsrSCTP::numSctpAssociations == 0u)
{
DepUsrSCTP::checker->Stop();
}
} Then when a void DepUsrSCTP::ClassDestroy()
{
MS_TRACE();
const std::lock_guard<std::mutex> lock(GlobalSyncMutex);
--GlobalInstances;
if (GlobalInstances == 0)
{
usrsctp_finish();
numSctpAssociations = 0u;
nextSctpAssociationId = 0u;
DepUsrSCTP::mapIdSctpAssociation.clear();
}
} @satoren do you see something here that may produce the crash when running multiple |
mediasoup/worker/src/DepUsrSCTP.cpp Line 76 in af7fb8c
Although I cannot be sure because it has been some time since our investigation, it appeared that this callback ( onSendSctpData ) was being executed in the thread of the Worker where it was first created.
UPDATE: However, |
It makes sense. IMHO the problem is not even related to usrsctp being multithreaded since we use the single thread api, so that shouldn't be the problem. The problem is that usrsctp uses a global state (such as the send callback passed to usrsctp_start_no_threads() function, it may have another name sorry). We set that callback within a Worker. When the worker is closed (in this case the thread in multi-worker app in Rust) we don't explicitly finish usrsctp since there are more workers running (see the counter in DepUsrSCTP.cpp). However the thread is closed anyway and probably usrsctp keeps calling it. In this scenario I have no idea what is going on, but doesn't look good. We use single thread version of usrsctp in mediasoup because originally (in Node apps) there is indeed a single thread, and if the app creates N Workers those run in a separate process with its own and separated usrsctp stack. However in Rust apps we run N threads when we launch N Workers. Given that, I wonder if we should instead use the multi thread API of usrsctp when in Rust by creating a specific thread for usrsctp (that is never closed) and pass the send callback within it. Or we keep using the single thread API of usrsctp but create a separate thread (that we never close if there are running workers) and pass the send callback to usrsctp within it. @nazar-pc does it make sense? |
If there is a multi-thread API in usrsctp - we probably should |
Bug Report
mediasoup version: 0.15
I'm trying to run mediasoup with 2 (eventually more) workers per server (on a 2 core server, so one worker per core). After switching to 2 workers, I've been getting pretty common worker panics:
and
The last one is weird because it seems unrelated to the others.
I've never seen any of these panics before switching to multiple workers, so I'm assuming it's related to that. I also only see this in prod so it might be difficult to try and reproduce
The text was updated successfully, but these errors were encountered: