-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mediasoup worker crash (libwebrtc uint64_t/int64_t conversion related) #357
Comments
So I understand you have both Chrome and Firefox sending and receiving video with simulcast, right? |
Yes, this is the case. We also support Safari >12.1 |
We are too busy and cannot fix the issue yet. As a terrible workaround you can modify this file:
Before line 146 add: if (current_timestamp_group_.complete_time_ms < 0)
return false; |
So zero feedback about this. Closing the issue since we have never seen it. Reopen if you get more information. |
I also observed this issue. Your environmentOperating system: Ubuntu 18.04 LTS Issue descriptionIn my producion server, we observed under logs. worker died and client can't access this worker.
It occures very rarely but it happens more often recently. |
I observed this crash last week for the first time in a few months, and yesterday it occurred again. |
I also get random crashes. It happend in every version since 3.5.3. So right now I'm running 3.5.2 which seems stable. |
Which crash? Please, specify. Paste the crash log/message and if possible a coredump. If you don't specify it we cannot do anything. |
What version is it? mediasoup or mediasoup-client? |
Mediasoup (server) version 3.5.6 09:05:49 Error: worker got "died" event Error: [pid:27100, code:null, signal:SIGABRT] |
@klausdk unfortunately your trace does not show the worker crash log, so we don't know why it happens. All the lines are just JS error trace (due to C++ worker being closed abruptly), they add no value to the report. @geongun20, mediasoup-client is a client side. It can never be the responsible of a crash in C++ server side. I'll check the differences between 3.5.2 and 3.5.3. Also, guys, please enable coredumps in your servers as explained here. This is critical for us to understand and fix the issue. |
@ibc --- worker/deps/libwebrtc/libwebrtc/modules/remote_bitrate_estimator/remote_bitrate_estimator_abs_send_time.cc (revision 660a5be937bb558bd1c72d45adbffbb0e0714be3)
+++ worker/deps/libwebrtc/libwebrtc/modules/remote_bitrate_estimator/remote_bitrate_estimator_abs_send_time.cc (date 1579272095114)
@@ -309,6 +309,9 @@
if (ProcessClusters(now_ms) == ProbeResult::kBitrateUpdated)
update_estimate = true;
}
+ if (arrival_time_ms < 0 || now_ms < 0) {
+ MS_ERROR("inter_arrival_->ComputeDeltas: %u %jd %jd", timestamp, arrival_time_ms, now_ms);
+ }
if (inter_arrival_->ComputeDeltas(timestamp, arrival_time_ms, now_ms,
payload_size, &ts_delta, &t_delta,
&size_delta)) { --- worker/deps/libwebrtc/libwebrtc/modules/congestion_controller/goog_cc/delay_based_bwe.cc (revision 660a5be937bb558bd1c72d45adbffbb0e0714be3)
+++ worker/deps/libwebrtc/libwebrtc/modules/congestion_controller/goog_cc/delay_based_bwe.cc (date 1579272101281)
@@ -145,6 +145,9 @@
uint32_t ts_delta = 0;
int64_t t_delta = 0;
int size_delta = 0;
+ if (packet_feedback.receive_time.ms() < 0 || at_time.ms() < 0) {
+ MS_ERROR("inter_arrival_->ComputeDeltas: %u %jd %jd", timestamp, packet_feedback.receive_time.ms(), at_time.ms());
+ }
bool calculated_deltas = inter_arrival_->ComputeDeltas(
timestamp, packet_feedback.receive_time.ms(), at_time.ms(),
packet_feedback.sent_packet.size.bytes(), &ts_delta, &t_delta, And on several occasions logs with negative |
Super useful, @Ivaka. Thanks, will check. BTW: have you been able to also add some |
We have the following patches in place to avoid the crash diff --git a/worker/deps/libwebrtc/libwebrtc/modules/remote_bitrate_estimator/inter_arrival.cc b/worker/deps/libwebrtc/libwebrtc/modules/remote_bitrate_estimator/inter_arrival.cc
index e27fdc5b..69cc7f29 100644
--- a/worker/deps/libwebrtc/libwebrtc/modules/remote_bitrate_estimator/inter_arrival.cc
+++ b/worker/deps/libwebrtc/libwebrtc/modules/remote_bitrate_estimator/inter_arrival.cc
@@ -50,7 +50,7 @@ bool InterArrival::ComputeDeltas(uint32_t timestamp,
current_timestamp_group_.timestamp = timestamp;
current_timestamp_group_.first_timestamp = timestamp;
current_timestamp_group_.first_arrival_ms = arrival_time_ms;
- } else if (!PacketInOrder(timestamp)) {
+ } else if (!PacketInOrder(timestamp, arrival_time_ms)) {
return false;
} else if (NewTimestampGroup(arrival_time_ms, timestamp)) {
// First packet of a later frame, the previous frame sample is ready.
@@ -110,16 +110,31 @@ bool InterArrival::ComputeDeltas(uint32_t timestamp,
return calculated_deltas;
}
-bool InterArrival::PacketInOrder(uint32_t timestamp) {
+bool InterArrival::PacketInOrder(uint32_t timestamp, int64_t arrival_time_ms) {
if (current_timestamp_group_.IsFirstPacket()) {
return true;
+ } else if (arrival_time_ms < 0) {
+ // Sometimes we do get negative arrival time, which causes BelongsToBurst()
+ // to fail, which may cause anything that uses InterArrival to crash.
+ // This is an attempt to prevent:
+ // https://github.com/versatica/mediasoup/issues/357
+ // The suggested fix there has other undesired side effects, so we try a
+ // different approach here.
+ return false;
} else {
+ const uint32_t int_middle = 0x80000000;
+
// Assume that a diff which is bigger than half the timestamp interval
// (32 bits) must be due to reordering. This code is almost identical to
// that in IsNewerTimestamp() in module_common_types.h.
uint32_t timestamp_diff =
timestamp - current_timestamp_group_.first_timestamp;
- return timestamp_diff < 0x80000000;
+
+ if (timestamp_diff == int_middle) {
+ return timestamp > current_timestamp_group_.first_timestamp;
+ } else {
+ return timestamp_diff < int_middle;
+ }
}
}
diff --git a/worker/deps/libwebrtc/libwebrtc/modules/remote_bitrate_estimator/inter_arrival.h b/worker/deps/libwebrtc/libwebrtc/modules/remote_bitrate_estimator/inter_arrival.h
index 8ce1c145..dafb6582 100644
--- a/worker/deps/libwebrtc/libwebrtc/modules/remote_bitrate_estimator/inter_arrival.h
+++ b/worker/deps/libwebrtc/libwebrtc/modules/remote_bitrate_estimator/inter_arrival.h
@@ -71,7 +71,7 @@ class InterArrival {
};
// Returns true if the packet with timestamp |timestamp| arrived in order.
- bool PacketInOrder(uint32_t timestamp);
+ bool PacketInOrder(uint32_t timestamp, int64_t arrival_time_ms);
// Returns true if the last packet was the end of the current batch and the
// packet with |timestamp| is the first of a new batch. Credit for this goes to @sspanak. If you find this useful we can prepare a PR |
Sure, I can do that. However, as I have never been part of |
Thanks. Even if that patch avoids the crash, we need to know its rationale and implications. For instance, what is the rationale behind these changes? InterArrival::PacketInOrder(uint32_t timestamp, int64_t arrival_time_ms)
const uint32_t int_middle = 0x80000000; |
BTW working in branch issue-357 to handle this bug. |
I observed this crash again today. But I can't get any files in the /tmp/cores folder. |
Here some diffs: libwebrtcAt the left the mediasoup version, at the right libwebrtc master branch version:
|
It's just impossible that 3.5.3 has a bug (related to this issue) and 3.5.3 does not: Are you using plain transports? or just WebRTC transports? |
Well, in short, the patch does what @ibc advised. It does "return whatever". As a matter of fact, I have even tried to guess what should be the correct return value. And In more details, it is a bit longer. So, the only thing I new when I started was, that So after thinking for a while, I got to the conclusion that either something goes terribly wrong and there are indeed negative timestamps or their order is incorrect. Since our users have plethora of devices, browsers and operating systems it could have been anything - faulty drivers, bugs in any piece of software, etc. Or it could have been The idea is not entirely mine. As you can see I have commented in the code, that the same approach with the middle of the integer values, is also used in
The rest of the changes are just to pass along the variables for this Of course, you are not entitled to my opinion. As I said, I am not well familiar with |
Thanks @sspanak, very useful info. Will work on this these days and notify here. |
As for the version, if I remember correcly, we saw the same with all 3.4.x version that we tried, right? It is definitely not a regression in 3.5.3. @Ivaka , correct me if I am wrong. |
https://mediasoup.org/support/#crashes-in-mediasoup-get-a-core-dump |
Yeah! that's exactly what we need. |
This to look:
for (const auto& packet_feedback : packet_feedback_vector) {
delayed_feedback = false;
IncomingPacketFeedback(packet_feedback, msg.feedback_time);
|
This to look:
transport_feedback_adapter_.ProcessTransportFeedback(
feedback, Timestamp::ms(DepLibUV::GetTimeMsInt64()))
bool SendTimeHistory::GetFeedback(PacketFeedback* packet_feedback,
bool remove) {
// RTC_DCHECK(packet_feedback);
int64_t unwrapped_seq_num =
seq_num_unwrapper_.Unwrap(packet_feedback->sequence_number);
UpdateAckedSeqNum(unwrapped_seq_num);
// RTC_DCHECK_GE(*last_ack_seq_num_, 0);
auto it = history_.find(unwrapped_seq_num);
if (it == history_.end())
return false;
// Save arrival_time not to overwrite it.
int64_t arrival_time_ms = packet_feedback->arrival_time_ms;
*packet_feedback = it->second;
packet_feedback->arrival_time_ms = arrival_time_ms;
if (remove)
history_.erase(it);
return true;
}
for (const auto& packet_feedback : packet_feedback_vector) {
delayed_feedback = false;
IncomingPacketFeedback(packet_feedback, msg.feedback_time);
|
This to look:
transport_feedback_adapter_.ProcessTransportFeedback(
feedback, Timestamp::ms(DepLibUV::GetTimeMsInt64()))
PacketFeedback packet_feedback(timestamp_ms, packet.sequence_number());
if (!send_time_history_.GetFeedback(&packet_feedback, true))
offset_us += packet.delta_us();
timestamp_ms = current_offset_ms_ + (offset_us / 1000);
if (last_timestamp_us_ == kNoTimestamp) {
current_offset_ms_ = feedback_time.ms();
} else {
current_offset_ms_ +=
mediasoup_helpers::FeedbackRtpTransport::GetBaseDeltaUs(&feedback, last_timestamp_us_) / 1000;
}
void RtpTransportControllerSend::OnTransportFeedback(
const RTC::RTCP::FeedbackRtpTransportPacket& feedback) {
absl::optional<TransportPacketsFeedback> feedback_msg =
transport_feedback_adapter_.ProcessTransportFeedback(
feedback, Timestamp::ms(DepLibUV::GetTimeMsInt64()));
bool SendTimeHistory::GetFeedback(PacketFeedback* packet_feedback,
bool remove) {
// RTC_DCHECK(packet_feedback);
int64_t unwrapped_seq_num =
seq_num_unwrapper_.Unwrap(packet_feedback->sequence_number);
UpdateAckedSeqNum(unwrapped_seq_num);
// RTC_DCHECK_GE(*last_ack_seq_num_, 0);
auto it = history_.find(unwrapped_seq_num);
if (it == history_.end())
return false;
// Save arrival_time not to overwrite it.
int64_t arrival_time_ms = packet_feedback->arrival_time_ms;
*packet_feedback = it->second;
packet_feedback->arrival_time_ms = arrival_time_ms;
if (remove)
history_.erase(it);
return true;
}
for (const auto& packet_feedback : packet_feedback_vector) {
delayed_feedback = false;
IncomingPacketFeedback(packet_feedback, msg.feedback_time);
|
Here are 4 [ISSUE 357] logs. crash 1
crash 2
crash 3
crash 4
|
Thanks, working on this. |
…d issue: #357 - PortManager.cpp: Do not use `UV_UDP_RECVMMSG` in Windows due to a bug in libuv 1.37.0
Yes, it looks good. Sorry I couldn't check it out earlier. |
Let's close this issue. Please people, comment if it happens again. And thanks to all. |
@ibc : The fix here only tosses out the cases with negative arrival_time_ms when current_timestamp_group_.IsFirstPacket() is false. Simply adding
at the top of InterArrival::ComputeDeltas seems like a better fix. |
Hello. I'm using recent version of
Does providing core-dumps can help investigating issue somehow? |
@ibc the issue is still reproducable.
|
This issue happened again with Mediasoup 3.11.3 Stack trace:
|
I took a look and couldn't find the root cause but I submitted this tentative fix for it: https://github.com/versatica/mediasoup/pull/985/files Feel free to close if it doesn't help. Also fyc maybe we could consider building in Release mode with MS_ASSERTS just logging the error. I think that's what happens in the original libwebrtc code with RTC_DCHECK. |
What? |
Yep, |
Guys, two things here:
|
Thx |
We use
What do you mean with "I agree it is not in the modern libwebrtc version"? Just to clarify, I mean that such a check doesn't even exist in modern libwebrtc code. Wondering how it's possible. But anyway, I'm not against changes in PR #985. |
Fixed in #985 |
Bug Report
Mediasoup worker dies unexpectedly
Your environment
Issue description
Occasionally we observe a worker dying on our production servers. This is what we see in the logs:
The browser version we support are:
Currently we do not have core dumps enabled on our production servers. We will enabled them, but this doesn't happen very often so it might take some time before I can provide more info.
The text was updated successfully, but these errors were encountered: