Worker Crash SIGABRT #1176

GEverding · 2023-10-15T20:37:56Z

Bug Report

IMPORTANT: We primarily use GitHub as an issue tracker. Just open an issue here if you have encountered a bug in mediasoup.

If you have questions or doubts about mediasoup or need support, please use the mediasoup Discourse Group instead:

https://mediasoup.discourse.group

If you got a crash in mediasoup, please try to provide a core dump into the issue report:

https://mediasoup.org/support/#crashes-in-mediasoup-get-a-core-dump

I don't have core dumps yet.

Your environment

Operating system: Linux
Node version: Node v20 + v21-nightly
npm version:
gcc/clang version:
mediasoup version: 3.12.12
mediasoup-client version:

Issue description

2023-10-15T20:33:22.685Z mediasoup:ERROR:Channel Producer Channel error: Error: write EPIPE
2023-10-15T20:33:22.731Z mediasoup:ERROR:PayloadChannel Producer PayloadChannel error: Error: write EPIPE
2023-10-15T20:33:22.751Z mediasoup:ERROR:Worker worker process died unexpectedly [pid:19, code:null, signal:SIGABRT]

GEverding · 2023-10-15T20:46:21Z

2023-10-15T20:43:50.557Z mediasoup:ERROR:PayloadChannel Producer PayloadChannel error: Error: read ECONNRESET
2023-10-15T20:43:50.558Z mediasoup:ERROR:Channel Producer Channel error: Error: read ECONNRESET
2023-10-15T20:43:50.559Z mediasoup:ERROR:Worker worker process died unexpectedly [pid:20, code:null, signal:SIGABRT]

Also seen it with this error as well.

jmillan · 2023-10-16T09:24:34Z

@GEverding, there is basically no description about the issue. Please elaborate the scenario in which the problem happens, and wether it reproduces always, etc.

GEverding · 2023-10-16T11:11:50Z

My apologies I was dealing with another issue. I'm seeing this in normal use of mediasoup in production. It's a fairly rare issues. My guess is it happens when a client disconnects half way through setup because I see this more on startup of a new instance. I know you really need core dumps for this and I'm working on that.

jmillan · 2023-10-16T11:38:34Z

The error refers to the UnixSocket between Node and C++ worker. It's not related to client connections.

GEverding · 2023-10-16T13:11:11Z

Do you want a core dump to debug this or do you need something else. I did see this on v20 but I'm seeing it more on v21-nightly (there's a node ssl crash in v20-v18)

ibc · 2023-10-16T13:27:10Z

Coredump please. If you figure out reliable steps to reproduce it would be great but I think that will be hard.

GEverding · 2023-10-16T19:43:40Z

Still working on the core dump. but saw this in the log and might be the issue:

2023-10-16T19:42:22.374Z mediasoup:ERROR:Worker (stderr) (ABORT) RTC::IceServer::HandleTuple() | failed assertion `this->tuples.empty()': state is 'disconnected' but there are 1 tuples

ibc · 2023-10-16T20:18:18Z

Interesting. I will check tomorrow. If you can get the core dump it would be better but maybe I'm able to figure out the issue without it. Thanks.

GEverding · 2023-10-17T18:21:26Z

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {0, 167607312, 0, 5, 0, 281470681743361, 281473119789568, 281473931202416, 281473119637420, 187650778675576, 281473114525712, 281473114527504, 187651764764656,
            281473931202384, 281473115126340, 1}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x0000ffff910a7aa0 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {281473116505920, 2676586395008836901, 2676586395008836901, 2334397743342383717,
              7813874495109036581, 684901, 2317476195752624424, 0, 0, 18446744073692774400, 0, 0, 1, 4616194021469978624, 0, 4616194021471028225}}, sa_flags = 1074791425,
          sa_restorer = 0xaaaada013978 <[email protected]>}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x0000aaaad9a4b00c in RTC::IceServer::HandleTuple(RTC::TransportTuple*, bool, bool, unsigned int) ()
No symbol table info available.
#3  0x0000aaaad9af7b5c in non-virtual thunk to RTC::WebRtcServer::OnTcpConnectionPacketReceived(RTC::TcpConnection*, unsigned char const*, unsigned long) ()
No symbol table info available.
#4  0x0000aaaad9ad3948 in RTC::TcpConnection::UserOnTcpConnectionRead() ()
No symbol table info available.
#5  0x0000aaaad9d91794 in uv.read ()
No symbol table info available.
#6  0x0000aaaad9d920f4 in uv.stream_io ()
No symbol table info available.
#7  0x0000aaaad9d98d54 in uv.io_poll ()
No symbol table info available.
#8  0x0000aaaad9d8c734 in uv_run ()
No symbol table info available.
#9  0x0000aaaad99d8784 in DepLibUV::RunLoop() ()
No symbol table info available.
#10 0x0000aaaad99e2060 in Worker::Worker(Channel::ChannelSocket*, PayloadChannel::PayloadChannelSocket*) ()
No symbol table info available.
#11 0x0000aaaad99d72a0 in mediasoup_worker_run ()
No symbol table info available.
#12 0x0000aaaad99d5ea4 in main ()
No symbol table info available.

Is this enough of a coredump? I have not been able to repo in testing.

ibc · 2023-10-17T18:49:28Z

Yes, it's enough. Thanks. Will work on it finally tomorrow. BTW do you have some special scenario such as clients connecting to mediasoup through TCP instead of UDP? Are you using separate WebRtcServer to make all WebRtcTransports listen in same port(s) or not?

GEverding · 2023-10-17T19:40:38Z

Yes we have some clients using TCP b/c of network restrictions. We run UDP and TCP on different ports.

We have 1 worker and it gets a WebRTCServer. Reading over my code there could be a race condition and 2 webrtcservers get created at startup. In this situation only 1 would be used for creating the all the transports.

ibc · 2023-10-18T10:00:10Z

Reading over my code there could be a race condition and 2 webrtcservers get created at startup. In this situation only 1 would be used for creating the all the transports.

Do you literally mean that you are creating WebRtcServers in different Workers and assigning them the same listening ports?

ibc · 2023-10-18T13:16:15Z

Reading over my code there could be a race condition and 2 webrtcservers get created at startup. In this situation only 1 would be used for creating the all the transports.

Do you literally mean that you are creating WebRtcServers in different Workers and assigning them the same listening ports?

This is not possible. worker.createWebRtcServer() will throw if given UDP or TCP port is already in use by other process:

mediasoup:WARN:Channel request failed [method:WORKER_CREATE_WEBRTCSERVER, id:1]: uv_listen() failed [transport:tcp, ip:'0.0.0.0', port:44444]: address already in use [method:worker.createWebRtcServer]

mediasoup:WARN:Channel request failed [method:WORKER_CREATE_WEBRTCSERVER, id:1]: uv_udp_bind() failed [transport:udp, ip:'0.0.0.0', port:44444]: address already in use [method:worker.createWebRtcServer]

GEverding · 2023-10-18T16:33:41Z

We have 1 worker with 1 webrtcserver with 4 ports 40001+40002 for internal udp+tcp, 40003+40004 for public announced ips. Nothing is shared.

I found this out yesterday. We were running a lot (95%) of traffic through TCP b/c of a config error. Since fixing the config error we see fewer errors.

ibc · 2023-10-18T16:55:06Z

So the thing is that iceServer->HandleTuple() private method was called while in 'disconnected' mode.

' IceServer' entered in 'disconnected' mode before because iceServer->RemoveTuple() was called and it removed the last/only tuple, so IceServer changed its state to "disconnected". That's the only way to switch to 'disconnected' state.
But somehow iceServer->HandleTuple() was called later and it added a tuple, but it didn't leave the 'disconnected' state. It doesn't make sense.
So later, while in 'disconnected' state and with a tuple in the list, somehow iceServer->HandleTuple() was called again and and here the assertion happens ("state is 'disconnected' but there are 1 tuples").

ibc · 2023-10-18T17:03:50Z

So the bug is here, in HandleTuple() method (for both 'NEW' and 'DISCONNECTED' states (maybe more):

	void IceServer::HandleTuple(
	  RTC::TransportTuple* tuple, bool hasUseCandidate, bool hasNomination, uint32_t nomination)
	{
		MS_TRACE();

		switch (this->state)
		{
			case IceState::NEW:
			{
				// There should be no tuples.
				MS_ASSERT(
				  this->tuples.empty(), "state is 'new' but there are %zu tuples", this->tuples.size());

				// There shouldn't be a selected tuple.
				MS_ASSERT(!this->selectedTuple, "state is 'new' but there is selected tuple");

				if (!hasUseCandidate && !hasNomination)
				{
					MS_DEBUG_TAG(
					  ice,
					  "transition from state 'new' to 'connected' [hasUseCandidate:%s, hasNomination:%s, nomination:%" PRIu32
					  "]",
					  hasUseCandidate ? "true" : "false",
					  hasNomination ? "true" : "false",
					  nomination);

					// Store the tuple.
					auto* storedTuple = AddTuple(tuple);

					// Mark it as selected tuple.
					SetSelectedTuple(storedTuple);
					// Update state.
					this->state = IceState::CONNECTED;
					// Notify the listener.
					this->listener->OnIceServerConnected(this);
				}
				else
				{
					// Store the tuple.
					auto* storedTuple = AddTuple(tuple);

					if ((hasNomination && nomination > this->remoteNomination) || !hasNomination)
					{
						MS_DEBUG_TAG(
						  ice,
						  "transition from state 'new' to 'completed' [hasUseCandidate:%s, hasNomination:%s, nomination:%" PRIu32
						  "]",
						  hasUseCandidate ? "true" : "false",
						  hasNomination ? "true" : "false",
						  nomination);

						// Mark it as selected tuple.
						SetSelectedTuple(storedTuple);
						// Update state.
						this->state = IceState::COMPLETED;
						// Update nomination.
						if (hasNomination && nomination > this->remoteNomination)
							this->remoteNomination = nomination;
						// Notify the listener.
						this->listener->OnIceServerCompleted(this);
					}
				}

				break;
			}

Notice that there are cases in which we store the tuple but we do NOT change the state, so there we end with tuples in the list and still in 'DISCONNECTED' state:

ibc · 2023-10-18T17:17:45Z

So basically after this PR #756 we made it possible for IceServer to have tuples in the list while none of them is selected and state remains 'new' or 'disconnected'.

Hi @penguinol, problems here.

Fixes #1176 ### Details - Problem was in this PR 756 that added support for ICE nomination. - Before that PR, we always assumed that if there is a tuple then we are in 'connected' or 'completed' state, so there is a selected ICE tuple. - That's no longer true when client uses ICE nomination stuff.

ibc · 2023-10-18T17:47:57Z

PR done here: #1182
Will merge and release new version tomorrow after getting some eyes on it.

ibc · 2023-10-18T18:11:56Z

@GEverding do you know which devices connect to your server? They obviously implement ICE renomination which is the reason of the crash (see PR) and AFAIK Chrome and libwebrtc doesn't support it. Well it does but I'm not aware that it's enabled by default.

GEverding · 2023-10-18T18:17:46Z

We have Chrome (electron), iOS and Android all connecting to our servers

penguinol · 2023-10-19T05:22:08Z

Well, we only use udp, so we have never met this crash. I'll check the PR.

ibc · 2023-10-19T14:39:18Z

@GEverding it should be fixed in #1182, but I'm extremely busy working on N things at the same time. Is it possible for you to try the branch in that PR in your setup with TCP enabled and see if it works as expected? I've done all tests I could but I've not been able to reproduce the bug using v3 branch.

GEverding · 2023-10-19T14:51:18Z

Yeah I can build it and put it out in production. I'll know by tmrw likely if I don't see any crashes.

ibc · 2023-10-19T15:23:09Z

Thanks a lot. I'm 95% sure it will work fine.

ibc · 2023-10-23T07:57:34Z

@GEverding I'm merging PR #1182 since there was an obvious bug in the code and that PR fixes the vulnerability in the code. 3.12.16 released. But still I'd appreciate if you could validate whether it fixes your scenario. Thanks a lot.

GEverding · 2023-10-25T16:32:11Z

Hey its been out for 48hrs now and its looking good 👍

ibc · 2023-10-25T16:36:53Z

Great!

GEverding added the bug label Oct 15, 2023

ibc mentioned this issue Oct 18, 2023

Fix IceServer crash when client uses ICE renomination #1182

Merged

ibc closed this as completed in #1182 Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker Crash SIGABRT #1176

Worker Crash SIGABRT #1176

GEverding commented Oct 15, 2023

GEverding commented Oct 15, 2023

jmillan commented Oct 16, 2023

GEverding commented Oct 16, 2023

jmillan commented Oct 16, 2023

GEverding commented Oct 16, 2023

ibc commented Oct 16, 2023

GEverding commented Oct 16, 2023

ibc commented Oct 16, 2023

GEverding commented Oct 17, 2023 •

edited

Loading

ibc commented Oct 17, 2023

GEverding commented Oct 17, 2023

ibc commented Oct 18, 2023

ibc commented Oct 18, 2023

GEverding commented Oct 18, 2023

ibc commented Oct 18, 2023 •

edited

Loading

ibc commented Oct 18, 2023

ibc commented Oct 18, 2023

ibc commented Oct 18, 2023

ibc commented Oct 18, 2023 •

edited

Loading

GEverding commented Oct 18, 2023

penguinol commented Oct 19, 2023

ibc commented Oct 19, 2023 •

edited

Loading

GEverding commented Oct 19, 2023

ibc commented Oct 19, 2023

ibc commented Oct 23, 2023

GEverding commented Oct 25, 2023

ibc commented Oct 25, 2023

Worker Crash SIGABRT #1176

Worker Crash SIGABRT #1176

Comments

GEverding commented Oct 15, 2023

Bug Report

Your environment

Issue description

GEverding commented Oct 15, 2023

jmillan commented Oct 16, 2023

GEverding commented Oct 16, 2023

jmillan commented Oct 16, 2023

GEverding commented Oct 16, 2023

ibc commented Oct 16, 2023

GEverding commented Oct 16, 2023

ibc commented Oct 16, 2023

GEverding commented Oct 17, 2023 • edited Loading

ibc commented Oct 17, 2023

GEverding commented Oct 17, 2023

ibc commented Oct 18, 2023

ibc commented Oct 18, 2023

GEverding commented Oct 18, 2023

ibc commented Oct 18, 2023 • edited Loading

ibc commented Oct 18, 2023

ibc commented Oct 18, 2023

ibc commented Oct 18, 2023

ibc commented Oct 18, 2023 • edited Loading

GEverding commented Oct 18, 2023

penguinol commented Oct 19, 2023

ibc commented Oct 19, 2023 • edited Loading

GEverding commented Oct 19, 2023

ibc commented Oct 19, 2023

ibc commented Oct 23, 2023

GEverding commented Oct 25, 2023

ibc commented Oct 25, 2023

GEverding commented Oct 17, 2023 •

edited

Loading

ibc commented Oct 18, 2023 •

edited

Loading

ibc commented Oct 18, 2023 •

edited

Loading

ibc commented Oct 19, 2023 •

edited

Loading