Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Crash SIGABRT #1176

Closed
GEverding opened this issue Oct 15, 2023 · 27 comments · Fixed by #1182
Closed

Worker Crash SIGABRT #1176

GEverding opened this issue Oct 15, 2023 · 27 comments · Fixed by #1182
Labels

Comments

@GEverding
Copy link

Bug Report

IMPORTANT: We primarily use GitHub as an issue tracker. Just open an issue here if you have encountered a bug in mediasoup.

If you have questions or doubts about mediasoup or need support, please use the mediasoup Discourse Group instead:

https://mediasoup.discourse.group

If you got a crash in mediasoup, please try to provide a core dump into the issue report:

https://mediasoup.org/support/#crashes-in-mediasoup-get-a-core-dump

I don't have core dumps yet.

Your environment

  • Operating system: Linux
  • Node version: Node v20 + v21-nightly
  • npm version:
  • gcc/clang version:
  • mediasoup version: 3.12.12
  • mediasoup-client version:

Issue description

2023-10-15T20:33:22.685Z mediasoup:ERROR:Channel Producer Channel error: Error: write EPIPE
2023-10-15T20:33:22.731Z mediasoup:ERROR:PayloadChannel Producer PayloadChannel error: Error: write EPIPE
2023-10-15T20:33:22.751Z mediasoup:ERROR:Worker worker process died unexpectedly [pid:19, code:null, signal:SIGABRT]

@GEverding GEverding added the bug label Oct 15, 2023
@GEverding
Copy link
Author

2023-10-15T20:43:50.557Z mediasoup:ERROR:PayloadChannel Producer PayloadChannel error: Error: read ECONNRESET
2023-10-15T20:43:50.558Z mediasoup:ERROR:Channel Producer Channel error: Error: read ECONNRESET
2023-10-15T20:43:50.559Z mediasoup:ERROR:Worker worker process died unexpectedly [pid:20, code:null, signal:SIGABRT]

Also seen it with this error as well.

@jmillan
Copy link
Member

jmillan commented Oct 16, 2023

@GEverding, there is basically no description about the issue. Please elaborate the scenario in which the problem happens, and wether it reproduces always, etc.

@GEverding
Copy link
Author

My apologies I was dealing with another issue. I'm seeing this in normal use of mediasoup in production. It's a fairly rare issues. My guess is it happens when a client disconnects half way through setup because I see this more on startup of a new instance. I know you really need core dumps for this and I'm working on that.

@jmillan
Copy link
Member

jmillan commented Oct 16, 2023

The error refers to the UnixSocket between Node and C++ worker. It's not related to client connections.

@GEverding
Copy link
Author

Do you want a core dump to debug this or do you need something else. I did see this on v20 but I'm seeing it more on v21-nightly (there's a node ssl crash in v20-v18)

@ibc
Copy link
Member

ibc commented Oct 16, 2023

Coredump please. If you figure out reliable steps to reproduce it would be great but I think that will be hard.

@GEverding
Copy link
Author

Still working on the core dump. but saw this in the log and might be the issue:

2023-10-16T19:42:22.374Z mediasoup:ERROR:Worker (stderr) (ABORT) RTC::IceServer::HandleTuple() | failed assertion `this->tuples.empty()': state is 'disconnected' but there are 1 tuples

@ibc
Copy link
Member

ibc commented Oct 16, 2023

Interesting. I will check tomorrow. If you can get the core dump it would be better but maybe I'm able to figure out the issue without it. Thanks.

@GEverding
Copy link
Author

GEverding commented Oct 17, 2023

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {0, 167607312, 0, 5, 0, 281470681743361, 281473119789568, 281473931202416, 281473119637420, 187650778675576, 281473114525712, 281473114527504, 187651764764656,
            281473931202384, 281473115126340, 1}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x0000ffff910a7aa0 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {281473116505920, 2676586395008836901, 2676586395008836901, 2334397743342383717,
              7813874495109036581, 684901, 2317476195752624424, 0, 0, 18446744073692774400, 0, 0, 1, 4616194021469978624, 0, 4616194021471028225}}, sa_flags = 1074791425,
          sa_restorer = 0xaaaada013978 <[email protected]>}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x0000aaaad9a4b00c in RTC::IceServer::HandleTuple(RTC::TransportTuple*, bool, bool, unsigned int) ()
No symbol table info available.
#3  0x0000aaaad9af7b5c in non-virtual thunk to RTC::WebRtcServer::OnTcpConnectionPacketReceived(RTC::TcpConnection*, unsigned char const*, unsigned long) ()
No symbol table info available.
#4  0x0000aaaad9ad3948 in RTC::TcpConnection::UserOnTcpConnectionRead() ()
No symbol table info available.
#5  0x0000aaaad9d91794 in uv.read ()
No symbol table info available.
#6  0x0000aaaad9d920f4 in uv.stream_io ()
No symbol table info available.
#7  0x0000aaaad9d98d54 in uv.io_poll ()
No symbol table info available.
#8  0x0000aaaad9d8c734 in uv_run ()
No symbol table info available.
#9  0x0000aaaad99d8784 in DepLibUV::RunLoop() ()
No symbol table info available.
#10 0x0000aaaad99e2060 in Worker::Worker(Channel::ChannelSocket*, PayloadChannel::PayloadChannelSocket*) ()
No symbol table info available.
#11 0x0000aaaad99d72a0 in mediasoup_worker_run ()
No symbol table info available.
#12 0x0000aaaad99d5ea4 in main ()
No symbol table info available.

Is this enough of a coredump? I have not been able to repo in testing.

@ibc
Copy link
Member

ibc commented Oct 17, 2023

Yes, it's enough. Thanks. Will work on it finally tomorrow. BTW do you have some special scenario such as clients connecting to mediasoup through TCP instead of UDP? Are you using separate WebRtcServer to make all WebRtcTransports listen in same port(s) or not?

@GEverding
Copy link
Author

Yes we have some clients using TCP b/c of network restrictions. We run UDP and TCP on different ports.

We have 1 worker and it gets a WebRTCServer. Reading over my code there could be a race condition and 2 webrtcservers get created at startup. In this situation only 1 would be used for creating the all the transports.

@ibc
Copy link
Member

ibc commented Oct 18, 2023

Reading over my code there could be a race condition and 2 webrtcservers get created at startup. In this situation only 1 would be used for creating the all the transports.

Do you literally mean that you are creating WebRtcServers in different Workers and assigning them the same listening ports?

@ibc
Copy link
Member

ibc commented Oct 18, 2023

Reading over my code there could be a race condition and 2 webrtcservers get created at startup. In this situation only 1 would be used for creating the all the transports.

Do you literally mean that you are creating WebRtcServers in different Workers and assigning them the same listening ports?

This is not possible. worker.createWebRtcServer() will throw if given UDP or TCP port is already in use by other process:

mediasoup:WARN:Channel request failed [method:WORKER_CREATE_WEBRTCSERVER, id:1]: uv_listen() failed [transport:tcp, ip:'0.0.0.0', port:44444]: address already in use [method:worker.createWebRtcServer]
mediasoup:WARN:Channel request failed [method:WORKER_CREATE_WEBRTCSERVER, id:1]: uv_udp_bind() failed [transport:udp, ip:'0.0.0.0', port:44444]: address already in use [method:worker.createWebRtcServer]

@GEverding
Copy link
Author

We have 1 worker with 1 webrtcserver with 4 ports 40001+40002 for internal udp+tcp, 40003+40004 for public announced ips. Nothing is shared.

I found this out yesterday. We were running a lot (95%) of traffic through TCP b/c of a config error. Since fixing the config error we see fewer errors.

@ibc
Copy link
Member

ibc commented Oct 18, 2023

So the thing is that iceServer->HandleTuple() private method was called while in 'disconnected' mode.

  1. ' IceServer' entered in 'disconnected' mode before because iceServer->RemoveTuple() was called and it removed the last/only tuple, so IceServer changed its state to "disconnected". That's the only way to switch to 'disconnected' state.
  2. But somehow iceServer->HandleTuple() was called later and it added a tuple, but it didn't leave the 'disconnected' state. It doesn't make sense.
  3. So later, while in 'disconnected' state and with a tuple in the list, somehow iceServer->HandleTuple() was called again and and here the assertion happens ("state is 'disconnected' but there are 1 tuples").

@ibc
Copy link
Member

ibc commented Oct 18, 2023

So the bug is here, in HandleTuple() method (for both 'NEW' and 'DISCONNECTED' states (maybe more):

	void IceServer::HandleTuple(
	  RTC::TransportTuple* tuple, bool hasUseCandidate, bool hasNomination, uint32_t nomination)
	{
		MS_TRACE();

		switch (this->state)
		{
			case IceState::NEW:
			{
				// There should be no tuples.
				MS_ASSERT(
				  this->tuples.empty(), "state is 'new' but there are %zu tuples", this->tuples.size());

				// There shouldn't be a selected tuple.
				MS_ASSERT(!this->selectedTuple, "state is 'new' but there is selected tuple");

				if (!hasUseCandidate && !hasNomination)
				{
					MS_DEBUG_TAG(
					  ice,
					  "transition from state 'new' to 'connected' [hasUseCandidate:%s, hasNomination:%s, nomination:%" PRIu32
					  "]",
					  hasUseCandidate ? "true" : "false",
					  hasNomination ? "true" : "false",
					  nomination);

					// Store the tuple.
					auto* storedTuple = AddTuple(tuple);

					// Mark it as selected tuple.
					SetSelectedTuple(storedTuple);
					// Update state.
					this->state = IceState::CONNECTED;
					// Notify the listener.
					this->listener->OnIceServerConnected(this);
				}
				else
				{
					// Store the tuple.
					auto* storedTuple = AddTuple(tuple);

					if ((hasNomination && nomination > this->remoteNomination) || !hasNomination)
					{
						MS_DEBUG_TAG(
						  ice,
						  "transition from state 'new' to 'completed' [hasUseCandidate:%s, hasNomination:%s, nomination:%" PRIu32
						  "]",
						  hasUseCandidate ? "true" : "false",
						  hasNomination ? "true" : "false",
						  nomination);

						// Mark it as selected tuple.
						SetSelectedTuple(storedTuple);
						// Update state.
						this->state = IceState::COMPLETED;
						// Update nomination.
						if (hasNomination && nomination > this->remoteNomination)
							this->remoteNomination = nomination;
						// Notify the listener.
						this->listener->OnIceServerCompleted(this);
					}
				}

				break;
			}

Notice that there are cases in which we store the tuple but we do NOT change the state, so there we end with tuples in the list and still in 'DISCONNECTED' state:

CleanShot-2023-10-18-at-19 02 36@2x

@ibc
Copy link
Member

ibc commented Oct 18, 2023

So basically after this PR #756 we made it possible for IceServer to have tuples in the list while none of them is selected and state remains 'new' or 'disconnected'.

Hi @penguinol, problems here.

ibc added a commit that referenced this issue Oct 18, 2023
Fixes #1176

### Details

- Problem was in this PR 756 that added support for ICE nomination.
- Before that PR, we always assumed that if there is a tuple then we are in 'connected' or 'completed' state, so there is a selected ICE tuple.
- That's no longer true when client uses ICE nomination stuff.
@ibc
Copy link
Member

ibc commented Oct 18, 2023

PR done here: #1182
Will merge and release new version tomorrow after getting some eyes on it.

@ibc
Copy link
Member

ibc commented Oct 18, 2023

@GEverding do you know which devices connect to your server? They obviously implement ICE renomination which is the reason of the crash (see PR) and AFAIK Chrome and libwebrtc doesn't support it. Well it does but I'm not aware that it's enabled by default.

@GEverding
Copy link
Author

We have Chrome (electron), iOS and Android all connecting to our servers

@penguinol
Copy link
Contributor

Well, we only use udp, so we have never met this crash. I'll check the PR.

@ibc
Copy link
Member

ibc commented Oct 19, 2023

@GEverding it should be fixed in #1182, but I'm extremely busy working on N things at the same time. Is it possible for you to try the branch in that PR in your setup with TCP enabled and see if it works as expected? I've done all tests I could but I've not been able to reproduce the bug using v3 branch.

@GEverding
Copy link
Author

Yeah I can build it and put it out in production. I'll know by tmrw likely if I don't see any crashes.

@ibc
Copy link
Member

ibc commented Oct 19, 2023

Thanks a lot. I'm 95% sure it will work fine.

@ibc ibc closed this as completed in #1182 Oct 23, 2023
@ibc
Copy link
Member

ibc commented Oct 23, 2023

@GEverding I'm merging PR #1182 since there was an obvious bug in the code and that PR fixes the vulnerability in the code. 3.12.16 released. But still I'd appreciate if you could validate whether it fixes your scenario. Thanks a lot.

@GEverding
Copy link
Author

Hey its been out for 48hrs now and its looking good 👍

@ibc
Copy link
Member

ibc commented Oct 25, 2023

Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

4 participants