Flaky scheduler port forwarding errors #926

slevang · 2025-01-03T03:13:44Z

Describe the issue:

When launching and running computations on a KubeCluster on GKE, specifically when connecting to the cluster from a local machine or GCE VM, I am regularly hitting issues with connection errors. From a few tests, this seems to be a non issue when I create a dask client on the cluster itself (i.e. authenticating via service account rather than a .kube/config).

I've seen a few variants, but this is the most common one:

Unhandled exception in client_connected_cb
transport: <_SelectorSocketTransport closed fd=53>
Traceback (most recent call last):
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 737, in __aexit__
    cb_suppress = await cb(*exc_details)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 642, in _exit_wrapper
    await callback(*args, **kwds)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 966, in close
    await self.stream.write(data)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_async/http11.py", line 372, in write
    await self._stream.write(buffer, timeout)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 52, in write
    await self._stream.send(item=buffer)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 211, in send
    await self._call_sslobject_method(self._ssl_object.write, item)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 177, in _call_sslobject_method
    await self.transport_stream.send(self._write_bio.read())
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1256, in send
    await AsyncIOBackend.checkpoint()
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2300, in checkpoint
    await sleep(0)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/asyncio/tasks.py", line 656, in sleep
    await __sleep0()
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/asyncio/tasks.py", line 650, in __sleep0
    yield
asyncio.exceptions.CancelledError: Cancelled by cancel scope 76c7c470acc0

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/kr8s/_portforward.py", line 226, in _sync_sockets
  |     async with self._connect_websocket() as ws:
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
  |     await self.gen.athrow(value)
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/kr8s/_portforward.py", line 204, in _connect_websocket
  |     async with self.pod.api.open_websocket(
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
  |     await self.gen.athrow(value)
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/kr8s/_api.py", line 231, in open_websocket
  |     async with httpx_ws.aconnect_ws(
  |                ^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
  |     await self.gen.athrow(value)
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 1308, in aconnect_ws
  |     async with _aconnect_ws(
  |                ^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
  |     await self.gen.athrow(value)
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 1211, in _aconnect_ws
  |     async with AsyncWebSocketSession(
  |                ^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 641, in __aexit__
  |     await self._exit_stack.aclose()
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 696, in aclose
  |     await self.__aexit__(None, None, None)
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 754, in __aexit__
  |     raise exc_details[1]
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 737, in __aexit__
  |     cb_suppress = await cb(*exc_details)
  |                   ^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 763, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 1029, in _background_keepalive_ping
    |     pong_callback = await self.ping()
    |                     ^^^^^^^^^^^^^^^^^
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 665, in ping
    |     await self.send(event)
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 692, in send
    |     await self.stream.write(data)
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_async/http11.py", line 372, in write
    |     await self._stream.write(buffer, timeout)
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 52, in write
    |     await self._stream.send(item=buffer)
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 211, in send
    |     await self._call_sslobject_method(self._ssl_object.write, item)
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 177, in _call_sslobject_method
    |     await self.transport_stream.send(self._write_bio.read())
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1255, in send
    |     with self._send_guard:
    |          ^^^^^^^^^^^^^^^^
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_core/_synchronization.py", line 713, in __enter__
    |     raise BusyResourceError(self.action)
    | anyio.BusyResourceError: Another task is already writing to this resource
    +------------------------------------

Sometimes this just causes the dashboard to go out which is fixed by a browser refresh. Other times I seem to lose the whole connection to the scheduler and then the client work gets cancelled. I haven't figured out all the internals but I'm guessing these are two separate port forwards and this can impact either?

Minimal Complete Verifiable Example:

Not sure how to create an MCVE here but I can certainly try to further isolate things if there aren't any immediate ideas.

Anything else we need to know?:

My .kube/config auth method, if that matters, is:

exec:
  command: gke-gcloud-auth-plugin

Seems to be worse when using a large cluster (hundreds of workers), but I haven't rigorously tested this.

I think this is really a kr8s.portforward issue so I'm happy to move there if you prefer.

Environment:

Dask version: 2024.11.0
Python version: 3.12.7
Operating System: Ubuntu
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2025-01-06T16:48:03Z

It looks like kr8s needs to handle the anyio.BusyResourceError when writing to the websocket. Would you mind opening an issue over there?

jacobtomlinson · 2025-01-07T10:50:39Z

This has been fixed upstream in frankie567/httpx-ws#89 and set to the minimum version in kr8s-org/kr8s#546. Upgrading to latest versions will resolve this so I'm going to close this out.

jacobtomlinson added the bug label Jan 6, 2025

slevang mentioned this issue Jan 6, 2025

anyio.BusyResourceError on port forward kr8s-org/kr8s#543

Closed

jacobtomlinson closed this as completed Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky scheduler port forwarding errors #926

Flaky scheduler port forwarding errors #926

slevang commented Jan 3, 2025

jacobtomlinson commented Jan 6, 2025

jacobtomlinson commented Jan 7, 2025

Flaky scheduler port forwarding errors #926

Flaky scheduler port forwarding errors #926

Comments

slevang commented Jan 3, 2025

jacobtomlinson commented Jan 6, 2025

jacobtomlinson commented Jan 7, 2025