Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky scheduler port forwarding errors #926

Closed
slevang opened this issue Jan 3, 2025 · 2 comments
Closed

Flaky scheduler port forwarding errors #926

slevang opened this issue Jan 3, 2025 · 2 comments
Labels

Comments

@slevang
Copy link

slevang commented Jan 3, 2025

Describe the issue:

When launching and running computations on a KubeCluster on GKE, specifically when connecting to the cluster from a local machine or GCE VM, I am regularly hitting issues with connection errors. From a few tests, this seems to be a non issue when I create a dask client on the cluster itself (i.e. authenticating via service account rather than a .kube/config).

I've seen a few variants, but this is the most common one:

Unhandled exception in client_connected_cb
transport: <_SelectorSocketTransport closed fd=53>
Traceback (most recent call last):
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 737, in __aexit__
    cb_suppress = await cb(*exc_details)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 642, in _exit_wrapper
    await callback(*args, **kwds)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 966, in close
    await self.stream.write(data)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_async/http11.py", line 372, in write
    await self._stream.write(buffer, timeout)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 52, in write
    await self._stream.send(item=buffer)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 211, in send
    await self._call_sslobject_method(self._ssl_object.write, item)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 177, in _call_sslobject_method
    await self.transport_stream.send(self._write_bio.read())
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1256, in send
    await AsyncIOBackend.checkpoint()
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2300, in checkpoint
    await sleep(0)
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/asyncio/tasks.py", line 656, in sleep
    await __sleep0()
  File "/home/slevang/miniconda3/envs/salient/lib/python3.12/asyncio/tasks.py", line 650, in __sleep0
    yield
asyncio.exceptions.CancelledError: Cancelled by cancel scope 76c7c470acc0

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/kr8s/_portforward.py", line 226, in _sync_sockets
  |     async with self._connect_websocket() as ws:
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
  |     await self.gen.athrow(value)
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/kr8s/_portforward.py", line 204, in _connect_websocket
  |     async with self.pod.api.open_websocket(
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
  |     await self.gen.athrow(value)
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/kr8s/_api.py", line 231, in open_websocket
  |     async with httpx_ws.aconnect_ws(
  |                ^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
  |     await self.gen.athrow(value)
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 1308, in aconnect_ws
  |     async with _aconnect_ws(
  |                ^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
  |     await self.gen.athrow(value)
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 1211, in _aconnect_ws
  |     async with AsyncWebSocketSession(
  |                ^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 641, in __aexit__
  |     await self._exit_stack.aclose()
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 696, in aclose
  |     await self.__aexit__(None, None, None)
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 754, in __aexit__
  |     raise exc_details[1]
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 737, in __aexit__
  |     cb_suppress = await cb(*exc_details)
  |                   ^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 763, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 1029, in _background_keepalive_ping
    |     pong_callback = await self.ping()
    |                     ^^^^^^^^^^^^^^^^^
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 665, in ping
    |     await self.send(event)
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 692, in send
    |     await self.stream.write(data)
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_async/http11.py", line 372, in write
    |     await self._stream.write(buffer, timeout)
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 52, in write
    |     await self._stream.send(item=buffer)
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 211, in send
    |     await self._call_sslobject_method(self._ssl_object.write, item)
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 177, in _call_sslobject_method
    |     await self.transport_stream.send(self._write_bio.read())
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1255, in send
    |     with self._send_guard:
    |          ^^^^^^^^^^^^^^^^
    |   File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_core/_synchronization.py", line 713, in __enter__
    |     raise BusyResourceError(self.action)
    | anyio.BusyResourceError: Another task is already writing to this resource
    +------------------------------------

Sometimes this just causes the dashboard to go out which is fixed by a browser refresh. Other times I seem to lose the whole connection to the scheduler and then the client work gets cancelled. I haven't figured out all the internals but I'm guessing these are two separate port forwards and this can impact either?

Minimal Complete Verifiable Example:

Not sure how to create an MCVE here but I can certainly try to further isolate things if there aren't any immediate ideas.

Anything else we need to know?:

My .kube/config auth method, if that matters, is:

exec:
  command: gke-gcloud-auth-plugin

Seems to be worse when using a large cluster (hundreds of workers), but I haven't rigorously tested this.

I think this is really a kr8s.portforward issue so I'm happy to move there if you prefer.

Environment:

  • Dask version: 2024.11.0
  • Python version: 3.12.7
  • Operating System: Ubuntu
  • Install method (conda, pip, source): pip
@jacobtomlinson
Copy link
Member

It looks like kr8s needs to handle the anyio.BusyResourceError when writing to the websocket. Would you mind opening an issue over there?

@jacobtomlinson
Copy link
Member

This has been fixed upstream in frankie567/httpx-ws#89 and set to the minimum version in kr8s-org/kr8s#546. Upgrading to latest versions will resolve this so I'm going to close this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants