You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When launching and running computations on a KubeCluster on GKE, specifically when connecting to the cluster from a local machine or GCE VM, I am regularly hitting issues with connection errors. From a few tests, this seems to be a non issue when I create a dask client on the cluster itself (i.e. authenticating via service account rather than a .kube/config).
I've seen a few variants, but this is the most common one:
Unhandled exception in client_connected_cb
transport: <_SelectorSocketTransport closed fd=53>
Traceback (most recent call last):
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 737, in __aexit__
cb_suppress = await cb(*exc_details)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 642, in _exit_wrapper
await callback(*args, **kwds)
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 966, in close
await self.stream.write(data)
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_async/http11.py", line 372, in write
await self._stream.write(buffer, timeout)
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 52, in write
await self._stream.send(item=buffer)
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 211, in send
await self._call_sslobject_method(self._ssl_object.write, item)
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 177, in _call_sslobject_method
await self.transport_stream.send(self._write_bio.read())
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1256, in send
await AsyncIOBackend.checkpoint()
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2300, in checkpoint
await sleep(0)
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/asyncio/tasks.py", line 656, in sleep
await __sleep0()
File "/home/slevang/miniconda3/envs/salient/lib/python3.12/asyncio/tasks.py", line 650, in __sleep0
yield
asyncio.exceptions.CancelledError: Cancelled by cancel scope 76c7c470acc0
During handling of the above exception, another exception occurred:
+ Exception Group Traceback (most recent call last):
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/kr8s/_portforward.py", line 226, in _sync_sockets
| async with self._connect_websocket() as ws:
| ^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
| await self.gen.athrow(value)
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/kr8s/_portforward.py", line 204, in _connect_websocket
| async with self.pod.api.open_websocket(
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
| await self.gen.athrow(value)
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/kr8s/_api.py", line 231, in open_websocket
| async with httpx_ws.aconnect_ws(
| ^^^^^^^^^^^^^^^^^^^^^
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
| await self.gen.athrow(value)
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 1308, in aconnect_ws
| async with _aconnect_ws(
| ^^^^^^^^^^^^^
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 231, in __aexit__
| await self.gen.athrow(value)
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 1211, in _aconnect_ws
| async with AsyncWebSocketSession(
| ^^^^^^^^^^^^^^^^^^^^^^
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 641, in __aexit__
| await self._exit_stack.aclose()
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 696, in aclose
| await self.__aexit__(None, None, None)
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 754, in __aexit__
| raise exc_details[1]
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/contextlib.py", line 737, in __aexit__
| cb_suppress = await cb(*exc_details)
| ^^^^^^^^^^^^^^^^^^^^^^
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 763, in __aexit__
| raise BaseExceptionGroup(
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 1029, in _background_keepalive_ping
| pong_callback = await self.ping()
| ^^^^^^^^^^^^^^^^^
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 665, in ping
| await self.send(event)
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpx_ws/_api.py", line 692, in send
| await self.stream.write(data)
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_async/http11.py", line 372, in write
| await self._stream.write(buffer, timeout)
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 52, in write
| await self._stream.send(item=buffer)
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 211, in send
| await self._call_sslobject_method(self._ssl_object.write, item)
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/streams/tls.py", line 177, in _call_sslobject_method
| await self.transport_stream.send(self._write_bio.read())
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1255, in send
| with self._send_guard:
| ^^^^^^^^^^^^^^^^
| File "/home/slevang/miniconda3/envs/salient/lib/python3.12/site-packages/anyio/_core/_synchronization.py", line 713, in __enter__
| raise BusyResourceError(self.action)
| anyio.BusyResourceError: Another task is already writing to this resource
+------------------------------------
Sometimes this just causes the dashboard to go out which is fixed by a browser refresh. Other times I seem to lose the whole connection to the scheduler and then the client work gets cancelled. I haven't figured out all the internals but I'm guessing these are two separate port forwards and this can impact either?
Minimal Complete Verifiable Example:
Not sure how to create an MCVE here but I can certainly try to further isolate things if there aren't any immediate ideas.
Anything else we need to know?:
My .kube/config auth method, if that matters, is:
exec:
command: gke-gcloud-auth-plugin
Seems to be worse when using a large cluster (hundreds of workers), but I haven't rigorously tested this.
I think this is really a kr8s.portforward issue so I'm happy to move there if you prefer.
Environment:
Dask version: 2024.11.0
Python version: 3.12.7
Operating System: Ubuntu
Install method (conda, pip, source): pip
The text was updated successfully, but these errors were encountered:
This has been fixed upstream in frankie567/httpx-ws#89 and set to the minimum version in kr8s-org/kr8s#546. Upgrading to latest versions will resolve this so I'm going to close this out.
Describe the issue:
When launching and running computations on a
KubeCluster
on GKE, specifically when connecting to the cluster from a local machine or GCE VM, I am regularly hitting issues with connection errors. From a few tests, this seems to be a non issue when I create a dask client on the cluster itself (i.e. authenticating via service account rather than a.kube/config
).I've seen a few variants, but this is the most common one:
Sometimes this just causes the dashboard to go out which is fixed by a browser refresh. Other times I seem to lose the whole connection to the scheduler and then the client work gets cancelled. I haven't figured out all the internals but I'm guessing these are two separate port forwards and this can impact either?
Minimal Complete Verifiable Example:
Not sure how to create an MCVE here but I can certainly try to further isolate things if there aren't any immediate ideas.
Anything else we need to know?:
My
.kube/config
auth method, if that matters, is:Seems to be worse when using a large cluster (hundreds of workers), but I haven't rigorously tested this.
I think this is really a
kr8s.portforward
issue so I'm happy to move there if you prefer.Environment:
The text was updated successfully, but these errors were encountered: