Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jina does not pass the right GPU in to clipseg #135

Open
mchaker opened this issue Nov 5, 2022 · 21 comments
Open

jina does not pass the right GPU in to clipseg #135

mchaker opened this issue Nov 5, 2022 · 21 comments

Comments

@mchaker
Copy link

mchaker commented Nov 5, 2022

Describe the bug

Does not work:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

Works:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "6"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

Describe how you solve it

I use the numeric GPU ID (sad)


Environment

- jina 3.8.3
- docarray 0.16.2
- jcloud 0.0.35
- jina-hubble-sdk 0.18.0
- jina-proto 0.1.13
- protobuf 3.20.1
- proto-backend cpp
- grpcio 1.47.0
- pyyaml 6.0
- python 3.8.10
- platform Linux
- platform-release 5.15.0-52-generic
- platform-version jina-ai/jina#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022
- architecture x86_64
- processor x86_64
- uid 2485377892357
- session-id fcbedcc8-5d43-11ed-9251-0242ac110005
- uptime 2022-11-05T19:56:49.977485
- ci-vendor (unset)
* JINA_DEFAULT_HOST (unset)
* JINA_DEFAULT_TIMEOUT_CTRL (unset)
* JINA_DEPLOYMENT_NAME (unset)
* JINA_DISABLE_UVLOOP (unset)
* JINA_EARLY_STOP (unset)
* JINA_FULL_CLI (unset)
* JINA_GATEWAY_IMAGE (unset)
* JINA_GRPC_RECV_BYTES (unset)
* JINA_GRPC_SEND_BYTES (unset)
* JINA_HUB_NO_IMAGE_REBUILD (unset)
* JINA_LOG_CONFIG (unset)
* JINA_LOG_LEVEL (unset)
* JINA_LOG_NO_COLOR (unset)
* JINA_MP_START_METHOD (unset)
* JINA_OPTOUT_TELEMETRY (unset)
* JINA_RANDOM_PORT_MAX (unset)
* JINA_RANDOM_PORT_MIN (unset)

Screenshots

N/A

@JoanFM
Copy link
Member

JoanFM commented Nov 7, 2022

Hey @mchaker ,

What is the backend you are using? what does clipseg do? It seems that the DL backend does not understand the UUID

@JoanFM
Copy link
Member

JoanFM commented Nov 7, 2022

Hey @mchaker ,

Are you sure ur cuda version support MIG access?

https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#cuda-baremetal

In this documentation, you see the drivers version that support this feature, plus the syntax to be used

@JoanFM
Copy link
Member

JoanFM commented Nov 7, 2022

Can you try changing your YAML to:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "MIG-GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

or

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "MIG-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

?

@mchaker
Copy link
Author

mchaker commented Nov 7, 2022

My NVIDIA driver version is 515, so it supports MIG.
However, I do not use MIG on my cards. I just use the main card UUID from nvidia-smi -L.

I'll try the MIG prefix and report back.

clipseg is an executor set up for Jina, I use the UUID GPU specification method with other executors and Jina passes the right GPU to the executor. For some reason it does not pass the right GPU to the clipseg executor. :(

@JoanFM
Copy link
Member

JoanFM commented Nov 7, 2022

this is weird, do you have the source code of clipseg? Can you check what is the value in the Executor when u do:

os.environ['CUDA_VISIBLE_DEVICES`]?

What Jina does is simply to set the env vars for each of Executor process, so wether or not this is respected by the Executor should be the Executor or upstream problem.

@mchaker
Copy link
Author

mchaker commented Nov 7, 2022

I see - will check the os.environ value and report back.

@JoanFM
Copy link
Member

JoanFM commented Nov 11, 2022

Hey @mchaker , any news about it?

@mchaker
Copy link
Author

mchaker commented Nov 11, 2022

@JoanFM yes - CUDA_VISIBLE_DEVICES is GPU-87d2c7e5-c3eb-1181-1857-368f4c2bbbbb in the container (proper GPU ID)

However Jina crashes with:

⠋ Waiting stablemulti clipseg upscalerp40 realesrgan... ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/6 0:00:18CRITI… clipseg/rep-0@61 can not load the executor from executors/clipseg/config.yml                          [11/11/22 14:54:57]
ERROR  clipseg/rep-0@61 RuntimeError('Attempting to deserialize object on CUDA device 0 but                  [11/11/22 14:54:57]
       torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an
       existing device.') during <class 'jina.serve.runtimes.worker.WorkerRuntime'> initialization
        add "--quiet-error" to suppress the exception details
       Traceback (most recent call last):
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/orchestrate/pods/__init__.py", line
       74, in run
           runtime = runtime_cls(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/worker/__init__.py",
       line 36, in __init__
           super().__init__(args, **kwargs)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/asyncio.py", line 80,
       in __init__
           self._loop.run_until_complete(self.async_setup())
         File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
           return future.result()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/worker/__init__.py",
       line 101, in async_setup
           self._data_request_handler = DataRequestHandler(
         File
       "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/request_handlers/data_reques…
       line 49, in __init__
           self._load_executor(
         File
       "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/request_handlers/data_reques…
       line 139, in _load_executor
           self._executor: BaseExecutor = BaseExecutor.load_config(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 760, in
       load_config
           obj = JAML.load(tag_yml, substitute=False, runtime_args=runtime_args)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 174, in load
           r = yaml.load(stream, Loader=get_jina_loader_with_runtime(runtime_args))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/__init__.py", line 81, in load
           return loader.get_single_data()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 51, in
       get_single_data
           return self.construct_document(node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 55, in
       construct_document
           data = self.construct_object(node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 100, in
       construct_object
           data = constructor(self, node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 582, in
       _from_yaml
           return get_parser(cls, version=data.get('version', None)).parse(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/parsers/executor/legacy.py",
       line 45, in parse
           obj = cls(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/executors/decorators.py", line
       63, in arg_wrapper
           f = func(self, *args, **kwargs)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/helper.py", line 71, in
       arg_wrapper
           f = func(self, *args, **kwargs)
         File "/dalle/dalle-flow/executors/clipseg/executor.py", line 71, in __init__
           torch.load(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 789, in load
           return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1131, in
       _load
           result = unpickler.load()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1101, in
       persistent_load
           load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1083, in
       load_tensor
           wrap_storage=restore_location(storage, location),
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1055, in
       restore_location
           return default_restore_location(storage, str(map_location))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 215, in
       default_restore_location
           result = fn(storage, location)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 182, in
       _cuda_deserialize
           device = validate_cuda_device(location)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 173, in
       validate_cuda_device
           raise RuntimeError('Attempting to deserialize object on CUDA device '
       RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0.
       Please use torch.load with map_location to map your storages to an existing device.
DEBUG  clipseg/rep-0@61 process terminated

@JoanFM
Copy link
Member

JoanFM commented Nov 11, 2022

Hey @mchaker ,

This problem is on the Executor and how they load into GPU, where are u getting it from? maybe we can open an issue on that repo and fix there?

@mchaker
Copy link
Author

mchaker commented Nov 11, 2022

I see - let me check with the developer and see where they are getting the executor from. Maybe it is custom.

@JoanFM
Copy link
Member

JoanFM commented Nov 11, 2022

I believe the issue may come from how the model was stored or something like this. in this case Jina has made sure that ur CUDA_VISIBLE_DEVICES env var is well passed to the Executor.

@mchaker
Copy link
Author

mchaker commented Nov 11, 2022

I see -- I'll follow up with the executor authors and dig into the executor source. Thanks for your help!

@mchaker
Copy link
Author

mchaker commented Nov 11, 2022

@JoanFM actually it looks like the executor is from Jina:
https://github.com/jina-ai/dalle-flow/blob/main/executors/clipseg/executor.py

@AmericanPresidentJimmyCarter
Copy link
Contributor

AmericanPresidentJimmyCarter commented Nov 11, 2022

The device for the model is simply mapped with:

        model.load_state_dict(
            torch.load(
                f'{cache_path}/{WEIGHT_FOLDER_NAME}/rd64-uni.pth',
                map_location=torch.device('cuda'),
            ),
            strict=False,
        )

In this case it appears that torch is unable to map the location. @mchaker before these lines in executors/clipseg/executor.py you can add print(os.environ.get('CUDA_VISIBLE_DEVICES))` to see what the environment actually is.

@JoanFM
Copy link
Member

JoanFM commented Nov 11, 2022

Hey @AmericanPresidentJimmyCarter, do you know what might be the problem why it cannot be loaded with that CUDA_VISIBLE_DEVICES setting?

@AmericanPresidentJimmyCarter
Copy link
Contributor

@JoanFM No, I will try to get you debug from the env. This appears to be a strange one.

@JoanFM JoanFM transferred this issue from jina-ai/serve Nov 11, 2022
@JoanFM
Copy link
Member

JoanFM commented Nov 11, 2022

I transfer the issue to DALLE-FLOW because the issue is specific to the Executor in this project

@mchaker
Copy link
Author

mchaker commented Nov 18, 2022

@AmericanPresidentJimmyCarter what do you need from the env?

@JoanFM
Copy link
Member

JoanFM commented Nov 30, 2022

Hey @mchaker , @AmericanPresidentJimmyCarter , any progress on this ?

@AmericanPresidentJimmyCarter
Copy link
Contributor

I still do not know why it happens -- it's only this one specific executor that has the problem. We can upload to latest jina and see if it persists.

@mchaker
Copy link
Author

mchaker commented Nov 30, 2022

I updated jina using pip install -U jina and the error still happens

RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants