Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on GPU: CUDA_ERROR_NO_DEVICE #44

Open
cguegi opened this issue Oct 20, 2020 · 6 comments
Open

Running on GPU: CUDA_ERROR_NO_DEVICE #44

cguegi opened this issue Oct 20, 2020 · 6 comments

Comments

@cguegi
Copy link

cguegi commented Oct 20, 2020

Hello

I have a properly configured GPU node with Nvidia / Cuda drivers as well as the Cuda toolkit.
nvidia-smi as well as Cuda samples such as deviceQuery and bandwithTest run.

Tensorflow locally executed detects the GPU device with
python -c "import tensorflow as tf;tf.config.list_physical_devices('GPU')"

As described here the Yarn node label “gpu” exits and is associated to above node.

For test purposes I modified keras_example.py as follows:

task_specs={
         "chief": TaskSpec(memory="2 GiB", vcores=4),
          "worker": TaskSpec(memory="2 GiB", vcores=4, instances=1, label=NodeLabel.GPU),
          "ps": TaskSpec(memory="2 GiB", vcores=4, instances=2),
         "evaluator": TaskSpec(memory="2 GiB", vcores=1)
},
queue="ml-gpu"

The worker.log shows that no GPU has been detected:

+ ./venv.pex -m tf_yarn.tasks._independent_workers_task
INFO:tf_yarn._task_commons: Python 3.6.8 (default, Sep 26 2019, 11:57:09)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
INFO:tf_yarn._task_commons: Skein 0.8.0
INFO:tf_yarn._task_commons: TensorFlow v2.2.0-rc4-8-g2b96f3662b 2.2.0
I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: <hostname>
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: <hostname-removed>
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.56.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.56.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 440.56.0
I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2095074999 Hz

Neither nvidia-smi nor YARN RM ui show processes on the GPU. Hence the CPU is used for processing.
Any ideas or hints how to further debug and solve this issue?

Many thanks in advance!

@fhoering
Copy link
Contributor

fhoering commented Oct 23, 2020

Does it use the GPU when you run your training code locally ?

You can also try to set CUDA_VISIBLE_DEVICES to see if that changes anything.

run_on_yarn(
  env= {"CUDA_VISIBLE_DEVICES ": "0"}
)

I would also execute list_physical_devices somewhere in your experiment function (or using directly this https://github.com/criteo/cluster-pack/tree/master/examples/interactive-mode)

print(tf.config.list_physical_devices('GPU'))

@akimboyko
Copy link

akimboyko commented Nov 2, 2020

Hello,

Could it be a mismatch between CUDA and TensorFlow versions? For example, there is CUDA 9.0 and TensorFlow 1.6 that requires CUDA 10.0

# some CUDA configured
Successfully opened dynamic library libcuda.so.1 
# however, no compatible devide found
E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

@cguegi
Copy link
Author

cguegi commented Nov 4, 2020

@fhoering as suggested I've used the interactive mode, unfortunately without success.
Below the code executed on a Hadoop cluster with Tensorflow 2.3.1.

Software installed on GPU Datanode:

  • Cuda version: 10.2.89
  • Nvidia driver version: 440.56
  • cuDNN version: 8.0.3.33
  • Python: 3.6.8
  • GCC version: 4.8.5
import tensorflow as tf
import os

def compute_intersection():
  print("TF version: " + tf.__version__)
  lib_dir =os.environ['LD_LIBRARY_PATH']
  print(f'lib directory: {lib_dir}')
  print('Cuda device: ', os.environ['CUDA_VISIBLE_DEVICES'])
  print("GPU: ", tf.config.list_physical_devices('GPU'))

import cluster_pack
package_path, _ = cluster_pack.upload_env()

from cluster_pack.skein import skein_config_builder
skein_config = skein_config_builder.build_with_func(
    func=compute_intersection,
    package_path=package_path
)

import skein
with skein.Client(log_level="DEBUG") as client:
    service = skein.Service(
        resources=skein.Resources("1 GiB", 1),
        files=skein_config.files,
        script=skein_config.script,
       env={
          "LD_LIBRARY_PATH": "/usr/local/cuda-10.2/lib64/:/usr/local/cuda-10.2/extras/CUPTI/lib64:/usr/local/cuda-10.2/targets/      x86_64-linux/lib:/usr/lib64",
          "TF_CPP_MIN_LOG_LEVEL": "0",
          "CUDA_VISIBLE_DEVICES": "0",
          "PATH": "/usr/local/cuda-10.2/bin:$PATH"
    }
    )
    master = skein.Master(
    log_level="DEBUG"
  )
    spec = skein.ApplicationSpec(services={"service": service},queue="ml-gpu",node_label="gpu",name="cuda-detection",            master=master)
    app_id = client.submit(spec)

Yarn container log:

Container: container_e54_1603984700251_0019_01_000002 on <hostname>

LogAggregationType: AGGREGATED

============================================================================

LogType:service.log

LogLastModifiedTime:Wed Nov 04 15:26:19 +0100 2020

LogLength:1392

LogContents:

running ./venv.pex -m cluster_pack.skein._execute_fun function_7790ceef-e14d-4f33-a0ce-9fc46b3a5f08.dat INFO ..
2020-11-04 15:26:15.567171: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-11-04 15:26:19.601619: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-04 15:26:19.603672: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-11-04 15:26:19.603722: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: <hostname>
2020-11-04 15:26:19.603733: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: <hostname>
2020-11-04 15:26:19.603814: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.56.0
2020-11-04 15:26:19.603855: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.56.0
2020-11-04 15:26:19.603867: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 440.56.0

TF version: 2.3.1
lib directory: /usr/local/cuda-10.2/lib64/:/usr/local/cuda-10.2/extras/CUPTI/lib64:/usr/local/cuda-10.2/targets/x86_64-linux/lib:/usr/lib64
Cuda device:  0
GPU:  []

End of LogType:service.log
***************************************************************************

@cguegi
Copy link
Author

cguegi commented Nov 4, 2020

Hi @akimboyko,
I don't know if this is a compatibility issue between Cuda and Tensorflow.
Cuda 10.2 is not mentioned in https://www.tensorflow.org/install/source#gpu, however, I've read that Cuda 10.2 is compatible with 10.1.

@cguegi
Copy link
Author

cguegi commented Nov 4, 2020

I downloaded the pex file from HDFS and executed it on the Datanode.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64/
/tmp/venv.pex -c "import tensorflow as tf;tf.config.list_physical_devices('GPU')"

It works and that's the output.
Why doesn't it work with skein respectively with tf-yarn?

2020-11-04 16:47:19.215890: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-11-04 16:47:19.220831: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-04 16:47:19.221319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:02:01.0 name: GRID T4-8C computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 298.08GiB/s
2020-11-04 16:47:19.221559: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-04 16:47:19.223706: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-04 16:47:19.225975: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-11-04 16:47:19.226300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-11-04 16:47:19.228811: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-11-04 16:47:19.230039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-11-04 16:47:19.230248: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-04 16:47:19.230348: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-04 16:47:19.230853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-04 16:47:19.231250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0

@fhoering
Copy link
Contributor

fhoering commented Nov 23, 2020

@cguegi
Was this the issue you had or is this issue here still different ?
jcrist/skein#224
(It should only apply to a hadoop 3 cluster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants