Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training error: CUDA_ERROR_OUT_OF_MEMORY #25

Open
Getsatrt11 opened this issue Apr 29, 2019 · 12 comments
Open

Training error: CUDA_ERROR_OUT_OF_MEMORY #25

Getsatrt11 opened this issue Apr 29, 2019 · 12 comments

Comments

@Getsatrt11
Copy link

I train with my own data set, don't use 5 fold cross validation.
Though I set PATCH_SIZE = [24, 24, 128], BATCH_SIZE = 1.
Always reporting mistakes as follow: Could you please help me figure out ?
My GPU is : Tesla P100.

2019-04-29 11:34:02.983765: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.984872: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.985950: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.987029: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.988006: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.988884: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.989733: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.991272: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.991653: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.993080: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.993504: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 55.10M (57777152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.994662: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.995871: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.996989: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.998042: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.999098: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.000601: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.358495: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.358600: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-29 11:34:03.398668: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.398760: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-29 11:34:03.445785: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.445889: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 148.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

@kylinJo
Copy link

kylinJo commented Apr 29, 2019 via email

@Getsatrt11
Copy link
Author

tensorflow_gpu ==1.8.0, cuda ==9.1, cudnn ==7 python==3.6.

@kylinJo
Copy link

kylinJo commented May 1, 2019 via email

@kylinJo
Copy link

kylinJo commented May 6, 2019 via email

@Getsatrt11
Copy link
Author

Sorry, I haven't logged in GitHub recently. My data is nrrd, so I modified some functions about data processing. And then train.py is used directly in the training process. The reason caused the out of memory error maybe is that other students used the same GPU (by the way, my GPU is Tesla P100.) during my data loading time (my data is really big). And now, I set PATCH_SIZE = [128,128,128], BATCH_SIZE = 1 during the training process. while my original data dimension is [500, 500, 200]. What upset me now is that the loss is almost constant during training.

@kylinJo
Copy link

kylinJo commented May 6, 2019 via email

@Getsatrt11
Copy link
Author

you can modify the following parameter:
ModelSaver(max_to_keep=10, keep_checkpoint_every_n_hours=0.5),
every_k_epochs=20)

@kylinJo
Copy link

kylinJo commented May 6, 2019 via email

@kylinJo
Copy link

kylinJo commented May 9, 2019 via email

@wasd120d
Copy link

I have the same issue. My gpu memory is just 16G, and the author mentioned his gpu is 60G(if not wrong).

@gusleo
Copy link

gusleo commented Sep 18, 2019

I have the same issue. My gpu memory is just 16G, and the author mentioned his gpu is 60G(if not wrong).

Have resolved this issue? i'm just have Ndivia Tesla K8 with 16GB memory

@675492062
Copy link

Maybe YOU can change the codes like below-
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)
gpu_config = tf.ConfigProto(allow_soft_placement=True,gpu_options=gpu_options)

   cfg = TrainConfig(
      ...
        session_config=gpu_config,
       ...
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants