Training error: CUDA_ERROR_OUT_OF_MEMORY #25

Getsatrt11 · 2019-04-29T03:39:24Z

I train with my own data set, don't use 5 fold cross validation.
Though I set PATCH_SIZE = [24, 24, 128], BATCH_SIZE = 1.
Always reporting mistakes as follow: Could you please help me figure out ?
My GPU is : Tesla P100.

2019-04-29 11:34:02.983765: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.984872: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.985950: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.987029: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.988006: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.988884: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.989733: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.991272: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.991653: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.993080: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.993504: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 55.10M (57777152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.994662: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.995871: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.996989: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.998042: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.999098: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.000601: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.358495: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.358600: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-29 11:34:03.398668: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.398760: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-29 11:34:03.445785: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.445889: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 148.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

kylinJo · 2019-04-29T10:23:45Z

what is your version of cuda,cudnn,tensorflow-gpu?

…

------------------ 原始邮件 ------------------ 发件人: "Getsatrt11"<[email protected]>; 发送时间: 2019年4月29日(星期一) 中午11:39 收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>; 抄送: "Subscribed"<[email protected]>; 主题: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25) I train with my own data set, don't use 5 fold cross validation. Though I set PATCH_SIZE = [24, 24, 128], BATCH_SIZE = 1. Always reporting mistakes as follow: Could you please help me figure out ? My GPU is : Tesla P100. 2019-04-29 11:34:02.983765: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.984872: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.985950: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.987029: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.988006: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.988884: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.989733: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.991272: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.991653: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.993080: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.993504: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 55.10M (57777152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.994662: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.995871: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.996989: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.998042: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:02.999098: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:03.000601: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:03.358495: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:03.358600: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-04-29 11:34:03.398668: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:03.398760: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-04-29 11:34:03.445785: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2019-04-29 11:34:03.445889: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 148.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Getsatrt11 · 2019-04-30T05:22:38Z

tensorflow_gpu ==1.8.0, cuda ==9.1, cudnn ==7 python==3.6.

kylinJo · 2019-05-01T02:48:31Z

what is your gpu version,perhaps your gpu has no space to deal it.And what is your step to use 3DUnet-Tensorflow-Brats18-master's file? your use train.py directly?

…

------------------ 原始邮件 ------------------ 发件人: "Getsatrt11"<[email protected]>; 发送时间: 2019年4月30日(星期二) 中午1:22 收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>; 抄送: "メ隨ご風★"<[email protected]>; "Comment"<[email protected]>; 主题: Re: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25) tensorflow_gpu ==1.8.0, cuda ==9.1, cudnn ==7 python==3.6. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

kylinJo · 2019-05-06T02:30:49Z

please give me a back!

…

------------------ 原始邮件 ------------------ 发件人: "Getsatrt11"<[email protected]>; 发送时间: 2019年4月30日(星期二) 中午1:22 收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>; 抄送: "メ隨ご風★"<[email protected]>; "Comment"<[email protected]>; 主题: Re: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25) tensorflow_gpu ==1.8.0, cuda ==9.1, cudnn ==7 python==3.6. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Getsatrt11 · 2019-05-06T05:00:31Z

Sorry, I haven't logged in GitHub recently. My data is nrrd, so I modified some functions about data processing. And then train.py is used directly in the training process. The reason caused the out of memory error maybe is that other students used the same GPU (by the way, my GPU is Tesla P100.) during my data loading time (my data is really big). And now, I set PATCH_SIZE = [128,128,128], BATCH_SIZE = 1 during the training process. while my original data dimension is [500, 500, 200]. What upset me now is that the loss is almost constant during training.

kylinJo · 2019-05-06T05:04:40Z

do you have model_30000 file，why i run his command it doesn't produce model_30000 ------------------ 原始邮件 ------------------ 发件人: "Getsatrt11"<[email protected]> 发送时间: 2019年5月6日(星期一) 中午1:00 收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>; 抄送: "kylinJo"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25) Sorry, I haven't logged in GitHub recently. My data is nrrd, so I modified some functions about data processing. And then train.py is used directly in the training process. The reason caused the out of memory error maybe is that other students used the same GPU (by the way, my GPU is Tesla P100.) during my data loading time (my data is really big). And now, I set PATCH_SIZE = [128,128,128], BATCH_SIZE = 1 during the training process. while my original data dimension is [500, 500, 200]. What upset me now is that the loss is almost constant during training. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Getsatrt11 · 2019-05-06T05:37:41Z

you can modify the following parameter:
ModelSaver(max_to_keep=10, keep_checkpoint_every_n_hours=0.5),
every_k_epochs=20)

kylinJo · 2019-05-06T12:46:25Z

I still have these three files,i dont have 'model' file

…

------------------ 原始邮件 ------------------ 发件人: "Getsatrt11"<[email protected]>; 发送时间: 2019年5月6日(星期一) 中午1:37 收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>; 抄送: "メ隨ご風★"<[email protected]>; "Comment"<[email protected]>; 主题: Re: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25) you can modify the following parameter: ModelSaver(max_to_keep=10, keep_checkpoint_every_n_hours=0.5), every_k_epochs=20) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

kylinJo · 2019-05-09T02:54:09Z

are you succeed to run it?------------------ 原始邮件 ------------------ 发件人: "Getsatrt11"<[email protected]> 发送时间: 2019年5月6日(星期一) 中午1:37 收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>; 抄送: "kylinJo"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25) you can modify the following parameter: ModelSaver(max_to_keep=10, keep_checkpoint_every_n_hours=0.5), every_k_epochs=20) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

wasd120d · 2019-06-28T05:53:43Z

I have the same issue. My gpu memory is just 16G, and the author mentioned his gpu is 60G(if not wrong).

gusleo · 2019-09-18T08:12:59Z

I have the same issue. My gpu memory is just 16G, and the author mentioned his gpu is 60G(if not wrong).

Have resolved this issue? i'm just have Ndivia Tesla K8 with 16GB memory

675492062 · 2020-02-12T05:54:29Z

Maybe YOU can change the codes like below-
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)
gpu_config = tf.ConfigProto(allow_soft_placement=True,gpu_options=gpu_options)

   cfg = TrainConfig(
      ...
        session_config=gpu_config,
       ...
    )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training error: CUDA_ERROR_OUT_OF_MEMORY #25

Training error: CUDA_ERROR_OUT_OF_MEMORY #25

Getsatrt11 commented Apr 29, 2019

kylinJo commented Apr 29, 2019 via email

Getsatrt11 commented Apr 30, 2019

kylinJo commented May 1, 2019 via email

kylinJo commented May 6, 2019 via email

Getsatrt11 commented May 6, 2019

kylinJo commented May 6, 2019 via email

Getsatrt11 commented May 6, 2019

kylinJo commented May 6, 2019 via email

kylinJo commented May 9, 2019 via email

wasd120d commented Jun 28, 2019

gusleo commented Sep 18, 2019

675492062 commented Feb 12, 2020

Training error: CUDA_ERROR_OUT_OF_MEMORY #25

Training error: CUDA_ERROR_OUT_OF_MEMORY #25

Comments

Getsatrt11 commented Apr 29, 2019

kylinJo commented Apr 29, 2019 via email

Getsatrt11 commented Apr 30, 2019

kylinJo commented May 1, 2019 via email

kylinJo commented May 6, 2019 via email

Getsatrt11 commented May 6, 2019

kylinJo commented May 6, 2019 via email

Getsatrt11 commented May 6, 2019

kylinJo commented May 6, 2019 via email

kylinJo commented May 9, 2019 via email

wasd120d commented Jun 28, 2019

gusleo commented Sep 18, 2019

675492062 commented Feb 12, 2020