-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training error: CUDA_ERROR_OUT_OF_MEMORY #25
Comments
what is your version of cuda,cudnn,tensorflow-gpu?
…------------------ 原始邮件 ------------------
发件人: "Getsatrt11"<[email protected]>;
发送时间: 2019年4月29日(星期一) 中午11:39
收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>;
抄送: "Subscribed"<[email protected]>;
主题: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25)
I train with my own data set, don't use 5 fold cross validation.
Though I set PATCH_SIZE = [24, 24, 128], BATCH_SIZE = 1.
Always reporting mistakes as follow: Could you please help me figure out ?
My GPU is : Tesla P100.
2019-04-29 11:34:02.983765: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.984872: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.985950: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.987029: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.988006: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.988884: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.989733: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.991272: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.991653: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.993080: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.993504: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 55.10M (57777152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.994662: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.995871: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.996989: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.998042: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.999098: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.000601: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.358495: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.358600: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-29 11:34:03.398668: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.398760: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-29 11:34:03.445785: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.445889: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 148.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
tensorflow_gpu ==1.8.0, cuda ==9.1, cudnn ==7 python==3.6. |
what is your gpu version,perhaps your gpu has no space to deal it.And what is your step to use
3DUnet-Tensorflow-Brats18-master's file? your use train.py directly?
…------------------ 原始邮件 ------------------
发件人: "Getsatrt11"<[email protected]>;
发送时间: 2019年4月30日(星期二) 中午1:22
收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>;
抄送: "メ隨ご風★"<[email protected]>; "Comment"<[email protected]>;
主题: Re: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25)
tensorflow_gpu ==1.8.0, cuda ==9.1, cudnn ==7 python==3.6.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
please give me a back!
…------------------ 原始邮件 ------------------
发件人: "Getsatrt11"<[email protected]>;
发送时间: 2019年4月30日(星期二) 中午1:22
收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>;
抄送: "メ隨ご風★"<[email protected]>; "Comment"<[email protected]>;
主题: Re: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25)
tensorflow_gpu ==1.8.0, cuda ==9.1, cudnn ==7 python==3.6.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Sorry, I haven't logged in GitHub recently. My data is nrrd, so I modified some functions about data processing. And then train.py is used directly in the training process. The reason caused the out of memory error maybe is that other students used the same GPU (by the way, my GPU is Tesla P100.) during my data loading time (my data is really big). And now, I set PATCH_SIZE = [128,128,128], BATCH_SIZE = 1 during the training process. while my original data dimension is [500, 500, 200]. What upset me now is that the loss is almost constant during training. |
do you have model_30000 file,why i run his command it doesn't produce model_30000 ------------------ 原始邮件 ------------------
发件人: "Getsatrt11"<[email protected]>
发送时间: 2019年5月6日(星期一) 中午1:00
收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>;
抄送: "kylinJo"<[email protected]>;"Comment"<[email protected]>;
主题: Re: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25)
Sorry, I haven't logged in GitHub recently. My data is nrrd, so I modified some functions about data processing. And then train.py is used directly in the training process. The reason caused the out of memory error maybe is that other students used the same GPU (by the way, my GPU is Tesla P100.) during my data loading time (my data is really big). And now, I set PATCH_SIZE = [128,128,128], BATCH_SIZE = 1 during the training process. while my original data dimension is [500, 500, 200]. What upset me now is that the loss is almost constant during training.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
you can modify the following parameter: |
I still have these three files,i dont have 'model' file
…------------------ 原始邮件 ------------------
发件人: "Getsatrt11"<[email protected]>;
发送时间: 2019年5月6日(星期一) 中午1:37
收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>;
抄送: "メ隨ご風★"<[email protected]>; "Comment"<[email protected]>;
主题: Re: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25)
you can modify the following parameter:
ModelSaver(max_to_keep=10, keep_checkpoint_every_n_hours=0.5),
every_k_epochs=20)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
are you succeed to run it?------------------ 原始邮件 ------------------
发件人: "Getsatrt11"<[email protected]>
发送时间: 2019年5月6日(星期一) 中午1:37
收件人: "tkuanlun350/3DUnet-Tensorflow-Brats18"<[email protected]>;
抄送: "kylinJo"<[email protected]>;"Comment"<[email protected]>;
主题: Re: [tkuanlun350/3DUnet-Tensorflow-Brats18] Training error:CUDA_ERROR_OUT_OF_MEMORY (#25)
you can modify the following parameter:
ModelSaver(max_to_keep=10, keep_checkpoint_every_n_hours=0.5),
every_k_epochs=20)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I have the same issue. My gpu memory is just 16G, and the author mentioned his gpu is 60G(if not wrong). |
Have resolved this issue? i'm just have Ndivia Tesla K8 with 16GB memory |
Maybe YOU can change the codes like below-
|
I train with my own data set, don't use 5 fold cross validation.
Though I set PATCH_SIZE = [24, 24, 128], BATCH_SIZE = 1.
Always reporting mistakes as follow: Could you please help me figure out ?
My GPU is : Tesla P100.
2019-04-29 11:34:02.983765: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.984872: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.985950: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.987029: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.988006: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.988884: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.989733: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.991272: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.991653: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.993080: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.993504: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 55.10M (57777152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.994662: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.995871: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.996989: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.998042: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.999098: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.000601: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.358495: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.358600: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-29 11:34:03.398668: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.398760: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-29 11:34:03.445785: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.445889: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 148.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
The text was updated successfully, but these errors were encountered: