-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
out of memory after several online eval iterations #8
Comments
I didn'y met this problem before. 60G memory usage sounds impossible to me. Can you try to use offline evaluation to see if the problem still exist ? ex: python3 train.py --load /path/to/ckpt/ --evaluate ... |
I changed to
something goes wrong, the log shows: [1119 19:40:30 @sessinit.py:117] Restoring checkpoint from train_log/unet3d/model-5 ... |
The warning is as expected, variable global_step:0, learning_rate:0 are only used in training mode. |
I see. Thank you. Finally, I found a solution to avoid OOM:
|
Nice ! It will be nice if you submit a pull request ! Maybe other people are facing the problem. |
Hi @tkuanlun350 , now I'm adapting your code to LiTS (for liver segmentation) challenge. The 3D CT volume is much larger than BRATS, i.e. 512x512xn (n = 100~1000). In this case, I can't do online prediction with My major revision to your code
Below is my revised code:
|
@huangmozhilv Hi, I'm working on online prediction too. did you have solve this problem ? could you please give me some advises on this problem ? Thanks |
@mini-Shark Yes. I found the reason resides in |
@huangmozhilv Saaad...Is there have some methods to avoid this situation ?may be this problem is stupid, but i didn't have time to rewrite whole pipeline : ( |
I have no idea using tensorpack. |
@huangmozhilv Anyway, thanks for your reply |
@mini-Shark Your problem is that you cannot do evaluation and training at the same time because of memory bottleneck ? You can try to change config. NO_CACHE = True to online load data. If the problem is not solved, I think we can open a new issue for the problem for better discussion. |
@tkuanlun350 It's a different problem. I think @mini-Shark should also set 'NO_CACHE = True'. The problem is that if the online evaluation takes long time(e.g. we have half of the BRATS dataset to online evaluation), the queue of training will get full, and preprocessed data from |
@huangmozhilv Thanks ! I will try to investigate tensorpack source code to figure out a workaround. |
@tkuanlun350 Thank you. |
@huangmozhilv @tkuanlun350 Thanks for you guys help me. now, maybe I found a trade-off solution is that add a additional parameter on 'PrefetchDataZMQ' when define 'get_train_dataflow()'. There 'PrefetchDataZMQ(ds, nr_proc=1, hwm=50)' have a default 'hwm=50' parameter, which control queue size of dataflow. I modify it to 'hwm=2'. And I also have modified 'get_eval_dataflow()' for don't load all validation data one time. I'm not sure this will work properly, but it didn't raise OOM now(I have 64GB memory). |
The training stage is well, consuming about 10 GB CPU memory. However, memory increases quickly once online eval (called by EvalCallback) starts, and amounts to 60G after several eval iterations. Did others observe the same problem? How do you solve it?
The text was updated successfully, but these errors were encountered: