-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create pytorch-notebook docker image #315
Conversation
Another GPU-enabled docker image for running deep learning!
/condalock |
Thanks for this @weiji14 ! If you add new packages (torchgeo) you have to relock the environment. The easiest way to do this is add a comment with |
/condalock |
Ah ok, I've actually ran |
channels: | ||
- conda-forge | ||
dependencies: | ||
- cudatoolkit=11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the ml-notebook image we pin cudatoolkit=10,
- cudatoolkit=10 |
I'm guessing if both these images are offered on pangeo jupyterhubs we want cuda versions to be the same, @yuvipanda any recommendations for 10 vs 11 when it comes to node configuration?
Fine leaving it as 11 here and bumping the other image in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it comes down to what CUDA drivers the server is using (find out using nvidia-smi
). I prefer cudatoolkit=11
because it has foward and backward compatibility within a minor version (see https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title), but that's of no use if the cuda drivers on the server don't support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually know enough to have an opinion here, but happy to stick to whatever upstream recommends and as long as we are consistent across the images :D If we need to upgrade the driver version on the clusters when we bump this, can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Just to confirm, I'm assuming that the clusters don't pull the latest pangeo-docker image? I.e. that they're pinned to a specific pangeo-docker-image version? Probably doesn't matter for this pytorch-notebook
image since nobody will be using it yet, but I'd recommend updating the CUDA driver version before bumping the cudatoolkit
version on ml/tensorflow-notebook
, just speaking from experience working on an on-prem HPC 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clusters don't pull the latest pangeo-docker image?
Actually they do currently pull the latest :) https://github.com/2i2c-org/infrastructure/search?q=ml-notebook
But I say we go ahead and merge this, and can use explicit pins on the hub if need be going forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, you all like to live dangerously 😆 So looking at 2i2c-org/infrastructure#1244, it seems like the GPUs are K80s, and according to https://forums.developer.nvidia.com/t/nvidia-tesla-k80-cuda-version-support/67676/6, CUDA 11 does work but support is deprecated (i.e. there might be lots of warnings). So let's test it out then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not currently using ml-notebook in production in 2i2c. All production clusters use pinned tags.
Those clusters where this image turns up are not really launched yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other point on cudatoolkit 10 vs 11 is that 11 brings along a few more gigabytes. I don't mind big images, but this one is currently approaching 10GB total uncompressed...
mamba create -n test --dry-run pytorch-gpu cudatoolkit=10
-> Total download: 1 GB
mamba create -n test --dry-run pytorch-gpu cudatoolkit=11
-> Total download: 3 GB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes cudatoolkit is huge unfortunately. There are ways to trim down the docker image size as you mentioned in #188 (comment), but it's tough getting around a big binary... I think CUDA 11 will need to be used eventually though, so will need to figure out a solution 🙃
Co-authored-by: Wei Ji <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @weiji14 - just one small nit, then LGTM.
tests/test_pytorch-notebook.py
Outdated
# cupy import fails unless on GPU-enabled node: | ||
#'cupy', #libcuda.so.1: cannot open shared object file: No such file or directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# cupy import fails unless on GPU-enabled node: | |
#'cupy', #libcuda.so.1: cannot open shared object file: No such file or directory |
Let's remove this comment. It's just residue from old stuff in ml-image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I add a suggestion as well? Because of the mkl and nomkl conflicts, this is sort of worrying...
Try a test for the underlying blas to make sure things are set up correctly. Usually it should be MKL (it's pulling MKL from conda-forge):
import torch
assert torch.__config__.show()[torch.__config__.show().find('BLAS_INFO')+10:torch.__config__.show().find('BLAS_INFO')+13] == "mkl"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done both in 3acef32. Let's see if the CI passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, works like a charm: tests/test_pytorch-notebook.py::test_torch_uses_mkl PASSED [100%]
https://github.com/pangeo-data/pangeo-docker-images/runs/6202609085?check_suite_focus=true#step:7:41
Another GPU-enabled docker image for running deep learning!
List of packages to include (let me know of any other relevant ones);
Resolves #312