Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create pytorch-notebook docker image #315

Merged
merged 8 commits into from
Apr 27, 2022
Merged

Create pytorch-notebook docker image #315

merged 8 commits into from
Apr 27, 2022

Conversation

weiji14
Copy link
Member

@weiji14 weiji14 commented Apr 27, 2022

Another GPU-enabled docker image for running deep learning!

List of packages to include (let me know of any other relevant ones);

  • pytorch
  • torchgeo
  • pytorch-lightning
  • etc

Resolves #312

Another GPU-enabled docker image for running deep learning!
@github-actions
Copy link
Contributor

Binder 👈 Try on Mybinder.org!
Binder 👈 Try on Pangeo GCP Binder!
Binder 👈 Try on Pangeo AWS Binder!

@pangeo-bot
Copy link
Collaborator

/condalock
Automatically locking new conda environment, building, and testing images...

@weiji14 weiji14 mentioned this pull request Apr 27, 2022
@weiji14 weiji14 changed the title Create pytorch-notebook Create pytorch-notebook docker image Apr 27, 2022
@scottyhq
Copy link
Member

Thanks for this @weiji14 ! If you add new packages (torchgeo) you have to relock the environment. The easiest way to do this is add a comment with /condalock on the first line.

@rabernat
Copy link
Member

/condalock

@weiji14
Copy link
Member Author

weiji14 commented Apr 27, 2022

Thanks for this @weiji14 ! If you add new packages (torchgeo) you have to relock the environment. The easiest way to do this is add a comment with /condalock on the first line.

Ah ok, I've actually ran conda-lock locally already so there might not be stuff to update. But there might be a few other packages we might add so will do that later 😄

channels:
- conda-forge
dependencies:
- cudatoolkit=11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the ml-notebook image we pin cudatoolkit=10,

I'm guessing if both these images are offered on pangeo jupyterhubs we want cuda versions to be the same, @yuvipanda any recommendations for 10 vs 11 when it comes to node configuration?

Fine leaving it as 11 here and bumping the other image in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it comes down to what CUDA drivers the server is using (find out using nvidia-smi). I prefer cudatoolkit=11 because it has foward and backward compatibility within a minor version (see https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title), but that's of no use if the cuda drivers on the server don't support it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually know enough to have an opinion here, but happy to stick to whatever upstream recommends and as long as we are consistent across the images :D If we need to upgrade the driver version on the clusters when we bump this, can do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Just to confirm, I'm assuming that the clusters don't pull the latest pangeo-docker image? I.e. that they're pinned to a specific pangeo-docker-image version? Probably doesn't matter for this pytorch-notebook image since nobody will be using it yet, but I'd recommend updating the CUDA driver version before bumping the cudatoolkit version on ml/tensorflow-notebook, just speaking from experience working on an on-prem HPC 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusters don't pull the latest pangeo-docker image?

Actually they do currently pull the latest :) https://github.com/2i2c-org/infrastructure/search?q=ml-notebook
But I say we go ahead and merge this, and can use explicit pins on the hub if need be going forward

Copy link
Member Author

@weiji14 weiji14 Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, you all like to live dangerously 😆 So looking at 2i2c-org/infrastructure#1244, it seems like the GPUs are K80s, and according to https://forums.developer.nvidia.com/t/nvidia-tesla-k80-cuda-version-support/67676/6, CUDA 11 does work but support is deprecated (i.e. there might be lots of warnings). So let's test it out then!

Copy link
Member

@rabernat rabernat Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not currently using ml-notebook in production in 2i2c. All production clusters use pinned tags.

Those clusters where this image turns up are not really launched yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other point on cudatoolkit 10 vs 11 is that 11 brings along a few more gigabytes. I don't mind big images, but this one is currently approaching 10GB total uncompressed...

mamba create -n test --dry-run pytorch-gpu cudatoolkit=10 -> Total download: 1 GB
mamba create -n test --dry-run pytorch-gpu cudatoolkit=11 -> Total download: 3 GB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes cudatoolkit is huge unfortunately. There are ways to trim down the docker image size as you mentioned in #188 (comment), but it's tough getting around a big binary... I think CUDA 11 will need to be used eventually though, so will need to figure out a solution 🙃

README.md Outdated Show resolved Hide resolved
Co-authored-by: Wei Ji <[email protected]>
@weiji14 weiji14 marked this pull request as ready for review April 27, 2022 22:22
Copy link
Member

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @weiji14 - just one small nit, then LGTM.

Comment on lines 13 to 14
# cupy import fails unless on GPU-enabled node:
#'cupy', #libcuda.so.1: cannot open shared object file: No such file or directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# cupy import fails unless on GPU-enabled node:
#'cupy', #libcuda.so.1: cannot open shared object file: No such file or directory

Let's remove this comment. It's just residue from old stuff in ml-image.

Copy link
Contributor

@ngam ngam Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I add a suggestion as well? Because of the mkl and nomkl conflicts, this is sort of worrying...

Try a test for the underlying blas to make sure things are set up correctly. Usually it should be MKL (it's pulling MKL from conda-forge):

import torch
assert torch.__config__.show()[torch.__config__.show().find('BLAS_INFO')+10:torch.__config__.show().find('BLAS_INFO')+13] == "mkl"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done both in 3acef32. Let's see if the CI passes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, works like a charm: tests/test_pytorch-notebook.py::test_torch_uses_mkl PASSED [100%] https://github.com/pangeo-data/pangeo-docker-images/runs/6202609085?check_suite_focus=true#step:7:41

@scottyhq
Copy link
Member

Thanks @weiji14 for putting this together and @ngam @rabernat for the reviews, going to merge so that it'll be easier for people to test drive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add pytorch to ML image (or create separate image)
6 participants