Create pytorch-notebook docker image #315

weiji14 · 2022-04-27T19:32:23Z

Another GPU-enabled docker image for running deep learning!

List of packages to include (let me know of any other relevant ones);

pytorch
torchgeo
pytorch-lightning
etc

Resolves #312

Another GPU-enabled docker image for running deep learning!

github-actions · 2022-04-27T19:32:37Z

👈 Try on Mybinder.org!
👈 Try on Pangeo GCP Binder!
👈 Try on Pangeo AWS Binder!

pangeo-bot · 2022-04-27T19:32:38Z

/condalock
Automatically locking new conda environment, building, and testing images...

pangeo-notebook/environment.yml

scottyhq · 2022-04-27T19:59:49Z

Thanks for this @weiji14 ! If you add new packages (torchgeo) you have to relock the environment. The easiest way to do this is add a comment with /condalock on the first line.

rabernat · 2022-04-27T20:05:23Z

/condalock

weiji14 · 2022-04-27T20:12:55Z

Thanks for this @weiji14 ! If you add new packages (torchgeo) you have to relock the environment. The easiest way to do this is add a comment with /condalock on the first line.

Ah ok, I've actually ran conda-lock locally already so there might not be stuff to update. But there might be a few other packages we might add so will do that later 😄

scottyhq · 2022-04-27T21:17:38Z

pytorch-notebook/environment.yml

+channels:
+ - conda-forge
+dependencies:
+ - cudatoolkit=11


in the ml-notebook image we pin cudatoolkit=10,

pangeo-docker-images/ml-notebook/environment.yml

Line 7 in b5c38d0

- cudatoolkit=10

I'm guessing if both these images are offered on pangeo jupyterhubs we want cuda versions to be the same, @yuvipanda any recommendations for 10 vs 11 when it comes to node configuration?

Fine leaving it as 11 here and bumping the other image in a separate PR.

Yes, it comes down to what CUDA drivers the server is using (find out using nvidia-smi). I prefer cudatoolkit=11 because it has foward and backward compatibility within a minor version (see https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title), but that's of no use if the cuda drivers on the server don't support it.

I don't actually know enough to have an opinion here, but happy to stick to whatever upstream recommends and as long as we are consistent across the images :D If we need to upgrade the driver version on the clusters when we bump this, can do that.

Ok. Just to confirm, I'm assuming that the clusters don't pull the latest pangeo-docker image? I.e. that they're pinned to a specific pangeo-docker-image version? Probably doesn't matter for this pytorch-notebook image since nobody will be using it yet, but I'd recommend updating the CUDA driver version before bumping the cudatoolkit version on ml/tensorflow-notebook, just speaking from experience working on an on-prem HPC 🙂

clusters don't pull the latest pangeo-docker image?

Actually they do currently pull the latest :) https://github.com/2i2c-org/infrastructure/search?q=ml-notebook
But I say we go ahead and merge this, and can use explicit pins on the hub if need be going forward

Wow, you all like to live dangerously 😆 So looking at 2i2c-org/infrastructure#1244, it seems like the GPUs are K80s, and according to https://forums.developer.nvidia.com/t/nvidia-tesla-k80-cuda-version-support/67676/6, CUDA 11 does work but support is deprecated (i.e. there might be lots of warnings). So let's test it out then!

We are not currently using ml-notebook in production in 2i2c. All production clusters use pinned tags.

Those clusters where this image turns up are not really launched yet.

One other point on cudatoolkit 10 vs 11 is that 11 brings along a few more gigabytes. I don't mind big images, but this one is currently approaching 10GB total uncompressed...

mamba create -n test --dry-run pytorch-gpu cudatoolkit=10 -> Total download: 1 GB
mamba create -n test --dry-run pytorch-gpu cudatoolkit=11 -> Total download: 3 GB

Yes cudatoolkit is huge unfortunately. There are ways to trim down the docker image size as you mentioned in #188 (comment), but it's tough getting around a big binary... I think CUDA 11 will need to be used eventually though, so will need to figure out a solution 🙃

README.md

Co-authored-by: Wei Ji <[email protected]>

rabernat

This is great @weiji14 - just one small nit, then LGTM.

rabernat · 2022-04-27T22:34:59Z

tests/test_pytorch-notebook.py

+    # cupy import fails unless on GPU-enabled node:
+    #'cupy', #libcuda.so.1: cannot open shared object file: No such file or directory


Suggested change

# cupy import fails unless on GPU-enabled node:

#'cupy', #libcuda.so.1: cannot open shared object file: No such file or directory

Let's remove this comment. It's just residue from old stuff in ml-image.

Could I add a suggestion as well? Because of the mkl and nomkl conflicts, this is sort of worrying...

Try a test for the underlying blas to make sure things are set up correctly. Usually it should be MKL (it's pulling MKL from conda-forge):

import torch assert torch.__config__.show()[torch.__config__.show().find('BLAS_INFO')+10:torch.__config__.show().find('BLAS_INFO')+13] == "mkl"

Ok, done both in 3acef32. Let's see if the CI passes.

Yep, works like a charm: tests/test_pytorch-notebook.py::test_torch_uses_mkl PASSED [100%] https://github.com/pangeo-data/pangeo-docker-images/runs/6202609085?check_suite_focus=true#step:7:41

scottyhq · 2022-04-27T23:58:40Z

Thanks @weiji14 for putting this together and @ngam @rabernat for the reviews, going to merge so that it'll be easier for people to test drive

🐳 Start Dockerfile for pytorch-notebook

23a49e7

Another GPU-enabled docker image for running deep learning!

weiji14 mentioned this pull request Apr 27, 2022

Try to add pytorch #313

Closed

[condalock-command] autogenerated conda-linux-64.lock files

327b6a2

scottyhq reviewed Apr 27, 2022

View reviewed changes

pangeo-notebook/environment.yml Show resolved Hide resolved

weiji14 added 2 commits April 27, 2022 15:45

Use libopenblas workaround

536d580

Add torchgeo

ca2c0d4

weiji14 changed the title ~~Create pytorch-notebook~~ Create pytorch-notebook docker image Apr 27, 2022

Test importing more torch related packages

ebb3159

scottyhq reviewed Apr 27, 2022

View reviewed changes

weiji14 commented Apr 27, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

Update README.md

377f29e

Co-authored-by: Wei Ji <[email protected]>

scottyhq approved these changes Apr 27, 2022

View reviewed changes

Add mermaid diagram for docker image links

fdf6c8c

weiji14 marked this pull request as ready for review April 27, 2022 22:22

rabernat reviewed Apr 27, 2022

View reviewed changes

Ensure torch uses mkl, and remove redundant lines

3acef32

scottyhq merged commit 1939869 into pangeo-data:master Apr 27, 2022

weiji14 deleted the pytorch-notebook branch April 28, 2022 01:05

weiji14 mentioned this pull request Apr 28, 2022

Document how to run pytorch-notebook docker container #316

Merged

scottyhq mentioned this pull request Apr 28, 2022

Use cudatoolkit=11 in both tensorflow and pytorch images #320

Closed

rabernat mentioned this pull request May 10, 2022

Support choosing images in profile list for pangeo hubs 2i2c-org/infrastructure#1253

Open

4 tasks

weiji14 mentioned this pull request Sep 26, 2023

ML image update #188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create pytorch-notebook docker image #315

Create pytorch-notebook docker image #315

weiji14 commented Apr 27, 2022 •

edited

Loading

github-actions bot commented Apr 27, 2022

pangeo-bot commented Apr 27, 2022

scottyhq commented Apr 27, 2022

rabernat commented Apr 27, 2022

weiji14 commented Apr 27, 2022

scottyhq Apr 27, 2022

weiji14 Apr 27, 2022

yuvipanda Apr 27, 2022

weiji14 Apr 27, 2022

scottyhq Apr 27, 2022

weiji14 Apr 27, 2022 •

edited

Loading

rabernat Apr 27, 2022 •

edited

Loading

scottyhq Apr 27, 2022

weiji14 Apr 27, 2022

rabernat left a comment

rabernat Apr 27, 2022

ngam Apr 27, 2022 •

edited

Loading

weiji14 Apr 27, 2022

weiji14 Apr 27, 2022

scottyhq commented Apr 27, 2022

		# cupy import fails unless on GPU-enabled node:
		#'cupy', #libcuda.so.1: cannot open shared object file: No such file or directory

Create pytorch-notebook docker image #315

Create pytorch-notebook docker image #315

Conversation

weiji14 commented Apr 27, 2022 • edited Loading

github-actions bot commented Apr 27, 2022

pangeo-bot commented Apr 27, 2022

scottyhq commented Apr 27, 2022

rabernat commented Apr 27, 2022

weiji14 commented Apr 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 Apr 27, 2022 • edited Loading

Choose a reason for hiding this comment

rabernat Apr 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngam Apr 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottyhq commented Apr 27, 2022

weiji14 commented Apr 27, 2022 •

edited

Loading

weiji14 Apr 27, 2022 •

edited

Loading

rabernat Apr 27, 2022 •

edited

Loading

ngam Apr 27, 2022 •

edited

Loading