Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML image update #188

Closed
dhruvbalwada opened this issue Feb 12, 2021 · 10 comments
Closed

ML image update #188

dhruvbalwada opened this issue Feb 12, 2021 · 10 comments

Comments

@dhruvbalwada
Copy link
Member

dhruvbalwada commented Feb 12, 2021

Was talking to @scottyhq about using the ML image over here and having pytorch preloaded. I know @rabernat has asked about this before (#179) .

We were wondering who all are using the ML image? and what might be the requirements they have? @nbren12 @jhamman
It seems like the usage for the ML image is low based on the pulls here: https://github.com/pangeo-data/pangeo-docker-images.

Since pytorch and tensorflow are two of the big candidates,(and maybe used independently usually), @scottyhq suggested having a pangeo-pytorch and a pangeo-tensorflow.

Any other thoughts that people have?

@rabernat
Copy link
Member

Correct, we are not using them much right now. However, there are several project spinning up now that will require ML Pangeo images, so it's a good time to think about this.

IMO, before creating more images, we need to make a plan to address how to maintain these images sustainably going forward. Within a month or so we should have a dedicated, full-time Pangeo engineer at 2i2c, and that person should be able to help out with this.

@nbren12
Copy link

nbren12 commented Feb 12, 2021

I don’t use these images.

My $0.02: the many images problem is a symptom of a docker not being a package manager. Dockerfiles are a linear sequence of commands while packages form a dependency graph. It will always be hard to map docker images onto the packages people want.

Maintaining multiple images is painful. Honestly, for scientific workflows with GB/TB scale datasets, “light” containers don’t seem worth the trouble. If you can get away with it, I suggest 1 mega docker image (you need to pin of all package versions or it will constantly break) or leveraging a tool like repo2docker if you need multiple images. You can also e.g. have packages installed when a user starts a container like the dask image does.

@nbren12
Copy link

nbren12 commented Feb 12, 2021

It looks like this repo already uses repo2docker, so maybe the tooling is good enough to support many images 🤷 . Maybe pin the “from -image” statements as well to keep things more reproducible.

@scottyhq
Copy link
Member

scottyhq commented Feb 16, 2021

@nbren12 thanks for the comments. This repo is a bit confusing to understand, despite the tags images are in theory reproducible thanks to using conda-lock to presolve for the environment added to the docker image, so for example to recreate an image from the past:

git clone https://github.com/pangeo-data/pangeo-docker-images.git
cd pangeo-docker-images
git checkout 2020.09.30
docker build -t pangeo/base-image:master base-image
docker build -t pangeo/ml-notebook:2020.09.30 ml-notebook

GPU-enabled ML packages are hard to cram into the same conda environment though in our attempts so far, which is why perhaps it's best to pick either tensorflow or pytorch. Preferably we have someone actively using the image responsible for curating the packages. Not sure who that would be these days?

It will always be hard to map docker images onto the packages people want.

Couldn't agree more. Although we've gotten a lot of mileage out of people using a common environment on pangeo hubs. For long term sustainability though, someone will need to tackle allowing users to customize their environment: #148

@nbren12
Copy link

nbren12 commented Feb 16, 2021

Ah yes. I see the lock files now.

GPU-enabled ML packages are hard to cram into the same conda environment though in our attempts so far

Interesting. What's the main barrier? Package versions not resolving?

@scottyhq
Copy link
Member

Interesting. What's the main barrier? Package versions not resolving?

Yeah. For example trying adding pytorch-gpu and jax in #179 https://github.com/pangeo-data/pangeo-docker-images/runs/1712185623?check_suite_focus=true

It seems like the general guidance is not to mix conda channels (ideally everything comes from conda-forge with the 'strict' channel priority setting). But to get the GPU-enabled packages we've had to relax that setting (https://github.com/pangeo-data/pangeo-docker-images/blob/master/ml-notebook/condarc.yml) and point to packages on specific channels:

#rapidsai-nightly/linux-64::cuspatial
#rapidsai-nightly/linux-64::cudf
- conda-forge/linux-64::cupy
- pkgs/main/linux-64::tensorflow-gpu>=2

@nbren12
Copy link

nbren12 commented Feb 17, 2021

Good to know. This topic provokes so much in me---I've spent a lot of time maintaining developer environments. I've been interested in a package manager called nix which is basically a more composable docker. I hope it picks up steam in the next few years.

@rabernat
Copy link
Member

For some context, I will share the amazing blog post Noah recently published on this topic! https://www.noahbrenowitz.com/post/2021-version-pinning/

It's a hard problem, but one we should keep plugging away at. We don't have a perfect solution yet, but we have made good progress!

@scottyhq
Copy link
Member

scottyhq commented Mar 2, 2021

love the post @nbren12 this one is also worth checking out for tips on reducing image size https://uwekorn.com/2021/03/01/deploying-conda-environments-in-docker-how-to-do-it-right.html

@weiji14
Copy link
Member

weiji14 commented Sep 26, 2023

Closing this as we've added a pytorch-notebook image in #315. See also discussion at #457 on optimizing the ml-notebook (tensorflow) and pytorch-notebook images further for GPU-accelerated workflows.

@weiji14 weiji14 closed this as completed Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants