Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help mamba decide on CUDA variant #162

Open
wolfv opened this issue Nov 4, 2021 · 27 comments
Open

Help mamba decide on CUDA variant #162

wolfv opened this issue Nov 4, 2021 · 27 comments

Comments

@wolfv
Copy link
Member

wolfv commented Nov 4, 2021

Mamba currently can't properly decide which tensorflow cuda package to prefer since we don't directly depend on cudatoolkit in the tensorflow package.

Only in the tensorflow-base package we have a cudatoolkit dependency.

However, tensorflow-base is connected to tensorflow via an exact dependency.

If we'd also add the cudatoolkit to the tensorflow package, mamba would be able to select the best version.

OR we could remove the tensorflow variant completely and just have a single metapackage that depends on tensorflow-base {{ VERSION }} (but not exact). Then mamba could decide the tensorflow-base version more freely.

@wolfv
Copy link
Member Author

wolfv commented Nov 4, 2021

What happens currently is that mamba "randomly" picks cuda 11.1 (even though cuda 11.2 would probably be preffered).

@183amir
Copy link

183amir commented Nov 4, 2021

I don't think the issue is here. Mamba should be able to handle this instead.

@h-vetinari
Copy link
Member

I don't think the issue is here. Mamba should be able to handle this instead.

As long as we're not compromising on the quality of the packaging (i.e. everything works as it should), I don't see why we shouldn't accommodate mamba here. Yes, it would be nicer if it were fixed in mamba, but it's used more and more (not least in CF-CI), so keeping things knowingly suboptimal has to have a commensurate benefit for not doing it (especially as the workaround is really not painful), IMO.

@wolfv
Copy link
Member Author

wolfv commented Nov 5, 2021

@183amir the problem in this case is that we have a variant package (tensorflow) that restricts the cuda version of downstream packages.
But we have no way of choosing which tensorflow package is the best without doing a global optimization.

We could change this in the following ways:

  • have only tensorflow pyXX meta-package (without CUDA, or even without Python!), then the tensorflow packages depends on tensorflow-base 2.6 (pyXX) and variant selection would work fine
  • Add a direct dependency to the same cudatoolkit from the build string, then variant selection would also work fine :)

Global optimization, with the amount of variant packages that exist in conda-forge, takes a looong time (because many branches have to be explored). I think it's good to make it as simple as possible for the solver to find the best solution.

@wolfv
Copy link
Member Author

wolfv commented Nov 5, 2021

I tried to write down how the sorting and solving works here: https://mamba.readthedocs.io/en/latest/advanced_usage/package_resolution.html

@183amir
Copy link

183amir commented Nov 5, 2021

As long as we're not compromising on the quality of the packaging (i.e. everything works as it should), I don't see why we shouldn't accommodate mamba here.

I agree.

have only tensorflow pyXX meta-package (without CUDA, or even without Python!), then the tensorflow packages depends on tensorflow-base 2.6 (pyXX) and variant selection would work fine

I don't think this is a good solution because the exact dependency on tensorflow-base is there to make sure the split packages are all installed from one build. That is, tensorflow, tensorflow-estimator, etc. are installed from one build.

Add a direct dependency to the same cudatoolkit from the build string, then variant selection would also work fine :)

I am not sure what this means. Could you please explain more? Maybe you could show how the recipe would look like.

@izahn
Copy link

izahn commented Nov 5, 2021

This might be a separate topic, but do we actually need both tensorflow and tensorflow-base packages? Currently the only difference between mamba create -d -n test "tensorflow=*=cuda112*" and mamba create -d -n test "tensorflow-base=*=cuda112*" is that the first one installs tensorflow and tensorflow-estimator. The same is true for "tensorflow-base=*=cuda112*" vs. "tensorflow-base=*=cuda112*". Unless there is a common need to install tensorflow without tensorflow-estimator that doesn't make much sense to me.

@wolfv
Copy link
Member Author

wolfv commented Nov 5, 2021

So right now we have the following situation:

tensorflow-py39-cuda112
depends on
- python 3.9
- tensorflow-base py39-cuda112 (heavily constrained, can only use a single tensorflow-base!)
tensorflow-py38-cuda112
- depends on python 3.8
- tensorflow-base py38-cuda112
tensorflow-py39-cuda111
- depends on python 3.8
- tensorflow-base py39-cuda111 

So here, mamba can correctly decide for tensorflow-py39... but cannot select the cuda variant because it cannot judge by inspecting the first order dependencies.

If we add cudatoolkit at this level, it woudl work fine. Otherwise, we could modify it like this:

tensorflow-py39
- depends on python 3.8
- tensorflow-base py39*
   can be resolved to 
   - tensorflow-base py39-cuda102
   - tensorflow-base py39-cuda110
   - tensorflow-base py39-cuda111
   - tensorflow-base py39-cuda112

Then again we could resolve properly, because we can decide again for tensorflow-base and select the one that uses the highest cudatoolkit.

I don't know if this makes sense? It's kinda hard to explain.

Basically with the current setup we would need to look at all tensorflow packages, and then at all tensorflow-base packages to figure out which one is best.

What I am proposing is to

  • either make it possible to decide in tensorflow (by adding the cudatoolkit as dependency)
  • or to leave the decision when deciding for tensorflow-base (by removing the exact pin)

Eliminating tensorflow-base would also fix this :)

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 8, 2021

I kinda feel like we should add cuda-toolkit to tensorflow. I'm not too sure why tensorflow-base still exists today. I think it was historically there to help with circular dependencies with estimator

@xhochy
Copy link
Member

xhochy commented Nov 8, 2021

I think it was historically there to help with circular dependencies with estimator

Yes!

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 8, 2021

I'm sorry, do we want to add :

  • cudatoolkit
    or
  • cudnn
  • nccl

to the host section

@wolfv
Copy link
Member Author

wolfv commented Nov 8, 2021

cudatoolkit is the one whose version is in the variant, right? so that should be enough to help mamba decide

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 8, 2021

I guess i don't see cudatoolkit anywhere in this recipe or in that of the pytorch recipe. I feel like we should make this recipe a model to follow for other recipes.

@wolfv
Copy link
Member Author

wolfv commented Nov 8, 2021

I think it's exported from compiler('cuda')?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 8, 2021

great. makes sense. let's add that.

it seems like we should be adding it to libtensorflow_cc as well right?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 8, 2021

Ok, I've folded it in the recent migraiton.

@h-vetinari
Copy link
Member

IIUC, we could get rid of tensorflow-base now that the estimator is folded into this feedstock?

How about we do this for 2.7.0?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 9, 2021

we could do that.... if downstream packages are not depending on it.

There seems to be quite a few things on github that are now depending on it already. So we kinda have to carry it forward.

@h-vetinari
Copy link
Member

If it's still needed, we could still keep it as a compatibility output that just 1:1 installs tensorflow, ideally together with an activation script that warns to change the dependence away from tensorflow-base.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 9, 2021

We had tried to keep tensorflow as simply installing tensorflow-base exactly pinned, but we needed to add more constraints to it.

I'm not sure there is an easy answer to this.

@h-vetinari
Copy link
Member

If we don't need tensorflow-base anymore to break circular dependencies with tf-estimator, then we can just fully absorb it into tensorflow in the recipe, and then add a new output tensorflow-base as a thin wrapper around tensorflow. I can't see how that wouldn't work...

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 9, 2021

In your sentence, swap tensorflow-base and tensorflow and that is what we had 2 weeks ago.

@xhochy
Copy link
Member

xhochy commented Nov 9, 2021

I introduced tensorflow-base only to get tensorflow-estimator building inside the feedstock here. Will still need that or do we have a workaround for the cyclic dependency?

@h-vetinari
Copy link
Member

In your sentence, swap tensorflow-base and tensorflow and that is what we had 2 weeks ago.

You said above:

We had tried to keep tensorflow as simply installing tensorflow-base exactly pinned, but we needed to add more constraints to it.

Why not still add those constraints and then call it tensorflow / tensorflow-base (whichever way the wrapper goes)?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 9, 2021

Maybe i'm just confused. It might be just easier if you show me in a PR.

Can we agree on an order of priorities? I suggest:

  1. Get the migration moving foward Migrate libprotobuf, grpcio, icu #144
  2. Build windows (cpu only)
  3. Cleanup the recipe (with what you are suggesting).

@h-vetinari
Copy link
Member

Can we agree on an order of priorities?

I agree that #144 is priority number 1, and as I said I'm fine to clean up the tf-base situation together with 2.7.0. I don't think we should make these improvements dependent on windows builds, which has a large number of known & unknown unknowns.

IOW, from your list, I suggest 1. -> 3. -> 2.

@hmaarrfk
Copy link
Contributor

has this been resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants