Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apple Silicon GPU compatibility for Tensorflow #2184

Merged
merged 3 commits into from
Oct 21, 2024

Conversation

comane
Copy link
Member

@comane comane commented Oct 20, 2024

This pull request includes updates to the doc/sphinx/source/n3fit/runcard_detailed.rst file to clarify instructions for running parallel models and using GPUs on M1/M2 Macs.

Updates to parallel model instructions:

  • Added a note that savepseudodata must be set to false in the fitting section of the runcard to run with parallel models. (doc/sphinx/source/n3fit/runcard_detailed.rst)

Updates for GPU usage on M1/M2 Macs:

  • Added instructions to install specific packages (tensorflow-deps, tensorflow-macos, tensorflow-metal, and wandb) to run replicas in parallel using GPUs on M1/M2 Macs. (doc/sphinx/source/n3fit/runcard_detailed.rst)

@comane comane added the documentation Issues and PRs related to documentation label Oct 20, 2024
Copy link
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comane for this! Good to see that this is indeed what's needed to make it run.

Out of curiosity, how is the performance (how many replicas could you run, etc.)?

.. code-block:: bash

conda install -c apple tensorflow-deps
pip install tensorflow-macos==2.13.0 tensorflow-metal wandb==0.15.9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to really pin this version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is related to this issue here: wandb/wandb#5935

I was not able to make it run on MaC M2 GPUs with other versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a reference to that issue in the docs?

@comane
Copy link
Member Author

comane commented Oct 21, 2024

Out of curiosity, how is the performance (how many replicas could you run, etc.)?

The performance is not really good, at least on my laptop, it takes longer than using cpus. But this might be different for someone else with a more powerful mac

@scarlehoff
Copy link
Member

Out of curiosity, it doesn't work on M3 at all or you only had M1 and M2 to test?

@comane
Copy link
Member Author

comane commented Oct 21, 2024

Out of curiosity, it doesn't work on M3 at all or you only had M1 and M2 to test?

I only tested for M2, but the above mentioned issue is for M1.
Maybe @ecole41, if time allows, can test it with her M3?

But I assume it works for M3 as well.

@Radonirinaunimi
Copy link
Member

Out of curiosity, how is the performance (how many replicas could you run, etc.)?

The performance is not really good, at least on my laptop, it takes longer than using cpus. But this might be different for someone else with a more powerful mac

When you say longer, how much is it? With how many replicas? (Maybe you are hitting memory bottleneck?)

But in any case, if it is not 4/5 times slower I'd say that's still good because you get all the replicas at then same time.

@scarlehoff
Copy link
Member

If @ecole41 can test it that would be great.

I'd suggest anyway changing from M1/M2 to something along the lines of "Apple Sillicon".

But in any case, if it is not 4/5 times slower I'd say that's still good because you get all the replicas at then same time.

I would say even 4/5 is still good. In my case, I can run entire fits in ~3 hours in my desktop's GPU, while a single replica takes about 40 minutes. It's about 5 times more but when the cluster is busy is the difference between having the fits ready in the same morning or one day later.

@Radonirinaunimi
Copy link
Member

I would say even 4/5 is still good. In my case, I can run entire fits in ~3 hours in my desktop's GPU, while a single replica takes about 40 minutes. It's about 5 times more but when the cluster is busy is the difference between having the fits ready in the same morning or one day later.

That's absolutely true! My threshold was really pessimistic 😅

@comane comane changed the title M1-2 GPU compatibility for Tensorflow Apple Silicon GPU compatibility for Tensorflow Oct 21, 2024
@comane
Copy link
Member Author

comane commented Oct 21, 2024

When you say longer, how much is it? With how many replicas? (Maybe you are hitting memory bottleneck?)

Running with 10 replicas only on GPUs takes 15 minutes to get to epoch 4400 / 17000. If I run on cpu the same thing (still on my laptop) it takes 2 min 45 sec.
So, I think that at least on my computer it's more convenient to run things on CPU.

I would say even 4/5 is still good. In my case, I can run entire fits in ~3 hours in my desktop's GPU, while a single replica takes about 40 minutes. It's about 5 times more but when the cluster is busy is the difference between having the fits ready in the same morning or one day later.

@scarlehoff when you say on your desktop do you mean a MaC Os?

An interesting warning that I am getting is the following:
Pasted Graphic

@scarlehoff
Copy link
Member

@scarlehoff when you say on your desktop do you mean a MaC Os?

Nop, a linux desktop with an nvidia gpu (at some point I tried it as well with an AMD one and it worked fwiw)

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing this (and adding it to the docs!!!)

doc/sphinx/source/n3fit/runcard_detailed.rst Outdated Show resolved Hide resolved
.. code-block:: bash

conda install -c apple tensorflow-deps
pip install tensorflow-macos==2.13.0 tensorflow-metal wandb==0.15.9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a reference to that issue in the docs?

@comane comane merged commit 5f06deb into master Oct 21, 2024
6 checks passed
@comane comane deleted the m1_2_gpu_tf_compatibility branch October 21, 2024 14:13
@ecole41
Copy link
Collaborator

ecole41 commented Oct 22, 2024

Hello @scarlehoff @comane @Radonirinaunimi , I tested this on M3 GPUs and it worked, the performance on GPUs was much slower than on CPUs on my Mac.
For 200 epochs and 100 replicas for the nnpdf40-like runcard:

  • GPUs : 3800s
  • CPUs: 2000s

@scarlehoff
Copy link
Member

In CPU you also ran 100 replicas or is this 1 replica in CPU vs 100 in GPU?

@ecole41
Copy link
Collaborator

ecole41 commented Oct 22, 2024

Both GPU and CPU for 100 replicas

@RoyStegeman
Copy link
Member

RoyStegeman commented Oct 22, 2024

Is that the timing only for the 200 epochs or does it include overhead?

@ecole41
Copy link
Collaborator

ecole41 commented Oct 22, 2024

I'm not sure how to check this, let me know if this helps:

This is the time is get for the GPU:
GPU

This is for the CPU:
CPU

@scarlehoff
Copy link
Member

I think it includes overhead. But in any case, it seems that running a fit on a Mac is not really going to be doable just yet :(

Maybe there's some low hanging fruit to improve it but not sure the effort is worth it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issues and PRs related to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants