Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling #45

Closed
wants to merge 36 commits into from
Closed

Profiling #45

wants to merge 36 commits into from

Conversation

cmvcordova
Copy link
Collaborator

Basic notebook integrated with mkdocks-jupyter. Pushing to test changes on CI.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.64%. Comparing base (32ae062) to head (912bda0).

Files with missing lines Patch % Lines
...oject/datamodules/image_classification/imagenet.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #45      +/-   ##
==========================================
- Coverage   70.72%   70.64%   -0.08%     
==========================================
  Files          57       57              
  Lines        3593     3591       -2     
==========================================
- Hits         2541     2537       -4     
- Misses       1052     1054       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lebrice lebrice linked an issue Sep 18, 2024 that may be closed by this pull request
4 tasks
@lebrice
Copy link
Collaborator

lebrice commented Sep 18, 2024

Here is some feedback with respect to the notebook structure and the contents of the wandb report.
(the numbers in the list correspond to the same numbered points in this here: #11 (comment))

  1. Looks good!
  2. First panel: Show metrics for a normal training run (with an rtx8000 GPU, ~1-4 CPUs), steps per second, cpu utilization, GPU utilization , RAM, VRAM. Could also be nice to display the # of CPUs, the CPU type, the GPU type, # of gpus, etc.
  3. Second panel: Within the same job, show a comparison of the metrics above with / without training. (algorithm=no_op vs algorithm=example)
    3.1: Assuming that the dataloading is the bottleneck, make a plot that compares the throughput with different num_workers for the same number of CPUs (all run within the same job, the current interactive job.)
    3.2: Make a plot that compares the throughput of runs with the no-op algorithm with a different number of CPUs with either a fixed num_workers, or a fixed num_workers per cpu ratio.
    3.3: Given this better configuration for num_workers and n_cpus, show similar panels as in step 2, showing a comparison between A) previous parameters (lower bound), B) New, optimized parameters, and C) Optimized parameters without training (upper bound).
  4. Comparing GPU vs CPU training:
    • The current panel looks good. The content will have to be updated to use the optimized # of cpus / num_workers from step 3.
    • Would be nice to add another panel showing the same comparison for training a small fcnetnetwork on MNIST. There, I suspect that the difference between GPU / CPU throughput shouldn't be that large.
  5. Comparing different types of GPUS
    • The panels look good! The content will have to be updated to use the optimized # of cpus / num_workers from step 3.
  6. GPU Utilization
    • Also show the GPU utilization %, mem usage % in addition to the metrics in 2.

docs/examples/profiling.ipynb Outdated Show resolved Hide resolved
docs/examples/profiling.ipynb Outdated Show resolved Hide resolved
docs/examples/profiling.ipynb Outdated Show resolved Hide resolved
docs/install.md Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

mkdocs.yml Outdated

nav:
- Home: index.md
- Profiling your code: docs/examples/profiling.ipynb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be added in the index.md file? I imagine that this would add the profiling notebook at the top level of the docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: SUMMARY.md (which is the navigation bar), not index.md

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 7d5c1d2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was probably already added in the master branch

project/configs/resources/one_gpu.yaml Outdated Show resolved Hide resolved
project/configs/trainer/default.yaml Outdated Show resolved Hide resolved
project/main.py Outdated Show resolved Hide resolved
project/main.py Show resolved Hide resolved
project/main.py Outdated Show resolved Hide resolved
docs/install.md Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already in master?

@@ -39,7 +36,8 @@
)
def main(dict_config: DictConfig) -> dict:
"""Main entry point for training a model."""
print_config(dict_config, resolve=False)
# print_config(dict_config, resolve=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd turn this off based on a config value instead of outright removing it

@lebrice lebrice closed this Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Benchmarking / profiling example
3 participants