Profiling #45

cmvcordova · 2024-09-03T22:01:54Z

Basic notebook integrated with mkdocks-jupyter. Pushing to test changes on CI.

…late into profiling

…k work

codecov-commenter · 2024-09-04T17:44:05Z

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.64%. Comparing base (32ae062) to head (912bda0).

Files with missing lines	Patch %	Lines
...oject/datamodules/image_classification/imagenet.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #45      +/-   ##
==========================================
- Coverage   70.72%   70.64%   -0.08%     
==========================================
  Files          57       57              
  Lines        3593     3591       -2     
==========================================
- Hits         2541     2537       -4     
- Misses       1052     1054       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lebrice · 2024-09-18T16:21:45Z

Here is some feedback with respect to the notebook structure and the contents of the wandb report.
(the numbers in the list correspond to the same numbered points in this here: #11 (comment))

Looks good!
First panel: Show metrics for a normal training run (with an rtx8000 GPU, ~1-4 CPUs), steps per second, cpu utilization, GPU utilization , RAM, VRAM. Could also be nice to display the # of CPUs, the CPU type, the GPU type, # of gpus, etc.
Second panel: Within the same job, show a comparison of the metrics above with / without training. (algorithm=no_op vs algorithm=example)
3.1: Assuming that the dataloading is the bottleneck, make a plot that compares the throughput with different num_workers for the same number of CPUs (all run within the same job, the current interactive job.)
3.2: Make a plot that compares the throughput of runs with the no-op algorithm with a different number of CPUs with either a fixed num_workers, or a fixed num_workers per cpu ratio.
3.3: Given this better configuration for num_workers and n_cpus, show similar panels as in step 2, showing a comparison between A) previous parameters (lower bound), B) New, optimized parameters, and C) Optimized parameters without training (upper bound).
Comparing GPU vs CPU training:
- The current panel looks good. The content will have to be updated to use the optimized # of cpus / num_workers from step 3.
- Would be nice to add another panel showing the same comparison for training a small fcnetnetwork on MNIST. There, I suspect that the difference between GPU / CPU throughput shouldn't be that large.
Comparing different types of GPUS
- The panels look good! The content will have to be updated to use the optimized # of cpus / num_workers from step 3.
GPU Utilization
- Also show the GPU utilization %, mem usage % in addition to the metrics in 2.

docs/examples/profiling.ipynb

lebrice · 2024-09-25T15:45:30Z

docs/install.md

lebrice · 2024-09-25T15:46:36Z

mkdocs.yml

-
+nav:
+  - Home: index.md
+  - Profiling your code: docs/examples/profiling.ipynb


Shouldn't this be added in the index.md file? I imagine that this would add the profiling notebook at the top level of the docs.

Edit: SUMMARY.md (which is the navigation bar), not index.md

fixed in 7d5c1d2

lebrice · 2024-09-25T15:46:47Z

project/algorithms/example.py

lebrice · 2024-09-25T15:46:58Z

project/configs/algorithm/example.yaml

lebrice · 2024-09-25T15:47:32Z

project/configs/algorithm/no_op.yaml

This was probably already added in the master branch

project/configs/resources/one_gpu.yaml

project/configs/trainer/default.yaml

project/main.py

Co-authored-by: Fabrice Normandin <[email protected]>

lebrice · 2024-09-25T22:15:36Z

docs/install.md

lebrice · 2024-09-25T22:16:13Z

project/algorithms/example.py

Isn't this already in master?

lebrice · 2024-09-25T22:18:18Z

project/main.py

@@ -39,7 +36,8 @@
 )
 def main(dict_config: DictConfig) -> dict:
    """Main entry point for training a model."""
-    print_config(dict_config, resolve=False)
+    # print_config(dict_config, resolve=False)


I'd turn this off based on a config value instead of outright removing it

cmvcordova added 12 commits August 8, 2024 11:08

add profiling notebook, hotfix a few classes

b7aec3e

add profiling notebook, hotfix a few classes

8e0dad2

merging

46b0ec7

removed pyrootutils, fixed typos, nbstripout check

7f2032f

nbstripout compliance

dae2e14

add profiling notebook, hotfix a few classes

24c8406

removed pyrootutils, fixed typos, nbstripout check

56b84b7

nbstripout compliance

bf51494

Merge branch 'profiling' of https://github.com/mila-iqia/ResearchTemp…

7b08cea

…late into profiling

attempt at merge, lockfiles still hanging

bff3d1e

pre-commit check

dde4112

lockfile regen, config update, misc changes to make profiling noteboo…

912bda0

…k work

cmvcordova added 6 commits September 10, 2024 13:25

precommit exclusions, more WIP text

303e6d9

profiling nb progress, new WIP configs for pending throughput multirun

787ddf0

wandb logging working, notebook progress, CPU GPU throughput comparisons

f189cf4

Cleaned notebook up, support for config parameters in wandb overview

e577f95

Additional nb cleanup

4dc7bbc

latest nb changes

064c117

lebrice linked an issue Sep 18, 2024 that may be closed by this pull request

Add a Benchmarking / profiling example #11

Open

4 tasks

cmvcordova added 8 commits September 19, 2024 13:14

post feedback nb restructure

0370aa9

Added ImageNet training example

142b963

nb restructuring

38b42f0

pre-run placeholders

26c50be

added placeholder for optimized training run, point 3.3

d7f5931

added cpu constraint

19b8de8

added all mnist runs

f5adb25

fixed text on dataloading differences

2140022

lebrice requested changes Sep 25, 2024

View reviewed changes

docs/examples/profiling.ipynb Outdated Show resolved Hide resolved

docs/examples/profiling.ipynb Outdated Show resolved Hide resolved

docs/examples/profiling.ipynb Outdated Show resolved Hide resolved

lebrice requested changes Sep 25, 2024

View reviewed changes

cmvcordova and others added 9 commits September 25, 2024 11:59

Update docs/examples/profiling.ipynb

b1c227b

Co-authored-by: Fabrice Normandin <[email protected]>

Update docs/examples/profiling.ipynb

035a4b7

Co-authored-by: Fabrice Normandin <[email protected]>

Update docs/examples/profiling.ipynb

c8c40ac

Co-authored-by: Fabrice Normandin <[email protected]>

Update project/configs/resources/one_gpu.yaml

2a95fdd

Co-authored-by: Fabrice Normandin <[email protected]>

Update project/configs/trainer/default.yaml

4f4505a

Co-authored-by: Fabrice Normandin <[email protected]>

Update project/main.py

768443a

Co-authored-by: Fabrice Normandin <[email protected]>

fixed trailing callback_metrics error

2b138f6

changed profiling nb nav from mkdocs.yml to SUMMARY.md

7d5c1d2

grammar

290b0d3

lebrice reviewed Sep 25, 2024

View reviewed changes

added profiling config

32f4e41

lebrice closed this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling #45

Profiling #45

cmvcordova commented Sep 3, 2024

codecov-commenter commented Sep 4, 2024

lebrice commented Sep 18, 2024

lebrice Sep 25, 2024

lebrice Sep 25, 2024

lebrice Sep 25, 2024

cmvcordova Sep 25, 2024

lebrice Sep 25, 2024

lebrice Sep 25, 2024

lebrice Sep 25, 2024

lebrice Sep 25, 2024

lebrice Sep 25, 2024

lebrice Sep 25, 2024

Profiling #45

Profiling #45

Conversation

cmvcordova commented Sep 3, 2024

codecov-commenter commented Sep 4, 2024

Codecov Report

lebrice commented Sep 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment