Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for runTheMatrix.py: input checks, GPUs repartition, input recycling #47377

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

AdrianoDee
Copy link
Contributor

@AdrianoDee AdrianoDee commented Feb 17, 2025

PR description:

This PR proposes a few modifications to runTheMatrix.py and correlated packages. It would add the possibility to:

  1. check if the default samples for the workflows requested are actually defined. This is done via the -c/--checkInputs flag. This should solve [RFC] Minimal test of Configuration/PyReleaseValidation/python/relval_steps.py validity #46910 if in the routine PR tests one runs runTheMatrix.py -n -c;

  2. have a workflow start from a specific step (GEN, SIM, DIGI, ...) with the option --startFrom STEP. This will remove all the steps before the one with a cmsDriver.py with -s STEP, [...];

  3. use a different file as input with --recycle. This is intended to be used either together with --startFrom either on wfs that, as first step, use a pre-existing input;

  4. have duplicate wfs in input with option -l WF, WF, WF [...]. Each wf would run in a different job (if specified) and _jobX is appended to the work area to avoid using the same folder;

And when running with the -gpu option and multiple jobs with -j N now each job would be assigned to a different GPU. Available GPUs may be also selected on the basis of the compute capability (only for NVIDIA) with the already existing --cuda-capabilities or by name with the already existing --force-gpu-name. If more jobs than available GPUs are requested, the job to GPU assignment will restart from the first GPU available until completion. So, e.g., with 8 jobs and 3 GPUs:

  • GPU 0 -> jobs [0, 3, 6]
  • GPU 1 -> jobs [1, 4, 7]
  • GPU 2 -> jobs [2, 5]

This should solve #47337

@AdrianoDee AdrianoDee changed the title Add SimpleTrackValidation Analyzer Updates for runTheMatrix.py Feb 17, 2025
@cmsbuild cmsbuild added this to the CMSSW_15_1_X milestone Feb 17, 2025
@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 17, 2025

cms-bot internal usage

@AdrianoDee AdrianoDee marked this pull request as draft February 17, 2025 16:28
@AdrianoDee AdrianoDee changed the title Updates for runTheMatrix.py Updates for runTheMatrix.py: input checks, GPUs repartition, input recycling Feb 17, 2025
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47377/43734

@AdrianoDee AdrianoDee marked this pull request as ready for review February 17, 2025 16:29
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @AdrianoDee for master.

It involves the following packages:

  • Configuration/PyReleaseValidation (upgrade, pdmv)

@AdrianoDee, @Moanwar, @cmsbuild, @DickyChant, @miquork, @srimanob, @subirsarkar can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @fabiocos, @makortel, @missirol, @slomeo this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@AdrianoDee
Copy link
Contributor Author

test parameters:

  • relval_opts = -c

@AdrianoDee
Copy link
Contributor Author

enable gpu

@AdrianoDee
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals RelVals-INPUT
Size: This PR adds an extra 108KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6d8ed7/44443/summary.html
COMMIT: 37dbec2
CMSSW: CMSSW_15_1_X_2025-02-17-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47377/44443/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals

ERROR Running runTheMatrix for '-s -l 9.0,101.0,1306.0,10224.0,25202.0,250202.181'

RelVals-INPUT

ERROR Running runTheMatrix for '-l 4.17,4.22,4.23,4.24,4.25,4.26,4.27,4.28,4.29,4.34,4.36,4.37,4.4,4.41,4.42,4.43,4.44,4.45,4.51,4.52,4.53,4.54,4.55,4.57,4.58,4.6,4.61,4.62,4.63,4.64,4.65,4.67,4.68,4.71,4.72,4.73,4.74,4.75,4.76,4.77,4.78,134.701,134.702,134.703,134.704,134.705,134.706,134.707,134.708,134.709,134.71,134.801,134.802,134.803,134.804,134.805,134.806,134.807,134.808,134.809,134.81,134.811,134.812,134.813,134.901,134.902,134.903,134.904,134.905,134.906,134.907,134.908,134.909,134.91,134.911,134.912,136.721,136.722,136.723,136.724,136.725,136.726,136.727,136.728,136.729,136.73,136.731,136.732,136.733,136.734,136.735,136.736,136.737,136.738,136.739,136.74,136.741,136.742,136.743,136.744,136.745,136.746,136.747,136.748,136.749,136.75,136.751,136.752,136.753,136.754,136.755,136.756,136.757,136.758,136.759,136.76,136.761,136.762,136.763,136.764,136.765,136.766,136.767,136.768,136.769,136.77,136.771,136.772,136.773,136.774,136.775,136.776,136.777,136.778,136.779,136.78,136.7801,136.7802,136.7803,136.781,136.782,136.783,136.784,136.785,136.786,136.787,136.788,136.789,136.79,136.791,136.792,136.793,136.794,136.795,136.796,136.797,136.798,136.799,136.8,136.801,136.802,136.803,136.804,136.805,136.806,136.807,136.808,136.809,136.81,136.811,136.812,136.813,136.814,136.815,136.816,136.817,136.818,136.819,136.82,136.821,136.822,136.823,136.824,136.825,136.826,136.827,136.828,136.829,136.83,136.831,136.832,136.833,136.834,136.835,136.836,136.837,136.838,136.839,136.8391,136.84,136.841,136.842,136.843,136.844,136.845,136.846,136.847,136.848,136.849,136.85,136.8501,136.851,136.852,136.853,136.854,136.855,136.856,136.8561,136.8562,136.857,136.858,136.859,136.86,136.861,136.862,136.863,136.864,136.8642,136.865,136.866,136.867,136.868,136.869,136.87,136.871,136.872,136.873,136.874,136.875,136.876,136.877,136.878,136.879,136.88,136.881,136.882,136.883,136.884,136.885,136.8855,136.886,136.8861,136.8862,136.887,136.888,136.8885,136.889,136.89,136.891,136.892,136.893,136.894,136.895,136.896,136.897,136.898,136.899,136.901,136.902,136.903,136.904,137.8,138.1,138.2,138.3,138.4,138.5,139.001,139.002,139.003,139.004,139.005,140.001,140.002,140.003,140.004,140.005,140.006,140.007,140.008,140.009,140.01,140.011,140.021,140.022,140.023,140.024,140.025,140.026,140.027,140.028,140.029,140.03,140.031,140.042,140.043,140.044,140.045,140.046,140.047,140.048,140.049,140.05,140.051,140.062,140.063,140.064,140.065,140.066,140.067,140.068,140.069,140.071,140.072,140.073,140.074,140.075,140.076,140.077,140.078,140.101,140.102,140.103,140.104,140.105,140.106,140.107,140.108,140.109,140.11,140.111,140.112,140.113,140.56,140.5611,140.57,140.58,140.6,140.61,141.001,141.002,141.003,141.004,141.005,141.006,141.007,141.008,141.008405,141.008411,141.008421,141.009,141.01,141.011,141.012,141.013,141.031,141.032,141.033,141.034,141.035,141.036,141.037,141.038,141.039,141.041,141.042,141.043,141.044,141.045,141.046,141.047,141.048,141.049,141.101,141.102,141.103,141.104,141.105,141.106,141.107,141.108,141.109,141.11,141.111,141.112,141.113,141.114,141.901,141.902,142.0,142.901,142.902,142.903,143.901,143.902,143.911,145.0,145.001,145.002,145.003,145.004,145.005,145.006,145.007,145.008,145.009,145.01,145.011,145.012,145.013,145.014,145.1,145.101,145.102,145.103,145.104,145.105,145.106,145.107,145.108,145.109,145.11,145.111,145.112,145.113,145.114,145.2,145.201,145.202,145.203,145.204,145.205,145.206,145.207,145.208,145.209,145.21,145.211,145.212,145.213,145.214,145.3,145.301,145.302,145.303,145.304,145.305,145.306,145.307,145.308,145.309,145.31,145.311,145.312,145.313,145.314,145.4,145.401,145.402,145.403,145.404,145.405,145.406,145.407,145.408,145.409,145.41,145.411,145.412,145.413,145.414,145.5,145.501,145.502,145.503,145.504,145.505,145.506,145.507,145.508,145.509,145.51,145.511,145.512,145.513,145.514,145.6,145.601,145.602,145.603,145.604,145.605,145.606,145.607,145.608,145.609,145.61,145.611,145.612,145.613,145.614,145.7,145.701,145.702,145.703,145.704,145.705,145.706,145.707,145.708,145.709,145.71,145.711,145.712,145.713,145.714,159.01,134.0,134.99601,134.99602,134.99603,134.99901,144.6,11024.2,1000.0,1001.0,1001.2,1001.3,1001.4,1002.0,1002.3,1002.4,1002.5,1003.0,1005.0,1010.0,1020.0,1030.0,1040.0,1040.1,1041.0,1042.0,1046.0,1047.0,1048.0,1049.0,1052.0,1052.1,2500.001,2500.002,2500.003,2500.011,2500.012,2500.013,2500.021,2500.022,2500.023,2500.024,2500.031,2500.032,2500.033,2500.034,2500.101,2500.111,2500.112,2500.131,2500.201,2500.211,2500.212,2500.221,2500.222,2500.223,2500.224,2500.225,2500.226,2500.227,2500.228,2500.231,2500.232,2500.233,2500.234,2500.235,2500.236,2500.237,2500.238,2500.241,2500.242,2500.243,2500.244,2500.245,2500.251,2500.301,2500.311,2500.901,2500.902'

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 867
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52204
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

@AdrianoDee
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2025

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47377/43953

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2025

Pull request #47377 was updated. @AdrianoDee, @Moanwar, @DickyChant, @miquork, @srimanob, @subirsarkar can you please check and sign again.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2025

-1

Failed Tests: UnitTests RelVals
Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6d8ed7/44800/summary.html
COMMIT: 5234e30
CMSSW: CMSSW_15_1_X_2025-03-04-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47377/44800/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test test-runTheMatrix-interactive had ERRORS

RelVals

  • 25202.025202.0_TTbar_13/step1_TTbar_13.log
  • 1306.01306.0_SingleMuPt1_UP15/step1_SingleMuPt1_UP15.log
  • 9.09.0_Higgs200ChargedTaus/step1_Higgs200ChargedTaus.log
Expand to see more relval errors ...

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 858
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52213
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

@AdrianoDee AdrianoDee force-pushed the recycle_and_checks_runthematrix branch from 5234e30 to 60d90f0 Compare March 9, 2025 21:00
@AdrianoDee
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 9, 2025

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47377/44006

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 9, 2025

Pull request #47377 was updated. @AdrianoDee, @Moanwar, @DickyChant, @miquork, @srimanob, @subirsarkar can you please check and sign again.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 9, 2025

-1

Failed Tests: RelVals
Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6d8ed7/44877/summary.html
COMMIT: 60d90f0
CMSSW: CMSSW_15_1_X_2025-03-09-0000/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47377/44877/install.sh to create a dev area with all the needed externals and cmssw changes.

  • DAS Queries: The DAS query tests failed, see the summary page for details.

RelVals

  • 25202.025202.0_TTbar_13/step1_TTbar_13.log
  • 1306.01306.0_SingleMuPt1_UP15/step1_SingleMuPt1_UP15.log
  • 9.09.0_Higgs200ChargedTaus/step1_Higgs200ChargedTaus.log
Expand to see more relval errors ...

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 858
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52213
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants