subset selection default algorithm is taking forever #71

vlimant · 2025-02-12T17:42:56Z

as part of looking into a pilot (https://gitlab.cern.ch/cms-ppd/dataset-management/simulation-production/-/issues/9#note_9061271) with limited statistic from an existing dataset (functionality that exists already) and after fixing a formatting issue ( #70) it became clear that the subset algorithm is not adapted for large input.

something like

import sys
sys.path.append('/afs/cern.ch/cms/PPD/PdmV/tools/wmcontrol/')
from modules import helper
import pprint
dataset='/InclusiveDileptonMinBias_TuneCP5Plus_13p6TeV_pythia8/GenericNoSmearGEN-124X_mcRun3_2022_realistic_v12-v2/GEN'
espl = helper.SubsetByLumi(dataset,0.05)
split, details = espl.run( 10000 , True, False)

takes a handful of minutes, while

split, details = espl.run( 10000 , True, False)

has been running for the last 3h ...

The text was updated successfully, but these errors were encountered:

vlimant mentioned this issue Feb 12, 2025

use brute force on large set #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subset selection default algorithm is taking forever #71

subset selection default algorithm is taking forever #71

vlimant commented Feb 12, 2025

subset selection default algorithm is taking forever #71

subset selection default algorithm is taking forever #71

Comments

vlimant commented Feb 12, 2025