Introduce concept of datasets #279

alintulu · 2020-03-24T17:16:34Z

Some analyses have hundreds of input files. A dataset field in the workflow would allow the user to specify the path to the dataset (for example a CMS DAS path like /QCD_Pt-15to7000_TuneCP5_Flat2018_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15_ext1-v1/AODSIM).

Reana could pick up the names of the files (in the above example ~3500 of root files) and

divide the files into batches with the appropriate # of files per batch depending on total dataset size
divide the files according to a second field like batch_size where the user get to specify themselves how many files go into one job

@clelange

The text was updated successfully, but these errors were encountered:

tiborsimko · 2020-03-24T17:55:48Z

(1) This requires to introduce, alongside files and directories, a new concept called dataset that people could set in their reana.yaml for example.

The value would be a string, however it would not be enough, since CMS datasets are handled in one way (and read from DAS), while LHCb datasets in other way (and read from Bookkeeping). So we'd need some options (and then some getters) that would be different.

For example:

dataset:
  type: cms
  name: /QCD_Pt-15to7000_TuneCP5_Flat2018_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15_ext1-v1/AODSIM

dataset:
  type: lhcb
  name: /LHCb/Collision12//RealData/Reco14/Stripping20r0p1//BHADRON.MDST

(2) The getters would also differ, answering question "how do I get all files for a dataset, or 17th file, or files from 23rd to 347th position? E.g. for CMS it will be dasgoclient, but for CMS Open Data it can be cernopendata-client, which can already do things like:

$ cernopendata-client get-file-locations --recid 10
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0029E804-C77C-E011-BA94-00215E22239A.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/00A398F9-CA7C-E011-8841-00215E221782.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0215E914-8C77-E011-8E1B-00215E2217E2.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02186E3C-D277-E011-8A05-00215E21D516.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02BEF921-A777-E011-AC08-00215E93D738.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02C9E2AD-A477-E011-BD4F-00215E222022.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02E8E884-2078-E011-A3DC-00215E93C4A8.root
...

$ cernopendata-client get-file-locations --recid 10 --protocol root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0029E804-C77C-E011-BA94-00215E22239A.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/00A398F9-CA7C-E011-8841-00215E221782.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0215E914-8C77-E011-8E1B-00215E2217E2.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02186E3C-D277-E011-8A05-00215E21D516.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02BEF921-A777-E011-AC08-00215E93D738.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02C9E2AD-A477-E011-BD4F-00215E222022.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02E8E884-2078-E011-A3DC-00215E93C4A8.root
...

We'd just enrich the client to accept not only dataset's recid and doi but also dataset name.

(3) Finally, for scatter/gather around files and processing them in batches of say ~10 at the time, this can be handled by the workflow engine. Note that the workflow engine may need to interact closely with REANA for the concept of "file belonging to the datasets".

tiborsimko · 2020-03-24T17:57:43Z

CC @lukasheinrich who has been also introducing dataset concept to Yadage workflows...

clelange · 2020-03-26T18:06:20Z

@tiborsimko Thanks for breaking this down into smaller pieces. I think it would be good to see how we can make things work rather quickly while at the same time avoiding design errors.

re (1): defining a data set type is definitely needed. For now, I would, however, only implement cms and leave everything else unimplemented.

re (2): I'm not sure if we need to take into account subsets of a single data set. In general, one always wants to process a full data set, and if this is not desired, users can fall back to providing individual file names instead. This step would therefore only have to return a list of file names. For simulation, one would need to know the total number of events (or events per file, the former can be obtained from DAS in the CMS case), and for data the integrated luminosity, but this is something one needs to set by hand at the moment. And usually analyses perform internal (ac)counting by writing out the number of events processed directly into the output file, so I think it's really just about returning the full file list here.

re (3): We can probably take some inspiration from twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab but initially offering an option "number of files per job" should be good enough.

Do you think this would become too dependent on the workflow engine? Should we initially implement the logic described above by hand (e.g. using some shell/python scripts)?

alintulu added the type/rfc label Mar 24, 2020

tiborsimko added community/cms type/epic labels Mar 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce concept of datasets #279

Introduce concept of datasets #279

alintulu commented Mar 24, 2020

tiborsimko commented Mar 24, 2020

tiborsimko commented Mar 24, 2020

clelange commented Mar 26, 2020

Introduce concept of datasets #279

Introduce concept of datasets #279

Comments

alintulu commented Mar 24, 2020

tiborsimko commented Mar 24, 2020

tiborsimko commented Mar 24, 2020

clelange commented Mar 26, 2020