Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce concept of datasets #279

Open
alintulu opened this issue Mar 24, 2020 · 3 comments
Open

Introduce concept of datasets #279

alintulu opened this issue Mar 24, 2020 · 3 comments

Comments

@alintulu
Copy link
Member

Some analyses have hundreds of input files. A dataset field in the workflow would allow the user to specify the path to the dataset (for example a CMS DAS path like /QCD_Pt-15to7000_TuneCP5_Flat2018_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15_ext1-v1/AODSIM).

Reana could pick up the names of the files (in the above example ~3500 of root files) and

  • divide the files into batches with the appropriate # of files per batch depending on total dataset size
  • divide the files according to a second field like batch_size where the user get to specify themselves how many files go into one job

@clelange

@tiborsimko
Copy link
Member

(1) This requires to introduce, alongside files and directories, a new concept called dataset that people could set in their reana.yaml for example.

The value would be a string, however it would not be enough, since CMS datasets are handled in one way (and read from DAS), while LHCb datasets in other way (and read from Bookkeeping). So we'd need some options (and then some getters) that would be different.

For example:

dataset:
  type: cms
  name: /QCD_Pt-15to7000_TuneCP5_Flat2018_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15_ext1-v1/AODSIM

dataset:
  type: lhcb
  name: /LHCb/Collision12//RealData/Reco14/Stripping20r0p1//BHADRON.MDST

(2) The getters would also differ, answering question "how do I get all files for a dataset, or 17th file, or files from 23rd to 347th position? E.g. for CMS it will be dasgoclient, but for CMS Open Data it can be cernopendata-client, which can already do things like:

$ cernopendata-client get-file-locations --recid 10
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0029E804-C77C-E011-BA94-00215E22239A.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/00A398F9-CA7C-E011-8841-00215E221782.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0215E914-8C77-E011-8E1B-00215E2217E2.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02186E3C-D277-E011-8A05-00215E21D516.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02BEF921-A777-E011-AC08-00215E93D738.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02C9E2AD-A477-E011-BD4F-00215E222022.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02E8E884-2078-E011-A3DC-00215E93C4A8.root
...

$ cernopendata-client get-file-locations --recid 10 --protocol root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0029E804-C77C-E011-BA94-00215E22239A.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/00A398F9-CA7C-E011-8841-00215E221782.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0215E914-8C77-E011-8E1B-00215E2217E2.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02186E3C-D277-E011-8A05-00215E21D516.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02BEF921-A777-E011-AC08-00215E93D738.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02C9E2AD-A477-E011-BD4F-00215E222022.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02E8E884-2078-E011-A3DC-00215E93C4A8.root
...

We'd just enrich the client to accept not only dataset's recid and doi but also dataset name.

(3) Finally, for scatter/gather around files and processing them in batches of say ~10 at the time, this can be handled by the workflow engine. Note that the workflow engine may need to interact closely with REANA for the concept of "file belonging to the datasets".

@tiborsimko
Copy link
Member

CC @lukasheinrich who has been also introducing dataset concept to Yadage workflows...

@clelange
Copy link

@tiborsimko Thanks for breaking this down into smaller pieces. I think it would be good to see how we can make things work rather quickly while at the same time avoiding design errors.

re (1): defining a data set type is definitely needed. For now, I would, however, only implement cms and leave everything else unimplemented.

re (2): I'm not sure if we need to take into account subsets of a single data set. In general, one always wants to process a full data set, and if this is not desired, users can fall back to providing individual file names instead. This step would therefore only have to return a list of file names. For simulation, one would need to know the total number of events (or events per file, the former can be obtained from DAS in the CMS case), and for data the integrated luminosity, but this is something one needs to set by hand at the moment. And usually analyses perform internal (ac)counting by writing out the number of events processed directly into the output file, so I think it's really just about returning the full file list here.

re (3): We can probably take some inspiration from twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab but initially offering an option "number of files per job" should be good enough.

Do you think this would become too dependent on the workflow engine? Should we initially implement the logic described above by hand (e.g. using some shell/python scripts)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants