-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce concept of datasets #279
Comments
(1) This requires to introduce, alongside The value would be a string, however it would not be enough, since CMS datasets are handled in one way (and read from DAS), while LHCb datasets in other way (and read from Bookkeeping). So we'd need some options (and then some getters) that would be different. For example:
(2) The getters would also differ, answering question "how do I get all files for a dataset, or 17th file, or files from 23rd to 347th position? E.g. for CMS it will be $ cernopendata-client get-file-locations --recid 10
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0029E804-C77C-E011-BA94-00215E22239A.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/00A398F9-CA7C-E011-8841-00215E221782.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0215E914-8C77-E011-8E1B-00215E2217E2.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02186E3C-D277-E011-8A05-00215E21D516.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02BEF921-A777-E011-AC08-00215E93D738.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02C9E2AD-A477-E011-BD4F-00215E222022.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02E8E884-2078-E011-A3DC-00215E93C4A8.root
...
$ cernopendata-client get-file-locations --recid 10 --protocol root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0029E804-C77C-E011-BA94-00215E22239A.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/00A398F9-CA7C-E011-8841-00215E221782.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/0215E914-8C77-E011-8E1B-00215E2217E2.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02186E3C-D277-E011-8A05-00215E21D516.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02BEF921-A777-E011-AC08-00215E93D738.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02C9E2AD-A477-E011-BD4F-00215E222022.root
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/MuOnia/AOD/Apr21ReReco-v1/0000/02E8E884-2078-E011-A3DC-00215E93C4A8.root
... We'd just enrich the client to accept not only dataset's (3) Finally, for scatter/gather around files and processing them in batches of say ~10 at the time, this can be handled by the workflow engine. Note that the workflow engine may need to interact closely with REANA for the concept of "file belonging to the datasets". |
CC @lukasheinrich who has been also introducing dataset concept to Yadage workflows... |
@tiborsimko Thanks for breaking this down into smaller pieces. I think it would be good to see how we can make things work rather quickly while at the same time avoiding design errors. re (1): defining a data set re (2): I'm not sure if we need to take into account subsets of a single data set. In general, one always wants to process a full data set, and if this is not desired, users can fall back to providing individual file names instead. This step would therefore only have to return a list of file names. For simulation, one would need to know the total number of events (or events per file, the former can be obtained from DAS in the CMS case), and for data the integrated luminosity, but this is something one needs to set by hand at the moment. And usually analyses perform internal (ac)counting by writing out the number of events processed directly into the output file, so I think it's really just about returning the full file list here. re (3): We can probably take some inspiration from twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab but initially offering an option "number of files per job" should be good enough. Do you think this would become too dependent on the workflow engine? Should we initially implement the logic described above by hand (e.g. using some shell/python scripts)? |
Some analyses have hundreds of input files. A
dataset
field in the workflow would allow the user to specify the path to the dataset (for example a CMS DAS path like/QCD_Pt-15to7000_TuneCP5_Flat2018_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15_ext1-v1/AODSIM
).Reana could pick up the names of the files (in the above example ~3500 of root files) and
batch_size
where the user get to specify themselves how many files go into one job@clelange
The text was updated successfully, but these errors were encountered: