-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing information about job input files #56
Comments
I agree that we should split the job input files the way you proposed.
This should uniquely identify jobs in the use cases that I can currently think of and it contains the information we would need to look up the jobs input file related ClassAd e.g. in a script that creates the input file input file |
Sounds like a good idea. The onliest think we should still think about is, if this is actually good in terms of reusability. Do we want to have specifically tailored input file information per job or should it be possible to reuse the same input file configuration for different exports? |
I'm not sure what you mean by
We could use the |
I am thinking at a more abstract level. Try to imagine that you want to compare different types of jobs, e.g. HTC and HPC, but having the same input files. So I could imagine having the same file for input file configuration for jobs but having to different exports of jobs that I use. So you would do two different simulations, 1) with the HTC import and the input file configuration and 2) with the HPC import and the very same input file configuration. If the input file configuration would utilise the Just thinking and questioning aloud :) |
Ok, in this case you are right and I don't think that there is an identifier that we could export from HTCondor that is reusable and we should do the identification you proposed or something similar. |
So the question is whether to use a relative identifier or having an additional mapping. @maxfischer2781: any opinion from your side? |
Have I understood correctly that you would put the jobs relative identifier into the additional input file? |
Yes! So the exported files from HTCondor etc. should / must stay untouched. |
I agree.
To me this sounds more like the config file that defines lapis mechanisms like which caching algorithm should be used or something similar and not like a config file that contains additional information of what is passed to lapis. Or did you plan to include "technical" information into this file? |
I meant in the discussion here to make clear what we are talking about :) How it looks in reality will probably evolve over time while we decide on what to actually put in there. So maybe it is going to be a caching-related config/information file. |
Sorry for the late reply. Some comments: Jobs having a list of input files is totally fine for HTCondor. As with everything in HTC, they are optional and of variable format, however. While we could make the parser ultra-configurable -- I don't see how that would be practical. So I'm in favour of having some plugin/annotation mechanism. ✅ Being able to annotate multiple data sets (e.g. HPC + HTC) seems interesting, but very complicated and we will probably not use it much. So I'd be in favour of focusing on a less-powerful but easier to use/implement plugin mechanism for the time being. |
@maxfischer2781, could you please review the description of this ticket if I remember the decisions we made last week correctly? Thanks in advance :) |
Rough summary of what we talked about offline: Every job has a job identifier. This MAY be externally specified, otherwise the importer MUST derive an identifier by enumerating all jobs starting at 0. The
Information on input file usage belongs to separate annotation files that SHALL use the job identifier to extend respective jobs. The job identifier MUST NOT be optional in these files, and is used to identify jobs from the main
|
The description and the last comment contain what we talked about offline and describe well what we should do. The only thing I'm unsure about is the choice of job identifier. I remember that we agreed that the HTCondor |
I think the summary in comment #56 (comment) suits well. Iff no |
Introduction
For simulating caching it is required to have information about input files a job requires and further metadata such as size of input files, etc.
We currently a working solution, that is great!
The information about input files for jobs are attached into a modified HTCondor job export in json format.
For the future, I would like to change the approach a bit.
Issue
I do see three issues here:
Proposed solution
I would like to propose to separate the information about the assignment of required input files by putting it into a separate file. This file should still be in
jsoncsv format and should follow the specification that @tfesenbecker introduced, see #51 (comment). However, the job-specific information like requested number of cores or memory should be skipped from this file. Only input file-specific information should be included. Further, we should introduce another field to reference a specific job. This would then be a kind of lapis-configuration file that could even include more information.The exports from HTCondor currently don't have information about a name or identifier for jobs.
Question to the experts (@maxfischer2781, @tfesenbecker): is it possible to also export an identifier?Therefore, the job input file should also contain the id of jobs to make proper references.Otherwise I would take the line count to create a job specifier. The job on the first line gets id
1
, job on next lines has id2
, and so on. As several files can be imported, we could even extend this to the following format<input-nr>-<job-nr>
. So we can even mix up different formats, e.g. SWF and HTCondor.When a job is skipped because of wrong parameters, I would still count it, so that we have a defined id for each of the following jobs.
The header fields for the additional input file file should be named:
JobID
,URL
,size
, andused
See #56 (comment) for a complete writeup.
Call for discussion
Did I miss something obvious in the described proposal above? Does anyone have a better idea? Does this fit to other configuration options we need to import, e.g. information about storage elements (see #53).
Every feedback is welcome!
The text was updated successfully, but these errors were encountered: