Splitter Executor

RFC #13

Author: A.Casajús

Last Modified: 2013-05-06

Motivation

Up until now there have been two ways to split jobs in DIRAC:

Client-side splitting: Client generates and submits to DIRAC n jobs

: Client can do whatever is needed
: Takes a lot of time
: Needs n job submissions
: Can't cache results

JobManager splitting: As soon as DIRAC's JobManager gets the job, it divides the job as required and returns a list of job ids

: Client doesn't have to do the splitting
: Client receives all the job ids at submission time
: Takes a lot of time and slows other submissions
: It's very restrictive on what can be done.
: Difficult to extend the functionality

We'd like to get the best of both worlds. Fast job submission, extendable splitting mechanisms and take advantage of knowledge that DIRAC already has to speed up the splitting.

Goal of the proposal

Users should be able to send a job, define how it has to be divided and let DIRAC do the splitting. The job manifest should include all the necessary information for DIRAC to know how to split it. But instead of the JobManager dividing the job, it should be stored in the JobDB and divided asynchronously.

DIRAC will not know a priori how many jobs will be generated. That means that users will now know at submission time all the job ids. But they will be able to request that information once the job has been divided.

The Job splitting will be done by specific modules. Each module will divide the job in a different way. Users will define which module they want DIRAC to use when dividing their job. For instance, one module can do parametric jobs just like the JobManager does, another one can divide Input Data based on where it is... Any DIRAC extension can make their own modules if needed.

Proposed implementation

When a job is received by the JobManager, it will do minimal checks, store it into the JobDB and return the resulting job id. It will not do any splitting and will always return one job id if the submission has been successful.

Jobs will define which module has to be used when splitting them by defining a Splitter option in the manifest. The value of this option is the name of the module to use. Splitter=Parametric will use the ParametricSplitter module to divide it.

Since each module can require the job to be in a different step of the optimization chain, they have to define after which optimizer they can work. For instance, the ParametricSplitter can divide the job right after the JobPath since it does not require any extra information. On the other hand a splitter module that requires Input Data information will have to run after the InputDataResolution optimizer.

Once the JobPath optimizers gets any job, it will look if it has the Splitter option defined in the manifest. If it is not there, it will proceed as usual. If it is defined, the JobPath will load the requested module (ParametricSplitter in the example) and check after what optimizer they can work. In that position the JobPath will insert the new Splitter optimizer. As an example:

For Jobs with Splitter=Parametric, since ParametricSplitter can work after the JobPath optimizer:

Before:
OptimizationChain: JobPath -> JobSanity -> JobScheduling

After:
OptimizationChain: JobPath -> Splitter -> JobSanity -> JobScheduling

Once the job reaches the Splitter optimizer, the requested splitting module will be loaded and used to split the job manifest into as many manifests as needed. Then all the manifests will be sent to the Optimization Mind that will generate all the required jobs and start them into the optimization chain.

JobDB changes

We need a way to track down all the jobs in a herd (all the jobs that come from a split). To do so we will use the MasterJobId and JobSplitType attributes. By default all jobs have their own job ids as MasterJobId and have JobSplitType=Single. If the Splitter parameter is defined, the JobManager will set the JobSplitType to WillSplit.

Once a list of manifests comes back from the Splitter optimizer, the Optimization Mind will generate all the required jobs. All the new jobs will have the generating job id as their MasterJobID. The generating manifest will be stored as a MasterJDL in the JobDB.

An important note: The originating job will become the first of the new set of jobs. It will be replaced completely. For instance, if the splitter has returned three manifests. The first one will be assigned to the originating job and two new jobs will be generated. This will prevent regenerating again the same herd of jobs if the originating job gets rescheduled. Also we don't need to invent a new set of states for this job because there are no special cases.

Manifest changes

Each splitter module can do different things. For instance, the ParametricSplitter will just make a list of jobs based on a list of parameters, the InputDataBySeSplitter will take the list of Input Data and generate jobs with a maximum number of Input Data files in the same SE...

Since splitter modules can define new options in the manifest, there must be a way for users to define where this new options have to be used.

To do so a new Optimizer has to be created. This optimizer will look at the manifest, check that the requested splitting module exists, run the manifest by it and submit the resulting manifests to the Mind to be stored into the JobDB as new jobs.

Provide feedback

Saved searches