-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to the Distributed Nodeworks
wiki!
Support the development of a distributed workflow (i.e., coordinated execution of multiple tasks performed by different computers) by developing a prototype task management server. This server will accept a diverse range of workflows from an application like Nodeworks ( https://mfix.netl.doe.gov/nodeworks/ ) and manage a collection of task queues that are populated based on the workflow requirements. End points, or workers, running on distributed hardware will accept, execute, and publish results from the tasks back to the server so the next highest priority task in the workflow can be executed.
Recommended prerequisites: Familiarity with intermediate to advanced Python, Git, database server concepts and other tools like Flask. Also some patience & willingness to explore new things.
Deliverables:
- Flask server with a REST API that accepts new users, generates tokens, and accepts Nodeworks workflow files (json).
- Server populates task queues based on individual resource requirements in Nodeworks files, ensuring that the workflows are executed in the correct order based on the dependencies of the workflow.
- End points that continuously poll task queues, and execute jobs (bash, Python, SLURM batch queuing, etc.) in the queue on distributed resources both local and on the cloud (AWS, GCP, AZURE, etc.)
- React front end for displaying job status, endpoints, and queue tasks.
- React front end that displays and allows editing of the workflow.
- React front end that displays estimated start time of execution or completion of the complete workflow based on historical execution data and user prescribed workflow requirements.
The Orchestrator is the main communication point. It stores and executes the workflow DAGs, allowing authenticated users to submit edits. Each change to the DAG is tracked using git, providing git hashes that are directly tied to workflow artifacts, enabling provenance of the workflow results.
The Orchestrator also determines the execution order and populates the queue. Tasks that have similar runner requirements are grouped together as much as possible to reduce runner overhead.
Resources (similar projects)
The queue contains a collection of tasks that the runners need to execute. This queue is populated and managed by the orchestrator
Resources
The database will store the workflows, git history, log files, and job artifacts (although it is recognized that job artifacts could be large).
A feature rich web UI is not anticipated for the Orchestrator. Instead, a comprehensive API will be used.
call | action |
---|---|
/api/v1/status | get the status of the Orchestrator |
/api/v1/workflow/get | download the selected workflow |
/api/v1/workflow/publish | upload the selected workflow |
/api/v1/workflow/execute | execute the selected workflow |
/api/v1/workflow/cancel | stop execution of the selected workflow |
/api/v1/workflow/status | get the status of the selected workflow |
/api/v1/workflow/history | get the git history of the selected workflow |
/api/v1/workflow/delete | delete the selected workflow |
/api/v1/queue/status | get the status of the queue |
/api/v1/queue/clear | remove all tasks from the queue |
/api/v1/runners/status | get the status of the runners |
/api/v1/runners/register | register a runner |
/api/v1/runners/cancel | cancel runner job |
/api/v1/runners/shutdown | shutdown a runner |
/api/v1/events/status | get the status of events |
/api/v1/events/push | push a new event |
/api/v1/timers/status | get the status of a timer |
/api/v1/timers/start | start a new timer |
/api/v1/admin/cancel_all | cancel all workflows, clear all tasks from the queue |
/api/v1/admin/shutdown | cancel all workflows and shutdown |
Runners are small programs that will handle communications with the Orchestrator and queue(s). Following a similar model as the gitlab-runners, runners will be registered with the Orchestrator. Once the runner is registered, it will periodically check the queue for new jobs. If a new job is present, the runner will start the executor.
Runners will have a collection of tags describing the environment and the executor. These tags will be used to decide what jobs to execute with which runner.
The executor will actually handle running the job. Several executors will be available to support different jobs:
- Shell - execute a list of commands in a shell
- Docker - start a selected docker container and execute a list of commands in that container
- Nodeworks - create a nodeworks conda environment and execute the node in that environment. This could also be containerized
- Kubernetes
A web application using REACT will be created to provide an intuitive interface for visualizing, editing, and displaying workflows and the execution progress.
Resources