-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to the Distributed Nodeworks
wiki!
Support the development of a distributed workflow (i.e., coordinated execution of multiple tasks performed by different computers) by developing a prototype task management server. This server will accept a diverse range of workflows from an application like Nodeworks ( https://mfix.netl.doe.gov/nodeworks/ ) and manage a collection of task queues that are populated based on the workflow requirements. End points, or workers, running on distributed hardware will accept, execute, and publish results from the tasks back to the server so the next highest priority task in the workflow can be executed.
Recommended prerequisites: Familiarity with intermediate to advanced Python, Git, database server concepts and other tools like Flask. Also some patience & willingness to explore new things.
Deliverables:
- Flask server with a REST API that accepts new users, generates tokens, and accepts Nodeworks workflow files (json).
- Server populates task queues based on individual resource requirements in Nodeworks files, ensuring that the workflows are executed in the correct order based on the dependencies of the workflow.
- End points that continuously poll task queues, and execute jobs (bash, Python, SLURM batch queuing, etc.) in the queue on distributed resources both local and on the cloud (AWS, GCP, AZURE, etc.)
- React front end for displaying job status, endpoints, and queue tasks.
- React front end that displays and allows editing of the workflow.
- React front end that displays estimated start time of execution or completion of the complete workflow based on historical execution data and user prescribed workflow requirements.
The Orchestrator is the main communication point. It stores and executes the workflow DAGs, allowing authenticated users to submit edits. Each change to the DAG is tracked using git, providing git hashes that are directly tied to workflow artifacts, enabling provenance of the workflow results.
The Orchestrator also determines the execution order and populates the queue. Tasks that have similar runner requirements are grouped together as much as possible to reduce runner overhead.
Resources (similar projects)
The queue contains a collection of tasks that the runners need to execute. This queue is populated and managed by the orchestrator
Resources
The database will store the workflows, git history, log files, and job artifacts (although it is recognized that job artifacts could be large).
A feature rich web UI is not anticipated for the Orchestrator. Instead, a comprehensive API will be used.
call | action |
---|---|
/api/v1/status | get the status of the Orchestrator |
/api/v1/workflow/get | download the selected workflow |
/api/v1/workflow/publish | upload the selected workflow |
/api/v1/workflow/execute | execute the selected workflow |
/api/v1/workflow/cancel | stop execution of the selected workflow |
/api/v1/workflow/status | get the status of the selected workflow |
/api/v1/workflow/history | get the git history of the selected workflow |
/api/v1/workflow/delete | delete the selected workflow |
/api/v1/queue/status | get the status of the queue |
/api/v1/queue/clear | remove all tasks from the queue |
/api/v1/runners/status | get the status of the runners |
/api/v1/runners/register | register a runner |
/api/v1/runners/cancel | cancel runner job |
/api/v1/runners/shutdown | shutdown a runner |
/api/v1/events/status | get the status of events |
/api/v1/events/push | push a new event |
/api/v1/timers/status | get the status of a timer |
/api/v1/timers/start | start a new timer |
/api/v1/admin/cancel_all | cancel all workflows, clear all tasks from the queue |
/api/v1/admin/shutdown | cancel all workflows and shutdown |