-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to the Distributed Nodeworks
wiki!
Support the development of a distributed workflow (i.e., coordinated execution of multiple tasks performed by different computers) by developing a prototype task management server. This server will accept a diverse range of workflows from an application like Nodeworks ( https://mfix.netl.doe.gov/nodeworks/ ) and manage a collection of task queues that are populated based on the workflow requirements. End points, or workers, running on distributed hardware will accept, execute, and publish results from the tasks back to the server so the next highest priority task in the workflow can be executed.
Recommended prerequisites: Familiarity with intermediate to advanced Python, Git, database server concepts and other tools like Flask. Also some patience & willingness to explore new things.
Deliverables:
- Flask server with a REST API that accepts new users, generates tokens, and accepts Nodeworks workflow files (json).
- Server populates task queues based on individual resource requirements in Nodeworks files, ensuring that the workflows are executed in the correct order based on the dependencies of the workflow.
- End points that continuously poll task queues, and execute jobs (bash, Python, SLURM batch queuing, etc.) in the queue on distributed resources both local and on the cloud (AWS, GCP, AZURE, etc.)
- React front end for displaying job status, endpoints, and queue tasks.
- React front end that displays and allows editing of the workflow.
- React front end that displays estimated start time of execution or completion of the complete workflow based on historical execution data and user prescribed workflow requirements.
The Orchestrator is the main communication point. It stores and executes the workflow DAGs, allowing authenticated users to submit edits. Each change to the DAG is tracked using git, providing git hashes that are directly tied to workflow artifacts, enabling provenance of the workflow results.
The Orchestrator also determines the execution order and populates the queue. Tasks that have similar runner requirements are grouped together as much as possible to reduce runner overhead.
Resources (similar projects)
A simple Directed acyclic graph (DAG) solver will need to be developed to figure out the order which the nodes should be executed in (based on the individual dependencies of each node).
Need a better name for these files. Ideas?
The files that describe the nodes and the connections are json files. They contain all the information required to load a workflow. There are two primary objects in the files: nodes and connections.
Nodes look like this:
{
'type': 'Node', # type: Node or Connection
'name': <str>,
'uniquename': <str>,
'pos': [x position <float>,
y position <float>],
'path': [<str>,], # path to the actual node; ['numpy', 'array']
'terminals': {}, # dict of terminals
'tunnel': <bool>,
'forcerun': <bool>,
'hidden': <bool>,
'customState': {}, # node specific information
'layer': <int>
}
Connections look like this:
{
"type": "Connection", # type: Node or Connection
"name": <str>,
"line": "cubic", # line style: line/cubic
"uniquename": <str>,
"input": [ # input connection
<str>, # node unique name
<str> # terminal name
],
"output": [ # output connection
<str>, # node unique name
<str> # terminal name
],
"controlpoints": [], # list of control points to move the line
"feedback": false
}
The queue contains a collection of tasks that the runners need to execute. This queue is populated and managed by the orchestrator
Resources
The database will store the workflows, log files, and job artifacts (although it is recognized that job artifacts could be large).
A feature rich web UI is not anticipated for the Orchestrator. Instead, a comprehensive API will be used so that both a web app and Nodeworks desktop app can interact with the server.
call | action | access |
---|---|---|
/api/v1/status | get the status of the Orchestrator | public |
/api/v1/workflow/get | download the selected workflow | user |
/api/v1/workflow/publish | upload the selected workflow | user |
/api/v1/workflow/execute | execute the selected workflow | user |
/api/v1/workflow/cancel | stop execution of the selected workflow | user |
/api/v1/workflow/status | get the status of the selected workflow | user |
/api/v1/workflow/history | get the git history of the selected workflow | user |
/api/v1/workflow/delete | delete the selected workflow | user |
/api/v1/queue/status | get the status of the queue | user |
/api/v1/queue/clear | remove all tasks from the queue | user |
/api/v1/runners/status | get the status of the runners | user |
/api/v1/runners/register | register a runner | user |
/api/v1/runners/cancel | cancel runner job | user |
/api/v1/runners/shutdown | shutdown a runner | user |
/api/v1/events/status | get the status of events | user |
/api/v1/events/push | push a new event | user |
/api/v1/timers/status | get the status of a timer | user |
/api/v1/timers/start | start a new timer | user |
/api/v1/admin/cancel_all | cancel all workflows, clear all tasks from the queue | admin |
/api/v1/admin/shutdown | cancel all workflows and shutdown | admin |
Runners are small programs that will handle communications with the Orchestrator and queue(s). Following a similar model as the gitlab-runners, runners will be registered with the Orchestrator. Once the runner is registered, it will periodically check the queue for new jobs. If a new job is present, the runner will start the executor.
Runners will have a collection of tags describing the environment and the executor. These tags will be used to decide what jobs to execute with which runner.
The executor will actually handle running the job. Several executors will be available to support different jobs:
- Shell - execute a list of commands in a shell
- Docker - start a selected docker container and execute a list of commands in that container
- Nodeworks - create a nodeworks conda environment and execute the node in that environment. This could also be containerized
- Kubernetes
A web application using REACT will be created to provide an intuitive interface for visualizing, editing, and displaying workflows and the execution progress. The application will need to:
- allow users to signup/login
- get a user specific token for API access to the flask server (for the Nodeworks desktop app)
- See a list of workflows
- open and visualize a workflow (stretch: edit the workflow)
- Execute the workflow (submit to the flask server)
- Visualize the execution progress
- register Executors
REACT node libraries
- https://www.npmjs.com/package/react-node-graph
- https://github.com/uber/react-digraph
- https://reactflow.dev/
Resources
- PLynx - REACT app with workflow visualization
- flask + react examples