Skip to content
This repository has been archived by the owner on Jun 7, 2024. It is now read-only.
Justin Weber edited this page Jul 9, 2021 · 17 revisions

Welcome to the Distributed Nodeworks wiki!

Description

Support the development of a distributed workflow (i.e., coordinated execution of multiple tasks performed by different computers) by developing a prototype task management server. This server will accept a diverse range of workflows from an application like Nodeworks ( https://mfix.netl.doe.gov/nodeworks/ ) and manage a collection of task queues that are populated based on the workflow requirements. End points, or workers, running on distributed hardware will accept, execute, and publish results from the tasks back to the server so the next highest priority task in the workflow can be executed.

Recommended prerequisites: Familiarity with intermediate to advanced Python, Git, database server concepts and other tools like Flask. Also some patience & willingness to explore new things.

Deliverables

Deliverables:

  • Flask server with a REST API that accepts new users, generates tokens, and accepts Nodeworks workflow files (json).
  • Server populates task queues based on individual resource requirements in Nodeworks files, ensuring that the workflows are executed in the correct order based on the dependencies of the workflow.
  • End points that continuously poll task queues, and execute jobs (bash, Python, SLURM batch queuing, etc.) in the queue on distributed resources both local and on the cloud (AWS, GCP, AZURE, etc.)
  • React front end for displaying job status, endpoints, and queue tasks.
  • React front end that displays and allows editing of the workflow.
  • React front end that displays estimated start time of execution or completion of the complete workflow based on historical execution data and user prescribed workflow requirements.

Specification

Orchestrator

The Orchestrator is the main communication point. It stores and executes the workflow DAGs, allowing authenticated users to submit edits. Each change to the DAG is tracked using git, providing git hashes that are directly tied to workflow artifacts, enabling provenance of the workflow results.

The Orchestrator also determines the execution order and populates the queue. Tasks that have similar runner requirements are grouped together as much as possible to reduce runner overhead.

Resources (similar projects)

Queue

The queue contains a collection of tasks that the runners need to execute. This queue is populated and managed by the orchestrator

Resources

Database

The database will store the workflows, git history, log files, and job artifacts (although it is recognized that job artifacts could be large).

Proposed REST API

A feature rich web UI is not anticipated for the Orchestrator. Instead, a comprehensive API will be used.

call action
/api/v1/status get the status of the Orchestrator
/api/v1/workflow/get download the selected workflow
/api/v1/workflow/publish upload the selected workflow
/api/v1/workflow/execute execute the selected workflow
/api/v1/workflow/cancel stop execution of the selected workflow
/api/v1/workflow/status get the status of the selected workflow
/api/v1/workflow/history get the git history of the selected workflow
/api/v1/workflow/delete delete the selected workflow
/api/v1/queue/status get the status of the queue
/api/v1/queue/clear remove all tasks from the queue
/api/v1/runners/status get the status of the runners
/api/v1/runners/register register a runner
/api/v1/runners/cancel cancel runner job
/api/v1/runners/shutdown shutdown a runner
/api/v1/events/status get the status of events
/api/v1/events/push push a new event
/api/v1/timers/status get the status of a timer
/api/v1/timers/start start a new timer
/api/v1/admin/cancel_all cancel all workflows, clear all tasks from the queue
/api/v1/admin/shutdown cancel all workflows and shutdown
Clone this wiki locally