Name		Name	Last commit message	Last commit date
parent directory ..
workload		workload
Makefile		Makefile
README.md		README.md
dask_standard_cluster_workflow_template.yaml		dask_standard_cluster_workflow_template.yaml
dask_workflow_template.yaml		dask_workflow_template.yaml
installation.yaml		installation.yaml

README.md

Processing petabytes in Python with Argo Workflows & Dask

Goal of this repository

Showcase the combination of Dask and Argo Workflows to dynamically scale a computational workload
Provide a basic Argo-workflows installation applicable to production-grade kubernetes clusters
- The set-up has been tested on AWS EKS, and would likely work for similar kubernetes providers
- The set-up might work for a local kubernetes installation, such as that with docker desktop or k3s (tested on an M3 Pro Mac with 18GB RAM)
Package a Dask data pipeline into a docker container
Create an argo workflows WorkflowTemplate and related resources required to scale out the Dask pipeline in kubernetes

The talk

The pipeline

This project includes a Dask data pipeline which showcases a simple set-up of the Futures Interface. The pipeline will:

Connect to a pre-existing Dask Scheduler
Consider a set of timeseries weather data for major cities in Spain
Submit a data-processing task to the available Dask Workers which accepts a single time stamp argument, and returns the name of a city
- Takes the input timestamp, and extracts windspeed data at this timestamp for each city
- Identifies the city with the highest windspeed
- Returns that city's name
Counts the observations where each city had the fastest windspeed
Reports the city which is most often the windiest

Installing and Running on Local Docker Desktop

With kubectl installed and Docker Desktop running with Kubernetes enabled:

Run make install
Go to http://localhost:2746/workflow-templates/customer/windy-city
Click Submit

About Pipekit

Pipekit is the control plane for Argo Workflows. Platform teams use Pipekit to manage data & CI pipelines at scale, while giving developers self-serve access to Argo. Pipekit's unified logging view, enterprise-grade RBAC, and multi-cluster management capabilities lower maintenance costs for platform teams while delivering a superior devex for Argo users. Sign up for a 30-day free trial at pipekit.io/signup.

Learn more about Pipekit's professional support for companies already using Argo at pipekit.io/services.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2021-processing-petabytes-with-dask

2021-processing-petabytes-with-dask

README.md

Processing petabytes in Python with Argo Workflows & Dask

Goal of this repository

The talk

The pipeline

Installing and Running on Local Docker Desktop

About Pipekit

Files

2021-processing-petabytes-with-dask

Directory actions

More options

Directory actions

More options

Latest commit

History

2021-processing-petabytes-with-dask

Folders and files

parent directory

README.md

Processing petabytes in Python with Argo Workflows & Dask

Goal of this repository

The talk

The pipeline

Installing and Running on Local Docker Desktop

About Pipekit