Skip to content

Commit

Permalink
doc: Documentation improvements (#102)
Browse files Browse the repository at this point in the history
  • Loading branch information
pvanliefland authored Jan 4, 2024
1 parent 65568b9 commit 69366fc
Show file tree
Hide file tree
Showing 33 changed files with 991 additions and 451 deletions.
16 changes: 6 additions & 10 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.0.280
hooks:
- id: ruff
- repo: https://github.com/psf/black
rev: 23.7.0
hooks:
- id: black
language_version: python3
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.1.8
hooks:
- id: ruff
args: [ --fix, --exit-non-zero-on-fix ]
- id: ruff-format
86 changes: 63 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,46 @@
# OpenHexa Python SDK
<div align="center">
<img alt="OpenHEXA Logo" src="https://raw.githubusercontent.com/BLSQ/openhexa-app/main/hexa/static/img/logo/logo_with_text_grey.svg" height="80">
</div>
<p align="center">
<em>Open-source Data integration platform</em>
</p>
<p align="center">
<a href="https://github.com/BLSQ/openhexa-app/actions/workflows/test.yml">
<img alt="Test Suite" src="https://github.com/BLSQ/openhexa-sdk-python/actions/workflows/ci.yml/badge.svg">
</a>
</p>

The OpenHexa Python SDK is a tool that helps you write code for the OpenHexa platform.
OpenHEXA Python SDK
===================

It is particularly useful to write OpenHexa data pipelines, but can also be used in the OpenHexa notebooks environment.
OpenHEXA is an open-source data integration platform developed by [Bluesquare](https://bluesquarehub.com).

## Quickstart
Its goal is to facilitate data integration and analysis workflows, in particular in the context of public health
projects.

### Writing and deploying pipelines
Please refer to the [OpenHEXA wiki](https://github.com/BLSQ/openhexa/wiki/Home) for more information about OpenHEXA.

This repository contains the code of the OpenHEXA SDK, a library allows you to write code for the OpenHEXA platform.
It is particularly useful to write OpenHEXA data pipelines, but can also be used in the OpenHEXA notebooks environment.

The OpenHEXA wiki has a section dedicated to the SDK:
[Using the OpenHEXA SDK](https://github.com/BLSQ/openhexa/wiki/Using-the-OpenHEXA-SDK).

For more information about the technical aspects of OpenHEXA, you might be interested in the two following wiki pages:

- [Installing OpenHEXA](https://github.com/BLSQ/openhexa/wiki/Installation-instructions)
- [Technical Overview](https://github.com/BLSQ/openhexa/wiki/Technical-overview)

Requirements
------------

The OpenHEXA SDK requires Python version 3.9 or newer, but it is not yet compatible with Python 3.12 or later versions.

If you want to be able to run pipeline in a containerized environment on your machine, you will need
[Docker](https://www.docker.com/).

Quickstart
----------

Here's a super minimal example to get you started. First, create a new directory and a virtual environment:

Expand All @@ -17,21 +51,24 @@ python -m venv venv
source venv/bin/activate
```

You can then install the OpenHexa SDK:
You can then install the OpenHEXA SDK:

```shell
pip install --upgrade openhexa.sdk
```

💡New OpenHEXA SDK versions are released on a regular basis. Don't forget to update your local installations with
`pip install --upgrade` from times to times!

Now that the SDK is installed withing your virtual environmentYou can now use the `openhexa` CLI utility to create
a new pipeline:

```shell
openhexa pipelines init "My awesome pipeline"
```

Great! As you can see in the console output, the OpenHexa CLI has created a new directory, which contains the basic
structure required for an OpenHexa pipeline. You can now `cd` in the new pipeline directory and run the pipeline:
Great! As you can see in the console output, the OpenHEXA CLI has created a new directory, which contains the basic
structure required for an OpenHEXA pipeline. You can now `cd` in the new pipeline directory and run the pipeline:

```shell
cd my_awesome_pipeline
Expand All @@ -41,11 +78,11 @@ python pipeline.py
Congratulations! You have successfully run your first pipeline locally.

If you inspect the actual pipeline code, you will see that it doesn't do a lot of things, but it is still a perfectly
valid OpenHexa pipeline.
valid OpenHEXA pipeline.

Let's publish to an actual OpenHexa workspace so that it can run online.
Let's publish to an actual OpenHEXA workspace so that it can run online.

Using the OpenHexa web interface, within a workspace, navigate to the Pipelines tab and click on "Create".
Using the OpenHEXA web interface, within a workspace, navigate to the Pipelines tab and click on "Create".

Copy the command displayed in the popup in your terminal:

Expand All @@ -62,20 +99,17 @@ openhexa pipelines push
```

As it is the first time, the CLI will ask you to confirm the creation operation. After confirmation the console will
output the link to the pipeline screen in the OpenHexa interface.

You can now open the link and run the pipeline using the OpenHexa web interface.
output the link to the pipeline screen in the OpenHEXA interface.

### Using the SDK in the notebooks environment
You can now open the link and run the pipeline using the OpenHEXA web interface.

TBC
Contributing
------------

## Contributing
The following sections explain how you can set up a local development environment if you want to participate to the
development of the SDK.

The following sections explain how you can setup a local development environment if you want to participate to the
development of the SDK

### Development setup
### SDK development setup

Install the SDK in editable mode:

Expand All @@ -85,7 +119,13 @@ source venv/bin/activate # Activate the venv
pip install -e ".[dev]" # Necessary to be able to run the openhexa CLI
```

### Using a local installation of the OpenHexa backend to run pipelines
### Using a local installation of OpenHEXA to run pipelines

While it is possible to run pipelines locally using only the SDK, if you want to run OpenHEXA in a more realistic
setting you will need to install the OpenHEXA app and frontend components. Please refer to the
[installation instructions](https://github.com/BLSQ/openhexa/wiki/Installation-instructions) for more information.

You can then configure the OpenHEXA CLI to connect to your local backend:

```shell
openhexa config set_url http://localhost:8000
Expand All @@ -95,7 +135,7 @@ Notes: you can monitor the status of your pipelines using http://localhost:8000/

### Running the tests

Run the tests using pytest:
You can run the tests using pytest:

```shell
pytest
Expand Down
11 changes: 9 additions & 2 deletions examples/pipelines/logistic_stats/pipeline.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Simple module for a sample logistic pipeline."""

import json
import typing
from io import BytesIO
Expand Down Expand Up @@ -26,6 +28,7 @@
)
@parameter("oul", name="Organisation unit level", type=int, default=2)
def logistic_stats(deg: str, periods: str, oul: int):
"""Run a basic logistic stats pipeline."""
dhis2_data = dhis2_download(deg, periods, oul)
gadm_data = gadm_download()
worldpop_data = worldpop_download()
Expand All @@ -34,7 +37,8 @@ def logistic_stats(deg: str, periods: str, oul: int):


@logistic_stats.task
def dhis2_download(data_element_group: str, periods: str, org_unit_level: int) -> typing.Dict[str, typing.Any]:
def dhis2_download(data_element_group: str, periods: str, org_unit_level: int) -> dict[str, typing.Any]:
"""Download DHIS2 data."""
connection = workspace.dhis2_connection("dhis2-play")
base_url = f"{connection.url}/api"
session = requests.Session()
Expand All @@ -61,6 +65,7 @@ def dhis2_download(data_element_group: str, periods: str, org_unit_level: int) -

@logistic_stats.task
def gadm_download():
"""Download administrative boundaries data from UCDavis."""
url = "https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_SLE.gpkg"
r = requests.get(url, timeout=30)

Expand All @@ -69,6 +74,7 @@ def gadm_download():

@logistic_stats.task
def worldpop_download():
"""Download population data from worldpop.org."""
base_url = "https://data.worldpop.org/"
url = f"{base_url}GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/SLE/sle_ppp_2020_UNadj_constrained.tif"
r = requests.get(url)
Expand All @@ -77,7 +83,8 @@ def worldpop_download():


@logistic_stats.task
def model(dhis2_data: typing.Dict[str, typing.Any], gadm_data, worldpop_data):
def model(dhis2_data: dict[str, typing.Any], gadm_data, worldpop_data):
"""Build a basic data model."""
# Load DHIS2 data
dhis2_df = pd.DataFrame(dhis2_data["rows"], columns=[h["column"] for h in dhis2_data["headers"]])
dhis2_df = dhis2_df.rename(columns={"Data": "Data element id", "Organisation unit": "Organisation unit id"})
Expand Down
Empty file.
8 changes: 8 additions & 0 deletions examples/pipelines/simple_io/pipeline.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""Simple module for a sample IO pipeline."""

import json

import pandas as pd
Expand All @@ -9,6 +11,7 @@

@pipeline("simple-io", name="Simple IO")
def simple_io():
"""Run a simple IO pipeline."""
# Read and write from/to workspace files
raw_files_data = load_files_data()
transform_and_write_files_data(raw_files_data)
Expand All @@ -23,13 +26,15 @@ def simple_io():

@simple_io.task
def load_files_data():
"""Load data from workspace filesystem."""
current_run.log_info("Loading files data...")

return pd.read_csv(f"{workspace.files_path}/raw.csv")


@simple_io.task
def transform_and_write_files_data(raw_data: pd.DataFrame):
"""Simulate a transformation on the provided dataframe and write data to workspace filesystem."""
current_run.log_info("Transforming files data...")

transformed_data = raw_data.copy()
Expand All @@ -41,6 +46,7 @@ def transform_and_write_files_data(raw_data: pd.DataFrame):

@simple_io.task
def load_data_from_postgresql() -> pd.DataFrame:
"""Perform a simple SELECT query in the workspace database."""
current_run.log_info("Loading Postgres data...")

engine = create_engine(workspace.database_url)
Expand All @@ -50,6 +56,7 @@ def load_data_from_postgresql() -> pd.DataFrame:

@simple_io.task
def transform_and_write_sql_data(raw_data: pd.DataFrame):
"""Simulate a transform operation on the provided data and load it in the workspace database."""
current_run.log_info("Transforming postgres data...")

engine = create_engine(workspace.database_url)
Expand All @@ -61,6 +68,7 @@ def transform_and_write_sql_data(raw_data: pd.DataFrame):

@simple_io.task
def load_dhis2_data():
"""Load data from DHIS2."""
current_run.log_info("Loading DHIS2 data...")

connection = workspace.dhis2_connection("dhis2-play")
Expand Down
2 changes: 2 additions & 0 deletions openhexa/cli/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
"""CLI package."""

from .cli import app

__all__ = ["app"]
Loading

0 comments on commit 69366fc

Please sign in to comment.