doc: Documentation improvements (#102)

BLSQ · Jan 4, 2024 · 69366fc · 69366fc
1 parent 65568b9
commit 69366fc
Show file tree

Hide file tree

Showing 33 changed files with 991 additions and 451 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,11 +1,7 @@
 repos:
-  - repo: https://github.com/astral-sh/ruff-pre-commit
-  # Ruff version.
-    rev: v0.0.280
-    hooks:
-      - id: ruff
-  - repo: https://github.com/psf/black
-    rev: 23.7.0
-    hooks:
-      - id: black
-        language_version: python3
+- repo: https://github.com/astral-sh/ruff-pre-commit
+  rev: v0.1.8
+  hooks:
+    - id: ruff
+      args: [ --fix, --exit-non-zero-on-fix ]
+    - id: ruff-format
diff --git a/README.md b/README.md
@@ -1,12 +1,46 @@
-# OpenHexa Python SDK
+<div align="center">
+   <img alt="OpenHEXA Logo" src="https://raw.githubusercontent.com/BLSQ/openhexa-app/main/hexa/static/img/logo/logo_with_text_grey.svg" height="80">
+</div>
+<p align="center">
+    <em>Open-source Data integration platform</em>
+</p>
+<p align="center">
+   <a href="https://github.com/BLSQ/openhexa-app/actions/workflows/test.yml">
+      <img alt="Test Suite" src="https://github.com/BLSQ/openhexa-sdk-python/actions/workflows/ci.yml/badge.svg">
+   </a>
+</p>
 
-The OpenHexa Python SDK is a tool that helps you write code for the OpenHexa platform.
+OpenHEXA Python SDK
+===================
 
-It is particularly useful to write OpenHexa data pipelines, but can also be used in the OpenHexa notebooks environment.
+OpenHEXA is an open-source data integration platform developed by [Bluesquare](https://bluesquarehub.com).
 
-## Quickstart
+Its goal is to facilitate data integration and analysis workflows, in particular in the context of public health 
+projects.
 
-### Writing and deploying pipelines
+Please refer to the [OpenHEXA wiki](https://github.com/BLSQ/openhexa/wiki/Home) for more information about OpenHEXA.
+
+This repository contains the code of the OpenHEXA SDK, a library allows you to write code for the OpenHEXA platform. 
+It is particularly useful to write OpenHEXA data pipelines, but can also be used in the OpenHEXA notebooks environment.
+
+The OpenHEXA wiki has a section dedicated to the SDK: 
+[Using the OpenHEXA SDK](https://github.com/BLSQ/openhexa/wiki/Using-the-OpenHEXA-SDK).
+
+For more information about the technical aspects of OpenHEXA, you might be interested in the two following wiki pages:
+
+- [Installing OpenHEXA](https://github.com/BLSQ/openhexa/wiki/Installation-instructions)
+- [Technical Overview](https://github.com/BLSQ/openhexa/wiki/Technical-overview)
+
+Requirements
+------------
+
+The OpenHEXA SDK requires Python version 3.9 or newer, but it is not yet compatible with Python 3.12 or later versions.
+
+If you want to be able to run pipeline in a containerized environment on your machine, you will need 
+[Docker](https://www.docker.com/).
+
+Quickstart
+----------
 
 Here's a super minimal example to get you started. First, create a new directory and a virtual environment:
 
@@ -17,21 +51,24 @@ python -m venv venv
 source venv/bin/activate
 ```
 
-You can then install the OpenHexa SDK:
+You can then install the OpenHEXA SDK:
 
 ```shell
 pip install --upgrade openhexa.sdk
 ```
 
+💡New OpenHEXA SDK versions are released on a regular basis. Don't forget to update your local installations with 
+`pip install --upgrade` from times to times!
+
 Now that the SDK is installed withing your virtual environmentYou can now use the `openhexa` CLI utility to create 
 a new pipeline:
 
 ```shell
 openhexa pipelines init "My awesome pipeline"
 ```
 
-Great! As you can see in the console output, the OpenHexa CLI has created a new directory, which contains the basic 
-structure required for an OpenHexa pipeline. You can now `cd` in the new pipeline directory and run the pipeline:
+Great! As you can see in the console output, the OpenHEXA CLI has created a new directory, which contains the basic 
+structure required for an OpenHEXA pipeline. You can now `cd` in the new pipeline directory and run the pipeline:
 
 ```shell
 cd my_awesome_pipeline
@@ -41,11 +78,11 @@ python pipeline.py
 Congratulations! You have successfully run your first pipeline locally.
 
 If you inspect the actual pipeline code, you will see that it doesn't do a lot of things, but it is still a perfectly 
-valid OpenHexa pipeline.
+valid OpenHEXA pipeline.
 
-Let's publish to an actual OpenHexa workspace so that it can run online.
+Let's publish to an actual OpenHEXA workspace so that it can run online.
 
-Using the OpenHexa web interface, within a workspace, navigate to the Pipelines tab and click on "Create".
+Using the OpenHEXA web interface, within a workspace, navigate to the Pipelines tab and click on "Create".
 
 Copy the command displayed in the popup in your terminal:
 
@@ -62,20 +99,17 @@ openhexa pipelines push
 ```
 
 As it is the first time, the CLI will ask you to confirm the creation operation. After confirmation the console will 
-output the link to the pipeline screen in the OpenHexa interface.
-
-You can now open the link and run the pipeline using the OpenHexa web interface.
+output the link to the pipeline screen in the OpenHEXA interface.
 
-### Using the SDK in the notebooks environment
+You can now open the link and run the pipeline using the OpenHEXA web interface.
 
-TBC
+Contributing
+------------
 
-## Contributing
+The following sections explain how you can set up a local development environment if you want to participate to the 
+development of the SDK.
 
-The following sections explain how you can setup a local development environment if you want to participate to the 
-development of the SDK
-
-### Development setup
+### SDK development setup
 
 Install the SDK in editable mode:
 
@@ -85,7 +119,13 @@ source venv/bin/activate # Activate the venv
 pip install -e ".[dev]"  # Necessary to be able to run the openhexa CLI
 ```
 
-### Using a local installation of the OpenHexa backend to run pipelines
+### Using a local installation of OpenHEXA to run pipelines
+
+While it is possible to run pipelines locally using only the SDK, if you want to run OpenHEXA in a more realistic 
+setting you will need to install the OpenHEXA app and frontend components. Please refer to the 
+[installation instructions](https://github.com/BLSQ/openhexa/wiki/Installation-instructions) for more information.
+
+You can then configure the OpenHEXA CLI to connect to your local backend:
 
 ```shell
 openhexa config set_url http://localhost:8000
@@ -95,7 +135,7 @@ Notes: you can monitor the status of your pipelines using http://localhost:8000/
 
 ### Running the tests
 
-Run the tests using pytest:
+You can run the tests using pytest:
 
 ```shell
 pytest

diff --git a/examples/pipelines/logistic_stats/pipeline.py b/examples/pipelines/logistic_stats/pipeline.py
@@ -1,3 +1,5 @@
+"""Simple module for a sample logistic pipeline."""
+
 import json
 import typing
 from io import BytesIO
@@ -26,6 +28,7 @@
 )
 @parameter("oul", name="Organisation unit level", type=int, default=2)
 def logistic_stats(deg: str, periods: str, oul: int):
+    """Run a basic logistic stats pipeline."""
     dhis2_data = dhis2_download(deg, periods, oul)
     gadm_data = gadm_download()
     worldpop_data = worldpop_download()
@@ -34,7 +37,8 @@ def logistic_stats(deg: str, periods: str, oul: int):
 
 
 @logistic_stats.task
-def dhis2_download(data_element_group: str, periods: str, org_unit_level: int) -> typing.Dict[str, typing.Any]:
+def dhis2_download(data_element_group: str, periods: str, org_unit_level: int) -> dict[str, typing.Any]:
+    """Download DHIS2 data."""
     connection = workspace.dhis2_connection("dhis2-play")
     base_url = f"{connection.url}/api"
     session = requests.Session()
@@ -61,6 +65,7 @@ def dhis2_download(data_element_group: str, periods: str, org_unit_level: int) -
 
 @logistic_stats.task
 def gadm_download():
+    """Download administrative boundaries data from UCDavis."""
     url = "https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_SLE.gpkg"
     r = requests.get(url, timeout=30)
 
@@ -69,6 +74,7 @@ def gadm_download():
 
 @logistic_stats.task
 def worldpop_download():
+    """Download population data from worldpop.org."""
     base_url = "https://data.worldpop.org/"
     url = f"{base_url}GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/SLE/sle_ppp_2020_UNadj_constrained.tif"
     r = requests.get(url)
@@ -77,7 +83,8 @@ def worldpop_download():
 
 
 @logistic_stats.task
-def model(dhis2_data: typing.Dict[str, typing.Any], gadm_data, worldpop_data):
+def model(dhis2_data: dict[str, typing.Any], gadm_data, worldpop_data):
+    """Build a basic data model."""
     # Load DHIS2 data
     dhis2_df = pd.DataFrame(dhis2_data["rows"], columns=[h["column"] for h in dhis2_data["headers"]])
     dhis2_df = dhis2_df.rename(columns={"Data": "Data element id", "Organisation unit": "Organisation unit id"})

diff --git a/examples/pipelines/logistic_stats/tests/__init__.py b/examples/pipelines/logistic_stats/tests/__init__.py
diff --git a/examples/pipelines/simple_io/pipeline.py b/examples/pipelines/simple_io/pipeline.py
@@ -1,3 +1,5 @@
+"""Simple module for a sample IO pipeline."""
+
 import json
 
 import pandas as pd
@@ -9,6 +11,7 @@
 
 @pipeline("simple-io", name="Simple IO")
 def simple_io():
+    """Run a simple IO pipeline."""
     # Read and write from/to workspace files
     raw_files_data = load_files_data()
     transform_and_write_files_data(raw_files_data)
@@ -23,13 +26,15 @@ def simple_io():
 
 @simple_io.task
 def load_files_data():
+    """Load data from workspace filesystem."""
     current_run.log_info("Loading files data...")
 
     return pd.read_csv(f"{workspace.files_path}/raw.csv")
 
 
 @simple_io.task
 def transform_and_write_files_data(raw_data: pd.DataFrame):
+    """Simulate a transformation on the provided dataframe and write data to workspace filesystem."""
     current_run.log_info("Transforming files data...")
 
     transformed_data = raw_data.copy()
@@ -41,6 +46,7 @@ def transform_and_write_files_data(raw_data: pd.DataFrame):
 
 @simple_io.task
 def load_data_from_postgresql() -> pd.DataFrame:
+    """Perform a simple SELECT query in the workspace database."""
     current_run.log_info("Loading Postgres data...")
 
     engine = create_engine(workspace.database_url)
@@ -50,6 +56,7 @@ def load_data_from_postgresql() -> pd.DataFrame:
 
 @simple_io.task
 def transform_and_write_sql_data(raw_data: pd.DataFrame):
+    """Simulate a transform operation on the provided data and load it in the workspace database."""
     current_run.log_info("Transforming postgres data...")
 
     engine = create_engine(workspace.database_url)
@@ -61,6 +68,7 @@ def transform_and_write_sql_data(raw_data: pd.DataFrame):
 
 @simple_io.task
 def load_dhis2_data():
+    """Load data from DHIS2."""
     current_run.log_info("Loading DHIS2 data...")
 
     connection = workspace.dhis2_connection("dhis2-play")

diff --git a/openhexa/cli/__init__.py b/openhexa/cli/__init__.py
@@ -1,3 +1,5 @@
+"""CLI package."""
+
 from .cli import app
 
 __all__ = ["app"]