GitHub - bertrandlalo/surfrider_etl: ETL pipeline : download video, extract GPS, detect trash, dump to PostGre database

Welcome to Plastic Origin ETL repository

Plastic Origin is a project led by Surfrider Europe aiming at identify and quantify plastic pollution in rivers through space and time.

Project Overview

Next sections provides an overview of the ETL principles, as well as guidance to get started using the ETL.

Please note that this project is under active development and that frequent change and update happen over time.

ETL Data Management Process

Overiew of ETL process architecture from blob download to database insertion.

Data sources

Four main source of data:

GoPro 1080p 24 img/s: mp4 format containing video + GPS as GPX tracks
Smartphone 1080p 24 ou 30 img/s + GPS in a separate GPX file
Smartphone picture + GPS data
OSM Tracker data: GPX file containing way-points with labeled trash and their coordinates

See samples of those in folder /data

Useful resources

Coming soon: Link to DB scheme, AI repository, ...

Installation

Prerequisites

Python 3 distribution

Clone the repository

$ git clone https://github.com/surfriderfoundationeurope/etl

Copy file example.env, rename in .env and

set environment variables

Guidelines

Install this package by adding the following line to your environment.yaml file as follows:

name: ...
dependencies:
- pip
- pip:
    - git+ssh://[email protected]/surfriderfoundationeurope/[email protected]#egg=etl
    # or
    - git+https://github.com/surfriderfoundationeurope/[email protected]#egg=etl

where vX.Y.Z is the exact version number that you need.

If you want to develop on Surfrider ETL at the same time that you are developing some other project, clone this repository and install it on development mode:

$ git clone git+ssh://[email protected]/surfriderfoundationeurope/etl.git
$ cd etl
$ pip install -e .

Note that this development mode is not recommended for reproducible analyses because you might end up with a locally modified version that is not available to other people.

How to create virtual/conda env ?

You have the choice between virtual env or conda env:

conda env

$ conda env create -f  environment.yml
$ conda activate etl-env

virtual env

$ python3 -m venv etl-env
$ source etl-env/bin/activate
$ pip install -r requirements.txt

Environment variables

This environment variable is required if you work in data-mode 'azure' :

.. envvar:: CON_STRING

  To connect to Azure Blob Storage

This environment variable is required if you need to call the AI :

.. envvar:: AI_URL

  URL to AI API

These environment variables is required if you need to insert trash to db :

.. envvar:: PGSERVER, PGDATABASE, PGDATABASE, PGUSERNAME, PGPWD

  Info and identifier of PG database

Usage

Launch CLI

Once your environment etl-env is activate:

>>> etl --help
     Usage: etl_cli.py [OPTIONS]

        Run the ETL

        Use command `python etl_cli.py --help` to get a list of the options .

      Options:
        --container TEXT                Name of Azure container to download the data
                                        from.

        --blob TEXT                     Name of Azure blob storage.
        --media TEXT                    Name of media to download the data from.
        --temp-dir DIRECTORY            Path to data directory to download the data.
        --data-dir DIRECTORY            Path to data directory to download the data.
                                        If given, data should be in local storage.
                                        (No download from Azure)

        --data-source [local|azure]     Source of data.
        --target-storage [local|postgre]
                                        Target to store the ETL output. If "local",
                                        will save in csv,  if "postgre", will insert
                                        each trash in Postgre database.

        --ai-url TEXT                   URL of AI. If not given, will be set from
                                        ENV.

        --help                          Show this message and exit.

Work 100% locally

With local data as input

If you wish to work with local data instead of downloading data from Azure blob storage, you need to set options --data-source local --data-dir path/to/your/data/folder

Dump output in CSV file

If you wish to dump the ETL result locally, instead of connecting to Postgre database, you need to set options --target-storage local and you will get a file 'etl_output.csv' in directory temp-dir, looking like:

Example to test locally:
- On GoPro data: python etl_cli.py --data-source local --media sample.mp4 --data-dir data/gopro/ --target-storage local
- On OSM Tracker file: python etl_cli.py --data-source local --media sample.gpx --data-dir data/osm_tracker/ --target-storage local
- On Smartphone video: python etl_cli.py --data-source local --media sample.mp4 --data-dir data/osm_tracker/ --target-storage local
- On Smartphone photo: not yet

Work In Progress

Docker

First, generate a github private SSH key that you store in /conf.

In poduction

Coming soon, where to get the subscription info & co.

Documentation

A readthedocs server is coming soon.

Notebooks

There is work in-progress [here](https://github.com/surfriderfoundationeurope/etl/tree/master/scripts) to build the script architecture that will allow then deploy the ETL in production. Typically, we target to deploy the ETL process on top of Azure Function to support a serverless deployement architecture, or conversly to leverage open souce solution like Apache Airflow or Docker container to make the ETL portable, scalable and event-triggered.

Azure function

After you installed [Azure function for Python pre-requesite](https://docs.microsoft.com/en-us/azure/azure-functions/functions-create-first-azure-function-azure-cli?pivots=programming-language-python&tabs=bash%2Cbrowser) on your local machine, you can run the ETL workflow as a local Azure function. First go to azfunction folder, then:

This will run the ETL workflow as a local API using Azure function utilities. You can therefore navigate to the ETL API endpoint using a browser, and execute the ETL process with:

Please note you still need the function to be running within a python environment with ETL pre-requesite, as well as the large local video file.

Contribute

Issue Tracker https://github.com/surfriderfoundationeurope/etl/issues
Source Code: https://github.com/surfriderfoundationeurope/etl

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
azure		azure
conf		conf
data		data
docs		docs
etl		etl
.example.env		.example.env
.gitignore		.gitignore
Dockerfile		Dockerfile
Readme.rst		Readme.rst
dockerigore		dockerigore
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to Plastic Origin ETL repository

Project Overview

ETL Data Management Process

Data sources

Useful resources

Installation

Prerequisites

Guidelines

How to create virtual/conda env ?

Environment variables

Usage

Launch CLI

Work 100% locally

Work In Progress

Docker

In poduction

Documentation

Notebooks

Azure function

Contribute

About

Releases

Packages

Contributors 2

Languages

bertrandlalo/surfrider_etl

Folders and files

Latest commit

History

Repository files navigation

Welcome to Plastic Origin ETL repository

Project Overview

ETL Data Management Process

Data sources

Useful resources

Installation

Prerequisites

Guidelines

How to create virtual/conda env ?

Environment variables

Usage

Launch CLI

Work 100% locally

Work In Progress

Docker

In poduction

Documentation

Notebooks

Azure function

Contribute

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages