Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 2130] Update analytics documentation #3562

Merged
merged 19 commits into from
Jan 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 21 additions & 45 deletions analytics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,48 +2,24 @@

## Introduction

This a command line interface (CLI) tool written in python that is used to run analytics on operational data for the Simpler.Grants.gov initiative. For a more in depth discussion of tools used and the structure of the codebase, view the technical details for the analytics package.

## Project directory structure

Outlines the structure of the analytics codebase, relative to the root of the simpler-grants-gov repo.

```text
root
├── analytics
│ └── src
│ └── analytics
│ └── datasets Create re-usable data interfaces for calculating metrics
│ └── integrations Integrate with external systems used to export data or metrics
│ └── metrics Calculate the project's operational metrics
│ └── tests
│ └── integrations Integration tests, mostly for src/analytics/integrations
│ └── datasets Unit tests for src/analytics/datasets
│ └── metrics Unit tests for src/analytics/metrics
Comment on lines -11 to -22
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this file structure has changed a little, but would it be possible to update it rather than simply removing it?

Having a high-level overview of the main sub-packages and key root level files in a codebase is often helpful to understand how to contribute to it.

|
│ └── config.py Load configurations from environment vars or local .toml files
│ └── settings.toml Default configuration settings, tracked by git
│ └── .secrets.toml Gitignored file for secrets and configuration management
│ └── Makefile Frequently used commands for setup, development, and CLI usage
│ └── pyproject.toml Python project configuration file
```

## Using the tool

Project maintainers and members of the public have a few options for interacting with the tool and the reports it produces. Read more about each option in the [usage guide](../documentation/analytics/usage.md):

1. [Viewing the reports in Slack](../documentation/analytics/usage.md#view-daily-reports-in-slack)
2. [Triggering reports from GitHub](../documentation/analytics/usage.md#trigger-a-report-from-github)
3. [Triggering reports from the command line](../documentation/analytics/usage.md#trigger-a-report-from-the-command-line)

## Contributing to the tool

Project maintainers or open source contributors are encouraged to contribute to the tool. Follow the guides linked below for more information:

1. [Technical overview](../documentation/analytics/technical-overview.md)
2. [Installation and development guide](../documentation/analytics/development.md)
- [Adding a new data source](../documentation/analytics/development.md#adding-a-new-dataset)
- [Adding a new metric](../documentation/analytics/development.md#adding-a-new-metric)
3. [Writing and running tests](../documentation/analytics/testing.md)
4. [Command line interface (CLI) user guide](../documentation/analytics/usage.md#using-the-command-line-interface)
5. [Description of existing metrics](../documentation/analytics/metrics/README.md)
This package encapsulates a data pipeline service. The service is responsible for extracting project data from GitHub and transforming the extracted data into rows in a data warehouse.

## Data Pipeline

The service in this package provides capabilities to satisfy the middle step (denoted as "ETL") in the following data flow diagram:

`SGG Project Data → GitHub → ETL → Postgres DW → Metabase → End User`

The service does not listen on a port or run as a daemon. Instead, it must be triggered manually, via `Make` commands on the command-line, or via a text-based interactive tool written in Python and referred to as CLI.

In current practice, the service is triggered daily via an AWS Step Function (akin to a cron job) orchestrated with Terraform.

## Developer Information

The service is open-source and can be installed and run in a local development environment, which is useful for project maintainers and/or open source contributors. Follow the links below for more information:

1. [Technical Overview](../documentation/analytics/technical-overview.md)
2. [Getting Started Guide for Developers](../documentation/analytics/development.md)
3. [Writing and Running Tests](../documentation/analytics/testing.md)
4. [Usage Guide: Data Pipeline Service & CLI](../documentation/analytics/usage.md)

203 changes: 55 additions & 148 deletions documentation/analytics/development.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,23 @@
# Development <!-- omit in toc -->
# Getting Started Guide for Developers

> [!NOTE]
> All of the steps on this page should be run from the root of the [`analytics/`](../../analytics/) sub-directory
> All of the steps on this page should be run from the root of the [`analytics/`](../../analytics/) directory

<details>
<summary>Table of contents</summary>

- [Setting up the tool locally](#setting-up-the-tool-locally)
- [Docker vs Native](#docker-vs-native)
- [Running with Docker](#running-with-docker)
- [Running natively](#running-natively)
- [Configuring secrets](#configuring-secrets)
- [Prerequisites](#prerequisites)
- [Finding reporting channel ID](#finding-reporting-channel-id)
- [Finding slackbot token](#finding-slackbot-token)
- [Running the tool locally](#running-the-tool-locally)
- [Using the `make` commands](#using-the-make-commands)
- [Using the CLI tool](#using-the-cli-tool)
- [Common development tasks](#common-development-tasks)
- [Adding a new dataset](#adding-a-new-dataset)
- [Adding a new metric](#adding-a-new-metric)
- [Adding a new CLI entrypoint](#adding-a-new-cli-entrypoint)

</details>

## Setting up the tool locally

The following sections describe how to install and work with the analytics application on your own computer. If you don't need to run the application locally, view the [usage docs](usage.md) for other ways to monitor our operational metrics.
## Development Environment Setup

### Docker vs Native

This project run itself inside of docker by default. If you wish to run this natively, add PY_RUN_APPROACH=local to your environment variables. You can set this by either running `export PY_RUN_APPROACH=local` in your shell or add it to your ~/.zshrc file (and run `source ~/.zshrc`).

After choosing your approach, following the corresponding setup instructions:
This package runs in Docker by default, but can also be configured to run natively without Docker. Choose the option that's best for you, and then follow the instructions for that option:

- [Running with Docker](#running-with-docker)
- [Running natively](#running-natively)
- [Run with Docker](#run-with-docker)
- [Run Natively](#run-natively)

#### Running with Docker
#### Run with Docker

**Pre-requisites**

- Docker installed and running locally: `docker --version`
- Docker compose installed: `docker-compose --version`
- **Docker** installed and running locally: `docker --version`
- **Docker compose** installed: `docker-compose --version`

**Steps**

Expand All @@ -58,27 +33,23 @@ After choosing your approach, following the corresponding setup instructions:
- read:org
- admin:public_key
- project
3. Add `export GH_TOKEN=...` to your `zshrc` or similar
4. Set the slackbot token and the channel ID for Slack after following the instructions in [configuring secrets](#configuring-secrets). **Note:** replace the `...` with the value of these secrets:
```
export ANALYTICS_SLACK_BOT_TOKEN=...
export ANALYTICS_REPORTING_CHANNEL_ID=...
```
5. Run `make test-audit` to confirm the application is running correctly.
3. Add `GH_TOKEN=...` to your environment variables, e.g. in .zshrc or .bashrc
4. Run `make test-audit` to confirm the application is running correctly
5. Proceed to next section to learn how to invoke commands

#### Running natively
#### Run Natively

**Pre-requisites**

- **Python version 3.12:** [pyenv](https://github.com/pyenv/pyenv#installation) is one popular option for installing Python,
or [asdf](https://asdf-vm.com/).
- **Poetry:** After installing and activating the right version of Python, [install poetry with the official installer](https://python-poetry.org/docs/#installing-with-the-official-installer) or alternatively use [pipx to install](https://python-poetry.org/docs/#installing-with-pipx).
- **Python version 3.12:** [pyenv](https://github.com/pyenv/pyenv#installation) is one popular option for installing Python, or [asdf](https://asdf-vm.com/)
- **Poetry:** [install poetry with the official installer](https://python-poetry.org/docs/#installing-with-the-official-installer) or alternatively use [pipx to install](https://python-poetry.org/docs/#installing-with-pipx)
Copy link
Collaborator

@widal001 widal001 Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're running it natively you also need Postgres installed locally in order to have the psycopg_binary available on your computer.

Screenshot 2025-01-21 at 10 28 28 AM

- **GitHub CLI:** [Install the GitHub CLI](https://github.com/cli/cli#installation)

**Steps**

1. Set up the project: `make setup` -- This will install the required packages and prompt you to authenticate with GitHub
2. Acquire a GitHub Token using one of the methods below
1. Add PY_RUN_APPROACH=local to your environment variables, e.g. in .zshrc or .bashrc
2. Set up the project: `make install` -- This will install the required packages and prompt you to authenticate with GitHub
3. Acquire a GitHub Token using one of the methods below
- Via AWS (Project Team)
- Retrieve GH_TOKEN from [AWS](https://us-east-1.console.aws.amazon.com/systems-manager/parameters/%252Fanalytics%252Fgithub-token/description?region=us-east-1&tab=Table#list_parameter_filters=Name:Contains:analytics%2Fgithub-token)
- Create your own in GitHub (Open Source)
Expand All @@ -89,122 +60,58 @@ After choosing your approach, following the corresponding setup instructions:
- read:org
- admin:public_key
- project
3. Add `export GH_TOKEN=...` to your `zshrc` or similar
4. Set the slackbot token and the channel ID for Slack after following the instructions in [configuring secrets](#configuring-secrets). **Note:** replace the `...` with the value of these secrets:
```
export ANALYTICS_SLACK_BOT_TOKEN=...
export ANALYTICS_REPORTING_CHANNEL_ID=...
```
5. Run `make test-audit` to confirm the application is running correctly.

### Configuring secrets

#### Prerequisites

In order to correctly set the value of the `slack_bot_token` and `reporting_channel_id` you will need:

1. To be a member of the Simpler.Grants.gov slack workspace
2. To be a collaborator on the Sprint Reporting Bot slack app

If you need to be added to the slack workspace or to the list of collaborators for the app, contact a project maintainer.
4. Add `GH_TOKEN=...` to your environment variables, e.g. in .zshrc or .bashrc
5. Run `make test-audit` to confirm the application is running correctly
6. Proceed to next section to learn how to invoke commands

#### Finding reporting channel ID
## Invoke Commands on the Service

1. In the Simpler.Grants.gov Slack workspace navigate to the `#z_bot-sprint-reporting` channel. NB: Use`#z_bot-analytics-ci-test` channel for testing.
2. Click on the name of the channel in the top left part of the screen.
3. Scroll down to the bottom of the resulting dialog box until you see where it says `Channel ID` and copy.
### Using `make`

<img alt="Screenshot of dialog box with channel ID" src="../../analytics/static/screenshot-channel-id.png" height=500>
Several `make` commands are defined in the project [`Makefile`](../../analytics/Makefile). Commands can be invoked from the command line, as in the following examples:

#### Finding slackbot token

1. Go to [the dashboard](https://api.slack.com/apps) that displays the slack apps for which you have collaborator access
2. Click on `Sprint Reporting Bot` to go to the settings for our analytics slackbot
3. From the side menu, select `OAuth & Permissions` and scroll down to the "OAuth tokens for your workspace" section
4. Copy the "Bot user OAuth token" which should start with `xoxb`

<img alt="Screenshot of slack app settings page with bot user OAuth token" src="../../analytics/static/screenshot-slackbot-token.png" width=750>

## Running the tool locally

While the [usage guide](usage.md) describes all of the options for running the `analytics` package locally, the following sections highlight some helpful commands to interact with the tool during development.

### Using the `make` commands

In earlier steps, you'll notice that we've configured a set of `make` commands that help streamline common developer workflows related to the `analytics` package. You can view the [`Makefile`](../../analytics/Makefile) for the full list of commands, but some common ones are also described below:

- `make install` - Checks that you have the prereqs installed, installs new dependencies, and prompts you to authenticate with the GitHub CLI.
- `make unit-test` - Runs the unit tests and prints a coverage report
- `make e2e-test` - Runs integration and end-to-end tests and prints a coverage report
- `make install` - Checks that prereqs are installed, installs new dependencies, and prompts for GitHub authentication
- `make unit-test` - Runs the unit tests and opens a coverage report in a web browser
- `make e2e-test` - Runs integration and end-to-end tests and opens a coverage report in a web browser
- `make lint` - Runs [linting and formatting checks](formatting-and-linting.md)
- `make sprint-reports-with-latest-data` Runs the full analytics pipeline which includes:
- Exporting data from GitHub
- Calculating the following metrics
- Either printing those metrics to the command line or posting them to slack (if `ACTION=post-results` is passed)

### Using the CLI tool
### Using the CLI

The `analytics` package comes with a built-in CLI that you can use to discover the reporting features available. Start by simply typing `poetry run analytics --help` which will print out a list of available commands:
The package includes a CLI that can be used to discover the available commands. To run the CLI, type `poetry run analytics --help` at the command line, and the CLI should respond with a list of available commands.

![Screenshot of passing the --help flag to CLI entry point](../../analytics/static/screenshot-cli-help.png)

Additional guidance on working with the CLI tool can be found in the [usage guide](usage.md#using-the-command-line-interface).
## Example Development Tasks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't block merging this, but it would be good to add instructions for accessing the postgres DB using Docker and psql


## Common development tasks
### How To Add New Dataset

### Adding a new dataset
1. Create a new python file in `src/analytics/datasets/`
2. In that file, create a new class that inherits from the `BaseDataset`
3. Store the names of key columns as either class or instance attributes
4. If you need to combine multiple source files (or other datasets) to produce this dataset, consider creating a class method that can be used to instantiate this dataset from those sources
5. Create **at least** one unit test for each method that is implemented with the new class

1. Create a new python file in `src/analytics/datasets/`.
2. In that file, create a new class that inherits from the `BaseDataset`.
3. Store the names of key columns as either class or instance attributes.
4. If you need to combine multiple source files (or other datasets) to produce this dataset, consider creating a class method that can be used to instantiate this dataset from those sources.
5. Create **at least** one unit test for each method that is implemented with the new class.
### How To Add New CLI Entrypoint

### Adding a new metric
1. Add a new function to [`cli.py`](../../analytics/src/analytics/cli.py)
2. Wrap this function with a [sub-command `typer` decorator](https://typer.tiangolo.com/tutorial/subcommands/single-file/)
3. If the function accepts parameters, [annotate those parameters](https://typer.tiangolo.com/tutorial/options/name/)
4. Add *at least* one unit test for the CLI entrypoint, optionally mocking potential side effects of calling the entrypoint

1. Create a new python file in `src/analytics/metrics/`.
2. In that file, create a new class that inherits from the `BaseMetric`.
3. Determine which dataset class this metric requires as an input. **Note:** If the metric requires a dataset that doesn't exist, review the steps to [add a dataset](#adding-a-new-dataset).
4. Implement the following methods on that class.:
- `__init__()` - Instantiates the metric class and accepts any inputs needed to calculate the metric (e.g. `sprint` for `SprintBurndown`)
- `calculate()` - Calculates the metric and stores the output to a `self.results` attribute. **Tip:** It's often helpful to break the steps involved in calculating the metric into a series of private methods (i.e. methods that begin with an underscore, e.g. `_get_and_validate_sprint_name()`) that can be called from the main `calculate()` method.
- `get_stats()` - Calculates and returns key stats about the metric or input dataset. **Note:** Stats are different from metrics in that they represent single values and aren't meant to be visualized in a chart.
- `format_slack_message()` - Generate a string that will be included if the results are posted to Slack. This often includes a list of stats as well as the title of the metric.
5. Create *at least* one unit test for each of these methods to test them against a simplified input dataset to ensure the function has been implemented correctly. For more information review the [docs on testing](../../documentation/analytics/testing.md)
6. Follow the steps in [adding a new CLI entrypoint](#adding-a-new-cli-entrypoint) to expose this metric via the CLI.
### How to Extend Analytics DB Schema

### Adding a new CLI entrypoint
1. Add a new migration file to [`integrations/etldb/migrations/versions/`](../../analytics/src/analytics/integrations/etldb/migrations/versions) and prefix file name with the next iteration number (ex: `0007_`)
2. Add valid Postgres SQL to the new integration file
3. Run the migration command: `make db-migrate`

1. Add a new function to [`cli.py`](../../analytics/src/analytics/cli.py)
2. Wrap this function with a [sub-command `typer` decorator](https://typer.tiangolo.com/tutorial/subcommands/single-file/). For example if you want to calculate sprint burndown with the entrypoint `analytics calculate sprint_burndown`, you'd use the decorator: `metrics_app.command(name="sprint_burndown")`
3. If the function accepts parameters, [annotate those parameters](https://typer.tiangolo.com/tutorial/options/name/).
4. Add *at least* one unit test for the CLI entrypoint, optionally mocking potential side effects of calling the entrypoint.

### Copying table from grants-db

1. Add a new sql migration file in `src/analytics/integrations/etldb/migrations/versions` and prefix file name with the next iteration number (ex: `0007`).
2. Use your database management system(ex: `pg_admin`, `db_beaver`...) and right-click on the table you wish to copy and select `SQL scripts` then `request and copy original DDL`
3. Paste the DDL in your new migration file. Fix any formating issues, see previous migration files for reference.
4. Remove all reference to schema, roles, triggers and the use of `default now()` for timestamp columns.

Example:
``` sql
create table if not exists opi.opportunity
(
...,
created_at timestamp with time zone default now() not null,
...
)
```
should be
``` sql
CREATE TABLE IF NOT EXISTS opportunity;
(
...,
created_at timestamp with time zone not null
...
)
```

5. Run migration via `make db-migrate` command
### How To Run Linters

```bash
make lint
```

### How To Run Unit Tests

```bash
make unit-test
```
Loading