Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infra configuration and deployments #20

Merged
merged 28 commits into from
Nov 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@ insert_final_newline = true

[*.{js,json,md,sql,yaml}]
indent_size = 2

[Makefile]
indent_style = tab
1 change: 1 addition & 0 deletions .github/workflows/linter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,4 @@ jobs:
VALIDATE_JSCPD: false
VALIDATE_JAVASCRIPT_PRETTIER: false
VALIDATE_MARKDOWN_PRETTIER: false
VALIDATE_GITHUB_ACTIONS: false
8 changes: 2 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,5 @@ node_modules/
.DS_Store

# Terraform
tf/.terraform/
tf/temp

# Dataform
.df-credentials.json
.gitignore
infra/tf/.terraform/
**/*.zip
14 changes: 14 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FN_NAME = dataform-trigger

.PHONY: *

start:
npx functions-framework --target=$(FN_NAME) --source=./infra/dataform-trigger/ --signature-type=http --port=8080 --debug

tf_plan:
terraform -chdir=infra/tf init -upgrade && terraform -chdir=infra/tf plan \
-var="FUNCTION_NAME=$(FN_NAME)"

tf_apply:
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve \
-var="FUNCTION_NAME=$(FN_NAME)"
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# HTTP Archive BigQuery pipeline with Dataform
# HTTP Archive datasets pipeline

This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the `httparchive` dataset in BigQuery.

Expand Down Expand Up @@ -62,7 +62,7 @@ Tag: `crawl_results_legacy`

### Triggering workflows

[see here](./src/README.md)
In order to unify the workflow triggering mechanism, we use [a Cloud Run function](./src/README.md) that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.

## Contributing

Expand All @@ -85,7 +85,7 @@ Tag: `crawl_results_legacy`

The issues within the pipeline are being tracked using the following alerts:

1. the event trigger processing fails - [Dataform Trigger Function Error](https://console.cloud.google.com/monitoring/alerting/policies/3950167380893746326?authuser=7&project=httparchive)
2. a job in the workflow fails - "[Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/7137542315653007241?authuser=7&project=httparchive)
1. the event trigger processing fails - [Dataform Trigger Function Error](https://console.cloud.google.com/monitoring/alerting/policies/570799173843203905?authuser=7&project=httparchive)
2. a job in the workflow fails - "[Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/16526940745374967367?authuser=7&project=httparchive)

Error notifications are sent to [#10x-infra](https://httparchive.slack.com/archives/C030V4WAVL3) Slack channel.
132 changes: 132 additions & 0 deletions docs/infrastructure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Infrastucture

```mermaid
graph LR;
subgraph Cloud_Run_Functions
dataformTrigger[Dataform Trigger Function]
end

subgraph PubSub
crawl_complete_topic[Crawl Complete Topic]
dataformTrigger_subscription[Dataform Trigger Subscription]
crawl_complete_topic --> dataformTrigger_subscription
end

dataformTrigger_subscription --> dataformTrigger

subgraph Cloud_Scheduler
bq_poller_cwv_tech_report[CWV Report Poller Job]
bq_poller_cwv_tech_report --> dataformTrigger
end

subgraph Dataform
dataform_repo[Dataform Repository]
dataform_repo_release_config[Release Configuration]
dataform_repo_workflow[Workflow Execution]
end

dataformTrigger --> dataform_repo[Dataform Repository]
dataform_repo --> dataform_repo_release_config[Release Configuration]
dataform_repo_release_config --> dataform_repo_workflow[Workflow Execution]

subgraph BigQuery
bq_jobs[BigQuery Jobs]
bq_datasets[BigQuery Dataset Updates]
bq_jobs --> bq_datasets
end
dataform_repo_workflow --> bq_jobs

subgraph Logs_and_Alerts
cloud_run_logs[Cloud Run Logs]
dataform_logs[Dataform Logs]
bq_logs[BigQuery Logs]
alerting_policies[Alerting Policies]
slack_notifications[Slack Notifications]

cloud_run_logs --> alerting_policies
dataform_logs --> alerting_policies
bq_logs --> alerting_policies
alerting_policies --> slack_notifications
end

dataformTrigger --> cloud_run_logs
dataform_repo_workflow --> dataform_logs
bq_jobs --> bq_logs

```

## Triggering pipelines

[Configuration](./tf/functions.tf)

### Cloud Run Function

Triggers the Dataform workflow execution, based on events or cron schedules.

- [dataformTrigger](https://console.cloud.google.com/functions/details/us-central1/dataformTrigger?env=gen2&project=httparchive)

[Source](./src/README.md)

### Cloud Scheduler

- [bq-poller-cwv-tech-report](https://console.cloud.google.com/cloudscheduler/jobs/edit/us-east4/bq-poller-cwv-tech-report?authuser=7&project=httparchive)

### Pub/Sub Subscription

- [dataform-trigger-subscription](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive)

## Dataform

Runs the batch processing workflows. There are two Dataform repositories for [development](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data-test/details/workspaces?authuser=7&project=httparchive) and [production](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workspaces?authuser=7&project=httparchive).

The test reporsitory is used [for development and testing purposes](https://cloud.google.com/dataform/docs/workspaces) and not connected to the rest of the pipeline infra.

Pipeline can be [run manually](https://cloud.google.com/dataform/docs/code-lifecycle) from the Dataform UI.

[Configuration](./tf/dataform.tf)

### Dataform Development Workspace

1. [Create new dev workspace](https://cloud.google.com/dataform/docs/quickstart-dev-environments) in test Dataform repository.
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.

*Some useful hints:*

1. In workflow settings vars set `dev_name: dev` to process sampled data in dev workspace.
2. Change `current_month` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.
3. `definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.

## Monitoring

[Configuration](./tf/monitoring.tf)

### Dataform repository

- [Production Dataform Workflow Excution logs](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workflows?authuser=7&project=httparchive)
- [Logs Explorer](https://cloudlogging.app.goo.gl/k9qfqCh4RjFwTnQ56)

### Cloud Run logs

- [Trigger function logs](https://console.cloud.google.com/run/detail/us-central1/dataformtrigger/logs?authuser=7&project=httparchive)
- [Logs Explorer](https://cloudlogging.app.goo.gl/6Q879UjnTPDqtVBx5)

### BigQuery logs

- [Logs Explorer](https://cloudlogging.app.goo.gl/rFjRMcvejd1Tyi7KA)

### Alerting policies

- [Dataform Trigger Function Error](https://console.cloud.google.com/monitoring/alerting/policies/3950167380893746326?authuser=7&project=httparchive)
- [Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/7137542315653007241?authuser=7&project=httparchive)

## CI/CD pipeline

### Dataform / GiHub connection

GitHub PAT saved to a [Secret Manager secret](https://console.cloud.google.com/security/secret-manager/secret/GitHub_max-ostapenko_dataform_PAT/versions?authuser=7&project=httparchive).

- repository: HTTPArchive/dataform
- permissions:
- Commit statuses: read
- Contents: read, write
18 changes: 10 additions & 8 deletions src/README.md → infra/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
# Cloud function for triggering Dataform workflows
# Infrastructure for the HTTP Archive data pipeline

## Cloud function for triggering Dataform workflows

[dataformTrigger](https://console.cloud.google.com/functions/details/us-central1/dataformTrigger?env=gen2&authuser=7&project=httparchive) Cloud Run Function

This function may be triggered by a PubSub message or Cloud Scheduler and triggers a Dataform workflow based on the trigger configuration provided.

## Configuration
### Configuration

Trigger types:

1. `event` - immediately triggers a Dataform workflow using tags provided in configuration.

2. `poller` - first triggers a BigQuery polling query. If the query returns TRUE, the Dataform workflow is triggered using the tags provided in configuration.

See [available trigger configurations](https://github.com/HTTPArchive/dataform/blob/30a3304bf0e903ec0c54ce1318aa4eed8ae828ed/src/index.js#L4).
See [available trigger configurations](https://github.com/HTTPArchive/dataform/blob/main/src/index.js#L4).

Request body example with trigger name:

Expand All @@ -22,12 +24,12 @@ Request body example with trigger name:
}
```

## Local testing
### Local testing

Run the following command to test the function locally:

```bash
npm run start
make start
```

Then, in a separate terminal, run the following command to trigger the function:
Expand All @@ -42,10 +44,10 @@ curl -X POST http://localhost:8080/ \
}'
```

## Deployment
### Deployment

When you're under `src/` run:
When you're under `infra/` run:

```bash
npm run deploy
make deploy
```
File renamed without changes.
6 changes: 3 additions & 3 deletions src/index.js → infra/dataform-trigger/index.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
const functions = require('@google-cloud/functions-framework')
const { BigQuery } = require('@google-cloud/bigquery')
const { getCompilationResults, runWorkflow } = require('./dataform')

const TRIGGERS = {
cwv_tech_report: {
Expand Down Expand Up @@ -109,7 +111,6 @@ async function messageHandler (req, res) {
* @returns {boolean} Query result.
*/
async function runQuery (query) {
const { BigQuery } = require('@google-cloud/bigquery')
const bigquery = new BigQuery()

const [job] = await bigquery.createQueryJob({ query })
Expand Down Expand Up @@ -138,7 +139,6 @@ async function executeAction (actionName, actionArgs) {
* @param {object} args Action arguments.
*/
async function runDataformRepo (args) {
const { getCompilationResults, runWorkflow } = require('./dataform')
const project = 'httparchive'
const location = 'us-central1'
const { repoName, tags } = args
Expand All @@ -163,4 +163,4 @@ async function runDataformRepo (args) {
* }
* }
*/
functions.http('dataformTrigger', (req, res) => messageHandler(req, res))
functions.http('dataform-trigger', (req, res) => messageHandler(req, res))
10 changes: 10 additions & 0 deletions infra/dataform-trigger/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"main": "index.js",
"dependencies": {
"@google-cloud/bigquery": "^7.9.1",
"@google-cloud/dataform": "^1.3.0",
"@google-cloud/functions-framework": "^3.4.2"
},
"name": "dataform-trigger",
"version": "1.0.0"
}
61 changes: 61 additions & 0 deletions infra/tf/.terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading