-
Notifications
You must be signed in to change notification settings - Fork 2
Home
This repo contains packages used to run the AWS data pipeline (see README) for the Crowdbreaks project.
Below the repo structure is explained:
-
.github/workflows
contains GitHub Actions workflows for deploying to ECS/Lambda/Sagemaker. -
awstools
package contains most of the functions that use AWS SDK for Python + 'global' configs. These helpers are then used throughout the rest of the repo, including Lambda functions, streamer package and Sagemaker tools. - AWS Lambda functions
- Streamer
-
lambda-es-rotation
is used for rotating Elasticsearch indices. It is triggered by an AWS EventBridge cron eventcrowdbreaks-es-monthly-rotation
. -
lambda-s3-to-es
is used to preprocess raw data to fit Elasticsearch schema, including retrieving geo info & predictions using existing Sagemaker endpoints. -
lambda-streamer-management
is used to manage the streamer status on the Crowdbreaks website.
-
- Auto MTurking on the Crowdbreaks website
-
lambda-sample-for-annotations
is used for creating random samples of the recent data for annotation. -
lambda-subsample-annotations
is used for creating a small evaluation subsample on the annotation results to evaluate annotation results.
-
- Streamer
-
streamer
package is used for streaming the data from Twitter API v1.1 (+ v2 is an option) filtering endpoint and sending them to AWS Kinesis Firehose. -
Dockerfile
is used to buildstreamer
for ECS. If moved to another folder, please change.github/workflows/aws-*.yml
files to build from there.
Currently, Lambda triggers are not set automatically (except the S3 triggers when a config is updated). If the functions are recreated from scratch (for example, by deleting all functions and running a 'push-create-lambda' workflow), make sure to set the corresponsing triggers in AWS console.
There are 3 sources of secrets: AWS, Elasticseach and Twitter.
The secrets are stored in four different places:
- GitHub Actions Secrets
- Settings -> Security -> Secrets -> Actions
- AWS Secrets Manager
- Heroku environment variables
- 1password
Elasticsearch is served through https://www.elastic.co, the credentials are stored in 1password. Make sure that the clusters (esp. crowdbreaks-stg, since it has very little memory, are not overflowing). Delete the older indices if the used storage is getting close to the limit.
Elasticsearch usage and billing are on AWS marketplace account or on https://www.elastic.co, it's not shown in AWS Costs Explorer.
In case Twitter v1 gets deprecated, here is how to launch streamer for Twitter API v2 (just for storage, not for anything else yet).
- Open the
Dockerfile
in the root folder of the repo. - Change
CMD run-stream -> CMD run-stream-v2
and save. - Run Actions -> Deploy to Amazon ECS (Production) (aws-prd.yml) -> Run workflow -> Branch: main.
- Restart streamer using the website or AWS ECS.
- You can check the logs of the ECS task to make sure that the correct version is running: the first log should contain the version.
To make sure that the streams are running, either check CrowdbreaksStreaming dashboard on AWS CloudWatch, or check that S3 is up to date for active streams.
Also make sure that the right app is connected to the Crowdbreaks project on the Twitter Developer Portal. The bearer token will not work if the app is not linked to the project.