Skip to content

Latest commit

 

History

History
306 lines (203 loc) · 9.53 KB

README.md

File metadata and controls

306 lines (203 loc) · 9.53 KB

ALERTWildfire Scraper and Tweet Monitor

A multi-pronged service created for the goal of collecting training data for USC research project "Early Fire Detection" that includes:

  • an ArangoDB instance which stores the urls of ALERTWildfire's cameras (as collected by scripts/enumerator.py) and Tweets of interest as collected by the Tweet monitor
  • a distributed and asynchronous scraper which collects classic cam images from http://www.AlertWildfire.org and uploads a zip compressed archive of the images to Google Drive after each full execution
  • a Tweet monitor that saves Tweets that mention @AlertWildfire's Twitter account (potentially in regards to a wildfire) to a database
  • an asynchronous scraper that retrieves infrared cam images from http://beta.alertwildfire.org/infrared-cameras/ and uploads the images to Google Drive

ALERTWildfire

"ALERTWildfire is a network of over 900 specialized camera installations in California, Nevada, Idaho and Oregon used by first responders and volunteers to detect and monitor wildfires." - Nevada Today

Contents

  1. Prerequisites
  2. Run It
  3. ArangoDB
  4. Redis
  5. RabbitMQ
  6. Classic Scraper
  7. Infrared Scraper

Prerequisites

  1. Create a Twitter Developer account, start a new project, and set the SEARCHTWEETS_ENDPOINT, SEARCHTWEETS_BEARER_TOKEN, SEARCHTWEETS_CONSUMER_KEY, and SEARCHTWEETS_CONSUMER_SECRET environment variables in docker-compose.yml accordingly. Step-by-step guide to making your first request to the new Twitter API v2
  2. Create a Google Developer account, create a new project with the Google Drive API (ensure that the scopes include read access to file metadata and write/file upload access to drive), authenticate a user outside of Docker (I used Google's quickstart and a modified version of this exists at scripts/gdrive-token-helper.py), and set PROJECT_ID, TOKEN, REFRESH_TOKEN, and GDRIVE_PARENT_DIR environment variables accordingly.

Run It

docker-compose build --parallel && docker-compose up -d

ArangoDB

ArangoDB database instance that stores all classic camera URLS (as collected by scripts/enumerator.py), infrared camera URLS, and Tweets from the Tweet Alerts monitor

Technologies:

  • Docker
  • ArangoDB (latest)

Collections

cameras example:

{
  "url": "http://www.alertwildfire.org/orangecoca/index.html?camera=Axis-DeerCanyon1",
  "timestamp": "2021-08-24T20:51:37.433870",
  "axis": "orangecoca.Axis-DeerCanyon1"
}

tweets example:

{
  "id": "1430287078156234757",
  "text": "RT @CphilpottCraig: Evening timelapse 5:25-6:25pm #CaldorFire Armstrong Lookout camera. @AlertWildfire viewing North from South side of fir…",
  "scrape_timestamp": "2021-08-24T22:55:25.862109"
}

ir-cameras example::

{
  "axis": "Danaher_606Z_Thermal",
  "epoch": 1631050791,
  "url": "https://weathernode.net/img/flir/Danaher_606Z_Thermal_1631050791.jpg",
  "timestamp": "2021-09-09T18:54:53.195532"
}

Redis

Celery backend for scraping app.

Technologies:

  • Docker
  • Redis (latest)

RabbitMQ

Celery broker for scraping app.

Technologies:

  • Docker
  • RabbitMQ (latest)

rabbitmq.conf

RabbitMQ config file located at rabbitmq/myrabbit.conf. consumer_timeout is set to 1 hour in milliseconds, 10 minutes longer than the timeout time (in seconds) explicitly set for each scraping task in the Scraper's producer.

## Consumer timeout
## If a message delivered to a consumer has not been acknowledge before this timer
## triggers the channel will be force closed by the broker. This ensure that
## faultly consumers that never ack will not hold on to messages indefinitely.
##
## Set to 1 hour in milliseconds
consumer_timeout = 3600000

Classic Scraper

Producer

Classic cameras image scraping queue producer. This process is invoked when a new Tweet to AlertWildfire's Twitter account is recognized. Tweets are queried every minute. If a camera is mentioned by name or axis in a Tweet's text, the camera is prioritized when scraping.

Technologies:

Environment Variables

RABBITMQ_HOST: RabbitMQ host

RABBITMQ_PORT: RabbitMQ port

RABBITMQ_DEFAULT_USER: RabbitMQ user

RABBITMQ_DEFAULT_PASS: RabbitMQ password

REDIS_HOST: Redis host

REDIS_PORT: Redis port

CONCURRENCY: integer number of concurrent celery tasks

DB_HOST: database host

DB_PORT: (arangodb) database port

DB_NAME: (arangodb) database name

DB_USER: (arangodb) database user

DB_PASS: (arangodb) database password

SEARCHTWEETS_ENDPOINT: Twitter Developer API endpoint

SEARCHTWEETS_BEARER_TOKEN: Twitter Developer API bearer token

SEARCHTWEETS_CONSUMER_KEY: Twitter Developer API key

SEARCHTWEETS_CONSUMER_SECRET: Twitter Developer API secret

CHUNK_SIZE: integer number of camera urls to be retrieved by asynchronous HTTP requests per celery task

QUEUE: name of the queue to push tasks to

Logs

Logs are sent to stdout and stderr. This can be changed in classic-producer/conf/supervise-producer.conf.

Scraper (aka Consumer)

Distributed, asynchronous scraping service of classic images from ALERTWildfire cameras.

Technologies:

  • Docker
  • Python 3.9
  • Redis (latest)
  • RabbitMQ (latest)
  • Google Drive API
  • Free Proxyscrape API

Environment Variables

RABBITMQ_HOST: RabbitMQ host

RABBITMQ_PORT: RabbitMQ port

RABBITMQ_DEFAULT_USER: RabbitMQ user

RABBITMQ_DEFAULT_PASS: RabbitMQ password

REDIS_HOST: Redis host

REDIS_PORT: Redis port

CONCURRENCY: integer number of concurrent celery tasks

LOGLEVEL: logging level (i.e. info)

QUEUE: name of the queue to retrieve tasks from

DB_HOST: database host

DB_PORT: (arangodb) database port

DB_NAME: (arangodb) database name

DB_USER: (arangodb) database user

DB_PASS: (arangodb) database password

CLIENT_ID: Twitter API client ID

CLIENT_SECRET: Twitter API client secret

PROJECT_ID: Google Drive API project ID

TOKEN: Google Drive API token

REFRESH_TOKEN: Google Drive API refresh token

GDRIVE_PARENT_DIR: ID of Google Drive directory in which to save zip archives of the scraped images

Logs

Logs are sent to stdout and stderr. This can be changed in classic-scraper/conf/supervise-celery.conf.

Infrared Scraper

Producer

Infrared cameras image scraping queue producer.

Technologies:

  • Docker
  • ArangoDB (latest)
  • Python 3.9
  • Redis (latest)
  • RabbitMQ (latest)
Environment Variables

RABBITMQ_HOST: RabbitMQ host

RABBITMQ_PORT: RabbitMQ port

RABBITMQ_DEFAULT_USER: RabbitMQ user

RABBITMQ_DEFAULT_PASS: RabbitMQ password

REDIS_HOST: Redis host

REDIS_PORT: Redis port

CONCURRENCY: integer number of concurrent celery tasks

DB_HOST: database host

DB_PORT: (arangodb) database port

DB_NAME: (arangodb) database name

DB_USER: (arangodb) database user

DB_PASS: (arangodb) database password

QUEUE: name of the queue to push tasks to

Logs

Logs are sent to stdout and stderr. This can be changed in infrared-producer/conf/supervise-producer.conf.

Scraper (aka Consumer)

Distributed, asynchronous scraping service of infrared images from ALERTWildfire cameras.

Technologies:

  • Docker
  • ArangoDB (latest)
  • Python 3.9
  • Redis (latest)
  • RabbitMQ (latest)
  • Google Drive API
  • Free Proxyscrape API

Environment Variables

RABBITMQ_HOST: RabbitMQ host

RABBITMQ_PORT: RabbitMQ port

RABBITMQ_DEFAULT_USER: RabbitMQ user

RABBITMQ_DEFAULT_PASS: RabbitMQ password

REDIS_HOST: Redis host

REDIS_PORT: Redis port

CONCURRENCY: integer number of concurrent celery tasks

LOGLEVEL: logging level (i.e. info)

QUEUE: name of the queue to retrieve tasks from

DB_HOST: database host

DB_PORT: (arangodb) database port

DB_NAME: (arangodb) database name

DB_USER: (arangodb) database user

DB_PASS: (arangodb) database password

CLIENT_ID: Twitter API client ID

CLIENT_SECRET: Twitter API client secret

PROJECT_ID: Google Drive API project ID

TOKEN: Google Drive API token

REFRESH_TOKEN: Google Drive API refresh token

GDRIVE_PARENT_DIR: ID of Google Drive directory in which to save zip archives of the scraped images

Logs

Logs are sent to stdout and stderr. This can be changed in infrared-scraper/conf/supervise-celery.conf.