This repository contains the barebones to setup a Python scraper.
scraper/
- Place all your source code in this directory.scraper.py
- Main scraper code. You can treat therun_scrape
function as the entrypoint and write your code here.__main__.py
- Main entrypoint to the scaper. This will be invoked with an output$filename
, as inpython -m scraper $JSON_FILE
.
requirement.txt
- List of package requirements for your code to run. You can modify these to your needs.sample.json
- A sample output from your scraper.
Note: You may also upload additional binaries or files and reference them as you need but please do not modify any of the other existing files (e.g. Dockerfile, run.sh, etc.).
Install the necessary requirements into your Python environment. The command below will install the necessary idr-requirements.txt
as well as your custom requirements.
$ pip install -r requirements.txt
Note: Make sure you re-run this when you add a new requirement to the requirements.txt
file.
To run your code, you must invoke the scraper module with a filename argument. This can be done using the -m
option with Python interpreter.
$ python -m scraper <filename>
You can manually test your code by running the run.sh
script in your terminal. This will invoke your scraper with a random filename and output a summary.
$ ./run.sh
A subset of the tests can be run as follows:
python -m scraper sample.json
python test_output.py sample.json
Any commits to the main branch will automatically trigger a GitHub Actions workflow. This will build and test your code in a containerized environment. The tests must pass for your code to be accepted.
During the build process, the contents of this repository will be copied to /usr/src/scrape
. Your code must be able to run from this path.