Skip to content

Commit

Permalink
refactor: move data files to public s3 bucket
Browse files Browse the repository at this point in the history
  • Loading branch information
anorthall committed Feb 7, 2025
1 parent 17da736 commit 7e67ace
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 22 deletions.
5 changes: 0 additions & 5 deletions .gitattributes

This file was deleted.

26 changes: 9 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Caving Incident Report DB

This is a project which aims to digitise the archive of
[National Speleological Society](https://caves.org/) *American Caving Accidents*
caving incident reports, which cover most caving incidents that have happened in the
Expand All @@ -13,6 +14,7 @@ to take a look. You may also wish to view the [about page](https://aca.caver.dev
the website for more information about the project.

## Django application

This fairly straightforward application lives within the `reportdb/` and `etc/` folders, and is
run using docker-compose (or Dokku in production). The applications allows a basic CRUD interface for
incident reports, and has management commands (`import_json` and `import_csv`) to enable the mass
Expand All @@ -23,29 +25,19 @@ the work of the AI formatter before marking an incident report as 'approved' and
by the public.

## Data and processing
The `data/` directory contains the archive of ACA Journals in a number of formats with varied levels of
processing.

The original PDFs of the journals are contained in `data/pdf/`. These PDF files were run
through Amazon AWS Textract, and processed with a simple script, to generate the text files within
`data/processed/txt`. These text files were then further processed by hand to generate the ones contained
within the `data/processed/txt-split/` directory, where non-incident report text has been removed and each
incident report separated by three dashes (`---`) within the text file to allow easier machine separation
of incidents.

The files from `data/processed/txt-split/` were then processed using the OpenAI API by means of the script
contained within the `data/openai-formatter/` directory. This script produces JSON arrays of each incident,
with relevant metadata (such as the cave name, date, incident report, cavers involved) separated. The results
from this are contained in the `data/json/` directory.

These JSON files are the final stage of processing before the data is added to the Django web application,
which is then used by volunteers to check the work of the AI formatter before making the incident available
for all to view online.
The original data, including full ACA report PDF files, and the same in all sorts of different manually
and machine processed forms, are available in a public S3 bucket called `caving-incident-reports` in the
`eu-west-2` region. You can download the data from the bucket using the AWS CLI, or by any number of
graphical S3 clients. If you require any assistance in accessing the data, please join our
[Discord server](https://discord.gg/bUCYsmghVs) and ask for help.

# Contributing

Contributions are welcome - both in terms of code, and volunteering to help edit incidents on
the production Django app. For more information, please join
[our Discord server](https://discord.gg/bUCYsmghVs).

# Licence

This project is licensed under the GNU GPL v3.0. For more information see the LICENCE file.

0 comments on commit 7e67ace

Please sign in to comment.