Skip to content

Commit

Permalink
refactor: move data files to public s3 bucket
Browse files Browse the repository at this point in the history
  • Loading branch information
anorthall committed Feb 7, 2025
1 parent f97a68d commit 48b319c
Show file tree
Hide file tree
Showing 272 changed files with 9 additions and 1,611 deletions.
26 changes: 9 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Caving Incident Report DB

This is a project which aims to digitise the archive of
[National Speleological Society](https://caves.org/) *American Caving Accidents*
caving incident reports, which cover most caving incidents that have happened in the
Expand All @@ -13,6 +14,7 @@ to take a look. You may also wish to view the [about page](https://aca.caver.dev
the website for more information about the project.

## Django application

This fairly straightforward application lives within the `reportdb/` and `etc/` folders, and is
run using docker-compose (or Dokku in production). The applications allows a basic CRUD interface for
incident reports, and has management commands (`import_json` and `import_csv`) to enable the mass
Expand All @@ -23,29 +25,19 @@ the work of the AI formatter before marking an incident report as 'approved' and
by the public.

## Data and processing
The `data/` directory contains the archive of ACA Journals in a number of formats with varied levels of
processing.

The original PDFs of the journals are contained in `data/pdf/`. These PDF files were run
through Amazon AWS Textract, and processed with a simple script, to generate the text files within
`data/processed/txt`. These text files were then further processed by hand to generate the ones contained
within the `data/processed/txt-split/` directory, where non-incident report text has been removed and each
incident report separated by three dashes (`---`) within the text file to allow easier machine separation
of incidents.

The files from `data/processed/txt-split/` were then processed using the OpenAI API by means of the script
contained within the `data/openai-formatter/` directory. This script produces JSON arrays of each incident,
with relevant metadata (such as the cave name, date, incident report, cavers involved) separated. The results
from this are contained in the `data/json/` directory.

These JSON files are the final stage of processing before the data is added to the Django web application,
which is then used by volunteers to check the work of the AI formatter before making the incident available
for all to view online.
The original data, including full ACA report PDF files, and the same in all sorts of different manually
and machine processed forms, are available in a public S3 bucket called `caving-incident-reports` in the
`eu-west-2` region. You can download the data from the bucket using the AWS CLI, or by any number of
graphical S3 clients. If you require any assistance in accessing the data, please join our
[Discord server](https://discord.gg/bUCYsmghVs) and ask for help.

# Contributing

Contributions are welcome - both in terms of code, and volunteering to help edit incidents on
the production Django app. For more information, please join
[our Discord server](https://discord.gg/bUCYsmghVs).

# Licence

This project is licensed under the GNU GPL v3.0. For more information see the LICENCE file.
3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 1991.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 1992.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 1993.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 1994-1995.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 1996-1998.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 1999-2001.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2002-2003.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2004-2005.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2006.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2007-2008.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2009-2010.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2011-2012.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2013-2014.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2015-2016 (50th Anniversary).pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2017-2018.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2019-2020.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/json/ACA 2021-2022.pdf.json

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 1991.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 1992.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 1993.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 1994-1995.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 1996-1998.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 1999-2001.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2002-2003.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2004-2005.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2006.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2007-2008.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2009-2010.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2011-2012.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2013-2014.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2015-2016 (50th Anniversary).pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2017-2018.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2019-2020.pdf.txt

This file was deleted.

3 changes: 0 additions & 3 deletions data/azure-ocr/txt/ACA 2021-2022.pdf.txt

This file was deleted.

4 changes: 0 additions & 4 deletions data/openai-formatter/.gitignore

This file was deleted.

30 changes: 0 additions & 30 deletions data/openai-formatter/fix-titles.py

This file was deleted.

Loading

0 comments on commit 48b319c

Please sign in to comment.