refactor: move data files to public s3 bucket

anorthall · Feb 7, 2025 · 7e67ace · 7e67ace
1 parent 17da736
commit 7e67ace
Show file tree

Hide file tree

Showing 2 changed files with 9 additions and 22 deletions.
diff --git a/.gitattributes b/.gitattributes
diff --git a/README.md b/README.md
@@ -1,4 +1,5 @@
 # Caving Incident Report DB
+
 This is a project which aims to digitise the archive of
 [National Speleological Society](https://caves.org/) *American Caving Accidents*
 caving incident reports, which cover most caving incidents that have happened in the
@@ -13,6 +14,7 @@ to take a look. You may also wish to view the [about page](https://aca.caver.dev
 the website for more information about the project.
 
 ## Django application
+
 This fairly straightforward application lives within the `reportdb/` and `etc/` folders, and is
 run using docker-compose (or Dokku in production). The applications allows a basic CRUD interface for
 incident reports, and has management commands (`import_json` and `import_csv`) to enable the mass
@@ -23,29 +25,19 @@ the work of the AI formatter before marking an incident report as 'approved' and
 by the public.
 
 ## Data and processing
-The `data/` directory contains the archive of ACA Journals in a number of formats with varied levels of
-processing.
-
-The original PDFs of the journals are contained in `data/pdf/`. These PDF files were run
-through Amazon AWS Textract, and processed with a simple script, to generate the text files within
-`data/processed/txt`. These text files were then further processed by hand to generate the ones contained
-within the `data/processed/txt-split/` directory, where non-incident report text has been removed and each
-incident report separated by three dashes (`---`) within the text file to allow easier machine separation
-of incidents.
 
-The files from `data/processed/txt-split/` were then processed using the OpenAI API by means of the script
-contained within the `data/openai-formatter/` directory. This script produces JSON arrays of each incident,
-with relevant metadata (such as the cave name, date, incident report, cavers involved) separated. The results
-from this are contained in the `data/json/` directory.
-
-These JSON files are the final stage of processing before the data is added to the Django web application,
-which is then used by volunteers to check the work of the AI formatter before making the incident available
-for all to view online.
+The original data, including full ACA report PDF files, and the same in all sorts of different manually
+and machine processed forms, are available in a public S3 bucket called `caving-incident-reports` in the
+`eu-west-2` region. You can download the data from the bucket using the AWS CLI, or by any number of
+graphical S3 clients. If you require any assistance in accessing the data, please join our
+[Discord server](https://discord.gg/bUCYsmghVs) and ask for help.
 
 # Contributing
+
 Contributions are welcome - both in terms of code, and volunteering to help edit incidents on
 the production Django app. For more information, please join
 [our Discord server](https://discord.gg/bUCYsmghVs).
 
 # Licence
+
 This project is licensed under the GNU GPL v3.0. For more information see the LICENCE file.