📝 Add the docs

investigativedata · Jan 16, 2025 · 8095996 · 8095996
1 parent 2824806
commit 8095996
Show file tree

Hide file tree

Showing 20 changed files with 738 additions and 94 deletions.
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@
 
 `leakrfc` provides a _data standard_ and _archive storage_ for leaked data, private and public document collections. The concepts and implementations are originally inspired by [mmmeta](https://github.com/simonwoerpel/mmmeta) and [Aleph's servicelayer archive](https://github.com/alephdata/servicelayer).
 
-`leakrfc` acts as a multi-tenant storage and retrieval mechanism for documents and their metadata. It provides an high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](https://docs.aleph.occrp.org/).
+`leakrfc` acts as a multi-tenant storage and retrieval mechanism for documents and their metadata. It provides a high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](https://docs.aleph.occrp.org/).
 
 ## Installation
 

diff --git a/docs/api.md b/docs/api.md
@@ -0,0 +1,57 @@
+`leakrfc` provides a simpel api powered by [FastAPI](https://fastapi.tiangolo.com/) for clients to retrieve file metadata and blobs. It therefore acts as a proxy between client and archive, so that the client doesn't need to know where the actual blobs live. The api can handle authorization via [JSON Web Tokens](https://jwt.io).
+
+## Start local api server
+
+This is for a quick testing setup:
+
+```bash
+export LEAKRFC_URI=./data
+uvicorn leakrfc.api:app
+```
+
+!!! warning
+
+    Never run the api with `DEBUG=1` in a production application and make sure to have a proper setup with a load balancer (e.g. nginx) doing TLS termination in front of it. As well make sure to set a good `LEAKRFC_API_SECRET_KEY` environment variable for the token authorization.
+
+## Request a file
+
+For public files:
+
+```bash
+# metadata only via headers
+curl -I "http://localhost:5000/test_dataset/utf.txt"
+
+HTTP/1.1 200 OK
+date: Thu, 16 Jan 2025 08:44:59 GMT
+server: uvicorn
+content-length: 4
+content-type: application/json
+x-leakrfc-version: 0.0.3
+x-leakrfc-dataset: test_dataset
+x-leakrfc-key: utf.txt
+x-leakrfc-sha1: 5a6acf229ba576d9a40b09292595658bbb74ef56
+x-leakrfc-name: utf.txt
+x-leakrfc-size: 19
+x-mimetype: text/plain
+content-type: text/plain
+```
+
+```bash
+# bytes stream of file
+curl -s "http://localhost:5000/<dataset>/<path>" > /tmp/file.pdf
+```
+
+Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: `{"sub": "<dataset>/<key>"}`). Therefore, clients need to be able to create such tokens (knowing the secret key configured via `LEAKRFC_API_SECRET_KEY`) and handle dataset permissions.
+
+Tokens should have a short expiration (via `exp` property in payload).
+
+```bash
+# token in Authorization header
+curl -H 'Authorization: Bearer <token>' ...
+
+# metadata only via headers
+curl -I "http://localhost:5000/file"
+
+# bytes stream of file
+curl -s "http://localhost:5000/file" > /tmp/file.lrfc
+```
diff --git a/docs/cache.md b/docs/cache.md
@@ -0,0 +1,17 @@
+For incremental processing of tasks, `leakrfc` uses a global cache to track task results. If a computed cache key for a specific task (e.g. sync a file, extract an archive) is already found in cache, running the task again will be skipped. This is implemented very granular and applies to all kinds of operations, such as [crawl](./crawl.md), [make](./make.md) and the adapters (currently [aleph](./sync/aleph.md))
+
+`leakrfc` is using [anystore](https://docs.investigraph.dev/lib/anystore/cache/) for the cache implementation, so any supported backend is possible. Recommended backends are redis or sql, but a distributed cloud-backend (such as a shared s3 bucket) can make sense, too.
+
+Per default, an in-memory cache is used, which doesn't persist.
+
+## Configure
+
+Via environment var:
+
+```bash
+LEAKRFC_CACHE__URI=redis://localhost
+
+# additional config
+LEAKRFC_CACHE__DEFAULT_TTL=3600  # seconds
+LEAKRFC_CACHE__BACKEND_CONFIG__REDIS_PREFIX=my-prefix
+```
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -0,0 +1,41 @@
+A `leakrfc` archive can be configured via _environment variables_ or a yaml configuration file. Individual datasets within the archive can have their own configuration, which actually enables creating an archive with different _storage configurations_ per dataset.
+
+## Using environment vars
+
+Simply point to a local base folder containing the archive:
+
+    LEAKRFC_URI=./data/
+
+Or point to a (local or remote) yaml configuration (see below):
+
+    LEAKRFC_URI=https://data.example.org/archive.yml
+
+More granular config with more env vars. `leakrfc` uses [pydantic-settings](https://docs.pydantic.dev/latest/concepts/pydantic_settings/) to parse the configuration. Nested configuration keys can be accessed via `__` delimiter.
+
+    LEAKRFC_ARCHIVE__URI=s3://leakrfc
+    LEAKRFC_ARCHIVE__PUBLIC_URL=https://cdn.example.org/{dataset}/{key}
+    LEAKRFC_ARCHIVE__STORAGE__READONLY=true
+
+## YAML config
+
+Create a base config and enable it via `LEAKRFC_URI=leakrfc.yml`:
+
+```yaml
+name: leakrfc-archive
+storage:
+  uri: ./archive
+# ...
+```
+
+Within the local archive, one dataset could be actually living in the cloud:
+
+`./archive/remote_dataset/.leakrfc/config.yml`:
+
+```yaml
+name: remote_dataset
+storage:
+  uri: s3://my_bucket/data
+# ...
+```
+
+This means, the local folder `./archive/remote_dataset/` would only contain this yaml configuration and use the remote contents of the dataset.
diff --git a/docs/crawl.md b/docs/crawl.md
@@ -0,0 +1,71 @@
+Crawl a local or remote location of documents (that supports file listing) into a `leakrfc` dataset. This operation stores the file metadata and actual file blobs in the [configured archive](./configuration.md).
+
+This will create a new dataset or update an existing one. Incremental crawls are cached via the global [leakrfc cache](./cache.md).
+
+Crawls can add files to a dataset but never deletes non-existing files.
+
+## Basic usage
+
+### Crawl a local directory
+
+```bash
+leakrfc -d my_dataset crawl /data/dump1/
+```
+### Crawl a http location
+
+The location needs to support file listing.
+
+In this example, archives (zip, tar.gz, ...) will be extracted during import.
+
+```bash
+leakrfc -d ddos_blueleaks crawl --extract https://data.ddosecrets.com/BlueLeaks/
+```
+
+### Crawl from a cloud bucket
+
+In this example, only pdf files are crawled:
+
+```bash
+leakrfc -d my_dataset crawl --include "*.pdf" s3://my_bucket/files
+```
+
+Under the hood, `leakrfc` uses [anystore](https://docs.investigraph.dev/lib/anystore) which uses [fsspec](https://filesystem-spec.readthedocs.io/en/latest/index.html) that allows a wide range of filesystem-like sources. For some, installing additional dependencies might be required.
+
+### Extract
+
+Source files can be extracted during import using [patool](https://pypi.org/project/patool/). This has a few caveats:
+
+- When enabling `--extract`, archives won't be stored but only their extracted members, keeping the original (archived) directory structure.
+- This can lead to file conflicts, if several archives have the same directory structure (file.pdf from archive2.zip would replace the previous one):
+
+```
+archive1.zip
+    subdir1/file.pdf
+
+archive2.zip
+    subdir1/file.pdf
+```
+
+- To avoid this, use `--extract-ensure-subdir` to create a sub-directory named by its source archive to place the extracted members into. The result would look like:
+
+```
+archive1.zip/subdir1/file.pdf
+archive2.zip/subdir1/file.pdf
+```
+
+- If keeping the source archives is desired, use `--extract-keep-source`
+
+## Include / Exclude glob patterns
+
+Only crawl a subdirectory:
+
+    --include "subdir/*"
+
+Exclude .txt files from a subdirectory and all it's children:
+
+    --exclude "subdir/**/*.txt"
+
+
+## Reference
+
+::: leakrfc.crawl
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,49 @@
+# leakrfc
+
+"_A RFC for leaks_"
+
+[leak-rfc.org](https://leak-rfc.org)
+
+`leakrfc` provides a _data standard_ and _archive storage_ for leaked data, private and public document collections. The concepts and implementations are originally inspired by [mmmeta](https://github.com/simonwoerpel/mmmeta) and [Aleph's servicelayer archive](https://github.com/alephdata/servicelayer).
+
+`leakrfc` acts as a multi-tenant storage and retrieval mechanism for documents and their metadata. It provides a high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](https://docs.aleph.occrp.org/).
+
+## Installation
+
+Requires python 3.11 or later.
+
+```bash
+pip install leakrfc
+```
+
+## Quickstart
+
+[>> get started here](quickstart.md)
+
+## Development
+
+This package is using [poetry](https://python-poetry.org/) for packaging and dependencies management, so first [install it](https://python-poetry.org/docs/#installation).
+
+Clone [this repository](https://github.com/investigativedata/leakrfc) to a local destination.
+
+Within the repo directory, run
+
+    poetry install --with dev
+
+This installs a few development dependencies, including [pre-commit](https://pre-commit.com/) which needs to be registered:
+
+    poetry run pre-commit install
+
+Before creating a commit, this checks for correct code formatting (isort, black) and some other useful stuff (see: `.pre-commit-config.yaml`)
+
+### Testing
+
+`leakrfc` uses [pytest](https://docs.pytest.org/en/stable/) as the testing framework.
+
+    make test
+
+## License and Copyright
+
+`leakrfc`, (c) 2024 [investigativedata.io](https://investigativedata.io)
+
+`leakrfc` is licensed under the AGPLv3 or later license.
diff --git a/docs/make.md b/docs/make.md
@@ -0,0 +1,12 @@
+This generates or updates a dataset archive. This command should be used after files were added or deleted from the archive.
+
+The process can also be used to turn any existing directory or remote location into a `leakrfc` dataset.
+
+```
+leakrfc -d my_dataset make [OPTIONS]
+```
+
+
+## Reference
+
+::: leakrfc.make
diff --git a/docs/quickstart.md b/docs/quickstart.md
@@ -0,0 +1,105 @@
+# Quickstart
+
+## Install
+
+Requires python 3.11 or later.
+
+```bash
+pip install leakrfc
+```
+
+## Build a dataset
+
+`leakrfc` stores _metadata_ for the files that then refers to the actual _source files_.
+
+For example, take this public file listing archive: [https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/](https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/)
+
+Crawl these documents into a _dataset_:
+
+```bash
+leakrfc -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes"
+```
+
+The _metadata_ and _source files_ are now stored in the archive (`./data` by default).
+
+## Inspect files and archive
+
+All _metadata_ and other information lives in the `ddos_patriotfront/.leakrfc` subdirectory. Files are keyed and accessible by their (relative) path.
+
+Retrieve file metadata:
+
+```bash
+leakrfc -d ddos_patriotfront head Event.pdf
+```
+
+Retrieve actual file blob:
+
+```bash
+leakrfc -d ddos_patriotfront get Event.pdf > Event.pdf
+```
+
+Show all files metadata present in the dataset archive:
+
+```bash
+leakrfc -d ddos_patriotfront ls
+```
+
+Show only the file paths:
+
+```bash
+leakrfc -d ddos_patriotfront ls --keys
+```
+
+Show only the checksums (sha1 by default):
+
+```bash
+leakrfc -d ddos_patriotfront ls --checksums
+```
+
+### Tracking changes
+
+The [`make`](./make.md) command (re-)generates the datasets metadata.
+
+Delete a file:
+
+```bash
+rm ./data/ddos_patriotfront/Event.pdf
+```
+
+Now regenerate:
+
+```bash
+leakrfc -d ddos_patriotfront make
+```
+
+The result output will indicate that 1 file was deleted.
+
+## configure storage
+
+```yaml
+storage_config:
+  uri: s3://my_bucket
+  backend_kwargs:
+    endpoint_url: https://s3.example.org
+    aws_access_key_id: ${AWS_ACCESS_KEY_ID}
+    aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
+```
+
+### dataset config.yml
+
+Follows the specification in [`ftmq.model.Dataset`](https://github.com/investigativedata/ftmq/blob/main/ftmq/model/dataset.py):
+
+```yaml
+name: my_dataset #  also known as "foreign_id"
+title: An awesome leak
+description: >
+  Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name
+  labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim
+  similique asperiores quod et quae maiores. Et accusantium accusantium error
+  et alias aut omnis eos. Omnis porro sit eum et.
+updated_at: 2024-09-25
+index_url: https://static.example.org/my_dataset/index.json
+# add more metadata
+
+leakrfc: # see above
+```
diff --git a/docs/reference/cli.md b/docs/reference/cli.md
@@ -0,0 +1,4 @@
+::: mkdocs-typer
+    :module: leakrfc.cli
+    :prog_name: leakrfc
+    :command: cli