-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
2824806
commit 8095996
Showing
20 changed files
with
738 additions
and
94 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
`leakrfc` provides a simpel api powered by [FastAPI](https://fastapi.tiangolo.com/) for clients to retrieve file metadata and blobs. It therefore acts as a proxy between client and archive, so that the client doesn't need to know where the actual blobs live. The api can handle authorization via [JSON Web Tokens](https://jwt.io). | ||
|
||
## Start local api server | ||
|
||
This is for a quick testing setup: | ||
|
||
```bash | ||
export LEAKRFC_URI=./data | ||
uvicorn leakrfc.api:app | ||
``` | ||
|
||
!!! warning | ||
|
||
Never run the api with `DEBUG=1` in a production application and make sure to have a proper setup with a load balancer (e.g. nginx) doing TLS termination in front of it. As well make sure to set a good `LEAKRFC_API_SECRET_KEY` environment variable for the token authorization. | ||
|
||
## Request a file | ||
|
||
For public files: | ||
|
||
```bash | ||
# metadata only via headers | ||
curl -I "http://localhost:5000/test_dataset/utf.txt" | ||
|
||
HTTP/1.1 200 OK | ||
date: Thu, 16 Jan 2025 08:44:59 GMT | ||
server: uvicorn | ||
content-length: 4 | ||
content-type: application/json | ||
x-leakrfc-version: 0.0.3 | ||
x-leakrfc-dataset: test_dataset | ||
x-leakrfc-key: utf.txt | ||
x-leakrfc-sha1: 5a6acf229ba576d9a40b09292595658bbb74ef56 | ||
x-leakrfc-name: utf.txt | ||
x-leakrfc-size: 19 | ||
x-mimetype: text/plain | ||
content-type: text/plain | ||
``` | ||
|
||
```bash | ||
# bytes stream of file | ||
curl -s "http://localhost:5000/<dataset>/<path>" > /tmp/file.pdf | ||
``` | ||
|
||
Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: `{"sub": "<dataset>/<key>"}`). Therefore, clients need to be able to create such tokens (knowing the secret key configured via `LEAKRFC_API_SECRET_KEY`) and handle dataset permissions. | ||
|
||
Tokens should have a short expiration (via `exp` property in payload). | ||
|
||
```bash | ||
# token in Authorization header | ||
curl -H 'Authorization: Bearer <token>' ... | ||
|
||
# metadata only via headers | ||
curl -I "http://localhost:5000/file" | ||
|
||
# bytes stream of file | ||
curl -s "http://localhost:5000/file" > /tmp/file.lrfc | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
For incremental processing of tasks, `leakrfc` uses a global cache to track task results. If a computed cache key for a specific task (e.g. sync a file, extract an archive) is already found in cache, running the task again will be skipped. This is implemented very granular and applies to all kinds of operations, such as [crawl](./crawl.md), [make](./make.md) and the adapters (currently [aleph](./sync/aleph.md)) | ||
|
||
`leakrfc` is using [anystore](https://docs.investigraph.dev/lib/anystore/cache/) for the cache implementation, so any supported backend is possible. Recommended backends are redis or sql, but a distributed cloud-backend (such as a shared s3 bucket) can make sense, too. | ||
|
||
Per default, an in-memory cache is used, which doesn't persist. | ||
|
||
## Configure | ||
|
||
Via environment var: | ||
|
||
```bash | ||
LEAKRFC_CACHE__URI=redis://localhost | ||
|
||
# additional config | ||
LEAKRFC_CACHE__DEFAULT_TTL=3600 # seconds | ||
LEAKRFC_CACHE__BACKEND_CONFIG__REDIS_PREFIX=my-prefix | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
A `leakrfc` archive can be configured via _environment variables_ or a yaml configuration file. Individual datasets within the archive can have their own configuration, which actually enables creating an archive with different _storage configurations_ per dataset. | ||
|
||
## Using environment vars | ||
|
||
Simply point to a local base folder containing the archive: | ||
|
||
LEAKRFC_URI=./data/ | ||
|
||
Or point to a (local or remote) yaml configuration (see below): | ||
|
||
LEAKRFC_URI=https://data.example.org/archive.yml | ||
|
||
More granular config with more env vars. `leakrfc` uses [pydantic-settings](https://docs.pydantic.dev/latest/concepts/pydantic_settings/) to parse the configuration. Nested configuration keys can be accessed via `__` delimiter. | ||
|
||
LEAKRFC_ARCHIVE__URI=s3://leakrfc | ||
LEAKRFC_ARCHIVE__PUBLIC_URL=https://cdn.example.org/{dataset}/{key} | ||
LEAKRFC_ARCHIVE__STORAGE__READONLY=true | ||
|
||
## YAML config | ||
|
||
Create a base config and enable it via `LEAKRFC_URI=leakrfc.yml`: | ||
|
||
```yaml | ||
name: leakrfc-archive | ||
storage: | ||
uri: ./archive | ||
# ... | ||
``` | ||
|
||
Within the local archive, one dataset could be actually living in the cloud: | ||
|
||
`./archive/remote_dataset/.leakrfc/config.yml`: | ||
|
||
```yaml | ||
name: remote_dataset | ||
storage: | ||
uri: s3://my_bucket/data | ||
# ... | ||
``` | ||
|
||
This means, the local folder `./archive/remote_dataset/` would only contain this yaml configuration and use the remote contents of the dataset. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
Crawl a local or remote location of documents (that supports file listing) into a `leakrfc` dataset. This operation stores the file metadata and actual file blobs in the [configured archive](./configuration.md). | ||
|
||
This will create a new dataset or update an existing one. Incremental crawls are cached via the global [leakrfc cache](./cache.md). | ||
|
||
Crawls can add files to a dataset but never deletes non-existing files. | ||
|
||
## Basic usage | ||
|
||
### Crawl a local directory | ||
|
||
```bash | ||
leakrfc -d my_dataset crawl /data/dump1/ | ||
``` | ||
### Crawl a http location | ||
|
||
The location needs to support file listing. | ||
|
||
In this example, archives (zip, tar.gz, ...) will be extracted during import. | ||
|
||
```bash | ||
leakrfc -d ddos_blueleaks crawl --extract https://data.ddosecrets.com/BlueLeaks/ | ||
``` | ||
|
||
### Crawl from a cloud bucket | ||
|
||
In this example, only pdf files are crawled: | ||
|
||
```bash | ||
leakrfc -d my_dataset crawl --include "*.pdf" s3://my_bucket/files | ||
``` | ||
|
||
Under the hood, `leakrfc` uses [anystore](https://docs.investigraph.dev/lib/anystore) which uses [fsspec](https://filesystem-spec.readthedocs.io/en/latest/index.html) that allows a wide range of filesystem-like sources. For some, installing additional dependencies might be required. | ||
|
||
### Extract | ||
|
||
Source files can be extracted during import using [patool](https://pypi.org/project/patool/). This has a few caveats: | ||
|
||
- When enabling `--extract`, archives won't be stored but only their extracted members, keeping the original (archived) directory structure. | ||
- This can lead to file conflicts, if several archives have the same directory structure (file.pdf from archive2.zip would replace the previous one): | ||
|
||
``` | ||
archive1.zip | ||
subdir1/file.pdf | ||
archive2.zip | ||
subdir1/file.pdf | ||
``` | ||
|
||
- To avoid this, use `--extract-ensure-subdir` to create a sub-directory named by its source archive to place the extracted members into. The result would look like: | ||
|
||
``` | ||
archive1.zip/subdir1/file.pdf | ||
archive2.zip/subdir1/file.pdf | ||
``` | ||
|
||
- If keeping the source archives is desired, use `--extract-keep-source` | ||
|
||
## Include / Exclude glob patterns | ||
|
||
Only crawl a subdirectory: | ||
|
||
--include "subdir/*" | ||
|
||
Exclude .txt files from a subdirectory and all it's children: | ||
|
||
--exclude "subdir/**/*.txt" | ||
|
||
|
||
## Reference | ||
|
||
::: leakrfc.crawl |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# leakrfc | ||
|
||
"_A RFC for leaks_" | ||
|
||
[leak-rfc.org](https://leak-rfc.org) | ||
|
||
`leakrfc` provides a _data standard_ and _archive storage_ for leaked data, private and public document collections. The concepts and implementations are originally inspired by [mmmeta](https://github.com/simonwoerpel/mmmeta) and [Aleph's servicelayer archive](https://github.com/alephdata/servicelayer). | ||
|
||
`leakrfc` acts as a multi-tenant storage and retrieval mechanism for documents and their metadata. It provides a high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](https://docs.aleph.occrp.org/). | ||
|
||
## Installation | ||
|
||
Requires python 3.11 or later. | ||
|
||
```bash | ||
pip install leakrfc | ||
``` | ||
|
||
## Quickstart | ||
|
||
[>> get started here](quickstart.md) | ||
|
||
## Development | ||
|
||
This package is using [poetry](https://python-poetry.org/) for packaging and dependencies management, so first [install it](https://python-poetry.org/docs/#installation). | ||
|
||
Clone [this repository](https://github.com/investigativedata/leakrfc) to a local destination. | ||
|
||
Within the repo directory, run | ||
|
||
poetry install --with dev | ||
|
||
This installs a few development dependencies, including [pre-commit](https://pre-commit.com/) which needs to be registered: | ||
|
||
poetry run pre-commit install | ||
|
||
Before creating a commit, this checks for correct code formatting (isort, black) and some other useful stuff (see: `.pre-commit-config.yaml`) | ||
|
||
### Testing | ||
|
||
`leakrfc` uses [pytest](https://docs.pytest.org/en/stable/) as the testing framework. | ||
|
||
make test | ||
|
||
## License and Copyright | ||
|
||
`leakrfc`, (c) 2024 [investigativedata.io](https://investigativedata.io) | ||
|
||
`leakrfc` is licensed under the AGPLv3 or later license. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
This generates or updates a dataset archive. This command should be used after files were added or deleted from the archive. | ||
|
||
The process can also be used to turn any existing directory or remote location into a `leakrfc` dataset. | ||
|
||
``` | ||
leakrfc -d my_dataset make [OPTIONS] | ||
``` | ||
|
||
|
||
## Reference | ||
|
||
::: leakrfc.make |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
# Quickstart | ||
|
||
## Install | ||
|
||
Requires python 3.11 or later. | ||
|
||
```bash | ||
pip install leakrfc | ||
``` | ||
|
||
## Build a dataset | ||
|
||
`leakrfc` stores _metadata_ for the files that then refers to the actual _source files_. | ||
|
||
For example, take this public file listing archive: [https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/](https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/) | ||
|
||
Crawl these documents into a _dataset_: | ||
|
||
```bash | ||
leakrfc -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes" | ||
``` | ||
|
||
The _metadata_ and _source files_ are now stored in the archive (`./data` by default). | ||
|
||
## Inspect files and archive | ||
|
||
All _metadata_ and other information lives in the `ddos_patriotfront/.leakrfc` subdirectory. Files are keyed and accessible by their (relative) path. | ||
|
||
Retrieve file metadata: | ||
|
||
```bash | ||
leakrfc -d ddos_patriotfront head Event.pdf | ||
``` | ||
|
||
Retrieve actual file blob: | ||
|
||
```bash | ||
leakrfc -d ddos_patriotfront get Event.pdf > Event.pdf | ||
``` | ||
|
||
Show all files metadata present in the dataset archive: | ||
|
||
```bash | ||
leakrfc -d ddos_patriotfront ls | ||
``` | ||
|
||
Show only the file paths: | ||
|
||
```bash | ||
leakrfc -d ddos_patriotfront ls --keys | ||
``` | ||
|
||
Show only the checksums (sha1 by default): | ||
|
||
```bash | ||
leakrfc -d ddos_patriotfront ls --checksums | ||
``` | ||
|
||
### Tracking changes | ||
|
||
The [`make`](./make.md) command (re-)generates the datasets metadata. | ||
|
||
Delete a file: | ||
|
||
```bash | ||
rm ./data/ddos_patriotfront/Event.pdf | ||
``` | ||
|
||
Now regenerate: | ||
|
||
```bash | ||
leakrfc -d ddos_patriotfront make | ||
``` | ||
|
||
The result output will indicate that 1 file was deleted. | ||
|
||
## configure storage | ||
|
||
```yaml | ||
storage_config: | ||
uri: s3://my_bucket | ||
backend_kwargs: | ||
endpoint_url: https://s3.example.org | ||
aws_access_key_id: ${AWS_ACCESS_KEY_ID} | ||
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY} | ||
``` | ||
### dataset config.yml | ||
Follows the specification in [`ftmq.model.Dataset`](https://github.com/investigativedata/ftmq/blob/main/ftmq/model/dataset.py): | ||
|
||
```yaml | ||
name: my_dataset # also known as "foreign_id" | ||
title: An awesome leak | ||
description: > | ||
Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name | ||
labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim | ||
similique asperiores quod et quae maiores. Et accusantium accusantium error | ||
et alias aut omnis eos. Omnis porro sit eum et. | ||
updated_at: 2024-09-25 | ||
index_url: https://static.example.org/my_dataset/index.json | ||
# add more metadata | ||
leakrfc: # see above | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
::: mkdocs-typer | ||
:module: leakrfc.cli | ||
:prog_name: leakrfc | ||
:command: cli |
Oops, something went wrong.