Skip to content

Commit

Permalink
📝 Add the docs
Browse files Browse the repository at this point in the history
  • Loading branch information
simonwoerpel committed Jan 16, 2025
1 parent 2824806 commit 8095996
Show file tree
Hide file tree
Showing 20 changed files with 738 additions and 94 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

`leakrfc` provides a _data standard_ and _archive storage_ for leaked data, private and public document collections. The concepts and implementations are originally inspired by [mmmeta](https://github.com/simonwoerpel/mmmeta) and [Aleph's servicelayer archive](https://github.com/alephdata/servicelayer).

`leakrfc` acts as a multi-tenant storage and retrieval mechanism for documents and their metadata. It provides an high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](https://docs.aleph.occrp.org/).
`leakrfc` acts as a multi-tenant storage and retrieval mechanism for documents and their metadata. It provides a high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](https://docs.aleph.occrp.org/).

## Installation

Expand Down
57 changes: 57 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
`leakrfc` provides a simpel api powered by [FastAPI](https://fastapi.tiangolo.com/) for clients to retrieve file metadata and blobs. It therefore acts as a proxy between client and archive, so that the client doesn't need to know where the actual blobs live. The api can handle authorization via [JSON Web Tokens](https://jwt.io).

## Start local api server

This is for a quick testing setup:

```bash
export LEAKRFC_URI=./data
uvicorn leakrfc.api:app
```

!!! warning

Never run the api with `DEBUG=1` in a production application and make sure to have a proper setup with a load balancer (e.g. nginx) doing TLS termination in front of it. As well make sure to set a good `LEAKRFC_API_SECRET_KEY` environment variable for the token authorization.

## Request a file

For public files:

```bash
# metadata only via headers
curl -I "http://localhost:5000/test_dataset/utf.txt"

HTTP/1.1 200 OK
date: Thu, 16 Jan 2025 08:44:59 GMT
server: uvicorn
content-length: 4
content-type: application/json
x-leakrfc-version: 0.0.3
x-leakrfc-dataset: test_dataset
x-leakrfc-key: utf.txt
x-leakrfc-sha1: 5a6acf229ba576d9a40b09292595658bbb74ef56
x-leakrfc-name: utf.txt
x-leakrfc-size: 19
x-mimetype: text/plain
content-type: text/plain
```

```bash
# bytes stream of file
curl -s "http://localhost:5000/<dataset>/<path>" > /tmp/file.pdf
```

Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: `{"sub": "<dataset>/<key>"}`). Therefore, clients need to be able to create such tokens (knowing the secret key configured via `LEAKRFC_API_SECRET_KEY`) and handle dataset permissions.

Tokens should have a short expiration (via `exp` property in payload).

```bash
# token in Authorization header
curl -H 'Authorization: Bearer <token>' ...

# metadata only via headers
curl -I "http://localhost:5000/file"

# bytes stream of file
curl -s "http://localhost:5000/file" > /tmp/file.lrfc
```
17 changes: 17 additions & 0 deletions docs/cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
For incremental processing of tasks, `leakrfc` uses a global cache to track task results. If a computed cache key for a specific task (e.g. sync a file, extract an archive) is already found in cache, running the task again will be skipped. This is implemented very granular and applies to all kinds of operations, such as [crawl](./crawl.md), [make](./make.md) and the adapters (currently [aleph](./sync/aleph.md))

`leakrfc` is using [anystore](https://docs.investigraph.dev/lib/anystore/cache/) for the cache implementation, so any supported backend is possible. Recommended backends are redis or sql, but a distributed cloud-backend (such as a shared s3 bucket) can make sense, too.

Per default, an in-memory cache is used, which doesn't persist.

## Configure

Via environment var:

```bash
LEAKRFC_CACHE__URI=redis://localhost

# additional config
LEAKRFC_CACHE__DEFAULT_TTL=3600 # seconds
LEAKRFC_CACHE__BACKEND_CONFIG__REDIS_PREFIX=my-prefix
```
41 changes: 41 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
A `leakrfc` archive can be configured via _environment variables_ or a yaml configuration file. Individual datasets within the archive can have their own configuration, which actually enables creating an archive with different _storage configurations_ per dataset.

## Using environment vars

Simply point to a local base folder containing the archive:

LEAKRFC_URI=./data/

Or point to a (local or remote) yaml configuration (see below):

LEAKRFC_URI=https://data.example.org/archive.yml

More granular config with more env vars. `leakrfc` uses [pydantic-settings](https://docs.pydantic.dev/latest/concepts/pydantic_settings/) to parse the configuration. Nested configuration keys can be accessed via `__` delimiter.

LEAKRFC_ARCHIVE__URI=s3://leakrfc
LEAKRFC_ARCHIVE__PUBLIC_URL=https://cdn.example.org/{dataset}/{key}
LEAKRFC_ARCHIVE__STORAGE__READONLY=true

## YAML config

Create a base config and enable it via `LEAKRFC_URI=leakrfc.yml`:

```yaml
name: leakrfc-archive
storage:
uri: ./archive
# ...
```

Within the local archive, one dataset could be actually living in the cloud:

`./archive/remote_dataset/.leakrfc/config.yml`:

```yaml
name: remote_dataset
storage:
uri: s3://my_bucket/data
# ...
```

This means, the local folder `./archive/remote_dataset/` would only contain this yaml configuration and use the remote contents of the dataset.
71 changes: 71 additions & 0 deletions docs/crawl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
Crawl a local or remote location of documents (that supports file listing) into a `leakrfc` dataset. This operation stores the file metadata and actual file blobs in the [configured archive](./configuration.md).

This will create a new dataset or update an existing one. Incremental crawls are cached via the global [leakrfc cache](./cache.md).

Crawls can add files to a dataset but never deletes non-existing files.

## Basic usage

### Crawl a local directory

```bash
leakrfc -d my_dataset crawl /data/dump1/
```
### Crawl a http location

The location needs to support file listing.

In this example, archives (zip, tar.gz, ...) will be extracted during import.

```bash
leakrfc -d ddos_blueleaks crawl --extract https://data.ddosecrets.com/BlueLeaks/
```

### Crawl from a cloud bucket

In this example, only pdf files are crawled:

```bash
leakrfc -d my_dataset crawl --include "*.pdf" s3://my_bucket/files
```

Under the hood, `leakrfc` uses [anystore](https://docs.investigraph.dev/lib/anystore) which uses [fsspec](https://filesystem-spec.readthedocs.io/en/latest/index.html) that allows a wide range of filesystem-like sources. For some, installing additional dependencies might be required.

### Extract

Source files can be extracted during import using [patool](https://pypi.org/project/patool/). This has a few caveats:

- When enabling `--extract`, archives won't be stored but only their extracted members, keeping the original (archived) directory structure.
- This can lead to file conflicts, if several archives have the same directory structure (file.pdf from archive2.zip would replace the previous one):

```
archive1.zip
subdir1/file.pdf
archive2.zip
subdir1/file.pdf
```

- To avoid this, use `--extract-ensure-subdir` to create a sub-directory named by its source archive to place the extracted members into. The result would look like:

```
archive1.zip/subdir1/file.pdf
archive2.zip/subdir1/file.pdf
```

- If keeping the source archives is desired, use `--extract-keep-source`

## Include / Exclude glob patterns

Only crawl a subdirectory:

--include "subdir/*"

Exclude .txt files from a subdirectory and all it's children:

--exclude "subdir/**/*.txt"


## Reference

::: leakrfc.crawl
49 changes: 49 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# leakrfc

"_A RFC for leaks_"

[leak-rfc.org](https://leak-rfc.org)

`leakrfc` provides a _data standard_ and _archive storage_ for leaked data, private and public document collections. The concepts and implementations are originally inspired by [mmmeta](https://github.com/simonwoerpel/mmmeta) and [Aleph's servicelayer archive](https://github.com/alephdata/servicelayer).

`leakrfc` acts as a multi-tenant storage and retrieval mechanism for documents and their metadata. It provides a high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](https://docs.aleph.occrp.org/).

## Installation

Requires python 3.11 or later.

```bash
pip install leakrfc
```

## Quickstart

[>> get started here](quickstart.md)

## Development

This package is using [poetry](https://python-poetry.org/) for packaging and dependencies management, so first [install it](https://python-poetry.org/docs/#installation).

Clone [this repository](https://github.com/investigativedata/leakrfc) to a local destination.

Within the repo directory, run

poetry install --with dev

This installs a few development dependencies, including [pre-commit](https://pre-commit.com/) which needs to be registered:

poetry run pre-commit install

Before creating a commit, this checks for correct code formatting (isort, black) and some other useful stuff (see: `.pre-commit-config.yaml`)

### Testing

`leakrfc` uses [pytest](https://docs.pytest.org/en/stable/) as the testing framework.

make test

## License and Copyright

`leakrfc`, (c) 2024 [investigativedata.io](https://investigativedata.io)

`leakrfc` is licensed under the AGPLv3 or later license.
12 changes: 12 additions & 0 deletions docs/make.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
This generates or updates a dataset archive. This command should be used after files were added or deleted from the archive.

The process can also be used to turn any existing directory or remote location into a `leakrfc` dataset.

```
leakrfc -d my_dataset make [OPTIONS]
```


## Reference

::: leakrfc.make
105 changes: 105 additions & 0 deletions docs/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Quickstart

## Install

Requires python 3.11 or later.

```bash
pip install leakrfc
```

## Build a dataset

`leakrfc` stores _metadata_ for the files that then refers to the actual _source files_.

For example, take this public file listing archive: [https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/](https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/)

Crawl these documents into a _dataset_:

```bash
leakrfc -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes"
```

The _metadata_ and _source files_ are now stored in the archive (`./data` by default).

## Inspect files and archive

All _metadata_ and other information lives in the `ddos_patriotfront/.leakrfc` subdirectory. Files are keyed and accessible by their (relative) path.

Retrieve file metadata:

```bash
leakrfc -d ddos_patriotfront head Event.pdf
```

Retrieve actual file blob:

```bash
leakrfc -d ddos_patriotfront get Event.pdf > Event.pdf
```

Show all files metadata present in the dataset archive:

```bash
leakrfc -d ddos_patriotfront ls
```

Show only the file paths:

```bash
leakrfc -d ddos_patriotfront ls --keys
```

Show only the checksums (sha1 by default):

```bash
leakrfc -d ddos_patriotfront ls --checksums
```

### Tracking changes

The [`make`](./make.md) command (re-)generates the datasets metadata.

Delete a file:

```bash
rm ./data/ddos_patriotfront/Event.pdf
```

Now regenerate:

```bash
leakrfc -d ddos_patriotfront make
```

The result output will indicate that 1 file was deleted.

## configure storage

```yaml
storage_config:
uri: s3://my_bucket
backend_kwargs:
endpoint_url: https://s3.example.org
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
```
### dataset config.yml
Follows the specification in [`ftmq.model.Dataset`](https://github.com/investigativedata/ftmq/blob/main/ftmq/model/dataset.py):

```yaml
name: my_dataset # also known as "foreign_id"
title: An awesome leak
description: >
Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name
labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim
similique asperiores quod et quae maiores. Et accusantium accusantium error
et alias aut omnis eos. Omnis porro sit eum et.
updated_at: 2024-09-25
index_url: https://static.example.org/my_dataset/index.json
# add more metadata
leakrfc: # see above
```
4 changes: 4 additions & 0 deletions docs/reference/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
::: mkdocs-typer
:module: leakrfc.cli
:prog_name: leakrfc
:command: cli
Loading

0 comments on commit 8095996

Please sign in to comment.