Skip to content

Commit

Permalink
🔥 Drop logic for retrieval by checksum
Browse files Browse the repository at this point in the history
  • Loading branch information
simonwoerpel committed Jan 15, 2025
1 parent cae1620 commit 2824806
Show file tree
Hide file tree
Showing 7 changed files with 23 additions and 203 deletions.
150 changes: 6 additions & 144 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,162 +1,24 @@
# leakrfc

_An RFC for leaks_
"_A RFC for leaks_"

[leak-rfc.org](https://leak-rfc.org)

`leakrfc` provides a _data standard_ and _archive storage_ for leaked data, private and public document collections. The concepts and implementations are originally inspired by [mmmeta](https://github.com/simonwoerpel/mmmeta) and [Aleph's servicelayer archive](https://github.com/alephdata/servicelayer).

`leakrfc` acts as a standardized storage and retrieval mechanism for documents and their metadata. It provides an high-level interface for generating and sharing document collections and importing them into various analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](docs.aleph.occrp.org/).
`leakrfc` acts as a multi-tenant storage and retrieval mechanism for documents and their metadata. It provides an high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as [_ICIJ Datashare_](https://datashare.icij.org/), [_Liquid Investigations_](https://github.com/liquidinvestigations/), and [_Aleph_](https://docs.aleph.occrp.org/).

It can act as a drop-in replacement for the underlying archive of [Aleph](https://docs.aleph.occrp.org/).
## Installation

## install
Requires python 3.11 or later.

```bash
pip install leakrfc
```

## build a dataset
## Documentation

`leakrfc` stores _metadata_ for the files that then refers to the actual _source file_.

List the files in a public accessible source (using [`anystore`](https://github.com/investigativedata/anystore/)):

```bash
ANYSTORE_URI="https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/" anystore keys
```

Crawl these documents into this _dataset_:

```bash
leakrfc -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes"
```

The _metadata_ and _source files_ are now stored in the archive (`./data` by default). All _metadata_ and other information lives in the `ddos_patriotfront/.leakrfc` subdirectory. Files are keyed and retrievable by their checksum (default: `sha1`).

Retrieve file metadata:

```bash
leakrfc -d ddos_patriotfront head "19338a97797bcc0eeb832cf7169cbbafc54ed255"
```

Retrieve actual file blob:

```bash
leakrfc -d ddos_patriotfront get "19338a97797bcc0eeb832cf7169cbbafc54ed255" > file.pdf
```

## api

### run api

```bash
export LEAKRFC_ARCHIVE__URI=./data
uvicorn leakrfc.api:app
```

### request a file

For public files:

```bash
# metadata only via headers
curl -I "http://localhost:5000/<dataset>/<sha1>"

# bytes stream of file
curl -s "http://localhost:5000/<dataset>/<sha1>" > /tmp/file.lrfc
```

Authorization expects an encrypted bearer token with the dataset and key lookup in the subject (token payload: `{"sub": "<dataset>/<key>"}`). Therefore, clients need to be able to create such tokens (knowing the secret key) and handle dataset permissions.

Tokens should have a short expiration (via `exp` property in payload).

```bash
# token in Authorization header
curl -H 'Authorization: Bearer <token>' ...

# metadata only via headers
curl -I "http://localhost:5000/file"

# bytes stream of file
curl -s "http://localhost:5000/file" > /tmp/file.s
```

## configure storage

```yaml
storage_config:
uri: s3://my_bucket
backend_kwargs:
endpoint_url: https://s3.example.org
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
```
### pass through legacy aleph
```yaml
storage_config:
uri: gcs://aleph_archive/
legacy_aleph: true
copy_over: true # subsequently merge legacy archive data into `leakrfc`
```
## layout
The _RFC_ is reflected by the following layout structure for a _Dataset_:
```bash
./archive/
my_dataset/

# metadata maintained by `leakrfc`
.leakrfc/
index.json # generated dataset metadata served for clients
config.yml # dataset configuration
documents.csv # document database (all metadata combined)
keys.csv # hash -> uri mapping for all files
state/ # processing state
logs/
created_at
updated_at
entities/
entities.ftm.json
files/ # FILE METADATA STORAGE:
a1/b1/a1b1c1.../info.json # - file metadata as json REQUIRED
a1/b1/a1b1c1.../txt # - extracted plain text
a1/b1/a1b1c1.../converted.pdf # - converted file, e.g. from .docx to .pdf for better web display
a1/b1/a1b1c1.../extracted/ # - extracted files from packages/archives
foo.txt
export/
my_dataset.img.zst # dump as image
my_dataset.leakrfc # dump as zipfile

# actual (read-only) data
Arbitrary Folder/
Source1.pdf
Tables/
Another_File.xlsx
```

### dataset config.yml

Follows the specification in [`ftmq.model.Dataset`](https://github.com/investigativedata/ftmq/blob/main/ftmq/model/dataset.py):

```yaml
name: my_dataset # also known as "foreign_id"
title: An awesome leak
description: >
Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name
labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim
similique asperiores quod et quae maiores. Et accusantium accusantium error
et alias aut omnis eos. Omnis porro sit eum et.
updated_at: 2024-09-25
index_url: https://static.example.org/my_dataset/index.json
# add more metadata

leakrfc: # see above
```
[docs.investigraph.dev/lib/leakrfc](https://docs.investigraph.dev/lib/leakrfc)

## Development

Expand Down
25 changes: 4 additions & 21 deletions leakrfc/api/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,7 @@
from fastapi.responses import StreamingResponse

from leakrfc import __version__
from leakrfc.api.auth import (
Token,
create_access_token,
ensure_auth_context,
ensure_token_context,
)
from leakrfc.api.auth import Token, create_access_token, ensure_auth_context
from leakrfc.api.util import Context, Errors, ensure_path_context, stream_file
from leakrfc.archive import get_archive
from leakrfc.logging import get_logger
Expand Down Expand Up @@ -40,7 +35,7 @@
if settings.debug:
log.warning("Api is running in debug mode!")

@app.get("/{dataset}/{key}/token")
@app.get("/{dataset}/{key:path}/token")
async def get_token(
response: Response,
ctx: Context = Depends(ensure_path_context),
Expand Down Expand Up @@ -79,7 +74,7 @@ async def get_file_by_token(
return stream_file(ctx)


@app.head("/{dataset}/{key}")
@app.head("/{dataset}/{key:path}")
async def head_file(
response: Response, ctx: Context = Depends(ensure_path_context)
) -> None:
Expand All @@ -90,22 +85,10 @@ async def head_file(
response.headers.update(ctx.headers)


@app.get("/{dataset}/{key}", response_model=None)
@app.get("/{dataset}/{key:path}", response_model=None)
async def get_file(ctx: Context = Depends(ensure_path_context)) -> StreamingResponse:
"""
Stream contents of a public file
"""
with Errors():
return stream_file(ctx)


@app.get("/api/2/archive", response_model=None)
async def legacy_aleph_api(
ctx: Context = Depends(ensure_token_context),
) -> StreamingResponse:
"""
Stream contents of a file, mimic Aleph servicelayer api to act as a drop-in
replacement
"""
with Errors():
return stream_file(ctx)
17 changes: 7 additions & 10 deletions leakrfc/api/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,14 @@
BASE_HEADER = {"x-leakrfc-version": __version__}


def get_base_header(dataset: str, key: str | None = None) -> dict[str, str]:
return clean_dict(
{**BASE_HEADER, "x-leakrfc-dataset": dataset, "x-leakrfc-key": key}
)


def get_file_header(file: File) -> dict[str, str]:
return clean_dict(
{
**get_base_header(file.dataset, file.content_hash),
"x-leakrfc-file": file.name,
**BASE_HEADER,
"x-leakrfc-dataset": file.dataset,
"x-leakrfc-key": file.key,
"x-leakrfc-sha1": file.content_hash,
"x-leakrfc-name": file.name,
"x-leakrfc-size": str(file.size),
"x-mimetype": file.mimetype,
"content-type": file.mimetype,
Expand Down Expand Up @@ -62,7 +59,7 @@ def __exit__(self, exc_cls, exc, _):

def get_file_info(dataset: str, key: str) -> File:
storage = archive.get_dataset(dataset)
return storage.lookup_file_by_content_hash(key)
return storage.lookup_file(key)


def ensure_path_context(dataset: str, key: str) -> Context:
Expand All @@ -72,7 +69,7 @@ def ensure_path_context(dataset: str, key: str) -> Context:

def stream_file(ctx: Context) -> StreamingResponse:
storage = archive.get_dataset(ctx.dataset)
file = storage.lookup_file_by_content_hash(ctx.key)
file = storage.lookup_file(ctx.key)
return StreamingResponse(
storage.stream_file(file),
headers=ctx.headers,
Expand Down
4 changes: 0 additions & 4 deletions leakrfc/archive/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,6 @@ def lookup_file(self, key: str) -> File:
path = self._get_file_info_path(key)
return self._storage.get(path, model=File)

def lookup_file_by_content_hash(self, ch: str) -> File:
key = self.documents.get_key_for_content_hash(ch)
return self.lookup_file(key)

def stream_file(self, file: File) -> BytesGenerator:
yield from self._storage.stream(self._make_path(file.key))

Expand Down
17 changes: 0 additions & 17 deletions leakrfc/archive/documents.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,19 +71,6 @@ def iter_entities(self) -> CEGenerator:
for document in self.iter_documents():
yield document.to_proxy()

def build_reversed(self) -> None:
# build reversed hash -> key index
if self._build_reversed:
return
log.info(
"Building reversed index ...",
dataset=self.dataset.name,
cache=self.cache.uri,
)
for doc in self.iter_documents():
self.cache.put(f"{self.ix_prefix}/{doc.content_hash}", doc.key)
self._build_reversed = True

def add(self, doc: Document) -> None:
"""Mark a document for addition /change"""
self.cache.put(f"{self.prefix}/add/{doc.key}", doc)
Expand Down Expand Up @@ -135,10 +122,6 @@ def pop_cache(self, prefix: Literal["add", "del"]) -> Docs:
data["dataset"] = self.dataset.name
yield Document(**data)

def get_key_for_content_hash(self, ch: str) -> str:
self.build_reversed()
return self.cache.get(f"{self.ix_prefix}/{ch}")

def get_total_size(self) -> int:
df = self.get_db()
return int(df["size"].sum())
Expand Down
10 changes: 6 additions & 4 deletions tests/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,18 @@
client = TestClient(app)

DATASET = "test_dataset"
KEY = "f26b980762285ab31143792df9b8d1dfa9643cb0"
SHA1 = "2aae6c35c94fcfb415dbe95f408b9ce91ee846ed"
KEY = "testdir/test.txt"
URL = f"{DATASET}/{KEY}"


def _check_headers(res):
assert "application/pdf" in res.headers["content-type"] # FIXME
assert "text/plain" in res.headers["content-type"] # FIXME
assert res.headers["x-leakrfc-dataset"] == DATASET
assert res.headers["x-leakrfc-key"] == KEY
assert res.headers["x-leakrfc-file"] == "readme.pdf"
assert res.headers["x-leakrfc-size"] == "73627"
assert res.headers["x-leakrfc-sha1"] == SHA1
assert res.headers["x-leakrfc-name"] == "test.txt"
assert res.headers["x-leakrfc-size"] == "11"
return True


Expand Down
3 changes: 0 additions & 3 deletions tests/test_archive.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,6 @@ def _test_dataset(dataset: DatasetArchive | ReadOnlyDatasetArchive):
key = "utf.txt"
content_hash = "5a6acf229ba576d9a40b09292595658bbb74ef56"

# lookup by content hash
assert dataset.lookup_file_by_content_hash(content_hash) == dataset.lookup_file(key)

# lookup by key
assert dataset.exists(key)
file = dataset.lookup_file(key)
Expand Down

0 comments on commit 2824806

Please sign in to comment.