Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.0.5 #20

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ jobs:
path: ~/.cache/pre-commit
key: pre-commit-${{ runner.os }}-${{ env.PY }}-${{ hashFiles('.pre-commit-config.yaml') }}
- name: Install dependencies
run: poetry install --with dev
run: poetry install --with dev --all-extras
- name: Run pre-commit hooks
run: poetry run pre-commit run
- name: Lint with flake8
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ api:
LEAKRFC_ARCHIVE__URI=./tests/fixtures/archive DEBUG=1 uvicorn leakrfc.api:app --reload --port 5000

install:
poetry install --with dev
poetry install --with dev --all-extras

lint:
poetry run flake8 leakrfc --count --select=E9,F63,F7,F82 --show-source --statistics
Expand Down
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
[![leakrfc on pypi](https://img.shields.io/pypi/v/leakrfc)](https://pypi.org/project/leakrfc/)
[![Python test and package](https://github.com/investigativedata/leakrfc/actions/workflows/python.yml/badge.svg)](https://github.com/investigativedata/leakrfc/actions/workflows/python.yml)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
[![Coverage Status](https://coveralls.io/repos/github/investigativedata/leakrfc/badge.svg?branch=main)](https://coveralls.io/github/investigativedata/leakrfc?branch=main)
[![AGPLv3+ License](https://img.shields.io/pypi/l/leakrfc)](./LICENSE)

# leakrfc

"_A RFC for leaks_"
Expand Down
6 changes: 6 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
`leakrfc` provides a simpel api powered by [FastAPI](https://fastapi.tiangolo.com/) for clients to retrieve file metadata and blobs. It therefore acts as a proxy between client and archive, so that the client doesn't need to know where the actual blobs live. The api can handle authorization via [JSON Web Tokens](https://jwt.io).

## Installation

The API feature needs some extra packages that are not installed by default. Install `leakrfc` with api dependencies:

pip install leakrfc[api]

## Start local api server

This is for a quick testing setup:
Expand Down
5 changes: 5 additions & 0 deletions docs/cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ For incremental processing of tasks, `leakrfc` uses a global cache to track task

`leakrfc` is using [anystore](https://docs.investigraph.dev/lib/anystore/cache/) for the cache implementation, so any supported backend is possible. Recommended backends are redis or sql, but a distributed cloud-backend (such as a shared s3 bucket) can make sense, too.

As long as caching is enabled (globally via `CACHE=1`, the default), all operations will look in the global cache if a task has already been processed. When disabling cache (`CACHE=0`) for a run, the cache is not respected but still populated for next runs.

Per default, an in-memory cache is used, which doesn't persist.

## Configure
Expand All @@ -14,4 +16,7 @@ LEAKRFC_CACHE__URI=redis://localhost
# additional config
LEAKRFC_CACHE__DEFAULT_TTL=3600 # seconds
LEAKRFC_CACHE__BACKEND_CONFIG__REDIS_PREFIX=my-prefix

# disable cache
CACHE=0
```
6 changes: 6 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
[![leakrfc on pypi](https://img.shields.io/pypi/v/leakrfc)](https://pypi.org/project/leakrfc/)
[![Python test and package](https://github.com/investigativedata/leakrfc/actions/workflows/python.yml/badge.svg)](https://github.com/investigativedata/leakrfc/actions/workflows/python.yml)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
[![Coverage Status](https://coveralls.io/repos/github/investigativedata/leakrfc/badge.svg?branch=main)](https://coveralls.io/github/investigativedata/leakrfc?branch=main)
[![AGPLv3+ License](https://img.shields.io/pypi/l/leakrfc)](./LICENSE)

# leakrfc

"_A RFC for leaks_"
Expand Down
2 changes: 1 addition & 1 deletion docs/sync/aleph.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Sync a leakrfc dataset into an [Aleph](https://docs.aleph.occrp.org/) instance.

Collections will be created if they don't exist and their metadata will be updated (this can be disabled via `--no-metadata`). The Aleph collections _foreign id_ can be set via `--foreign-id` and defaults to the leakrfc dataset name.

As long as using `--use-cache` (default) only new documents are synced. The cache handles multiple Aleph instances and keeps track of the individual status for each of them.
As long as using the global cache (environment `CACHE=1`, default) only new documents are synced. The cache handles multiple Aleph instances and keeps track of the individual status for each of them.

Aleph api configuration can as well set via command line:

Expand Down
5 changes: 2 additions & 3 deletions docs/sync/memorious.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Import [memorious](https://github.com/alephdata/memorious) crawler results into a `leakrfc` dataset.

As long as using `--use-cache` (default) only new documents are synced.
As long as using the global cache (environment `CACHE=1`, default) only new documents are synced.

```bash
leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset
Expand All @@ -16,7 +16,7 @@ leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset --name-
leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset --strip-prefix "assets/docs"
```

Or use a template that will replace values from the original memorious "*.json" file for the source file. Given a json file stored by memorious like this:
Or use a template that will replace values from the original memorious "\*.json" file for the source file. Given a json file stored by memorious like this:

```json
{
Expand Down Expand Up @@ -49,7 +49,6 @@ To import this file as "2022/05/Berlin/Beratungsvorgang/19-11840.pdf":
leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset --key-template "{{ date[:4] }}/{{ date[5:7] }}/{{ state }}/{{ category }}/{{ reference.replace('/','-') }}.{{ url.split('.')[-1] }}"
```


## Reference

::: leakrfc.sync.memorious
16 changes: 2 additions & 14 deletions leakrfc/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,6 @@ def cli_diff(
@cli.command("make")
def cli_make(
out_uri: Annotated[str, typer.Option("-o")] = "-",
use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
check_integrity: Annotated[
Optional[bool], typer.Option(help="Check checksums")
] = True,
Expand All @@ -181,9 +180,7 @@ def cli_make(
dataset.make_index()
obj = dataset._storage.get(dataset._get_index_path(), model=DatasetModel)
else:
obj = make_dataset(
dataset, use_cache, check_integrity, cleanup, metadata_only
)
obj = make_dataset(dataset, check_integrity, cleanup, metadata_only)
write_obj(obj, out_uri)


Expand Down Expand Up @@ -242,7 +239,6 @@ def cli_crawl(
out_uri: Annotated[
str, typer.Option("-o", help="Write results to this destination")
] = "-",
use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
skip_existing: Annotated[
Optional[bool],
typer.Option(
Expand Down Expand Up @@ -276,7 +272,6 @@ def cli_crawl(
crawl(
uri,
dataset,
use_cache=use_cache,
skip_existing=skip_existing,
extract=extract,
extract_keep_source=extract_keep_source,
Expand All @@ -300,7 +295,6 @@ def cli_export(out: str):
@memorious.command("sync")
def cli_sync_memorious(
uri: Annotated[str, typer.Option("-i")],
use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
name_only: Annotated[
Optional[bool], typer.Option(help="Use only file name as key")
] = False,
Expand All @@ -323,13 +317,12 @@ def cli_sync_memorious(
key_func = get_file_name_templ_func(key_template)
else:
key_func = None
res = import_memorious(dataset, uri, key_func, use_cache=use_cache)
res = import_memorious(dataset, uri, key_func)
write_obj(res, "-")


@aleph.command("sync")
def cli_aleph_sync(
use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
host: Annotated[Optional[str], typer.Option(help="Aleph host")] = None,
api_key: Annotated[Optional[str], typer.Option(help="Aleph api key")] = None,
folder: Annotated[Optional[str], typer.Option(help="Base folder path")] = None,
Expand All @@ -350,7 +343,6 @@ def cli_aleph_sync(
api_key=api_key,
prefix=folder,
foreign_id=foreign_id,
use_cache=use_cache,
metadata=metadata,
)
write_obj(res, "-")
Expand All @@ -359,7 +351,6 @@ def cli_aleph_sync(
@aleph.command("load-dataset")
def cli_aleph_load_dataset(
uri: Annotated[str, typer.Argument(help="Dataset index.json uri")],
use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
host: Annotated[Optional[str], typer.Option(help="Aleph host")] = None,
api_key: Annotated[Optional[str], typer.Option(help="Aleph api key")] = None,
foreign_id: Annotated[
Expand All @@ -378,7 +369,6 @@ def cli_aleph_load_dataset(
host=host,
api_key=api_key,
foreign_id=foreign_id,
use_cache=use_cache,
metadata=metadata,
)
write_obj(res, "-")
Expand All @@ -395,7 +385,6 @@ def cli_aleph_load_catalog(
# Optional[list[str]],
# typer.Argument(help="Dataset foreign_ids to exclude, can be a glob"),
# ] = None,
use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
host: Annotated[Optional[str], typer.Option(help="Aleph host")] = None,
api_key: Annotated[Optional[str], typer.Option(help="Aleph api key")] = None,
metadata: Annotated[
Expand All @@ -412,7 +401,6 @@ def cli_aleph_load_catalog(
api_key=api_key,
# include_dataset=include_dataset,
# exclude_dataset=exclude_dataset,
use_cache=use_cache,
metadata=metadata,
):
write_obj(res, "-")
3 changes: 0 additions & 3 deletions leakrfc/crawl.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,6 @@ def crawl(
extract: bool | None = False,
extract_keep_source: bool | None = False,
extract_ensure_subdir: bool | None = False,
use_cache: bool | None = True,
write_documents_db: bool | None = True,
exclude: str | None = None,
include: str | None = None,
Expand All @@ -157,7 +156,6 @@ def crawl(
extract_ensure_subdir: Make sub-directories for extracted files with the
archive name to avoid overwriting existing files during extraction
of multiple archives with the same directory structure
use_cache: Use global processing cache to skip tasks
write_documents_db: Create csv-based document tables at the end of crawl run
exclude: Exclude glob for file paths not to crawl
include: Include glob for file paths to crawl
Expand All @@ -180,7 +178,6 @@ def crawl(
extract=extract,
extract_keep_source=extract_keep_source,
extract_ensure_subdir=extract_ensure_subdir,
use_cache=use_cache,
write_documents_db=write_documents_db,
exclude=exclude,
include=include,
Expand Down
6 changes: 1 addition & 5 deletions leakrfc/make.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,6 @@ def done(self) -> None:

def make_dataset(
dataset: DatasetArchive,
use_cache: bool | None = True,
check_integrity: bool | None = True,
cleanup: bool | None = True,
metadata_only: bool | None = False,
Expand All @@ -129,15 +128,12 @@ def make_dataset(

Args:
dataset: leakrfc Dataset instance
use_cache: Use global processing cache to skip tasks
check_integrity: Check checksum for each file (logs mismatches)
cleanup: When checking integrity, fix mismatched metadata and delete
unreferenced metadata files
metadata_only: Only iterate through existing metadata files, don't look
for new source files

"""
worker = MakeWorker(
check_integrity, cleanup, metadata_only, dataset, use_cache=use_cache
)
worker = MakeWorker(check_integrity, cleanup, metadata_only, dataset)
return worker.run()
11 changes: 9 additions & 2 deletions leakrfc/settings.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from typing import Optional

from anystore.io import smart_read
from anystore.io import DoesNotExist, smart_read
from anystore.model import StoreModel
from pydantic import BaseModel, HttpUrl
from pydantic_settings import BaseSettings, SettingsConfigDict
Expand Down Expand Up @@ -40,6 +40,13 @@ class ApiContactSettings(BaseModel):
email: str | None


def get_api_doc() -> str:
try:
return smart_read("./README.md", "r")
except DoesNotExist:
return ""


class ApiSettings(BaseSettings):
model_config = SettingsConfigDict(
env_prefix="leakrfc_api_",
Expand All @@ -52,7 +59,7 @@ class ApiSettings(BaseSettings):
access_token_algorithm: str = "HS256"

title: str = "LeakRFC Api"
description: str = smart_read("./README.md", mode="r")
description: str = get_api_doc()
contact: ApiContactSettings | None = None

allowed_origin: Optional[HttpUrl] = "http://localhost:3000"
19 changes: 13 additions & 6 deletions leakrfc/sync/aleph.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@

from anystore import anycache
from anystore.io import logged_items
from anystore.types import SDict
from anystore.worker import WorkerStatus
from banal import ensure_dict

from leakrfc.archive.cache import get_cache
from leakrfc.archive.dataset import DatasetArchive
Expand Down Expand Up @@ -39,6 +41,16 @@ def make_current_version_cache_key(self: "AlephUploadWorker") -> str:
return aleph.make_aleph_cache_key(self, version)


def get_source_url(data: SDict) -> str | None:
url = data.get("source_url")
if url:
return url
url = ensure_dict(data.get("extra")).get("source_url")
if url:
return url
return data.get("url")


class AlephUploadStatus(WorkerStatus):
uploaded: int = 0
folders_created: int = 0
Expand Down Expand Up @@ -107,7 +119,7 @@ def handle_task(self, task: File) -> dict[str, Any]:
foreign_id=self.foreign_id,
)
metadata = {**task.extra, "file_name": task.name, "foreign_id": task.key}
metadata["source_url"] = metadata.get("url")
metadata["source_url"] = get_source_url(metadata)
parent = self.get_parent(task.key, self.prefix)
if parent:
metadata["parent"] = parent
Expand Down Expand Up @@ -135,21 +147,17 @@ def sync_to_aleph(
api_key: str | None,
prefix: str | None = None,
foreign_id: str | None = None,
use_cache: bool | None = True,
metadata: bool | None = True,
) -> AlephUploadStatus:
"""
Incrementally sync a leakrfc dataset into an Aleph instance.

As long as using `use_cache`, only new documents will be imported.

Args:
dataset: leakrfc Dataset instance
host: Aleph host (can be set via env `ALEPHCLIENT_HOST`)
api_key: Aleph api key (can be set via env `ALEPHCLIENT_API_KEY`)
prefix: Add a folder prefix to import documents into
foreign_id: Aleph collection foreign_id (if different from leakrfc dataset name)
use_cache: Use global processing cache to skip tasks
metadata: Update Aleph collection metadata
"""
worker = AlephUploadWorker(
Expand All @@ -158,7 +166,6 @@ def sync_to_aleph(
api_key=api_key,
prefix=prefix,
foreign_id=foreign_id,
use_cache=use_cache,
metadata=metadata,
)
worker.log_info(f"Starting sync to Aleph `{worker.host}` ...")
Expand Down
4 changes: 0 additions & 4 deletions leakrfc/sync/aleph_entities.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,6 @@ def load_dataset(
host: str | None,
api_key: str | None,
foreign_id: str | None = None,
use_cache: bool | None = True,
metadata: bool | None = True,
) -> AlephLoadDatasetStatus:
dataset = Dataset._from_uri(uri)
Expand All @@ -103,7 +102,6 @@ def load_dataset(
host=host,
api_key=api_key,
foreign_id=foreign_id,
use_cache=use_cache,
metadata=metadata,
)
res = worker.run()
Expand All @@ -115,7 +113,6 @@ def load_catalog(
host: str | None,
api_key: str | None,
foreign_id: str | None = None,
use_cache: bool | None = True,
metadata: bool | None = True,
exclude_dataset: str | None = None,
include_dataset: str | None = None,
Expand All @@ -132,6 +129,5 @@ def load_catalog(
host=host,
api_key=api_key,
foreign_id=foreign_id,
use_cache=use_cache,
metadata=metadata,
)
Loading
Loading