investigativedata · simonwoerpel · Jan 16, 2025 · Jan 16, 2025 · Jan 17, 2025 · Jan 17, 2025
diff --git a/.github/workflows/python.yml b/.github/workflows/python.yml
@@ -43,7 +43,7 @@ jobs:
         path: ~/.cache/pre-commit
         key: pre-commit-${{ runner.os }}-${{ env.PY }}-${{ hashFiles('.pre-commit-config.yaml') }}
     - name: Install dependencies
-      run: poetry install --with dev
+      run: poetry install --with dev --all-extras
     - name: Run pre-commit hooks
       run: poetry run pre-commit run
     - name: Lint with flake8

diff --git a/Makefile b/Makefile
@@ -4,7 +4,7 @@ api:
 	LEAKRFC_ARCHIVE__URI=./tests/fixtures/archive DEBUG=1 uvicorn leakrfc.api:app --reload --port 5000
 
 install:
-	poetry install --with dev
+	poetry install --with dev --all-extras
 
 lint:
 	poetry run flake8 leakrfc --count --select=E9,F63,F7,F82 --show-source --statistics

diff --git a/README.md b/README.md
@@ -1,3 +1,9 @@
+[![leakrfc on pypi](https://img.shields.io/pypi/v/leakrfc)](https://pypi.org/project/leakrfc/)
+[![Python test and package](https://github.com/investigativedata/leakrfc/actions/workflows/python.yml/badge.svg)](https://github.com/investigativedata/leakrfc/actions/workflows/python.yml)
+[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
+[![Coverage Status](https://coveralls.io/repos/github/investigativedata/leakrfc/badge.svg?branch=main)](https://coveralls.io/github/investigativedata/leakrfc?branch=main)
+[![AGPLv3+ License](https://img.shields.io/pypi/l/leakrfc)](./LICENSE)
+
 # leakrfc
 
 "_A RFC for leaks_"

diff --git a/docs/api.md b/docs/api.md
@@ -1,5 +1,11 @@
 `leakrfc` provides a simpel api powered by [FastAPI](https://fastapi.tiangolo.com/) for clients to retrieve file metadata and blobs. It therefore acts as a proxy between client and archive, so that the client doesn't need to know where the actual blobs live. The api can handle authorization via [JSON Web Tokens](https://jwt.io).
 
+## Installation
+
+The API feature needs some extra packages that are not installed by default. Install `leakrfc` with api dependencies:
+
+    pip install leakrfc[api]
+
 ## Start local api server
 
 This is for a quick testing setup:

diff --git a/docs/cache.md b/docs/cache.md
@@ -2,6 +2,8 @@ For incremental processing of tasks, `leakrfc` uses a global cache to track task
 
 `leakrfc` is using [anystore](https://docs.investigraph.dev/lib/anystore/cache/) for the cache implementation, so any supported backend is possible. Recommended backends are redis or sql, but a distributed cloud-backend (such as a shared s3 bucket) can make sense, too.
 
+As long as caching is enabled (globally via `CACHE=1`, the default), all operations will look in the global cache if a task has already been processed. When disabling cache (`CACHE=0`) for a run, the cache is not respected but still populated for next runs.
+
 Per default, an in-memory cache is used, which doesn't persist.
 
 ## Configure
@@ -14,4 +16,7 @@ LEAKRFC_CACHE__URI=redis://localhost
 # additional config
 LEAKRFC_CACHE__DEFAULT_TTL=3600  # seconds
 LEAKRFC_CACHE__BACKEND_CONFIG__REDIS_PREFIX=my-prefix
+
+# disable cache
+CACHE=0
 ```
diff --git a/docs/index.md b/docs/index.md
@@ -1,3 +1,9 @@
+[![leakrfc on pypi](https://img.shields.io/pypi/v/leakrfc)](https://pypi.org/project/leakrfc/)
+[![Python test and package](https://github.com/investigativedata/leakrfc/actions/workflows/python.yml/badge.svg)](https://github.com/investigativedata/leakrfc/actions/workflows/python.yml)
+[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
+[![Coverage Status](https://coveralls.io/repos/github/investigativedata/leakrfc/badge.svg?branch=main)](https://coveralls.io/github/investigativedata/leakrfc?branch=main)
+[![AGPLv3+ License](https://img.shields.io/pypi/l/leakrfc)](./LICENSE)
+
 # leakrfc
 
 "_A RFC for leaks_"

diff --git a/docs/sync/aleph.md b/docs/sync/aleph.md
@@ -2,7 +2,7 @@ Sync a leakrfc dataset into an [Aleph](https://docs.aleph.occrp.org/) instance.
 
 Collections will be created if they don't exist and their metadata will be updated (this can be disabled via `--no-metadata`). The Aleph collections _foreign id_ can be set via `--foreign-id` and defaults to the leakrfc dataset name.
 
-As long as using `--use-cache` (default) only new documents are synced. The cache handles multiple Aleph instances and keeps track of the individual status for each of them.
+As long as using the global cache (environment `CACHE=1`, default) only new documents are synced. The cache handles multiple Aleph instances and keeps track of the individual status for each of them.
 
 Aleph api configuration can as well set via command line:
 

diff --git a/docs/sync/memorious.md b/docs/sync/memorious.md
@@ -1,6 +1,6 @@
 Import [memorious](https://github.com/alephdata/memorious) crawler results into a `leakrfc` dataset.
 
-As long as using `--use-cache` (default) only new documents are synced.
+As long as using the global cache (environment `CACHE=1`, default) only new documents are synced.
 
 ```bash
 leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset
@@ -16,7 +16,7 @@ leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset --name-
 leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset --strip-prefix "assets/docs"
 ```
 
-Or use a template that will replace values from the original memorious "*.json" file for the source file. Given a json file stored by memorious like this:
+Or use a template that will replace values from the original memorious "\*.json" file for the source file. Given a json file stored by memorious like this:
 
 ```json
 {
@@ -49,7 +49,6 @@ To import this file as "2022/05/Berlin/Beratungsvorgang/19-11840.pdf":
 leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset --key-template "{{ date[:4] }}/{{ date[5:7] }}/{{ state }}/{{ category }}/{{ reference.replace('/','-') }}.{{ url.split('.')[-1] }}"
 ```
 
-
 ## Reference
 
 ::: leakrfc.sync.memorious
diff --git a/leakrfc/cli.py b/leakrfc/cli.py
@@ -159,7 +159,6 @@ def cli_diff(
 @cli.command("make")
 def cli_make(
     out_uri: Annotated[str, typer.Option("-o")] = "-",
-    use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
     check_integrity: Annotated[
         Optional[bool], typer.Option(help="Check checksums")
     ] = True,
@@ -181,9 +180,7 @@ def cli_make(
             dataset.make_index()
             obj = dataset._storage.get(dataset._get_index_path(), model=DatasetModel)
         else:
-            obj = make_dataset(
-                dataset, use_cache, check_integrity, cleanup, metadata_only
-            )
+            obj = make_dataset(dataset, check_integrity, cleanup, metadata_only)
         write_obj(obj, out_uri)
 
 
@@ -242,7 +239,6 @@ def cli_crawl(
     out_uri: Annotated[
         str, typer.Option("-o", help="Write results to this destination")
     ] = "-",
-    use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
     skip_existing: Annotated[
         Optional[bool],
         typer.Option(
@@ -276,7 +272,6 @@ def cli_crawl(
             crawl(
                 uri,
                 dataset,
-                use_cache=use_cache,
                 skip_existing=skip_existing,
                 extract=extract,
                 extract_keep_source=extract_keep_source,
@@ -300,7 +295,6 @@ def cli_export(out: str):
 @memorious.command("sync")
 def cli_sync_memorious(
     uri: Annotated[str, typer.Option("-i")],
-    use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
     name_only: Annotated[
         Optional[bool], typer.Option(help="Use only file name as key")
     ] = False,
@@ -323,13 +317,12 @@ def cli_sync_memorious(
             key_func = get_file_name_templ_func(key_template)
         else:
             key_func = None
-        res = import_memorious(dataset, uri, key_func, use_cache=use_cache)
+        res = import_memorious(dataset, uri, key_func)
         write_obj(res, "-")
 
 
 @aleph.command("sync")
 def cli_aleph_sync(
-    use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
     host: Annotated[Optional[str], typer.Option(help="Aleph host")] = None,
     api_key: Annotated[Optional[str], typer.Option(help="Aleph api key")] = None,
     folder: Annotated[Optional[str], typer.Option(help="Base folder path")] = None,
@@ -350,7 +343,6 @@ def cli_aleph_sync(
             api_key=api_key,
             prefix=folder,
             foreign_id=foreign_id,
-            use_cache=use_cache,
             metadata=metadata,
         )
         write_obj(res, "-")
@@ -359,7 +351,6 @@ def cli_aleph_sync(
 @aleph.command("load-dataset")
 def cli_aleph_load_dataset(
     uri: Annotated[str, typer.Argument(help="Dataset index.json uri")],
-    use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
     host: Annotated[Optional[str], typer.Option(help="Aleph host")] = None,
     api_key: Annotated[Optional[str], typer.Option(help="Aleph api key")] = None,
     foreign_id: Annotated[
@@ -378,7 +369,6 @@ def cli_aleph_load_dataset(
             host=host,
             api_key=api_key,
             foreign_id=foreign_id,
-            use_cache=use_cache,
             metadata=metadata,
         )
         write_obj(res, "-")
@@ -395,7 +385,6 @@ def cli_aleph_load_catalog(
     #     Optional[list[str]],
     #     typer.Argument(help="Dataset foreign_ids to exclude, can be a glob"),
     # ] = None,
-    use_cache: Annotated[Optional[bool], typer.Option(help="Use runtime cache")] = True,
     host: Annotated[Optional[str], typer.Option(help="Aleph host")] = None,
     api_key: Annotated[Optional[str], typer.Option(help="Aleph api key")] = None,
     metadata: Annotated[
@@ -412,7 +401,6 @@ def cli_aleph_load_catalog(
             api_key=api_key,
             # include_dataset=include_dataset,
             # exclude_dataset=exclude_dataset,
-            use_cache=use_cache,
             metadata=metadata,
         ):
             write_obj(res, "-")
diff --git a/leakrfc/crawl.py b/leakrfc/crawl.py
@@ -138,7 +138,6 @@ def crawl(
     extract: bool | None = False,
     extract_keep_source: bool | None = False,
     extract_ensure_subdir: bool | None = False,
-    use_cache: bool | None = True,
     write_documents_db: bool | None = True,
     exclude: str | None = None,
     include: str | None = None,
@@ -157,7 +156,6 @@ def crawl(
         extract_ensure_subdir: Make sub-directories for extracted files with the
             archive name to avoid overwriting existing files during extraction
             of multiple archives with the same directory structure
-        use_cache: Use global processing cache to skip tasks
         write_documents_db: Create csv-based document tables at the end of crawl run
         exclude: Exclude glob for file paths not to crawl
         include: Include glob for file paths to crawl
@@ -180,7 +178,6 @@ def crawl(
         extract=extract,
         extract_keep_source=extract_keep_source,
         extract_ensure_subdir=extract_ensure_subdir,
-        use_cache=use_cache,
         write_documents_db=write_documents_db,
         exclude=exclude,
         include=include,

diff --git a/leakrfc/make.py b/leakrfc/make.py
@@ -113,7 +113,6 @@ def done(self) -> None:
 
 def make_dataset(
     dataset: DatasetArchive,
-    use_cache: bool | None = True,
     check_integrity: bool | None = True,
     cleanup: bool | None = True,
     metadata_only: bool | None = False,
@@ -129,15 +128,12 @@ def make_dataset(
 
     Args:
         dataset: leakrfc Dataset instance
-        use_cache: Use global processing cache to skip tasks
         check_integrity: Check checksum for each file (logs mismatches)
         cleanup: When checking integrity, fix mismatched metadata and delete
             unreferenced metadata files
         metadata_only: Only iterate through existing metadata files, don't look
             for new source files
 
     """
-    worker = MakeWorker(
-        check_integrity, cleanup, metadata_only, dataset, use_cache=use_cache
-    )
+    worker = MakeWorker(check_integrity, cleanup, metadata_only, dataset)
     return worker.run()
diff --git a/leakrfc/settings.py b/leakrfc/settings.py
@@ -1,6 +1,6 @@
 from typing import Optional
 
-from anystore.io import smart_read
+from anystore.io import DoesNotExist, smart_read
 from anystore.model import StoreModel
 from pydantic import BaseModel, HttpUrl
 from pydantic_settings import BaseSettings, SettingsConfigDict
@@ -40,6 +40,13 @@ class ApiContactSettings(BaseModel):
     email: str | None
 
 
+def get_api_doc() -> str:
+    try:
+        return smart_read("./README.md", "r")
+    except DoesNotExist:
+        return ""
+
+
 class ApiSettings(BaseSettings):
     model_config = SettingsConfigDict(
         env_prefix="leakrfc_api_",
@@ -52,7 +59,7 @@ class ApiSettings(BaseSettings):
     access_token_algorithm: str = "HS256"
 
     title: str = "LeakRFC Api"
-    description: str = smart_read("./README.md", mode="r")
+    description: str = get_api_doc()
     contact: ApiContactSettings | None = None
 
     allowed_origin: Optional[HttpUrl] = "http://localhost:3000"
diff --git a/leakrfc/sync/aleph.py b/leakrfc/sync/aleph.py
@@ -9,7 +9,9 @@
 
 from anystore import anycache
 from anystore.io import logged_items
+from anystore.types import SDict
 from anystore.worker import WorkerStatus
+from banal import ensure_dict
 
 from leakrfc.archive.cache import get_cache
 from leakrfc.archive.dataset import DatasetArchive
@@ -39,6 +41,16 @@ def make_current_version_cache_key(self: "AlephUploadWorker") -> str:
     return aleph.make_aleph_cache_key(self, version)
 
 
+def get_source_url(data: SDict) -> str | None:
+    url = data.get("source_url")
+    if url:
+        return url
+    url = ensure_dict(data.get("extra")).get("source_url")
+    if url:
+        return url
+    return data.get("url")
+
+
 class AlephUploadStatus(WorkerStatus):
     uploaded: int = 0
     folders_created: int = 0
@@ -107,7 +119,7 @@ def handle_task(self, task: File) -> dict[str, Any]:
             foreign_id=self.foreign_id,
         )
         metadata = {**task.extra, "file_name": task.name, "foreign_id": task.key}
-        metadata["source_url"] = metadata.get("url")
+        metadata["source_url"] = get_source_url(metadata)
         parent = self.get_parent(task.key, self.prefix)
         if parent:
             metadata["parent"] = parent
@@ -135,21 +147,17 @@ def sync_to_aleph(
     api_key: str | None,
     prefix: str | None = None,
     foreign_id: str | None = None,
-    use_cache: bool | None = True,
     metadata: bool | None = True,
 ) -> AlephUploadStatus:
     """
     Incrementally sync a leakrfc dataset into an Aleph instance.
 
-    As long as using `use_cache`, only new documents will be imported.
-
     Args:
         dataset: leakrfc Dataset instance
         host: Aleph host (can be set via env `ALEPHCLIENT_HOST`)
         api_key: Aleph api key (can be set via env `ALEPHCLIENT_API_KEY`)
         prefix: Add a folder prefix to import documents into
         foreign_id: Aleph collection foreign_id (if different from leakrfc dataset name)
-        use_cache: Use global processing cache to skip tasks
         metadata: Update Aleph collection metadata
     """
     worker = AlephUploadWorker(
@@ -158,7 +166,6 @@ def sync_to_aleph(
         api_key=api_key,
         prefix=prefix,
         foreign_id=foreign_id,
-        use_cache=use_cache,
         metadata=metadata,
     )
     worker.log_info(f"Starting sync to Aleph `{worker.host}` ...")

diff --git a/leakrfc/sync/aleph_entities.py b/leakrfc/sync/aleph_entities.py
@@ -92,7 +92,6 @@ def load_dataset(
     host: str | None,
     api_key: str | None,
     foreign_id: str | None = None,
-    use_cache: bool | None = True,
     metadata: bool | None = True,
 ) -> AlephLoadDatasetStatus:
     dataset = Dataset._from_uri(uri)
@@ -103,7 +102,6 @@ def load_dataset(
         host=host,
         api_key=api_key,
         foreign_id=foreign_id,
-        use_cache=use_cache,
         metadata=metadata,
     )
     res = worker.run()
@@ -115,7 +113,6 @@ def load_catalog(
     host: str | None,
     api_key: str | None,
     foreign_id: str | None = None,
-    use_cache: bool | None = True,
     metadata: bool | None = True,
     exclude_dataset: str | None = None,
     include_dataset: str | None = None,
@@ -132,6 +129,5 @@ def load_catalog(
             host=host,
             api_key=api_key,
             foreign_id=foreign_id,
-            use_cache=use_cache,
             metadata=metadata,
         )