Skip to content

Commit

Permalink
Allow configuring input and output storage options in SEG-Y ingestion (
Browse files Browse the repository at this point in the history
…#479)

* add input storage options and rename old one to output

* handle new ingestion storage opts

* update docs about storage options

* add segy settings to grid reader
  • Loading branch information
tasansal authored Dec 13, 2024
1 parent 2e422e8 commit a4b26c8
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 25 deletions.
22 changes: 11 additions & 11 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ The protocols that help choose a backend (i.e. `s3://`, `gs://`, or `az://`) can
prepended to the **_MDIO_** path.

The connection string can be passed to the command-line-interface (CLI) using the
`-storage, --storage-options` flag as a JSON string or the Python API with the `storage_options`
keyword argument as a Python dictionary.
`-storage-{input,output, --storage-options-{input,output}` flag as a JSON string or the Python API with
the `storage_options_{input,output}` keyword argument as a Python dictionary.

````{warning}
On Windows clients, JSON strings are passed to the CLI with a special escape character.
Expand All @@ -66,7 +66,7 @@ If this done incorrectly, you will get an invalid JSON string error from the CLI

Credentials can be automatically fetched from pre-authenticated AWS CLI.
See [here](https://s3fs.readthedocs.io/en/latest/index.html#credentials) for the order `s3fs`
checks them. If it is not pre-authenticated, you need to pass `--storage-options`.
checks them. If it is not pre-authenticated, you need to pass `--storage-options-{input,output}`.

**Prefix:**
`s3://`
Expand All @@ -82,7 +82,7 @@ mdio segy import \
path/to/my.segy \
s3://bucket/prefix/my.mdio \
--header-locations 189,193 \
--storage-options '{"key": "my_super_private_key", "secret": "my_super_private_secret"}'
--storage-options-output '{"key": "my_super_private_key", "secret": "my_super_private_secret"}'
```

Using Windows (note the extra escape characters `\`):
Expand All @@ -92,14 +92,14 @@ mdio segy import \
path/to/my.segy \
s3://bucket/prefix/my.mdio \
--header-locations 189,193 \
--storage-options "{\"key\": \"my_super_private_key\", \"secret\": \"my_super_private_secret\"}"
--storage-options-output "{\"key\": \"my_super_private_key\", \"secret\": \"my_super_private_secret\"}"
```

### Google Cloud Provider

Credentials can be automatically fetched from pre-authenticated `gcloud` CLI.
See [here](https://gcsfs.readthedocs.io/en/latest/#credentials) for the order `gcsfs`
checks them. If it is not pre-authenticated, you need to pass `--storage-options`.
checks them. If it is not pre-authenticated, you need to pass `--storage-options-{input-output}`.

GCP uses [service accounts](https://cloud.google.com/iam/docs/service-accounts) to pass
authentication information to APIs.
Expand All @@ -117,7 +117,7 @@ mdio segy import \
path/to/my.segy \
gs://bucket/prefix/my.mdio \
--header-locations 189,193 \
--storage-options '{"token": "~/.config/gcloud/application_default_credentials.json"}'
--storage-options-output '{"token": "~/.config/gcloud/application_default_credentials.json"}'
```

Using browser to populate authentication:
Expand All @@ -127,14 +127,14 @@ mdio segy import \
path/to/my.segy \
gs://bucket/prefix/my.mdio \
--header-locations 189,193 \
--storage-options '{"token": "browser"}'
--storage-options-output '{"token": "browser"}'
```

### Microsoft Azure

There are various ways to authenticate with Azure Data Lake (ADL).
See [here](https://github.com/fsspec/adlfs#details) for some details.
If ADL is not pre-authenticated, you need to pass `--storage-options`.
If ADL is not pre-authenticated, you need to pass `--storage-options-{input,output}`.

**Prefix:**
`az://` or `abfs://`
Expand All @@ -148,7 +148,7 @@ mdio segy import \
path/to/my.segy \
az://bucket/prefix/my.mdio \
--header-locations 189,193 \
--storage-options '{"account_name": "myaccount", "account_key": "my_super_private_key"}'
--storage-options-output '{"account_name": "myaccount", "account_key": "my_super_private_key"}'
```

### Advanced Cloud Features
Expand Down Expand Up @@ -190,7 +190,7 @@ reduces object-store request costs.

When combining advanced protocols like `simplecache` and using a remote store like `s3` the
URL can be chained like `simplecache::s3://bucket/prefix/file.mdio`. When doing this the
`--storage-options` argument must explicitly state parameters for the cloud backend and the
`--storage-options-{input,output}` argument must explicitly state parameters for the cloud backend and the
extra protocol. For the above example it would look like this:

```json
Expand Down
19 changes: 14 additions & 5 deletions src/mdio/commands/segy.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,10 +115,17 @@
show_default=True,
)
@option(
"-storage",
"--storage-options",
"-storage-input",
"--storage-options-input",
required=False,
help="Storage options for SEG-Y input file.",
type=JSON,
)
@option(
"-storage-output",
"--storage-options-output",
required=False,
help="Custom storage options for cloud backends",
help="Storage options for the MDIO output file.",
type=JSON,
)
@option(
Expand All @@ -144,7 +151,8 @@ def segy_import(
chunk_size: list[int],
lossless: bool,
compression_tolerance: float,
storage_options: dict[str, Any],
storage_options_input: dict[str, Any],
storage_options_output: dict[str, Any],
overwrite: bool,
grid_overrides: dict[str, Any],
):
Expand Down Expand Up @@ -347,7 +355,8 @@ def segy_import(
chunksize=chunk_size,
lossless=lossless,
compression_tolerance=compression_tolerance,
storage_options=storage_options,
storage_options_input=storage_options_input,
storage_options_output=storage_options_output,
overwrite=overwrite,
grid_overrides=grid_overrides,
)
Expand Down
27 changes: 18 additions & 9 deletions src/mdio/converters/segy.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,16 @@

import logging
import os
from collections.abc import Sequence
from datetime import datetime
from datetime import timezone
from importlib import metadata
from typing import Any
from typing import Sequence

import numpy as np
import zarr
from segy import SegyFile
from segy.config import SegySettings
from segy.schema import HeaderField

from mdio.api.io_utils import process_url
Expand Down Expand Up @@ -113,7 +114,8 @@ def segy_to_mdio( # noqa: C901
chunksize: Sequence[int] | None = None,
lossless: bool = True,
compression_tolerance: float = 0.01,
storage_options: dict[str, Any] | None = None,
storage_options_input: dict[str, Any] | None = None,
storage_options_output: dict[str, Any] | None = None,
overwrite: bool = False,
grid_overrides: dict | None = None,
) -> None:
Expand Down Expand Up @@ -164,7 +166,9 @@ def segy_to_mdio( # noqa: C901
accuracy mode in ZFP guarantees there won't be any errors larger
than this value. The default is 0.01, which gives about 70%
reduction in size. Will be ignored if `lossless=True`.
storage_options: Storage options for the cloud storage backend.
storage_options_input: Storage options for SEG-Y input file.
Default is `None` (will assume anonymous)
storage_options_output: Storage options for the MDIO output file.
Default is `None` (will assume anonymous)
overwrite: Toggle for overwriting existing store
grid_overrides: Option to add grid overrides. See examples.
Expand Down Expand Up @@ -355,20 +359,25 @@ def segy_to_mdio( # noqa: C901
)
raise ValueError(message)

if storage_options is None:
storage_options = {}
# Handle storage options and check permissions etc
if storage_options_input is None:
storage_options_input = {}

if storage_options_output is None:
storage_options_output = {}

store = process_url(
url=mdio_path_or_buffer,
mode="w",
storage_options=storage_options,
storage_options=storage_options_output,
memory_cache_size=0, # Making sure disk caching is disabled,
disk_cache=False, # Making sure disk caching is disabled
)

# Open SEG-Y with MDIO's SegySpec. Endianness will be inferred.
mdio_spec = mdio_segy_spec()
segy = SegyFile(url=segy_path, spec=mdio_spec)
segy_settings = SegySettings(storage_options=storage_options_input)
segy = SegyFile(url=segy_path, spec=mdio_spec, settings=segy_settings)

text_header = segy.text_header
binary_header = segy.binary_header
Expand All @@ -380,7 +389,7 @@ def segy_to_mdio( # noqa: C901
for name, byte, format_ in zip(index_names, index_bytes, index_types): # noqa: B905
index_fields.append(HeaderField(name=name, byte=byte, format=format_))
mdio_spec_grid = mdio_spec.customize(trace_header_fields=index_fields)
segy_grid = SegyFile(url=segy_path, spec=mdio_spec_grid)
segy_grid = SegyFile(url=segy_path, spec=mdio_spec_grid, settings=segy_settings)

dimensions, chunksize, index_headers = get_grid_plan(
segy_file=segy_grid,
Expand Down Expand Up @@ -482,7 +491,7 @@ def segy_to_mdio( # noqa: C901
store_nocache = process_url(
url=mdio_path_or_buffer,
mode="r+",
storage_options=storage_options,
storage_options=storage_options_output,
memory_cache_size=0, # Making sure disk caching is disabled,
disk_cache=False, # Making sure disk caching is disabled
)
Expand Down

0 comments on commit a4b26c8

Please sign in to comment.