Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add multi-storage-client backend for file open #1455

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jayya2
Copy link

@jayya2 jayya2 commented Feb 15, 2025

This PR adds support for the Multi-Storage Client (MSC) backend to handle object storage access in Lhotse. The changes include:

Features

  • New MSCIOBackend for handling MSC protocol URLs
  • URL transformation from existing protocols (e.g., s3://) to MSC format via
    • LHOTSE_MSC_OVERRIDE_PROTOCOLS env for supported protocols, e.g. s3:// -> msc://
    • LHOTSE_MSC_PROFILE env for profile/bucket name overrides, e.g. msc://my-bucket -> msc://my-profile

Implementation Details

  • Added URL transformation utilities for bucket/profile name handling
  • Integrated MSC backend into the default IO backend chain
  • Added unit tests for MSC functionality

Configuration

MSC behavior can be configured through environment variables:

  • LHOTSE_MSC_OVERRIDE_PROTOCOLS: Comma-separated list of protocols to override (e.g., "s3,gs")
  • LHOTSE_MSC_PROFILE: Profile name to use for bucket override

Dependencies

  • Requires multistorageclient package to be installed

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks good! I left a few comments.

In addition to those, can you add MSC to the list of optional dependencies in README.md?

You might also need to add msc to the list of optional dependencies installed in the CI for its tests to succeed (see here).

@@ -815,6 +815,82 @@ def handles_special_case(self, identifier: Pathlike) -> bool:
def is_applicable(self, identifier: Pathlike) -> bool:
return is_valid_url(identifier)


@lru_cache(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the lru cache decorator since these environ lookups should be cheap - it looks like that would simplify the tests.

2. override the profile/bucket name by env LHOTSE_MSC_PROFILE if provided: msc://profile/path/to/my/object2,
if bucket name is not provided, then we expect the msc profile name to match with bucket name
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an import guard here:

if not is_module_available("multistorageclient"):
    raise RuntimeError("Please run 'pip install multistorageclient' in order to use MSCIOBackend.")

(imported from lhotse.utils)


class MSCIOBackend(IOBackend):
"""
Uses multi-storage client to download data from object store
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to MSC here? It'd be good to add 1-2 sentences about how MSC is different and what are it's unique features.


@lru_cache(1)
def get_lhotse_msc_override_protocols() -> Any:
return os.getenv("LHOTSE_MSC_OVERRIDE_PROTOCOLS", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document these environment variables in Lhotse's top-level README.md where it lists all env vars used to modify lhotse behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants