Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement conversion from CutSet to HuggingFace dataset #1398

Conversation

domklement
Copy link
Contributor

This PR implements a simple conversion from a CutSet containing MonoCuts and single-source Recording to a HuggingFace dataset.

CutSet.to_huggingface_dataset: None -> DataSetconverts the cutset into one of two formats, depending on whether all the cuts contain only one supervision or multiple of them. The formats are described in the method's docstring.

So far, conversion from CutSet containing MonoCut and single-source audio to HuggingFace dataset.
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

def has_one_audio_source(self) -> bool:
return all(len(cut.recording.sources) == 1 for cut in self)

def _convert_cuts_info_to_hf(self) -> Tuple[Dict[str, Any], Dict[str, Any]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this file is pretty large and the functionality introduced is fairly well isolated, I suggest to change every method into a function (except the main HF export API) and move them into a separate file lhotse/hf.py. Keep the HF datasets imports local like they are. The body of to_huggingface_dataset(self) should import and call def export_cuts_to_hf(cuts) from that file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Converts cut supervisions into a dictionary compatible with HuggingFace datasets format.

:param has_speaker: Whether the supervisions have speaker information.
:param has_language: Whether the supervisions have language information.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these be auto-deducted?

@domklement domklement force-pushed the feature/export-to-huggingface-dataset branch from b4a5756 to 6247d99 Compare October 3, 2024 05:49
@domklement domklement force-pushed the feature/export-to-huggingface-dataset branch from 6247d99 to aa981c7 Compare October 3, 2024 05:53
@domklement domklement requested a review from pzelasko October 6, 2024 17:17
Converts a CutSet to a HuggingFace Dataset. Currently, only MonoCut with one recording source is supported.
Other cut types will be supported in the future.

More detailed description is in lhotse/hf.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment: can you copy (duplicate) the documentation from export_to_hf here? It will be more user friendly for people using interactive suggestions (notebooks, IDEs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. It's done.

@domklement domklement requested a review from pzelasko October 7, 2024 17:49
@pzelasko pzelasko merged commit e2b149d into lhotse-speech:master Oct 7, 2024
9 checks passed
@pzelasko
Copy link
Collaborator

pzelasko commented Oct 7, 2024

Thanks!!!

@pzelasko pzelasko added this to the v1.28.0 milestone Oct 7, 2024
yfyeung pushed a commit to yfyeung/lhotse that referenced this pull request Jan 8, 2025
…h#1398)

* Implement conversion from CutSet to HuggingFace dataset

So far, conversion from CutSet containing MonoCut and single-source audio to HuggingFace dataset.

* Refactor

* Add docs to set.py

---------

Co-authored-by: Piotr Żelasko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants