Implement conversion from CutSet to HuggingFace dataset #1398

domklement · 2024-09-30T05:45:36Z

This PR implements a simple conversion from a CutSet containing MonoCuts and single-source Recording to a HuggingFace dataset.

CutSet.to_huggingface_dataset: None -> DataSetconverts the cutset into one of two formats, depending on whether all the cuts contain only one supervision or multiple of them. The formats are described in the method's docstring.

So far, conversion from CutSet containing MonoCut and single-source audio to HuggingFace dataset.

pzelasko

Nice work!

pzelasko · 2024-10-01T22:30:11Z

lhotse/cut/set.py

+    def has_one_audio_source(self) -> bool:
+        return all(len(cut.recording.sources) == 1 for cut in self)
+
+    def _convert_cuts_info_to_hf(self) -> Tuple[Dict[str, Any], Dict[str, Any]]:


Since this file is pretty large and the functionality introduced is fairly well isolated, I suggest to change every method into a function (except the main HF export API) and move them into a separate file lhotse/hf.py. Keep the HF datasets imports local like they are. The body of to_huggingface_dataset(self) should import and call def export_cuts_to_hf(cuts) from that file.

pzelasko · 2024-10-01T22:31:07Z

lhotse/cut/set.py

+        Converts cut supervisions into a dictionary compatible with HuggingFace datasets format.
+
+        :param has_speaker: Whether the supervisions have speaker information.
+        :param has_language: Whether the supervisions have language information.


Shouldn't these be auto-deducted?

…e-dataset

pzelasko · 2024-10-07T17:10:05Z

lhotse/cut/set.py

+        Converts a CutSet to a HuggingFace Dataset. Currently, only MonoCut with one recording source is supported.
+        Other cut types will be supported in the future.
+
+        More detailed description is in lhotse/hf.py


One last comment: can you copy (duplicate) the documentation from export_to_hf here? It will be more user friendly for people using interactive suggestions (notebooks, IDEs)

That's a good point. It's done.

pzelasko · 2024-10-07T19:17:27Z

Thanks!!!

…h#1398) * Implement conversion from CutSet to HuggingFace dataset So far, conversion from CutSet containing MonoCut and single-source audio to HuggingFace dataset. * Refactor * Add docs to set.py --------- Co-authored-by: Piotr Żelasko <[email protected]>

Implement conversion from CutSet to HuggingFace dataset

dd9e62f

So far, conversion from CutSet containing MonoCut and single-source audio to HuggingFace dataset.

pzelasko requested changes Oct 1, 2024

View reviewed changes

Merge branch 'lhotse-speech:master' into feature/export-to-huggingfac…

0b1d644

…e-dataset

domklement force-pushed the feature/export-to-huggingface-dataset branch from b4a5756 to 6247d99 Compare October 3, 2024 05:49

Refactor

aa981c7

domklement force-pushed the feature/export-to-huggingface-dataset branch from 6247d99 to aa981c7 Compare October 3, 2024 05:53

domklement requested a review from pzelasko October 6, 2024 17:17

pzelasko reviewed Oct 7, 2024

View reviewed changes

pzelasko and others added 2 commits October 7, 2024 13:10

Merge branch 'master' into feature/export-to-huggingface-dataset

5f370bc

Add docs to set.py

fae3899

domklement requested a review from pzelasko October 7, 2024 17:49

pzelasko approved these changes Oct 7, 2024

View reviewed changes

pzelasko merged commit e2b149d into lhotse-speech:master Oct 7, 2024
9 checks passed

pzelasko added this to the v1.28.0 milestone Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement conversion from CutSet to HuggingFace dataset #1398

Implement conversion from CutSet to HuggingFace dataset #1398

domklement commented Sep 30, 2024

pzelasko left a comment

pzelasko Oct 1, 2024

domklement Oct 3, 2024

pzelasko Oct 1, 2024

pzelasko Oct 7, 2024

domklement Oct 7, 2024

pzelasko commented Oct 7, 2024

Implement conversion from CutSet to HuggingFace dataset #1398

Implement conversion from CutSet to HuggingFace dataset #1398

Conversation

domklement commented Sep 30, 2024

pzelasko left a comment

Choose a reason for hiding this comment

pzelasko Oct 1, 2024

Choose a reason for hiding this comment

domklement Oct 3, 2024

Choose a reason for hiding this comment

pzelasko Oct 1, 2024

Choose a reason for hiding this comment

pzelasko Oct 7, 2024

Choose a reason for hiding this comment

domklement Oct 7, 2024

Choose a reason for hiding this comment

pzelasko commented Oct 7, 2024