Dataset.from_dict() can't handle large dict #7366

CSU-OSS · 2025-01-11T02:05:21Z

Describe the bug

I have 26,000,000 3-tuples. When I use Dataset.from_dict() to load, neither. py nor Jupiter notebook can run successfully. This is my code:

    # len(example_data) is 26,000,000, 'diff' is a text
    diff1_list = [example_data[i].texts[0] for i in range(len(example_data))]
    diff2_list = [example_data[i].texts[1] for i in range(len(example_data))]
    label_list = [example_data[i].label for i in range(len(example_data))]

    embedding_dataset = Dataset.from_dict({
        "diff1": diff1_list,
        "diff2": diff2_list,
        "label": label_list
    })

Steps to reproduce the bug

Initialize a large 3-tuple, e.g. 26,000,000
Use Dataset.from_dict() to load

Expected behavior

Dataset.from_dict() run successfully

Environment info

sentence-transformers 3.3.1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.from_dict() can't handle large dict #7366

Dataset.from_dict() can't handle large dict #7366

CSU-OSS commented Jan 11, 2025

Dataset.from_dict() can't handle large dict #7366

Dataset.from_dict() can't handle large dict #7366

Comments

CSU-OSS commented Jan 11, 2025

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info