Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.from_dict() can't handle large dict #7366

Open
CSU-OSS opened this issue Jan 11, 2025 · 0 comments
Open

Dataset.from_dict() can't handle large dict #7366

CSU-OSS opened this issue Jan 11, 2025 · 0 comments

Comments

@CSU-OSS
Copy link

CSU-OSS commented Jan 11, 2025

Describe the bug

I have 26,000,000 3-tuples. When I use Dataset.from_dict() to load, neither. py nor Jupiter notebook can run successfully. This is my code:

    # len(example_data) is 26,000,000, 'diff' is a text
    diff1_list = [example_data[i].texts[0] for i in range(len(example_data))]
    diff2_list = [example_data[i].texts[1] for i in range(len(example_data))]
    label_list = [example_data[i].label for i in range(len(example_data))]

    embedding_dataset = Dataset.from_dict({
        "diff1": diff1_list,
        "diff2": diff2_list,
        "label": label_list
    })

Steps to reproduce the bug

  1. Initialize a large 3-tuple, e.g. 26,000,000
  2. Use Dataset.from_dict() to load

Expected behavior

Dataset.from_dict() run successfully

Environment info

sentence-transformers 3.3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant