Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deduplication removes 98% of my data #71

Open
Yu-Shi opened this issue Sep 2, 2024 · 2 comments
Open

deduplication removes 98% of my data #71

Yu-Shi opened this issue Sep 2, 2024 · 2 comments
Assignees

Comments

@Yu-Shi
Copy link

Yu-Shi commented Sep 2, 2024

Hi authors, I'm using dedup/bff to run deduplication on my data. I split my data into 512 jsonl files, each containing ~170000 docs. The size of my data is about ~500G. I ran the following command:

cargo run --release bff  --inputs /path/to/my/data  --output-directory /path/to/output  --expected-ngram-count 1000000000  --fp-rate 0.01  --min-ngram-size 13  --max-ngram-size 13  --filtering-threshold 0.8  --remove-type naive-both

And it reported that 98% of my data was removed:

Creating new bloom filter...
Bloom filter has size 1.1 GiB | FP Rate 0.010000000289397546
Files 0/512 [00:00:00/00:00:00] [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]Completed setup phase in 0 seconds
Files 512/512 [00:20:19/00:20:19] [███████████████████████████████████████████████████████████████████████████████████████████████████████]Completed filtering all files in 1220 seconds
After running, BFF sparsity was 0.9774785737910421
Completed full BFF run in 1220 seconds
Stats: Saw 505.4 GiB of text | Removed 0.9832516704860044 of them

I think the quality of my data is not that bad containing that many duplicates, as I ran it on the 400M-1x setup without deduplication and achieved results similar to RPJ. Could I be setting some hyperparameters incorrectly, or is there something else I might be overlooking?

@Mivg Mivg self-assigned this Sep 3, 2024
@Mivg
Copy link
Collaborator

Mivg commented Sep 4, 2024

Hey @Yu-Shi ,
The likely reason for that is the low estimation of the ngram-count. with 500Gb of data, you would likely have somewhere closer to 250B tokens, while you listed 1B tokens. can you please try with a higher estimation?

@Yu-Shi
Copy link
Author

Yu-Shi commented Sep 6, 2024

@Mivg Thank you for your reply! I changed --expected-ngram-count from 1B to 250B, and found that it still removed 72% of my data. Could you please provide more information on the estimation of this parameter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants