You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi authors, I'm using dedup/bff to run deduplication on my data. I split my data into 512 jsonl files, each containing ~170000 docs. The size of my data is about ~500G. I ran the following command:
Creating new bloom filter...
Bloom filter has size 1.1 GiB | FP Rate 0.010000000289397546
Files 0/512 [00:00:00/00:00:00] [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]Completed setup phase in 0 seconds
Files 512/512 [00:20:19/00:20:19] [███████████████████████████████████████████████████████████████████████████████████████████████████████]Completed filtering all files in 1220 seconds
After running, BFF sparsity was 0.9774785737910421
Completed full BFF run in 1220 seconds
Stats: Saw 505.4 GiB of text | Removed 0.9832516704860044 of them
I think the quality of my data is not that bad containing that many duplicates, as I ran it on the 400M-1x setup without deduplication and achieved results similar to RPJ. Could I be setting some hyperparameters incorrectly, or is there something else I might be overlooking?
The text was updated successfully, but these errors were encountered:
Hey @Yu-Shi ,
The likely reason for that is the low estimation of the ngram-count. with 500Gb of data, you would likely have somewhere closer to 250B tokens, while you listed 1B tokens. can you please try with a higher estimation?
@Mivg Thank you for your reply! I changed --expected-ngram-count from 1B to 250B, and found that it still removed 72% of my data. Could you please provide more information on the estimation of this parameter?
Hi authors, I'm using
dedup/bff
to run deduplication on my data. I split my data into 512 jsonl files, each containing ~170000 docs. The size of my data is about ~500G. I ran the following command:And it reported that 98% of my data was removed:
I think the quality of my data is not that bad containing that many duplicates, as I ran it on the 400M-1x setup without deduplication and achieved results similar to RPJ. Could I be setting some hyperparameters incorrectly, or is there something else I might be overlooking?
The text was updated successfully, but these errors were encountered: