deduplication removes 98% of my data #71

Yu-Shi · 2024-09-02T06:01:57Z

Hi authors, I'm using dedup/bff to run deduplication on my data. I split my data into 512 jsonl files, each containing ~170000 docs. The size of my data is about ~500G. I ran the following command:

cargo run --release bff  --inputs /path/to/my/data  --output-directory /path/to/output  --expected-ngram-count 1000000000  --fp-rate 0.01  --min-ngram-size 13  --max-ngram-size 13  --filtering-threshold 0.8  --remove-type naive-both

And it reported that 98% of my data was removed:

Creating new bloom filter...
Bloom filter has size 1.1 GiB | FP Rate 0.010000000289397546
Files 0/512 [00:00:00/00:00:00] [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]Completed setup phase in 0 seconds
Files 512/512 [00:20:19/00:20:19] [███████████████████████████████████████████████████████████████████████████████████████████████████████]Completed filtering all files in 1220 seconds
After running, BFF sparsity was 0.9774785737910421
Completed full BFF run in 1220 seconds
Stats: Saw 505.4 GiB of text | Removed 0.9832516704860044 of them

I think the quality of my data is not that bad containing that many duplicates, as I ran it on the 400M-1x setup without deduplication and achieved results similar to RPJ. Could I be setting some hyperparameters incorrectly, or is there something else I might be overlooking?

The text was updated successfully, but these errors were encountered:

Mivg · 2024-09-04T16:13:54Z

Hey @Yu-Shi ,
The likely reason for that is the low estimation of the ngram-count. with 500Gb of data, you would likely have somewhere closer to 250B tokens, while you listed 1B tokens. can you please try with a higher estimation?

Yu-Shi · 2024-09-06T08:35:51Z

@Mivg Thank you for your reply! I changed --expected-ngram-count from 1B to 250B, and found that it still removed 72% of my data. Could you please provide more information on the estimation of this parameter?

Mivg self-assigned this Sep 3, 2024

Mivg assigned revbucket Sep 7, 2024

XirenZhou mentioned this issue Oct 4, 2024

bff deduplication removes >90% data with NaiveBoth remove type #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deduplication removes 98% of my data #71

deduplication removes 98% of my data #71

Yu-Shi commented Sep 2, 2024 •

edited

Loading

Mivg commented Sep 4, 2024

Yu-Shi commented Sep 6, 2024 •

edited

Loading

deduplication removes 98% of my data #71

deduplication removes 98% of my data #71

Comments

Yu-Shi commented Sep 2, 2024 • edited Loading

Mivg commented Sep 4, 2024

Yu-Shi commented Sep 6, 2024 • edited Loading

Yu-Shi commented Sep 2, 2024 •

edited

Loading

Yu-Shi commented Sep 6, 2024 •

edited

Loading