Any plans to release pools after refinedweb heuristic filtering + dedup? #59

CodeCreator · 2024-08-19T16:08:21Z

Thank you for the great work!

The repo is great for reproducing the entire data processing pipeline, but a lot of people (including me) seem particularly interested in studying the final quality filtering step and ablate the dclm-baseline-fasttext classifier.

It would be really, really helpful if you could release the pools after refineweb heuristic filtering + dedup, i.e. the stage before applying the dclm-baseline-fasttext classifier, because (1) it will make it much easier to compete for small teams of researchers with limited resources / cloud compute budgets (2) fixing the exact deduplication setting allows for cleaner comparisons at the quality filtering stage.

Uploading this for at least the 400m and the two 7b pools would be great!

GeorgiosSmyrnis · 2024-08-21T01:36:23Z

Hi @CodeCreator ,

Thank you for your interest in our work!

We have made the entire dataset after RefinedWeb processing and deduplication available - information on how to access it can be found here: https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html

Examining the fasttext classifier is a great direction! Hope the above data proves useful!

CodeCreator · 2024-08-21T18:55:17Z

Hi @GeorgiosSmyrnis,

Thanks for the quick response and pointing me to the refinedweb data. Two questions about this data:

Is there a canonical way of forming the competition pools from this overall data? E.g. specific global_shards / local_shards / indices?
I just want to confirm that this is before deduplication and that the deduplication should happen after forming the pools?

Thank you!

GeorgiosSmyrnis · 2024-08-23T20:06:46Z

Hi @CodeCreator ,

To create the pool, we randomly sampled jsonl files from the full dataset - I would recommend randomly sampling the same number of jsonl files depending on the scale you want to examine.
The above dataset is, in fact, after deduplication.

Let us know if the above helps!

yuzc19 · 2024-08-23T20:20:41Z

Hi @GeorgiosSmyrnis

It seems that randomly sampling files does not equal randomly sampling documents, as mentioned in this discussion. Also, the global and local shards seem to be already shuffled at the file level.

Could you explain more about why sampling jsonl files is correct? Thank you!

GeorgiosSmyrnis · 2024-08-23T20:34:07Z

Hi @yuzc19 ,

Indeed, per the discussion you linked this is not the same as sampling randomly on the document level.

The reason I recommended sampling at the file level here from the RefinedWeb + dedup processed dataset is that this is the way we created the pools from the raw data, and given what we have available right now I believe this is the best way to get something as close as possible to running the RefinedWeb + dedup pipeline on the pools themselves.

yuzc19 · 2024-08-23T20:45:15Z

@GeorgiosSmyrnis Thank you for your prompt response! May I ask for DCLM-baseline (not DCLM-refinedweb), did you also sample at the file level to form the final training data?

GeorgiosSmyrnis · 2024-08-23T21:23:19Z

@yuzc19 the full DCLM-baseline dataset available in s3 was actually created without sampling - we simply applied our filter to all the raw data we had available (namely, DCLM-pool).

I hope this clarifies things!

yuzc19 · 2024-08-23T21:27:07Z

Got it. Thanks

yuzc19 · 2024-09-07T05:37:32Z

Just a quick follow-up on this question @GeorgiosSmyrnis :

Is directly sampling at the file level from the RefinedWeb + dedup processed dataset (with the same amount of tokens processed from the original dclm-pool) considered an acceptable option to participate in the filter track competition? Thank you!

Mivg assigned GeorgiosSmyrnis Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any plans to release pools after refinedweb heuristic filtering + dedup? #59

Any plans to release pools after refinedweb heuristic filtering + dedup? #59

CodeCreator commented Aug 19, 2024

GeorgiosSmyrnis commented Aug 21, 2024 •

edited

Loading

CodeCreator commented Aug 21, 2024

GeorgiosSmyrnis commented Aug 23, 2024 •

edited

Loading

yuzc19 commented Aug 23, 2024

GeorgiosSmyrnis commented Aug 23, 2024

yuzc19 commented Aug 23, 2024

GeorgiosSmyrnis commented Aug 23, 2024 •

edited

Loading

yuzc19 commented Aug 23, 2024

yuzc19 commented Sep 7, 2024 •

edited

Loading

Any plans to release pools after refinedweb heuristic filtering + dedup? #59

Any plans to release pools after refinedweb heuristic filtering + dedup? #59

Comments

CodeCreator commented Aug 19, 2024

GeorgiosSmyrnis commented Aug 21, 2024 • edited Loading

CodeCreator commented Aug 21, 2024

GeorgiosSmyrnis commented Aug 23, 2024 • edited Loading

yuzc19 commented Aug 23, 2024

GeorgiosSmyrnis commented Aug 23, 2024

yuzc19 commented Aug 23, 2024

GeorgiosSmyrnis commented Aug 23, 2024 • edited Loading

yuzc19 commented Aug 23, 2024

yuzc19 commented Sep 7, 2024 • edited Loading

GeorgiosSmyrnis commented Aug 21, 2024 •

edited

Loading

GeorgiosSmyrnis commented Aug 23, 2024 •

edited

Loading

GeorgiosSmyrnis commented Aug 23, 2024 •

edited

Loading

yuzc19 commented Sep 7, 2024 •

edited

Loading