Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plans to release pools after refinedweb heuristic filtering + dedup? #59

Open
CodeCreator opened this issue Aug 19, 2024 · 9 comments
Assignees

Comments

@CodeCreator
Copy link

Thank you for the great work!

The repo is great for reproducing the entire data processing pipeline, but a lot of people (including me) seem particularly interested in studying the final quality filtering step and ablate the dclm-baseline-fasttext classifier.

It would be really, really helpful if you could release the pools after refineweb heuristic filtering + dedup, i.e. the stage before applying the dclm-baseline-fasttext classifier, because (1) it will make it much easier to compete for small teams of researchers with limited resources / cloud compute budgets (2) fixing the exact deduplication setting allows for cleaner comparisons at the quality filtering stage.

Uploading this for at least the 400m and the two 7b pools would be great!

@GeorgiosSmyrnis
Copy link
Contributor

GeorgiosSmyrnis commented Aug 21, 2024

Hi @CodeCreator ,

Thank you for your interest in our work!

We have made the entire dataset after RefinedWeb processing and deduplication available - information on how to access it can be found here: https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html

Examining the fasttext classifier is a great direction! Hope the above data proves useful!

@CodeCreator
Copy link
Author

Hi @GeorgiosSmyrnis,

Thanks for the quick response and pointing me to the refinedweb data. Two questions about this data:

  1. Is there a canonical way of forming the competition pools from this overall data? E.g. specific global_shards / local_shards / indices?
  2. I just want to confirm that this is before deduplication and that the deduplication should happen after forming the pools?

Thank you!

@GeorgiosSmyrnis
Copy link
Contributor

GeorgiosSmyrnis commented Aug 23, 2024

Hi @CodeCreator ,

  1. To create the pool, we randomly sampled jsonl files from the full dataset - I would recommend randomly sampling the same number of jsonl files depending on the scale you want to examine.
  2. The above dataset is, in fact, after deduplication.

Let us know if the above helps!

@yuzc19
Copy link
Contributor

yuzc19 commented Aug 23, 2024

Hi @GeorgiosSmyrnis

It seems that randomly sampling files does not equal randomly sampling documents, as mentioned in this discussion. Also, the global and local shards seem to be already shuffled at the file level.

Could you explain more about why sampling jsonl files is correct? Thank you!

@GeorgiosSmyrnis
Copy link
Contributor

Hi @yuzc19 ,

Indeed, per the discussion you linked this is not the same as sampling randomly on the document level.

The reason I recommended sampling at the file level here from the RefinedWeb + dedup processed dataset is that this is the way we created the pools from the raw data, and given what we have available right now I believe this is the best way to get something as close as possible to running the RefinedWeb + dedup pipeline on the pools themselves.

@yuzc19
Copy link
Contributor

yuzc19 commented Aug 23, 2024

@GeorgiosSmyrnis Thank you for your prompt response! May I ask for DCLM-baseline (not DCLM-refinedweb), did you also sample at the file level to form the final training data?

@GeorgiosSmyrnis
Copy link
Contributor

GeorgiosSmyrnis commented Aug 23, 2024

@yuzc19 the full DCLM-baseline dataset available in s3 was actually created without sampling - we simply applied our filter to all the raw data we had available (namely, DCLM-pool).

I hope this clarifies things!

@yuzc19
Copy link
Contributor

yuzc19 commented Aug 23, 2024

Got it. Thanks

@yuzc19
Copy link
Contributor

yuzc19 commented Sep 7, 2024

Just a quick follow-up on this question @GeorgiosSmyrnis :

Is directly sampling at the file level from the RefinedWeb + dedup processed dataset (with the same amount of tokens processed from the original dclm-pool) considered an acceptable option to participate in the filter track competition? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants