-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any plans to release pools after refinedweb heuristic filtering + dedup? #59
Comments
Hi @CodeCreator , Thank you for your interest in our work! We have made the entire dataset after RefinedWeb processing and deduplication available - information on how to access it can be found here: https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html Examining the fasttext classifier is a great direction! Hope the above data proves useful! |
Hi @GeorgiosSmyrnis, Thanks for the quick response and pointing me to the refinedweb data. Two questions about this data:
Thank you! |
Hi @CodeCreator ,
Let us know if the above helps! |
It seems that randomly sampling files does not equal randomly sampling documents, as mentioned in this discussion. Also, the global and local shards seem to be already shuffled at the file level. Could you explain more about why sampling jsonl files is correct? Thank you! |
Hi @yuzc19 , Indeed, per the discussion you linked this is not the same as sampling randomly on the document level. The reason I recommended sampling at the file level here from the RefinedWeb + dedup processed dataset is that this is the way we created the pools from the raw data, and given what we have available right now I believe this is the best way to get something as close as possible to running the RefinedWeb + dedup pipeline on the pools themselves. |
@GeorgiosSmyrnis Thank you for your prompt response! May I ask for DCLM-baseline (not DCLM-refinedweb), did you also sample at the file level to form the final training data? |
Got it. Thanks |
Just a quick follow-up on this question @GeorgiosSmyrnis : Is directly sampling at the file level from the RefinedWeb + dedup processed dataset (with the same amount of tokens processed from the original dclm-pool) considered an acceptable option to participate in the filter track competition? Thank you! |
Thank you for the great work!
The repo is great for reproducing the entire data processing pipeline, but a lot of people (including me) seem particularly interested in studying the final quality filtering step and ablate the dclm-baseline-fasttext classifier.
It would be really, really helpful if you could release the pools after refineweb heuristic filtering + dedup, i.e. the stage before applying the dclm-baseline-fasttext classifier, because (1) it will make it much easier to compete for small teams of researchers with limited resources / cloud compute budgets (2) fixing the exact deduplication setting allows for cleaner comparisons at the quality filtering stage.
Uploading this for at least the 400m and the two 7b pools would be great!
The text was updated successfully, but these errors were encountered: