-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the pretrain scripts? #68
Comments
Hi @mathfinder , You can find the training configuration files under Please let us know if the above helps! |
A small follow-up: for the largest model that we trained (beyond the competition scales), we used the same hyperparameters as the 7B-2x scale, along with the cooldown process described in Appendix P of our paper. |
Thanks for your reply, but I am still confused. The following pre-training script template
Which one should be filled in |
Hi @mathfinder , For the DCLM-baseline dataset, you will need to first use DCLM-baseline from here: https://data.commoncrawl.org/contrib/datacomp/DCLM-baseline/index.html and create an untokenized json file similar to the ones found in After doing so, you should tokenize it using the instructions in this repository. This will produce a new |
Hi @GeorgiosSmyrnis ,
And then I use the following script to tokenize the datasets:
If I tune My aim is to reproduce the highlighted experiment below. |
And by the way, could you please provide the tokenized and shuffled dataset so that we can directly reproduce the experiment? |
Hi @mathfinder !
|
It will be very helpful for reproduction with multiple version of data, especially the sampled ones! |
Oh, that sounds amazing! I'm really looking forward to seeing your progress. |
It seems that Is it necessary to turn off --no_shuffle? If I use spark to tokenize, can I reproduce your experiment without keeping the same shuffle process as yours? |
Hi @mathfinder , This script should work on local datasets as well, as long as you spin up a ray cluster locally. Are you encountering any specific errors when trying to do so? |
Thank you for your excellent work. If I want to use this data for pretraining and conduct a rigorous comparison with the DCLM-BASELINE 7B model mentioned here, what hyper-parameters should I use? Could you provide the corresponding script? Thank you.
The text was updated successfully, but these errors were encountered: