Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the pretrain scripts? #68

Open
mathfinder opened this issue Aug 30, 2024 · 12 comments
Open

What is the pretrain scripts? #68

mathfinder opened this issue Aug 30, 2024 · 12 comments
Assignees

Comments

@mathfinder
Copy link

Thank you for your excellent work. If I want to use this data for pretraining and conduct a rigorous comparison with the DCLM-BASELINE 7B model mentioned here, what hyper-parameters should I use? Could you provide the corresponding script? Thank you.

@GeorgiosSmyrnis
Copy link
Contributor

Hi @mathfinder ,

You can find the training configuration files under training/configs. E.g. for the 7B-1x scale, the corresponding config with all the hyperpameters is https://github.com/mlfoundations/dclm/blob/main/training/configs/7b_1x_fast_2e-3_lr_5e-6_zloss.json

Please let us know if the above helps!

@GeorgiosSmyrnis
Copy link
Contributor

GeorgiosSmyrnis commented Aug 30, 2024

A small follow-up: for the largest model that we trained (beyond the competition scales), we used the same hyperparameters as the 7B-2x scale, along with the cooldown process described in Appendix P of our paper.

@mathfinder
Copy link
Author

Thanks for your reply, but I am still confused.

The following pre-training script template

torchrun --nproc-per-node 8 -m training.train --scale <scale> <tokenized_json> --logs <log_dir> [--remote-sync <s3_bucket>] [--chinchilla-multiplier <multiplier>] [--clean-exp] [--report-to-wandb]

Which one should be filled in <tokenized_json> to align with the results reported by leaderboard?

@mathfinder
Copy link
Author

image
I guess data-config is exp_data/datasets/tokenized/c4_original.json? Just like the following script:
torchrun --nproc-per-node 8 -m training.train --scale="7b_2x_fast_2e-3_lr_5e-6_zloss" --data-config="exp_data/datasets/tokenized/c4_original.json" --report-to-wandb

@GeorgiosSmyrnis
Copy link
Contributor

Hi @mathfinder ,

For the DCLM-baseline dataset, you will need to first use DCLM-baseline from here: https://data.commoncrawl.org/contrib/datacomp/DCLM-baseline/index.html and create an untokenized json file similar to the ones found in exp_data/datasets/raw_sources.

After doing so, you should tokenize it using the instructions in this repository. This will produce a new json file under exp_data/datasets/tokenized, which you can then use as tokenized_json for --data-config.

@mathfinder
Copy link
Author

Hi @GeorgiosSmyrnis ,
I have downloaded the datasets by the following code:

import os
import sys
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
from huggingface_hub import snapshot_download
pattern = None
if len(sys.argv) > 1:
    # "sample/350BT/008_*"
    pattern = sys.argv[1]
    print(f'{"#"*30} {pattern} {"#"*30}')
snapshot_download(repo_id="mlfoundations/dclm-baseline-1.0",
                  repo_type="dataset",
                  revision="main",
                  allow_patterns=pattern,
                  local_dir="path/to/dclm",
                  local_dir_use_symlinks=False,
                  resume_download=True,
                  max_workers=32)

And then I use the following script to tokenize the datasets:

python ray_processing/tokenize_shuffle.py \
--input /path/to/untokenized_data \
--readable_name dclm \
--output /path/to/tokenized_data \
--content_key text

If I tune do_sample, the code will look for the non-existent file DCLM/ray_processing/tokenization_configs/rpj_lm_data.yaml. Should I need tune on it?

My aim is to reproduce the highlighted experiment below.

image

@mathfinder
Copy link
Author

And by the way, could you please provide the tokenized and shuffled dataset so that we can directly reproduce the experiment?

@GeorgiosSmyrnis
Copy link
Contributor

Hi @mathfinder !

  • For the highlighted experiment, you don't need to do upsampling / downsampling of sources, so you don't need the --do_sample parameter or the associated yaml file - you can safely ignore this.
  • I will check in with the rest of the team regarding the tokenized dataset - given the size of these datasets there are some considerations regarding hosting multiple versions of the data.

@Taoer1996
Copy link

Taoer1996 commented Sep 3, 2024

Hi @mathfinder !

  • For the highlighted experiment, you don't need to do upsampling / downsampling of sources, so you don't need the --do_sample parameter or the associated yaml file - you can safely ignore this.
  • I will check in with the rest of the team regarding the tokenized dataset - given the size of these datasets there are some considerations regarding hosting multiple versions of the data.

It will be very helpful for reproduction with multiple version of data, especially the sampled ones!

@mathfinder
Copy link
Author

Hi @mathfinder !

  • For the highlighted experiment, you don't need to do upsampling / downsampling of sources, so you don't need the --do_sample parameter or the associated yaml file - you can safely ignore this.
  • I will check in with the rest of the team regarding the tokenized dataset - given the size of these datasets there are some considerations regarding hosting multiple versions of the data.

Oh, that sounds amazing! I'm really looking forward to seeing your progress.

@mathfinder
Copy link
Author

It seems that ray_processing/tokenize_shuffle.py heavily depends on S3, so using a local dataset will require a lot of changes.

Is it necessary to turn off --no_shuffle? If I use spark to tokenize, can I reproduce your experiment without keeping the same shuffle process as yours?

@GeorgiosSmyrnis
Copy link
Contributor

Hi @mathfinder ,

This script should work on local datasets as well, as long as you spin up a ray cluster locally. Are you encountering any specific errors when trying to do so?

@Mivg Mivg assigned Mivg and GeorgiosSmyrnis and unassigned Mivg Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants