What is the pretrain scripts? #68

mathfinder · 2024-08-30T16:01:28Z

Thank you for your excellent work. If I want to use this data for pretraining and conduct a rigorous comparison with the DCLM-BASELINE 7B model mentioned here, what hyper-parameters should I use? Could you provide the corresponding script? Thank you.

GeorgiosSmyrnis · 2024-08-30T19:55:57Z

Hi @mathfinder ,

You can find the training configuration files under training/configs. E.g. for the 7B-1x scale, the corresponding config with all the hyperpameters is https://github.com/mlfoundations/dclm/blob/main/training/configs/7b_1x_fast_2e-3_lr_5e-6_zloss.json

Please let us know if the above helps!

GeorgiosSmyrnis · 2024-08-30T19:58:35Z

A small follow-up: for the largest model that we trained (beyond the competition scales), we used the same hyperparameters as the 7B-2x scale, along with the cooldown process described in Appendix P of our paper.

mathfinder · 2024-08-31T08:38:32Z

Thanks for your reply, but I am still confused.

The following pre-training script template

torchrun --nproc-per-node 8 -m training.train --scale <scale> <tokenized_json> --logs <log_dir> [--remote-sync <s3_bucket>] [--chinchilla-multiplier <multiplier>] [--clean-exp] [--report-to-wandb]

Which one should be filled in <tokenized_json> to align with the results reported by leaderboard?

mathfinder · 2024-08-31T09:52:40Z

I guess data-config is exp_data/datasets/tokenized/c4_original.json? Just like the following script:
torchrun --nproc-per-node 8 -m training.train --scale="7b_2x_fast_2e-3_lr_5e-6_zloss" --data-config="exp_data/datasets/tokenized/c4_original.json" --report-to-wandb

GeorgiosSmyrnis · 2024-08-31T12:25:35Z

Hi @mathfinder ,

For the DCLM-baseline dataset, you will need to first use DCLM-baseline from here: https://data.commoncrawl.org/contrib/datacomp/DCLM-baseline/index.html and create an untokenized json file similar to the ones found in exp_data/datasets/raw_sources.

After doing so, you should tokenize it using the instructions in this repository. This will produce a new json file under exp_data/datasets/tokenized, which you can then use as tokenized_json for --data-config.

mathfinder · 2024-09-01T15:18:40Z

Hi @GeorgiosSmyrnis ,
I have downloaded the datasets by the following code:

import os
import sys
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
from huggingface_hub import snapshot_download
pattern = None
if len(sys.argv) > 1:
    # "sample/350BT/008_*"
    pattern = sys.argv[1]
    print(f'{"#"*30} {pattern} {"#"*30}')
snapshot_download(repo_id="mlfoundations/dclm-baseline-1.0",
                  repo_type="dataset",
                  revision="main",
                  allow_patterns=pattern,
                  local_dir="path/to/dclm",
                  local_dir_use_symlinks=False,
                  resume_download=True,
                  max_workers=32)

And then I use the following script to tokenize the datasets:

python ray_processing/tokenize_shuffle.py \
--input /path/to/untokenized_data \
--readable_name dclm \
--output /path/to/tokenized_data \
--content_key text

If I tune do_sample, the code will look for the non-existent file DCLM/ray_processing/tokenization_configs/rpj_lm_data.yaml. Should I need tune on it?

My aim is to reproduce the highlighted experiment below.

mathfinder · 2024-09-02T07:03:26Z

And by the way, could you please provide the tokenized and shuffled dataset so that we can directly reproduce the experiment?

GeorgiosSmyrnis · 2024-09-02T11:16:32Z

Hi @mathfinder !

For the highlighted experiment, you don't need to do upsampling / downsampling of sources, so you don't need the --do_sample parameter or the associated yaml file - you can safely ignore this.
I will check in with the rest of the team regarding the tokenized dataset - given the size of these datasets there are some considerations regarding hosting multiple versions of the data.

Taoer1996 · 2024-09-03T03:19:08Z

Hi @mathfinder !

For the highlighted experiment, you don't need to do upsampling / downsampling of sources, so you don't need the --do_sample parameter or the associated yaml file - you can safely ignore this.

I will check in with the rest of the team regarding the tokenized dataset - given the size of these datasets there are some considerations regarding hosting multiple versions of the data.

It will be very helpful for reproduction with multiple version of data, especially the sampled ones!

mathfinder · 2024-09-03T04:02:53Z

Hi @mathfinder !

For the highlighted experiment, you don't need to do upsampling / downsampling of sources, so you don't need the --do_sample parameter or the associated yaml file - you can safely ignore this.

I will check in with the rest of the team regarding the tokenized dataset - given the size of these datasets there are some considerations regarding hosting multiple versions of the data.

Oh, that sounds amazing! I'm really looking forward to seeing your progress.

mathfinder · 2024-09-03T14:42:33Z

It seems that ray_processing/tokenize_shuffle.py heavily depends on S3, so using a local dataset will require a lot of changes.

Is it necessary to turn off --no_shuffle? If I use spark to tokenize, can I reproduce your experiment without keeping the same shuffle process as yours?

GeorgiosSmyrnis · 2024-09-03T14:54:16Z

Hi @mathfinder ,

This script should work on local datasets as well, as long as you spin up a ray cluster locally. Are you encountering any specific errors when trying to do so?

mathfinder closed this as completed Aug 30, 2024

mathfinder reopened this Aug 30, 2024

Mivg assigned Mivg and GeorgiosSmyrnis and unassigned Mivg Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the pretrain scripts? #68

What is the pretrain scripts? #68

mathfinder commented Aug 30, 2024

GeorgiosSmyrnis commented Aug 30, 2024

GeorgiosSmyrnis commented Aug 30, 2024 •

edited

Loading

mathfinder commented Aug 31, 2024

mathfinder commented Aug 31, 2024

GeorgiosSmyrnis commented Aug 31, 2024

mathfinder commented Sep 1, 2024

mathfinder commented Sep 2, 2024

GeorgiosSmyrnis commented Sep 2, 2024

Taoer1996 commented Sep 3, 2024 •

edited

Loading

mathfinder commented Sep 3, 2024

mathfinder commented Sep 3, 2024

GeorgiosSmyrnis commented Sep 3, 2024

What is the pretrain scripts? #68

What is the pretrain scripts? #68

Comments

mathfinder commented Aug 30, 2024

GeorgiosSmyrnis commented Aug 30, 2024

GeorgiosSmyrnis commented Aug 30, 2024 • edited Loading

mathfinder commented Aug 31, 2024

mathfinder commented Aug 31, 2024

GeorgiosSmyrnis commented Aug 31, 2024

mathfinder commented Sep 1, 2024

mathfinder commented Sep 2, 2024

GeorgiosSmyrnis commented Sep 2, 2024

Taoer1996 commented Sep 3, 2024 • edited Loading

mathfinder commented Sep 3, 2024

mathfinder commented Sep 3, 2024

GeorgiosSmyrnis commented Sep 3, 2024

GeorgiosSmyrnis commented Aug 30, 2024 •

edited

Loading

Taoer1996 commented Sep 3, 2024 •

edited

Loading