update

LLM360 · Dec 8, 2023 · 8784141 · 8784141
1 parent 126e827
commit 8784141
Show file tree

Hide file tree

Showing 12 changed files with 94,135 additions and 29 deletions.
diff --git a/README.md b/README.md
@@ -3,16 +3,113 @@
 This repository contains the code for preparing the training dataset for [CrystalCoder](https://huggingface.co/LLM360/CrystalCoder), a 7B-parameter language model pre-trained on code and natural language.
 
 The processed dataset for each phase is available at [CrystalCoderDatasets](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets). This repository contains the code for processing the dataset from scratch. 
-Basically, we adhere to the procedure outlined in [Cerebra's Model Zoo](https://github.com/Cerebras/modelzoo/tree/main/modelzoo/transformers/data_processing/scripts).
+Basically, we adhere to the procedure outlined in [Cerebra's Model Zoo](https://github.com/Cerebras/modelzoo/tree/main/modelzoo/transformers/data_processing/scripts). Specifically, the data is prepared in the following steps:
 
-#### Step 1: Tokenization
+1. Download the untokenized SlimPajama and StarCoder data from the sources.
+2. Tokenize the data and concatenate documents to reach the maximum length limit. For the SlimPajama dataset, we evenly divided the tokenized files into two sections, categorizing them by the evenness or oddness of their file numbers for use in Stage 1 and Stage 2, respectively.
+3. Apply Fill-In-the-Middle (FIM) augmentation on the tokenized StarCoder data.
+4. Shuffle data within each domain and across epochs if there are multiple epochs.
+
+## Step 1: Data and code downloading
+```
+mkdir data
+cd data
+
+# SlimPajama data
+
+# StarCoder data
+git lfs install
+git clone https://huggingface.co/datasets/bigcode/starcoderdata
+
+cd ../
+
+# Code
+git clone https://github.com/Cerebras/modelzoo.git
+```
+
+
+## Step 2: Tokenization and Sequences Concatenation
 
 We tokenize the [SlimPajama dataset](https://huggingface.co/datasets/cerebras/SlimPajama-627B) (in `jsonl` format) and [StarCoder dataset](https://huggingface.co/datasets/bigcode/starcoderdata) (in `parquet` format) to `hdf5` format. This is done using the [`create_hdf5_dataset.py`](https://github.com/Cerebras/modelzoo/blob/main/modelzoo/transformers/data_processing/scripts/hdf5_preprocessing/create_hdf5_dataset.py) script.
 
-#### Step 2: FIM Augmentation
+### SlimPajama data
+
+#### Data split
+
+```
+for i in `ls | grep train_packed | grep -v "_part[01]of2"`
+do
+	echo $i
+	for part in {0..1}
+	do
+		echo "  Part $part"
+		dirname="${i}_part${part}of2"
+		mkdir -p $dirname
+		pushd . >&/dev/null
+		cd $dirname
+		for h5chunk in `ls ../$i/data-*.h5 | sort`
+		do
+			chunkid=`echo $h5chunk | sed 's/.*data-[0]*//' | sed 's/\.h5//' | sed 's/^$/0/'`
+			if [ $(($chunkid % 2)) == $part ]
+			then
+				ln -s $h5chunk
+			fi
+		done
+		popd >&/dev/null
+	done
+done
+```
+
+
+### StarCoder data
+
+First, we convert the original `parquet` format to `jsonl` format.
 
-In the tokenized StarCoder dataset, we implement token-level FIM augmentation while maintaining a constant SPM rate of 0.5, utilizing the `fim_hdf5.py` script from this repository. For stage 2, the FIM rate is set at 0.9, whereas in stage 3, it is lowered to 0.3. Across both stages, we train on the corresponding StarCoder data over several epochs. FIM is applied independently to each epoch. Consequently, we prepare and store all the data for each epoch on disk prior to beginning the training process.
+```
+python parquet2jsonl.py
+```
 
-#### Step 3: Shuffling and Mixing
+#### Stage 2
+
+We tokenize the `jsonl` files from all programming languages together:
+
+```
+python -B modelzoo/transformers/data_processing/scripts/hdf5_preprocessing/create_hdf5_dataset.py LMData \ 
+    --params configs/star_tokenizer_config.yaml \ 
+    --input_dir ./data/starcoderdata_jsonl --eos_id 2 --pad_id 2 \ 
+    --max_seq_length 2048 --output_dir ./data/starcoderdata_tokenized \ 
+    --seed 45 --processes 4 --split_text_to_tokenize True \ 
+    --ignore_bos_in_split_text True \ 
+    --encoder_file ./tokenizer.json
+```
+
+#### Stage 3
+
+Here we tokenize the subfolders: `Python`, `HTML`, `JaveScript`, `CSS` independently using similar scripts.
+```
+bash scripts/script.sh
+```
+
+
+## Step 3: FIM Augmentation for StarCoder data
+
+In the tokenized StarCoder dataset, we implement **token-level** FIM augmentation while maintaining a constant SPM rate of 0.5, utilizing the `fim_hdf5.py` script from this repository. For stage 2, the FIM rate is set at 0.9, whereas in stage 3, it is lowered to 0.3. Across both stages, we train on the corresponding StarCoder data over several epochs. FIM is applied independently to each epoch. Consequently, we prepare and store all the data for each epoch on disk prior to beginning the training process.
+
+#### Stage 2
+
+```
+python fim_hdf5_stage2.py
+```
+
+#### Stage 3
+```
+python fim_hdf5_stage2.py
+```
+
+## Step 3: Shuffling
 
 We shuffle and mix data from different sources and epochs for each stage as per the guidelines in [`h5_dataset_shuffle.py`](https://github.com/Cerebras/modelzoo/blob/main/modelzoo/transformers/data_processing/scripts/hdf5_shuffling/h5_dataset_shuffle.py).
+
+```
+bash scripts/shuffle.sh
+```
diff --git a/configs/star_tokenizer_config.yaml b/configs/star_tokenizer_config.yaml
@@ -0,0 +1,35 @@
+
+setup:
+    ## for training data
+    input_dir: "data/starcoderdata_jsonl" 
+    output_dir: "data/starcoderdata_tokenized" 
+    processes: 4 # 64
+
+    dataset_processor: "LMDataPreprocessor"
+
+processing:
+    tokenizer_type: "NeoXTokenizer"
+    encoder_file: "tokenizer.json" # replace with your directory
+    eos_id: 2
+    pad_id: 2
+
+    max_seq_length: 2048
+    short_seq_prob: 0.0
+
+    output_name: "examples"
+    files_per_record: 50000
+    write_in_batch: True
+
+    write_remainder: True
+    resume_from_checkpoint: False
+    display_pbar: True
+    seed: 45
+
+dataset:
+    use_ftfy: True
+    ftfy_normalizer: "NFC"
+    wikitext_detokenize: False
+    min_sequence_len: 10
+    sep_token: null
+    # prompt_key: "source"
+    # completion_key: "target"
diff --git a/configs/tokenizer_config_css.yaml b/configs/tokenizer_config_css.yaml
@@ -0,0 +1,35 @@
+
+setup:
+    ## for training data
+    input_dir: "data/starcoderdata_jsonl_split/train/css" 
+    output_dir: "data/starcoderdata_tokenized_stage3_split/train/css" 
+    processes: 4 # 64
+
+    dataset_processor: "LMDataPreprocessor"
+
+processing:
+    tokenizer_type: "NeoXTokenizer"
+    encoder_file: "tokenizer.json" # replace with your directory
+    eos_id: 2
+    pad_id: 2
+
+    max_seq_length: 2048
+    short_seq_prob: 0.0
+
+    output_name: "examples"
+    files_per_record: 50000
+    write_in_batch: True
+
+    write_remainder: True
+    resume_from_checkpoint: False
+    display_pbar: True
+    seed: 45
+
+dataset:
+    use_ftfy: True
+    ftfy_normalizer: "NFC"
+    wikitext_detokenize: False
+    min_sequence_len: 10
+    sep_token: null
+    # prompt_key: "source"
+    # completion_key: "target"
diff --git a/configs/tokenizer_config_html.yaml b/configs/tokenizer_config_html.yaml
@@ -0,0 +1,35 @@
+
+setup:
+    ## for training data
+    input_dir: "data/starcoderdata_jsonl_split/train/html" 
+    output_dir: "data/starcoderdata_tokenized_stage3_split/train/html" 
+    processes: 4 # 64
+
+    dataset_processor: "LMDataPreprocessor"
+
+processing:
+    tokenizer_type: "NeoXTokenizer"
+    encoder_file: "tokenizer.json" # replace with your directory
+    eos_id: 2
+    pad_id: 2
+
+    max_seq_length: 2048
+    short_seq_prob: 0.0
+
+    output_name: "examples"
+    files_per_record: 50000
+    write_in_batch: True
+
+    write_remainder: True
+    resume_from_checkpoint: False
+    display_pbar: True
+    seed: 45
+
+dataset:
+    use_ftfy: True
+    ftfy_normalizer: "NFC"
+    wikitext_detokenize: False
+    min_sequence_len: 10
+    sep_token: null
+    # prompt_key: "source"
+    # completion_key: "target"
diff --git a/configs/tokenizer_config_javascript.yaml b/configs/tokenizer_config_javascript.yaml
@@ -0,0 +1,35 @@
+
+setup:
+    ## for training data
+    input_dir: "data/starcoderdata_jsonl_split/train/javascript" 
+    output_dir: "data/starcoderdata_tokenized_stage3_split/train/javascript" 
+    processes: 4 # 64
+
+    dataset_processor: "LMDataPreprocessor"
+
+processing:
+    tokenizer_type: "NeoXTokenizer"
+    encoder_file: "tokenizer.json" # replace with your directory
+    eos_id: 2
+    pad_id: 2
+
+    max_seq_length: 2048
+    short_seq_prob: 0.0
+
+    output_name: "examples"
+    files_per_record: 50000
+    write_in_batch: True
+
+    write_remainder: True
+    resume_from_checkpoint: False
+    display_pbar: True
+    seed: 45
+
+dataset:
+    use_ftfy: True
+    ftfy_normalizer: "NFC"
+    wikitext_detokenize: False
+    min_sequence_len: 10
+    sep_token: null
+    # prompt_key: "source"
+    # completion_key: "target"
diff --git a/configs/tokenizer_config_python.yaml b/configs/tokenizer_config_python.yaml
@@ -0,0 +1,35 @@
+
+setup:
+    ## for training data
+    input_dir: "data/starcoderdata_jsonl_split/train/python" 
+    output_dir: "data/starcoderdata_tokenized_stage3_split/train/python" 
+    processes: 4 # 64
+
+    dataset_processor: "LMDataPreprocessor"
+
+processing:
+    tokenizer_type: "NeoXTokenizer"
+    encoder_file: "tokenizer.json" # replace with your directory
+    eos_id: 2
+    pad_id: 2
+
+    max_seq_length: 2048
+    short_seq_prob: 0.0
+
+    output_name: "examples"
+    files_per_record: 50000
+    write_in_batch: True
+
+    write_remainder: True
+    resume_from_checkpoint: False
+    display_pbar: True
+    seed: 45
+
+dataset:
+    use_ftfy: True
+    ftfy_normalizer: "NFC"
+    wikitext_detokenize: False
+    min_sequence_len: 10
+    sep_token: null
+    # prompt_key: "source"
+    # completion_key: "target"