Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
Junbo Li committed Dec 8, 2023
1 parent 126e827 commit 8784141
Show file tree
Hide file tree
Showing 12 changed files with 94,135 additions and 29 deletions.
107 changes: 102 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,113 @@
This repository contains the code for preparing the training dataset for [CrystalCoder](https://huggingface.co/LLM360/CrystalCoder), a 7B-parameter language model pre-trained on code and natural language.

The processed dataset for each phase is available at [CrystalCoderDatasets](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets). This repository contains the code for processing the dataset from scratch.
Basically, we adhere to the procedure outlined in [Cerebra's Model Zoo](https://github.com/Cerebras/modelzoo/tree/main/modelzoo/transformers/data_processing/scripts).
Basically, we adhere to the procedure outlined in [Cerebra's Model Zoo](https://github.com/Cerebras/modelzoo/tree/main/modelzoo/transformers/data_processing/scripts). Specifically, the data is prepared in the following steps:

#### Step 1: Tokenization
1. Download the untokenized SlimPajama and StarCoder data from the sources.
2. Tokenize the data and concatenate documents to reach the maximum length limit. For the SlimPajama dataset, we evenly divided the tokenized files into two sections, categorizing them by the evenness or oddness of their file numbers for use in Stage 1 and Stage 2, respectively.
3. Apply Fill-In-the-Middle (FIM) augmentation on the tokenized StarCoder data.
4. Shuffle data within each domain and across epochs if there are multiple epochs.

## Step 1: Data and code downloading
```
mkdir data
cd data
# SlimPajama data
# StarCoder data
git lfs install
git clone https://huggingface.co/datasets/bigcode/starcoderdata
cd ../
# Code
git clone https://github.com/Cerebras/modelzoo.git
```


## Step 2: Tokenization and Sequences Concatenation

We tokenize the [SlimPajama dataset](https://huggingface.co/datasets/cerebras/SlimPajama-627B) (in `jsonl` format) and [StarCoder dataset](https://huggingface.co/datasets/bigcode/starcoderdata) (in `parquet` format) to `hdf5` format. This is done using the [`create_hdf5_dataset.py`](https://github.com/Cerebras/modelzoo/blob/main/modelzoo/transformers/data_processing/scripts/hdf5_preprocessing/create_hdf5_dataset.py) script.

#### Step 2: FIM Augmentation
### SlimPajama data

#### Data split

```
for i in `ls | grep train_packed | grep -v "_part[01]of2"`
do
echo $i
for part in {0..1}
do
echo " Part $part"
dirname="${i}_part${part}of2"
mkdir -p $dirname
pushd . >&/dev/null
cd $dirname
for h5chunk in `ls ../$i/data-*.h5 | sort`
do
chunkid=`echo $h5chunk | sed 's/.*data-[0]*//' | sed 's/\.h5//' | sed 's/^$/0/'`
if [ $(($chunkid % 2)) == $part ]
then
ln -s $h5chunk
fi
done
popd >&/dev/null
done
done
```


### StarCoder data

First, we convert the original `parquet` format to `jsonl` format.

In the tokenized StarCoder dataset, we implement token-level FIM augmentation while maintaining a constant SPM rate of 0.5, utilizing the `fim_hdf5.py` script from this repository. For stage 2, the FIM rate is set at 0.9, whereas in stage 3, it is lowered to 0.3. Across both stages, we train on the corresponding StarCoder data over several epochs. FIM is applied independently to each epoch. Consequently, we prepare and store all the data for each epoch on disk prior to beginning the training process.
```
python parquet2jsonl.py
```

#### Step 3: Shuffling and Mixing
#### Stage 2

We tokenize the `jsonl` files from all programming languages together:

```
python -B modelzoo/transformers/data_processing/scripts/hdf5_preprocessing/create_hdf5_dataset.py LMData \
--params configs/star_tokenizer_config.yaml \
--input_dir ./data/starcoderdata_jsonl --eos_id 2 --pad_id 2 \
--max_seq_length 2048 --output_dir ./data/starcoderdata_tokenized \
--seed 45 --processes 4 --split_text_to_tokenize True \
--ignore_bos_in_split_text True \
--encoder_file ./tokenizer.json
```

#### Stage 3

Here we tokenize the subfolders: `Python`, `HTML`, `JaveScript`, `CSS` independently using similar scripts.
```
bash scripts/script.sh
```


## Step 3: FIM Augmentation for StarCoder data

In the tokenized StarCoder dataset, we implement **token-level** FIM augmentation while maintaining a constant SPM rate of 0.5, utilizing the `fim_hdf5.py` script from this repository. For stage 2, the FIM rate is set at 0.9, whereas in stage 3, it is lowered to 0.3. Across both stages, we train on the corresponding StarCoder data over several epochs. FIM is applied independently to each epoch. Consequently, we prepare and store all the data for each epoch on disk prior to beginning the training process.

#### Stage 2

```
python fim_hdf5_stage2.py
```

#### Stage 3
```
python fim_hdf5_stage2.py
```

## Step 3: Shuffling

We shuffle and mix data from different sources and epochs for each stage as per the guidelines in [`h5_dataset_shuffle.py`](https://github.com/Cerebras/modelzoo/blob/main/modelzoo/transformers/data_processing/scripts/hdf5_shuffling/h5_dataset_shuffle.py).

```
bash scripts/shuffle.sh
```
35 changes: 35 additions & 0 deletions configs/star_tokenizer_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@

setup:
## for training data
input_dir: "data/starcoderdata_jsonl"
output_dir: "data/starcoderdata_tokenized"
processes: 4 # 64

dataset_processor: "LMDataPreprocessor"

processing:
tokenizer_type: "NeoXTokenizer"
encoder_file: "tokenizer.json" # replace with your directory
eos_id: 2
pad_id: 2

max_seq_length: 2048
short_seq_prob: 0.0

output_name: "examples"
files_per_record: 50000
write_in_batch: True

write_remainder: True
resume_from_checkpoint: False
display_pbar: True
seed: 45

dataset:
use_ftfy: True
ftfy_normalizer: "NFC"
wikitext_detokenize: False
min_sequence_len: 10
sep_token: null
# prompt_key: "source"
# completion_key: "target"
35 changes: 35 additions & 0 deletions configs/tokenizer_config_css.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@

setup:
## for training data
input_dir: "data/starcoderdata_jsonl_split/train/css"
output_dir: "data/starcoderdata_tokenized_stage3_split/train/css"
processes: 4 # 64

dataset_processor: "LMDataPreprocessor"

processing:
tokenizer_type: "NeoXTokenizer"
encoder_file: "tokenizer.json" # replace with your directory
eos_id: 2
pad_id: 2

max_seq_length: 2048
short_seq_prob: 0.0

output_name: "examples"
files_per_record: 50000
write_in_batch: True

write_remainder: True
resume_from_checkpoint: False
display_pbar: True
seed: 45

dataset:
use_ftfy: True
ftfy_normalizer: "NFC"
wikitext_detokenize: False
min_sequence_len: 10
sep_token: null
# prompt_key: "source"
# completion_key: "target"
35 changes: 35 additions & 0 deletions configs/tokenizer_config_html.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@

setup:
## for training data
input_dir: "data/starcoderdata_jsonl_split/train/html"
output_dir: "data/starcoderdata_tokenized_stage3_split/train/html"
processes: 4 # 64

dataset_processor: "LMDataPreprocessor"

processing:
tokenizer_type: "NeoXTokenizer"
encoder_file: "tokenizer.json" # replace with your directory
eos_id: 2
pad_id: 2

max_seq_length: 2048
short_seq_prob: 0.0

output_name: "examples"
files_per_record: 50000
write_in_batch: True

write_remainder: True
resume_from_checkpoint: False
display_pbar: True
seed: 45

dataset:
use_ftfy: True
ftfy_normalizer: "NFC"
wikitext_detokenize: False
min_sequence_len: 10
sep_token: null
# prompt_key: "source"
# completion_key: "target"
35 changes: 35 additions & 0 deletions configs/tokenizer_config_javascript.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@

setup:
## for training data
input_dir: "data/starcoderdata_jsonl_split/train/javascript"
output_dir: "data/starcoderdata_tokenized_stage3_split/train/javascript"
processes: 4 # 64

dataset_processor: "LMDataPreprocessor"

processing:
tokenizer_type: "NeoXTokenizer"
encoder_file: "tokenizer.json" # replace with your directory
eos_id: 2
pad_id: 2

max_seq_length: 2048
short_seq_prob: 0.0

output_name: "examples"
files_per_record: 50000
write_in_batch: True

write_remainder: True
resume_from_checkpoint: False
display_pbar: True
seed: 45

dataset:
use_ftfy: True
ftfy_normalizer: "NFC"
wikitext_detokenize: False
min_sequence_len: 10
sep_token: null
# prompt_key: "source"
# completion_key: "target"
35 changes: 35 additions & 0 deletions configs/tokenizer_config_python.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@

setup:
## for training data
input_dir: "data/starcoderdata_jsonl_split/train/python"
output_dir: "data/starcoderdata_tokenized_stage3_split/train/python"
processes: 4 # 64

dataset_processor: "LMDataPreprocessor"

processing:
tokenizer_type: "NeoXTokenizer"
encoder_file: "tokenizer.json" # replace with your directory
eos_id: 2
pad_id: 2

max_seq_length: 2048
short_seq_prob: 0.0

output_name: "examples"
files_per_record: 50000
write_in_batch: True

write_remainder: True
resume_from_checkpoint: False
display_pbar: True
seed: 45

dataset:
use_ftfy: True
ftfy_normalizer: "NFC"
wikitext_detokenize: False
min_sequence_len: 10
sep_token: null
# prompt_key: "source"
# completion_key: "target"
Loading

0 comments on commit 8784141

Please sign in to comment.