Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
Junbo Li committed Dec 8, 2023
1 parent b052d82 commit 55ad974
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ git clone https://github.com/Cerebras/modelzoo.git

We tokenize the [SlimPajama dataset](https://huggingface.co/datasets/cerebras/SlimPajama-627B) (in `jsonl` format) and [StarCoder dataset](https://huggingface.co/datasets/bigcode/starcoderdata) (in `parquet` format) to `hdf5` format. This is done using the [`create_hdf5_dataset.py`](https://github.com/Cerebras/modelzoo/blob/main/modelzoo/transformers/data_processing/scripts/hdf5_preprocessing/create_hdf5_dataset.py) script.

### SlimPajama data
### 1. SlimPajama data

#### Tokenization

Expand Down Expand Up @@ -65,7 +65,7 @@ done
```


### StarCoder data
### 2. StarCoder data

First, we convert the original `parquet` format to `jsonl` format.

Expand Down

0 comments on commit 55ad974

Please sign in to comment.