Skip to content

Commit

Permalink
move tokenize operator in RecDP (intel#433)
Browse files Browse the repository at this point in the history
Signed-off-by: Xue, Chendi <[email protected]>
  • Loading branch information
xuechendi authored Nov 7, 2023
1 parent 106a2a0 commit 4553256
Show file tree
Hide file tree
Showing 7 changed files with 982 additions and 0 deletions.
1 change: 1 addition & 0 deletions RecDP/pyrecdp/LLM/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ RecDP LLM is a set of python components that enables quick and easy establish of
| [ Writer ](https://github.com/intel/e2eAIOK/blob/main/RecDP/pyrecdp/primitives/operations/text_writer.py#L7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/RecDP/examples/notebooks/llmutils/writer.ipynb) | write data to directory | jsonl, parquet | RefinedWeb - 1.7 TB |
| [ ClassifyWriter ](https://github.com/intel/e2eAIOK/blob/main/RecDP/pyrecdp/primitives/operations/text_writer.py#L47) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/RecDP/examples/notebooks/llmutils/classify.ipynb) | Classify and write data into sub buckets | meta fields, language | RefinedWeb - 1.7 TB |
| [ Prompt Enhancement ](#) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/RecDP/examples/notebooks/llmutils/prompt_enhancement.ipynb) | creates high-complexity instructions from existing instruct-tuned LLM models | PromptSource, self-instruct, evol-instruct(wizardLM) | alpaca |
| [ Tokenization ](https://github.com/intel/e2eAIOK/blob/main/RecDP/pyrecdp/primitives/llmutils/tokenize_and_save/) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](#) | using LLAMA2 tokenizer and save as Megatron | LLAMA2 tokenizer | RefinedWeb - 1.7 TB |

## LLM Data Quality Analysis

Expand Down
45 changes: 45 additions & 0 deletions RecDP/pyrecdp/primitives/llmutils/tokenize_and_save/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Tokenization Dataset for Ray Finetuning

## Step 1: Set up Env
Please follow [this guide](https://github.com/intel-sandbox/llm-ray/tree/main/tools/workload_in_containers) to set-up the container environment.
When the containers are up to run, enter the container on head node using following command:
```bash
docker exec -it ray-leader bash
```

## Step 2: Run Preprocessing job
```python
# run the preprocessing job, below is an example
python tokenize_and_save.py \
--input-dir /home/user/shared/PILE_dedup/EuroParl \
--data-field text \
--tokenizer togethercomputer/LLaMA-2-7B-32K \
--output-dir /home/user/shared/processed \
--load-batch-size 1000 \
--cpu-per-node 90
```
If the data preprocessing gets finished, you will see the total execution time of this script in the command-line output.

To find out the possible params, run the following command:
```bash
python tokenize_and_save.py -h
```

## Step 3: Merge Multiple Megatron Data Files [Optional]
For Megatron-format data, you may need to do an extra step to merge multiple data files in to one. You can use the `merge_datasets.py` as follows:

```python
python merge_datasets.py --input <directory_containing_megatron_files> --output-prefix <output_file_name_without_file_extensions>
```

## Validation
When the data preprocessing gets finished, you will see the total execution time at the end of the command line output. Now, it is your responsibility to gather all data partition files on each worker to the head node. When all the data partition files are under one folder on the head node, you can run the `merge_datasets.py` script to merge multiple megatron `bin` and `idx` files into one `bin` and `idx` files on each worker node. To count the token numbers in the dataset, you can use the `count_tokens.py` script, e.g.
```python
python count_tokens.py <megatron_file_without_file_extension> <output_file_containing_the_token_number_per_row> <tokenizer_name>
```






Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
from indexed_dataset import MMapIndexedDataset
from transformers import AutoTokenizer

import argparse

# get the first argument as a file name, and an output file
parser = argparse.ArgumentParser()
parser.add_argument("file_name", help="the file name to read")
parser.add_argument("output_file", help="the file name to write")
parser.add_argument("tokenizer", help="tokenizer name")
args = parser.parse_args()

ds = MMapIndexedDataset(args.file_name)

tok = AutoTokenizer.from_pretrained(args.tokenizer)

num_tokens = [
len(ds[i]) for i in range(len(ds))
]

# write it out to an output_file
with open(args.output_file, "w") as f:
for i in num_tokens:
f.write(f"{i}\n")

print(f'Total tokens: {sum(num_tokens)}')
Loading

0 comments on commit 4553256

Please sign in to comment.