forked from intel/e2eAIOK
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
move tokenize operator in RecDP (intel#433)
Signed-off-by: Xue, Chendi <[email protected]>
- Loading branch information
Showing
7 changed files
with
982 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
45 changes: 45 additions & 0 deletions
45
RecDP/pyrecdp/primitives/llmutils/tokenize_and_save/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Tokenization Dataset for Ray Finetuning | ||
|
||
## Step 1: Set up Env | ||
Please follow [this guide](https://github.com/intel-sandbox/llm-ray/tree/main/tools/workload_in_containers) to set-up the container environment. | ||
When the containers are up to run, enter the container on head node using following command: | ||
```bash | ||
docker exec -it ray-leader bash | ||
``` | ||
|
||
## Step 2: Run Preprocessing job | ||
```python | ||
# run the preprocessing job, below is an example | ||
python tokenize_and_save.py \ | ||
--input-dir /home/user/shared/PILE_dedup/EuroParl \ | ||
--data-field text \ | ||
--tokenizer togethercomputer/LLaMA-2-7B-32K \ | ||
--output-dir /home/user/shared/processed \ | ||
--load-batch-size 1000 \ | ||
--cpu-per-node 90 | ||
``` | ||
If the data preprocessing gets finished, you will see the total execution time of this script in the command-line output. | ||
|
||
To find out the possible params, run the following command: | ||
```bash | ||
python tokenize_and_save.py -h | ||
``` | ||
|
||
## Step 3: Merge Multiple Megatron Data Files [Optional] | ||
For Megatron-format data, you may need to do an extra step to merge multiple data files in to one. You can use the `merge_datasets.py` as follows: | ||
|
||
```python | ||
python merge_datasets.py --input <directory_containing_megatron_files> --output-prefix <output_file_name_without_file_extensions> | ||
``` | ||
|
||
## Validation | ||
When the data preprocessing gets finished, you will see the total execution time at the end of the command line output. Now, it is your responsibility to gather all data partition files on each worker to the head node. When all the data partition files are under one folder on the head node, you can run the `merge_datasets.py` script to merge multiple megatron `bin` and `idx` files into one `bin` and `idx` files on each worker node. To count the token numbers in the dataset, you can use the `count_tokens.py` script, e.g. | ||
```python | ||
python count_tokens.py <megatron_file_without_file_extension> <output_file_containing_the_token_number_per_row> <tokenizer_name> | ||
``` | ||
|
||
|
||
|
||
|
||
|
||
|
26 changes: 26 additions & 0 deletions
26
RecDP/pyrecdp/primitives/llmutils/tokenize_and_save/count_tokens.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
from indexed_dataset import MMapIndexedDataset | ||
from transformers import AutoTokenizer | ||
|
||
import argparse | ||
|
||
# get the first argument as a file name, and an output file | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("file_name", help="the file name to read") | ||
parser.add_argument("output_file", help="the file name to write") | ||
parser.add_argument("tokenizer", help="tokenizer name") | ||
args = parser.parse_args() | ||
|
||
ds = MMapIndexedDataset(args.file_name) | ||
|
||
tok = AutoTokenizer.from_pretrained(args.tokenizer) | ||
|
||
num_tokens = [ | ||
len(ds[i]) for i in range(len(ds)) | ||
] | ||
|
||
# write it out to an output_file | ||
with open(args.output_file, "w") as f: | ||
for i in num_tokens: | ||
f.write(f"{i}\n") | ||
|
||
print(f'Total tokens: {sum(num_tokens)}') |
Oops, something went wrong.