move tokenize operator in RecDP (intel#433)

Signed-off-by: Xue, Chendi <[email protected]>
XinyaoWa · Nov 7, 2023 · 4553256 · 4553256
1 parent 106a2a0
commit 4553256
Show file tree

Hide file tree

Showing 7 changed files with 982 additions and 0 deletions.
diff --git a/RecDP/pyrecdp/LLM/README.md b/RecDP/pyrecdp/LLM/README.md
@@ -26,6 +26,7 @@ RecDP LLM is a set of python components that enables quick and easy establish of
 | [ Writer ](https://github.com/intel/e2eAIOK/blob/main/RecDP/pyrecdp/primitives/operations/text_writer.py#L7)                      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/RecDP/examples/notebooks/llmutils/writer.ipynb)            | write data to directory                            | jsonl, parquet                               | RefinedWeb - 1.7 TB                   |
 | [ ClassifyWriter ](https://github.com/intel/e2eAIOK/blob/main/RecDP/pyrecdp/primitives/operations/text_writer.py#L47)                      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/RecDP/examples/notebooks/llmutils/classify.ipynb)            | Classify and write data into sub buckets                            | meta fields, language                                | RefinedWeb - 1.7 TB                   |
 | [ Prompt Enhancement ](#)                                                                                                       | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/RecDP/examples/notebooks/llmutils/prompt_enhancement.ipynb)      | creates high-complexity instructions from existing instruct-tuned LLM models       | PromptSource, self-instruct, evol-instruct(wizardLM) | alpaca |
+| [ Tokenization ](https://github.com/intel/e2eAIOK/blob/main/RecDP/pyrecdp/primitives/llmutils/tokenize_and_save/)                                                                                                       | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](#)      | using LLAMA2 tokenizer and save as Megatron    | LLAMA2 tokenizer | RefinedWeb - 1.7 TB |
 
 ## LLM Data Quality Analysis
 

diff --git a/RecDP/pyrecdp/primitives/llmutils/tokenize_and_save/README.md b/RecDP/pyrecdp/primitives/llmutils/tokenize_and_save/README.md
@@ -0,0 +1,45 @@
+# Tokenization Dataset for Ray Finetuning
+
+## Step 1: Set up Env
+Please follow [this guide](https://github.com/intel-sandbox/llm-ray/tree/main/tools/workload_in_containers) to set-up the container environment. 
+When the containers are up to run, enter the container on head node using following command:
+```bash  
+docker exec -it ray-leader bash 
+```
+
+## Step 2: Run Preprocessing job 
+```python 
+# run the preprocessing job, below is an example
+python tokenize_and_save.py \
+        --input-dir /home/user/shared/PILE_dedup/EuroParl \
+        --data-field text \
+        --tokenizer togethercomputer/LLaMA-2-7B-32K \
+        --output-dir /home/user/shared/processed \
+        --load-batch-size 1000 \
+        --cpu-per-node 90
+```
+If the data preprocessing gets finished, you will see the total execution time of this script in the command-line output. 
+
+To find out the possible params, run the following command:
+```bash
+python tokenize_and_save.py -h
+```
+
+## Step 3: Merge Multiple Megatron Data Files [Optional]
+For Megatron-format data, you may need to do an extra step to merge multiple data files in to one. You can use the `merge_datasets.py` as follows:
+
+```python
+python merge_datasets.py --input <directory_containing_megatron_files> --output-prefix <output_file_name_without_file_extensions>
+``` 
+
+## Validation
+When the data preprocessing gets finished, you will see the total execution time at the end of the command line output. Now, it is your responsibility to gather all data partition files on each worker to the head node. When all the data partition files are under one folder on the head node, you can run the `merge_datasets.py` script to merge multiple megatron `bin` and `idx` files into one `bin` and `idx` files on each worker node. To count the token numbers in the dataset, you can use the `count_tokens.py` script, e.g.
+```python
+python count_tokens.py <megatron_file_without_file_extension> <output_file_containing_the_token_number_per_row> <tokenizer_name>
+```
+
+
+
+
+
+
diff --git a/RecDP/pyrecdp/primitives/llmutils/tokenize_and_save/count_tokens.py b/RecDP/pyrecdp/primitives/llmutils/tokenize_and_save/count_tokens.py
@@ -0,0 +1,26 @@
+from indexed_dataset import MMapIndexedDataset
+from transformers import AutoTokenizer
+
+import argparse
+
+# get the first argument as a file name, and an output file
+parser = argparse.ArgumentParser()
+parser.add_argument("file_name", help="the file name to read")
+parser.add_argument("output_file", help="the file name to write")
+parser.add_argument("tokenizer", help="tokenizer name")
+args = parser.parse_args()
+
+ds = MMapIndexedDataset(args.file_name)
+
+tok = AutoTokenizer.from_pretrained(args.tokenizer)
+
+num_tokens = [
+    len(ds[i]) for i in range(len(ds))
+]
+
+# write it out to an output_file
+with open(args.output_file, "w") as f:
+    for i in num_tokens:
+        f.write(f"{i}\n")
+
+print(f'Total tokens: {sum(num_tokens)}')