[Finetune] replace fine-tuning DefaultTrainer with transformers.Train…

…er (intel#204) * replace fine-tuning DefaultTrainer with transformers.Trainer * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * enable resume from checkpoint * update docs
yutianchen666 · May 13, 2024 · 3523011 · 3523011
1 parent db68b65
commit 3523011
Show file tree

Hide file tree

Showing 22 changed files with 169 additions and 153 deletions.
diff --git a/.github/workflows/config/update_finetune_config_on_intel_gpu.py b/.github/workflows/config/update_finetune_config_on_intel_gpu.py
@@ -39,7 +39,7 @@ def update_finetune_config(config_file, base_model):
 
         config["General"]["base_model"] = base_model
         config["General"]["output_dir"] = "./output"
-        config["General"]["checkpoint_dir"] = "./checkpoint"
+        config["General"]["save_strategy"] = "no"
         config["Training"]["device"] = "GPU"
         config["Training"]["resources_per_worker"]["CPU"] = 1
         config["Training"]["resources_per_worker"]["GPU"] = 1

diff --git a/docs/finetune_parameters.md b/docs/finetune_parameters.md
@@ -9,8 +9,9 @@ The following are the parameters supported in the finetuning workflow.
 |base_model| EleutherAI/gpt-j-6b|Path to pretrained model or model identifier from huggingface.co/models|
 |tokenizer_name|None|Path to pretrained tokenizer from huggingface.co/models. If not provided, the tokenizer will be loaded from the `base_model`.|
 |gpt_base_model|True|This parameter is for [Transformers#22482](https://github.com/huggingface/transformers/issues/22482). It needs to be set to True when the pretrained model is realted to gpt, otherwise it is False.|
-|output_dir|/tmp/llm-ray/output|The output directory to store the finetuned model|
-|checkpoint_dir|/tmp/llm-ray/checkpoint|The directory to store checkpoint|
+|output_dir|/tmp/llm-ray/output|The output directory to store the finetuned model.|
+|resume_from_checkpoint|null|The path to a folder with a valid checkpoint for your model.|
+|save_strategy|no|The checkpoint save strategy to adopt during training. Possible values are: "no", "epoch", "steps".|
 |config|trust_remote_code: False<br> use_auth_token: None|Will be passed to the transformers `from_pretrained()` method|
 |lora_config|task_type: CAUSAL_LM<br>r: 8<br>lora_alpha: 32<br>lora_dropout: 0.1|Will be passed to the LoraConfig `__init__()` method, then it'll be used as config to build Peft model object.|
 |deltatuner_config|"algo": "lora"<br>"denas": True<br>"best_model_structure": "/path/to/best_structure_of_deltatuner_model"|Will be passed to the DeltaTunerArguments `__init__()` method, then it'll be used as config to build [Deltatuner model](https://github.com/intel/e2eAIOK/tree/main/e2eAIOK/deltatuner) object.|
@@ -33,7 +34,7 @@ The following are the parameters supported in the finetuning workflow.
 ## Training Parameters
 |Configuration Name| Default|Meaning|
 |-|-|-|
-|optimizer|AdamW|The optimizer used for model training. Supported values: "Adadelta", "Adagrad", "Adam", "AdamW", "Adamax", "ASGD", "NAdam", "RAdam", "RMSprop", "Rprop", "SGD"|
+|optimizer|adamw_torch|The optimizer to use: adamw_hf, adamw_torch, adamw_torch_fused, adamw_apex_fused, adamw_anyprecision or adafactor. for more optimizer names, please search OptimizerNames [here](https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py)|
 |batch_size|4|Batch size per training worker|
 |epochs|3|Total number of training epochs to perform.|
 |learning_rate|1e-5|Initial learning rate to use.|

diff --git a/examples/finetune/dolly1/dolly_1_finetune.yaml b/examples/finetune/dolly1/dolly_1_finetune.yaml
@@ -2,7 +2,7 @@ General:
   base_model: EleutherAI/gpt-j-6b
   gpt_base_model: true
   output_dir: /tmp/llm-ray/output
-  checkpoint_dir: /tmp/llm-ray/checkpoint
+  save_strategy: no
   config:
     trust_remote_code: false
     use_auth_token: null

diff --git a/examples/finetune/dolly2/dolly_2_finetune.yaml b/examples/finetune/dolly2/dolly_2_finetune.yaml
@@ -2,7 +2,7 @@ General:
   base_model: EleutherAI/pythia-6.9b
   gpt_base_model: true
   output_dir: /tmp/llm-ray/output
-  checkpoint_dir: /tmp/llm-ray/checkpoint
+  save_strategy: no
   config:
     trust_remote_code: false
     use_auth_token: null

diff --git a/examples/finetune/gpt_j_6b/finetune_hpu.yaml b/examples/finetune/gpt_j_6b/finetune_hpu.yaml
@@ -2,7 +2,7 @@ General:
   base_model: EleutherAI/gpt-j-6b
   gpt_base_model: true
   output_dir: /tmp/llm-ray/output
-  checkpoint_dir: /tmp/llm-ray/checkpoint
+  save_strategy: no
   config:
     trust_remote_code: false
     use_auth_token: null

diff --git a/examples/finetune/gpt_j_6b/finetune_intel_gpu.yaml b/examples/finetune/gpt_j_6b/finetune_intel_gpu.yaml
@@ -2,7 +2,7 @@ General:
   base_model: EleutherAI/gpt-j-6b
   gpt_base_model: true
   output_dir: /tmp/llm-ray/output
-  checkpoint_dir: /tmp/llm-ray/checkpoint
+  save_strategy: no
   config:
     trust_remote_code: false
     use_auth_token: null

diff --git a/examples/finetune/open_assistant/open_assistant_finetune.yaml b/examples/finetune/open_assistant/open_assistant_finetune.yaml
@@ -2,7 +2,7 @@ General:
   base_model: EleutherAI/gpt-j-6b
   gpt_base_model: true
   output_dir: /tmp/llm-ray/output
-  checkpoint_dir: /tmp/llm-ray/checkpoint
+  save_strategy: no
   config:
     trust_remote_code: false
     use_auth_token: null

diff --git a/llm_on_ray/common/dataprocesser/general_processer.py b/llm_on_ray/common/dataprocesser/general_processer.py
@@ -99,13 +99,10 @@ def torch_call(self, examples):
 
 
 class GeneralProcesser(DataProcesser):
-    def prepare(self, tokenizer, dataset):
-        per_device_train_batch_size = self.config.get("per_device_train_batch_size")
-        per_device_eval_batch_size = self.config.get("per_device_eval_batch_size")
+    def tokenize_dataset(self, tokenizer, dataset):
         max_length = self.config.get("max_length")
         group = self.config.get("group")
         block_size = self.config.get("block_size")
-        shuffle = self.config.get("shuffle")
         tokenizer.pad_token = tokenizer.eos_token
 
         if isinstance(dataset, datasets.Dataset):
@@ -176,13 +173,22 @@ def group_texts(examples):
                 desc=f"Grouping texts in chunks of {block_size}",
             )
 
+        return tokenized_datasets
+
+    def prepare_dataloader(self, tokenizer, dataset):
+        per_device_train_batch_size = self.config.get("per_device_train_batch_size")
+        per_device_eval_batch_size = self.config.get("per_device_eval_batch_size")
+        shuffle = self.config.get("shuffle")
+
         data_collator = DataCollatorForCompletionOnlyLM(
             tokenizer=tokenizer,
             mlm=False,
             return_tensors="pt",
             pad_to_multiple_of=8,
         )
 
+        tokenized_datasets = self.tokenize_dataset(tokenizer, dataset)
+
         train_dataset = tokenized_datasets["train"]
         train_dataloader_params = {
             "shuffle": shuffle,

diff --git a/llm_on_ray/common/trainer/default_trainer.py b/llm_on_ray/common/trainer/default_trainer.py
@@ -130,7 +130,9 @@ def prepare(self, model, tokenizer, dataset, optimizer, accelerator):
                 f"model embedding size resize to {len(tokenizer)} because of tokenizer size"
             )
 
-        train_dataloader, eval_dataloader = self.dataprocesser.prepare(tokenizer, dataset)
+        train_dataloader, eval_dataloader = self.dataprocesser.prepare_dataloader(
+            tokenizer, dataset
+        )
 
         lr_scheduler_config = self.config.get("lr_scheduler")
         if lr_scheduler_config: