Skip to content

Commit

Permalink
[Finetune] replace fine-tuning DefaultTrainer with transformers.Train…
Browse files Browse the repository at this point in the history
…er (intel#204)

* replace fine-tuning DefaultTrainer with transformers.Trainer

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* enable resume from checkpoint

* update docs
  • Loading branch information
harborn authored May 13, 2024
1 parent db68b65 commit 3523011
Show file tree
Hide file tree
Showing 22 changed files with 169 additions and 153 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def update_finetune_config(config_file, base_model):

config["General"]["base_model"] = base_model
config["General"]["output_dir"] = "./output"
config["General"]["checkpoint_dir"] = "./checkpoint"
config["General"]["save_strategy"] = "no"
config["Training"]["device"] = "GPU"
config["Training"]["resources_per_worker"]["CPU"] = 1
config["Training"]["resources_per_worker"]["GPU"] = 1
Expand Down
7 changes: 4 additions & 3 deletions docs/finetune_parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ The following are the parameters supported in the finetuning workflow.
|base_model| EleutherAI/gpt-j-6b|Path to pretrained model or model identifier from huggingface.co/models|
|tokenizer_name|None|Path to pretrained tokenizer from huggingface.co/models. If not provided, the tokenizer will be loaded from the `base_model`.|
|gpt_base_model|True|This parameter is for [Transformers#22482](https://github.com/huggingface/transformers/issues/22482). It needs to be set to True when the pretrained model is realted to gpt, otherwise it is False.|
|output_dir|/tmp/llm-ray/output|The output directory to store the finetuned model|
|checkpoint_dir|/tmp/llm-ray/checkpoint|The directory to store checkpoint|
|output_dir|/tmp/llm-ray/output|The output directory to store the finetuned model.|
|resume_from_checkpoint|null|The path to a folder with a valid checkpoint for your model.|
|save_strategy|no|The checkpoint save strategy to adopt during training. Possible values are: "no", "epoch", "steps".|
|config|trust_remote_code: False<br> use_auth_token: None|Will be passed to the transformers `from_pretrained()` method|
|lora_config|task_type: CAUSAL_LM<br>r: 8<br>lora_alpha: 32<br>lora_dropout: 0.1|Will be passed to the LoraConfig `__init__()` method, then it'll be used as config to build Peft model object.|
|deltatuner_config|"algo": "lora"<br>"denas": True<br>"best_model_structure": "/path/to/best_structure_of_deltatuner_model"|Will be passed to the DeltaTunerArguments `__init__()` method, then it'll be used as config to build [Deltatuner model](https://github.com/intel/e2eAIOK/tree/main/e2eAIOK/deltatuner) object.|
Expand All @@ -33,7 +34,7 @@ The following are the parameters supported in the finetuning workflow.
## Training Parameters
|Configuration Name| Default|Meaning|
|-|-|-|
|optimizer|AdamW|The optimizer used for model training. Supported values: "Adadelta", "Adagrad", "Adam", "AdamW", "Adamax", "ASGD", "NAdam", "RAdam", "RMSprop", "Rprop", "SGD"|
|optimizer|adamw_torch|The optimizer to use: adamw_hf, adamw_torch, adamw_torch_fused, adamw_apex_fused, adamw_anyprecision or adafactor. for more optimizer names, please search OptimizerNames [here](https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py)|
|batch_size|4|Batch size per training worker|
|epochs|3|Total number of training epochs to perform.|
|learning_rate|1e-5|Initial learning rate to use.|
Expand Down
2 changes: 1 addition & 1 deletion examples/finetune/dolly1/dolly_1_finetune.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ General:
base_model: EleutherAI/gpt-j-6b
gpt_base_model: true
output_dir: /tmp/llm-ray/output
checkpoint_dir: /tmp/llm-ray/checkpoint
save_strategy: no
config:
trust_remote_code: false
use_auth_token: null
Expand Down
2 changes: 1 addition & 1 deletion examples/finetune/dolly2/dolly_2_finetune.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ General:
base_model: EleutherAI/pythia-6.9b
gpt_base_model: true
output_dir: /tmp/llm-ray/output
checkpoint_dir: /tmp/llm-ray/checkpoint
save_strategy: no
config:
trust_remote_code: false
use_auth_token: null
Expand Down
2 changes: 1 addition & 1 deletion examples/finetune/gpt_j_6b/finetune_hpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ General:
base_model: EleutherAI/gpt-j-6b
gpt_base_model: true
output_dir: /tmp/llm-ray/output
checkpoint_dir: /tmp/llm-ray/checkpoint
save_strategy: no
config:
trust_remote_code: false
use_auth_token: null
Expand Down
2 changes: 1 addition & 1 deletion examples/finetune/gpt_j_6b/finetune_intel_gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ General:
base_model: EleutherAI/gpt-j-6b
gpt_base_model: true
output_dir: /tmp/llm-ray/output
checkpoint_dir: /tmp/llm-ray/checkpoint
save_strategy: no
config:
trust_remote_code: false
use_auth_token: null
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ General:
base_model: EleutherAI/gpt-j-6b
gpt_base_model: true
output_dir: /tmp/llm-ray/output
checkpoint_dir: /tmp/llm-ray/checkpoint
save_strategy: no
config:
trust_remote_code: false
use_auth_token: null
Expand Down
14 changes: 10 additions & 4 deletions llm_on_ray/common/dataprocesser/general_processer.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,13 +99,10 @@ def torch_call(self, examples):


class GeneralProcesser(DataProcesser):
def prepare(self, tokenizer, dataset):
per_device_train_batch_size = self.config.get("per_device_train_batch_size")
per_device_eval_batch_size = self.config.get("per_device_eval_batch_size")
def tokenize_dataset(self, tokenizer, dataset):
max_length = self.config.get("max_length")
group = self.config.get("group")
block_size = self.config.get("block_size")
shuffle = self.config.get("shuffle")
tokenizer.pad_token = tokenizer.eos_token

if isinstance(dataset, datasets.Dataset):
Expand Down Expand Up @@ -176,13 +173,22 @@ def group_texts(examples):
desc=f"Grouping texts in chunks of {block_size}",
)

return tokenized_datasets

def prepare_dataloader(self, tokenizer, dataset):
per_device_train_batch_size = self.config.get("per_device_train_batch_size")
per_device_eval_batch_size = self.config.get("per_device_eval_batch_size")
shuffle = self.config.get("shuffle")

data_collator = DataCollatorForCompletionOnlyLM(
tokenizer=tokenizer,
mlm=False,
return_tensors="pt",
pad_to_multiple_of=8,
)

tokenized_datasets = self.tokenize_dataset(tokenizer, dataset)

train_dataset = tokenized_datasets["train"]
train_dataloader_params = {
"shuffle": shuffle,
Expand Down
4 changes: 3 additions & 1 deletion llm_on_ray/common/trainer/default_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,9 @@ def prepare(self, model, tokenizer, dataset, optimizer, accelerator):
f"model embedding size resize to {len(tokenizer)} because of tokenizer size"
)

train_dataloader, eval_dataloader = self.dataprocesser.prepare(tokenizer, dataset)
train_dataloader, eval_dataloader = self.dataprocesser.prepare_dataloader(
tokenizer, dataset
)

lr_scheduler_config = self.config.get("lr_scheduler")
if lr_scheduler_config:
Expand Down
Loading

0 comments on commit 3523011

Please sign in to comment.