[WIP][Finetune] Report training metrics to Tensorboard #40

carsonwang · 2024-01-08T09:31:17Z

When you start finetuning, you will see a message like below:
To visualize your results with TensorBoard, run: tensorboard --logdir /xxx/ray_results/TorchTrainer_2024-01-08_16-51-19

This PR reports metrics train loss and perplexity to Ray that writes to TensorBoard.
Also introduced parameter logging_steps to control the logging frequency.

KepingYan · 2024-01-09T05:38:04Z

common/trainer/default_trainer.py

@@ -130,7 +131,7 @@ def prepare(self, model, tokenizer, dataset, optimizer, accelerator):
    def train(self):
        num_train_epochs = self.config.get("num_train_epochs", 1)
        checkpoint = self.config.get("checkpoint")
-        log_step = self.config.get("log_step", 1)
+        logging_steps = self.config.get("logging_steps")


Please add default value 1 because of ui.

OK. I've set the default to 10 in the configuration file. Fine for UI?

Got it. It‘s because now start_ui.py does not use finetune.yaml, so the 'logging_steps' will be none. Or I modify start_ui.py later to get the configuration from finetune.yaml.

I understood it. Sorry I didn't say clearly. I mean I will update this to logging_steps = self.config.get("logging_steps", 10) here using 10 instead of 1 as default.

OK, but progress bar needs the values of report({}) to be updated every step, otherwise the status of progress bar will only be updated after 10 steps.

I see. Ok, will set it to 1 here.

KepingYan · 2024-01-09T05:48:57Z

common/trainer/default_trainer.py

+                        self.completed_steps += 1
+
+                    if self.completed_steps % logging_steps == 0:
+                        perplexity = math.exp(loss)


Would it be better to use loss.item() here?

Yes, thanks. I will update.

carsonwang · 2024-01-10T07:40:48Z

I will also update the logging format and data.

KepingYan · 2024-01-10T08:07:02Z

common/trainer/default_trainer.py

@@ -147,12 +148,19 @@ def train(self):
                    if self.lr_scheduler is not None:
                        self.lr_scheduler.step()
                    self.optimizer.zero_grad()
-                    if step % log_step == 0:
-                        logger.info(f"train epoch:[{idx}/{num_train_epochs}]\tstep:[{step}/{total_steps}]\tloss:{loss:.6f}\tppl:{math.exp(loss):.6f}\ttime:{time.time()-start:.6f}")
-                        report({"train_epoch": idx, "total_epochs": num_train_epochs, "train_step": step, "total_steps": min(max_train_step, total_steps) if max_train_step else total_steps})


This line is for updating the progress bar on web ui, the key of this dict can't be modified because of

llm-on-ray/ui/start_ui.py

Lines 82 to 85 in f26343d

self.config["epoch_value"].put(trial.last_result["train_epoch"] + 1, block=False)

self.config["total_epochs"].put(trial.last_result["total_epochs"], block=False)

self.config["step_value"].put(trial.last_result["train_step"] + 1, block=False)

self.config["total_steps"].put(trial.last_result["total_steps"], block=False)

.

Thanks for the info. Let me double check how to update this. All the keys and values will also be shown in Tensorboard.

carsonwang · 2024-01-22T02:59:59Z

I'll close this one. @harborn will continue to work on this and address the comments.

carsonwang added 10 commits January 5, 2024 16:57

temp

e5fc570

update

7f9cc75

update

4c3f293

update

1daa76d

update

ea2a208

add logging_steps

6998804

update

2a4ffee

revert

c99e7a2

remove

8d1463d

Merge branch 'main' into tensorboard_ray

01e5d80

carsonwang requested review from KepingYan and harborn January 8, 2024 09:39

KepingYan reviewed Jan 9, 2024

View reviewed changes

carsonwang pushed a commit to carsonwang/llm-on-ray that referenced this pull request Jan 9, 2024

fix max_train_step error (intel#40)

cc692c8

carsonwang changed the title ~~[Finetune] Report training metrics to Tensorboard~~ [WIP][Finetune] Report training metrics to Tensorboard Jan 10, 2024

KepingYan reviewed Jan 10, 2024

View reviewed changes

carsonwang closed this Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Finetune] Report training metrics to Tensorboard #40

[WIP][Finetune] Report training metrics to Tensorboard #40

carsonwang commented Jan 8, 2024

KepingYan Jan 9, 2024

carsonwang Jan 10, 2024

KepingYan Jan 10, 2024

carsonwang Jan 10, 2024

KepingYan Jan 10, 2024

carsonwang Jan 10, 2024

KepingYan Jan 9, 2024

carsonwang Jan 10, 2024

carsonwang commented Jan 10, 2024

KepingYan Jan 10, 2024 •

edited

Loading

carsonwang Jan 10, 2024

carsonwang commented Jan 22, 2024

	self.config["epoch_value"].put(trial.last_result["train_epoch"] + 1, block=False)
	self.config["total_epochs"].put(trial.last_result["total_epochs"], block=False)
	self.config["step_value"].put(trial.last_result["train_step"] + 1, block=False)
	self.config["total_steps"].put(trial.last_result["total_steps"], block=False)

[WIP][Finetune] Report training metrics to Tensorboard #40

[WIP][Finetune] Report training metrics to Tensorboard #40

Conversation

carsonwang commented Jan 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carsonwang commented Jan 10, 2024

KepingYan Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carsonwang commented Jan 22, 2024

KepingYan Jan 10, 2024 •

edited

Loading