-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][Finetune] Report training metrics to Tensorboard #40
Conversation
@@ -130,7 +131,7 @@ def prepare(self, model, tokenizer, dataset, optimizer, accelerator): | |||
def train(self): | |||
num_train_epochs = self.config.get("num_train_epochs", 1) | |||
checkpoint = self.config.get("checkpoint") | |||
log_step = self.config.get("log_step", 1) | |||
logging_steps = self.config.get("logging_steps") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add default value 1
because of ui.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I've set the default to 10 in the configuration file. Fine for UI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. It‘s because now start_ui.py does not use finetune.yaml, so the 'logging_steps' will be none. Or I modify start_ui.py later to get the configuration from finetune.yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood it. Sorry I didn't say clearly. I mean I will update this to logging_steps = self.config.get("logging_steps", 10)
here using 10 instead of 1 as default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but progress bar needs the values of report({})
to be updated every step, otherwise the status of progress bar will only be updated after 10 steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Ok, will set it to 1 here.
self.completed_steps += 1 | ||
|
||
if self.completed_steps % logging_steps == 0: | ||
perplexity = math.exp(loss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use loss.item()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks. I will update.
I will also update the logging format and data. |
@@ -147,12 +148,19 @@ def train(self): | |||
if self.lr_scheduler is not None: | |||
self.lr_scheduler.step() | |||
self.optimizer.zero_grad() | |||
if step % log_step == 0: | |||
logger.info(f"train epoch:[{idx}/{num_train_epochs}]\tstep:[{step}/{total_steps}]\tloss:{loss:.6f}\tppl:{math.exp(loss):.6f}\ttime:{time.time()-start:.6f}") | |||
report({"train_epoch": idx, "total_epochs": num_train_epochs, "train_step": step, "total_steps": min(max_train_step, total_steps) if max_train_step else total_steps}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is for updating the progress bar on web ui, the key of this dict can't be modified because of
Lines 82 to 85 in f26343d
self.config["epoch_value"].put(trial.last_result["train_epoch"] + 1, block=False) | |
self.config["total_epochs"].put(trial.last_result["total_epochs"], block=False) | |
self.config["step_value"].put(trial.last_result["train_step"] + 1, block=False) | |
self.config["total_steps"].put(trial.last_result["total_steps"], block=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info. Let me double check how to update this. All the keys and values will also be shown in Tensorboard.
I'll close this one. @harborn will continue to work on this and address the comments. |
When you start finetuning, you will see a message like below:
To visualize your results with TensorBoard, run:
tensorboard --logdir /xxx/ray_results/TorchTrainer_2024-01-08_16-51-19
This PR reports metrics train loss and perplexity to Ray that writes to TensorBoard.
Also introduced parameter logging_steps to control the logging frequency.