Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][Finetune] Report training metrics to Tensorboard #40

Closed
wants to merge 10 commits into from

Conversation

carsonwang
Copy link
Contributor

When you start finetuning, you will see a message like below:
To visualize your results with TensorBoard, run: tensorboard --logdir /xxx/ray_results/TorchTrainer_2024-01-08_16-51-19

This PR reports metrics train loss and perplexity to Ray that writes to TensorBoard.
Also introduced parameter logging_steps to control the logging frequency.

@@ -130,7 +131,7 @@ def prepare(self, model, tokenizer, dataset, optimizer, accelerator):
def train(self):
num_train_epochs = self.config.get("num_train_epochs", 1)
checkpoint = self.config.get("checkpoint")
log_step = self.config.get("log_step", 1)
logging_steps = self.config.get("logging_steps")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add default value 1 because of ui.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I've set the default to 10 in the configuration file. Fine for UI?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. It‘s because now start_ui.py does not use finetune.yaml, so the 'logging_steps' will be none. Or I modify start_ui.py later to get the configuration from finetune.yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood it. Sorry I didn't say clearly. I mean I will update this to logging_steps = self.config.get("logging_steps", 10) here using 10 instead of 1 as default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but progress bar needs the values of report({}) to be updated every step, otherwise the status of progress bar will only be updated after 10 steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Ok, will set it to 1 here.

self.completed_steps += 1

if self.completed_steps % logging_steps == 0:
perplexity = math.exp(loss)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use loss.item() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks. I will update.

carsonwang pushed a commit to carsonwang/llm-on-ray that referenced this pull request Jan 9, 2024
@carsonwang carsonwang changed the title [Finetune] Report training metrics to Tensorboard [WIP][Finetune] Report training metrics to Tensorboard Jan 10, 2024
@carsonwang
Copy link
Contributor Author

I will also update the logging format and data.

@@ -147,12 +148,19 @@ def train(self):
if self.lr_scheduler is not None:
self.lr_scheduler.step()
self.optimizer.zero_grad()
if step % log_step == 0:
logger.info(f"train epoch:[{idx}/{num_train_epochs}]\tstep:[{step}/{total_steps}]\tloss:{loss:.6f}\tppl:{math.exp(loss):.6f}\ttime:{time.time()-start:.6f}")
report({"train_epoch": idx, "total_epochs": num_train_epochs, "train_step": step, "total_steps": min(max_train_step, total_steps) if max_train_step else total_steps})
Copy link
Contributor

@KepingYan KepingYan Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is for updating the progress bar on web ui, the key of this dict can't be modified because of

self.config["epoch_value"].put(trial.last_result["train_epoch"] + 1, block=False)
self.config["total_epochs"].put(trial.last_result["total_epochs"], block=False)
self.config["step_value"].put(trial.last_result["train_step"] + 1, block=False)
self.config["total_steps"].put(trial.last_result["total_steps"], block=False)
.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info. Let me double check how to update this. All the keys and values will also be shown in Tensorboard.

@carsonwang
Copy link
Contributor Author

I'll close this one. @harborn will continue to work on this and address the comments.

@carsonwang carsonwang closed this Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants