Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5FineTuner issue "in training_epoch_end avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean() " #8

Open
GeYue opened this issue Dec 28, 2020 · 4 comments

Comments

@GeYue
Copy link

GeYue commented Dec 28, 2020

Hi, Suraj,
I am trying to use your T5FineTune class to study the fine tune skill.
But, unfortunately, when I tried to run the program on my env, I got this error:

in training_epoch_end
avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean()
RuntimeError: stack expects a non-empty TensorList

I tried to track the cause and found that the "training_step" never be called.
I think it may relate with the "ImdbDataSet" for the train_dataloadder, but I debuged it and it seems all right.
I just begin to contact the DeepLearning, so maybe there is something is obvious but I really don't know.

Do you have any idea about what may cause it?
Thank you and looking forward your any feedback.

Best Regards

@MarcosFP97
Copy link

Hi! I had the same problem and I figured out that it was a package version problem. In order to make this notebook work properly, you need to use this versions:

!pip install transformers==2.9.0 
!pip install pytorch_lightning==0.7.5

@MarcosFP97
Copy link

I have created a PR, but meanwhile you can download the fixed notebook from my fork: here

Best,
Marcos

@Jackthebighead
Copy link

Jackthebighead commented Nov 9, 2021

Thank @MarcosFP97 for the answer, I got the same issue and the loss is 'nan' during training. It can be solved by changing the package into the right version.

And also, perhaps the problem may be caused by the self-defined optimizer_step function. Another solution can be adding closure=optimizer_closure in optimizer.step() in the function optimizer_step(). This may work because in the self-defined optimizer_step() function, we need a closure function to return the last-training-backward-result to the ProgressBar in tqdm_dict.

In this way, my problem got solved without changing the package version. For example, add closure=optimizer_closure in the function:

def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu=False, using_native_amp=False, using_lbfgs=False):
    if self.trainer.use_tpu:
        xm.optimizer_step(optimizer)
    else:             
        optimizer.step(closure=optimizer_closure)
    optimizer.zero_grad()
    self.lr_scheduler.step()

@MarcosFP97
Copy link

Thanks for your comments @Jackthebighead!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants