Gradient Clipping #596

jafioti · 2023-03-21T17:21:20Z

When training large or deep models, exploding gradients are frequent and cause instability. Clipping them to a certian small amount is an effective way of stabilizing training.

To implement this, I believe a method on the Gradients struct would be needed (correct me if I'm wrong)

The text was updated successfully, but these errors were encountered:

coreylowman · 2023-03-21T17:49:19Z

I know there's multiple ways to clip gradients (e.g. pytorch has clip_grad_norm_ and clip_grad_value_.

Do we know if one of these is more widely used than the other?

jafioti · 2023-03-21T17:57:23Z

I think clip_grad_norm_ is more widely used, however it is also more complex, as it takes the norm of all the gradients first. clip_grad_value_ is used less, but is far more straightforward to implement so I think it makes sense to add that first.

nkoppel · 2023-03-21T18:01:09Z

It should be possible to implement a general Gradients::map function that takes a FnMut(&mut Tensor<(usize,), E, D>) -> Result<(), D::Err> and applies it to each D::Vec after wrapping it in a tensor.

jafioti · 2023-03-21T18:02:31Z

That seems like all that would be needed for the clip_grad_value_

coreylowman · 2023-03-22T01:16:07Z

pytorch implementations of the above are pretty straightforward https://github.com/pytorch/pytorch/blob/master/torch/nn/utils/clip_grad.py

I would say clip_grad_norm would be required to go through TensorCollection api so:

only the norm of the model's gradients are considered
We can get access to the gradient's tensor

model.clip_grad_norm(&mut grads, 0.5);
model.clip_grad_value(&mut grads, 0.5);

Then we could implement clip_grad_norm with two passes with RecursiveWalker:

Accumulate each gradient's norm. For each tensor & gradient:
1. Create a tensor out of the gradient using Gradients::get
2. Compute norm of tensor with g.square().sum().sqrt()
3. Append this 0d norm tensor to a Vec along the walker
Call stack on the Vec of 0d norms
Call stacked.square().sum().sqrt() to compute total norm
Multiply each gradient by max_norm / total_norm as done in pytorch code

If we wanted this all to be in-place:

For clip_grad_norm, we'd need a way to in-place multiply a D::Vec<E>.
For clip_grad_value, we'd need a way to in-place clamp a D::Vec<E>.

Also separately, the .square().sum().sqrt() way of taking norm may be expensive since .square() will allocate another tensor with the same size as the gradient. I think this can be addressed separately though.

opfromthestart · 2023-09-24T21:40:43Z

Has any work been done on this?

swfsql · 2023-12-14T01:01:25Z

I've submitted a draft PR, and once the examples are added I'll mark as ready for review. But so far I think it's working correctly, I've been able to avoid exploding gradients.

swfsql linked a pull request Dec 14, 2023 that will close this issue

Gradient Clipping #902

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient Clipping #596

Gradient Clipping #596

jafioti commented Mar 21, 2023

coreylowman commented Mar 21, 2023

jafioti commented Mar 21, 2023

nkoppel commented Mar 21, 2023

jafioti commented Mar 21, 2023

coreylowman commented Mar 22, 2023 •

edited

Loading

opfromthestart commented Sep 24, 2023

swfsql commented Dec 14, 2023

Gradient Clipping #596

Gradient Clipping #596

Comments

jafioti commented Mar 21, 2023

coreylowman commented Mar 21, 2023

jafioti commented Mar 21, 2023

nkoppel commented Mar 21, 2023

jafioti commented Mar 21, 2023

coreylowman commented Mar 22, 2023 • edited Loading

opfromthestart commented Sep 24, 2023

swfsql commented Dec 14, 2023

coreylowman commented Mar 22, 2023 •

edited

Loading