Subtle difference with Pytorch AdamW? #35504

kyleliang919 · 2025-01-04T04:09:40Z

transformers/src/transformers/optimization.py

Line 648 in e5fd865

denom = exp_avg_sq.sqrt().add_(group["eps"])

It does correction after epsilon is added, whereas pytorch：
https://pytorch.org/docs/stable/_modules/torch/optim/adamw.html#AdamW

step = _get_value(step_t)

bias_correction1 = 1 - beta1**step
bias_correction2 = 1 - beta2**step

step_size = lr / bias_correction1

bias_correction2_sqrt = bias_correction2**0.5

if amsgrad:
    # Maintains the maximum of all 2nd moment running avg. till now
    torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])

    # Use the max. for normalizing running avg. of gradient
    denom = (max_exp_avg_sqs[i].sqrt() / bias_correction2_sqrt).add_(eps)
else:
    denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)

param.addcdiv_(exp_avg, denom, value=-step_size)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subtle difference with Pytorch AdamW? #35504

Subtle difference with Pytorch AdamW? #35504

kyleliang919 commented Jan 4, 2025

Subtle difference with Pytorch AdamW? #35504

Subtle difference with Pytorch AdamW? #35504

Comments

kyleliang919 commented Jan 4, 2025