-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
normalize (sum to 1) attention score seems not right #16
Comments
@jacobgil , thanks for code. I think the following line in the code is redundant. Line 31 in 15a81d3
Reason:-I have attached the screenshot of the original paper below from page 3. Here, the author said that the W_attn matrix is already normalized. When we add the identity matrix I , which is already a normalized matrix(meaning all the columns sum to one), multiplying by 0.5 makes W_attn plus I a normalized matrix.
Also at line 10 Line 10 in 15a81d3
result is an identity matrix, whereas at line 33 result = torch.matmul(a, result) Line 33 in 15a81d3
a matrix and result matrix(an identity matrix) are getting multiplied, this should result in a always. Further, as mentioned in the original paper, recursive multiplication is not implemented. Anyway, thanks for the nice implementation of the techniques.
|
@vivekh2000 I did not check the paper, but I think @jacobgil also implemented the |
keepdim=True should be correct |
Hi Thanks for sharing nice work.
I noticed that you've done normalizing attention score (row sum to 1) as mentioned in the original attention rollout paper.
But it seems when dividing the summation of row attention score,
keepdim=True
should be apply to ensure that sum of row attention score after normalization should be 1.Maybe I'm wrong, please double check this issue.
Thanks
The text was updated successfully, but these errors were encountered: