Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize (sum to 1) attention score seems not right #16

Open
jihwanp opened this issue Apr 24, 2022 · 3 comments
Open

normalize (sum to 1) attention score seems not right #16

jihwanp opened this issue Apr 24, 2022 · 3 comments

Comments

@jihwanp
Copy link

jihwanp commented Apr 24, 2022

Hi Thanks for sharing nice work.

I noticed that you've done normalizing attention score (row sum to 1) as mentioned in the original attention rollout paper.

I = torch.eye(attention_heads_fused.size(-1))
a = (attention_heads_fused + 1.0*I)/2
a = a / a.sum(dim=-1)

But it seems when dividing the summation of row attention score, keepdim=True should be apply to ensure that sum of row attention score after normalization should be 1.

a = a / a.sum(dim=-1,keepdim=True)

Maybe I'm wrong, please double check this issue.
Thanks

@vivekh2000
Copy link

vivekh2000 commented May 16, 2024

@jacobgil , thanks for code. I think the following line in the code is redundant.

a = a / a.sum(dim=-1)

Reason:-I have attached the screenshot of the original paper below from page 3.
image
Here, the author said that the W_attn matrix is already normalized. When we add the identity matrix I, which is already a normalized matrix(meaning all the columns sum to one), multiplying by 0.5 makes W_attn plus I a normalized matrix.

Also at line 10 result = torch.eye(attentions[0].size(-1))

result = torch.eye(attentions[0].size(-1))

result is an identity matrix, whereas at line 33 result = torch.matmul(a, result)
result = torch.matmul(a, result)

a matrix and result matrix(an identity matrix) are getting multiplied, this should result in a always. Further, as mentioned in the original paper, recursive multiplication is not implemented. Anyway, thanks for the nice implementation of the techniques.

@eneserdo
Copy link

eneserdo commented Aug 2, 2024

@vivekh2000 I did not check the paper, but I think @jacobgil also implemented the discard_ratio which may not be available in the paper because this obviously breaks the normalization of the matrix. So, it is necessary to re-normalize the matrix. Also, I agreed with @jihwanp, there should be keepdim=True

@gbZachYin
Copy link

keepdim=True should be correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants