Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change align_corners to True in position encoding interpolation #468

Open
infinity1096 opened this issue Sep 24, 2024 · 2 comments
Open

Comments

@infinity1096
Copy link

In current implementation, we have align_corners=False, which according to this post this will led position encoding at the edge be value padded(have the same value as the endpoint of source values, the original PE) and behave differently compared to patches at the center.

We therefore suggest changing to align_corners=True to avoid the difference between PE at the edge v.s. others.

image
image credit: Lucas Beyer

@maxrohleder
Copy link

maxrohleder commented Nov 20, 2024

[just a curious note, not a request for changes]

I agree, align corners should be done when interpolating the position embedding to larger image sizes. (I believe this is your use case). However, what worries me more is that during training this function is used to downsize pos_enc tensors to the resolution of the local crops. This function is used with gradient descent!

Effect on gradient wrt position embeddings from local crops

In our experiments, a strange position embedding tensor emerged which we believe originates from the interpolation (resizing) of position encodings. Has anyone observed similar behavior?

image

Fig.1: ViT-T-8 architecture trained with DINOv1 from scratch on custom data.

During training, gradients from both local crops (at shape 96x96) and global crops (224x224) contribute to the formation of the learned position encodings. Downsizing the position encoding tensor from 224x244 to 96x96 (global to local crop sizes) results in gradient distributions like this during backprop (assuming constant gradient image):

image

Fig. 2: Constant gradient tensor (torch.ones) backpropagated through interpolation downsizing (F.interpolate()) with and without align_corners

Aligning corners appears to only make a small difference. What would really result in homogenously distributed gradients from local and global crops would be anti-aliasing according to this image:

image

Fig. 3: Gradient distribution with F.interpolate(x, anti_aliasing=True, align_corners=False)

Note: I previously used DINOv1. I just saw that in DINOv2, specifically this commit 9c7e324#diff-c711d58dde9a8d285c684d67f6e1872bba4220631ca0bc88f023ca3684dfb890, a parameter was introduced to enable anti-aliasing. I wonder what the reason for this was?

patch_pos_embed = nn.functional.interpolate(
patch_pos_embed.reshape(1, M, M, dim).permute(0, 3, 1, 2),
mode="bicubic",
antialias=self.interpolate_antialias,
**kwargs,
)
)

@JoshuaScheuplein
Copy link

In addition to the comment of @maxrohleder, I would like to share the result of an additional experiment, which we have conducted to investigate the influence of different parameter settings in torch.nn.functional.interpolate().

Description of image 1
mode=bicubic, antialias=False, align_corners=False
Description of image 2
mode=bicubic, antialias=True, align_corners=False

The figure on the right shows that using antialias=True produces a learned position encoding without a checkerboard pattern, unlike the initial case with antialias set to False. Additionally, the PCA reduced features show how the model nicely captures different spatial directions in the encoding (e.g., from left to right or from top to bottom).

We believe this behavior is caused by the smoother gradient backpropagating of nn.functional.interpolate(antialias=True) compared to nn.functional.interpolate(antialias=False) (see comment #468 (comment)). We would really appreciate it if you could briefly verify our solution and explain if self.interpolate_antialias in DINOv2 (commit 9c7e324) was introduced for the same reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants