-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change align_corners to True in position encoding interpolation #468
Comments
[just a curious note, not a request for changes] I agree, align corners should be done when interpolating the position embedding to larger image sizes. (I believe this is your use case). However, what worries me more is that during training this function is used to downsize Effect on gradient wrt position embeddings from local cropsIn our experiments, a strange position embedding tensor emerged which we believe originates from the interpolation (resizing) of position encodings. Has anyone observed similar behavior? Fig.1: ViT-T-8 architecture trained with DINOv1 from scratch on custom data. During training, gradients from both local crops (at shape 96x96) and global crops (224x224) contribute to the formation of the learned position encodings. Downsizing the position encoding tensor from 224x244 to 96x96 (global to local crop sizes) results in gradient distributions like this during backprop (assuming constant gradient image): Fig. 2: Constant gradient tensor ( Aligning corners appears to only make a small difference. What would really result in homogenously distributed gradients from local and global crops would be anti-aliasing according to this image: Fig. 3: Gradient distribution with Note: I previously used DINOv1. I just saw that in DINOv2, specifically this commit 9c7e324#diff-c711d58dde9a8d285c684d67f6e1872bba4220631ca0bc88f023ca3684dfb890, a parameter was introduced to enable anti-aliasing. I wonder what the reason for this was? dinov2/dinov2/models/vision_transformer.py Lines 203 to 208 in e1277af
|
In addition to the comment of @maxrohleder, I would like to share the result of an additional experiment, which we have conducted to investigate the influence of different parameter settings in
The figure on the right shows that using We believe this behavior is caused by the smoother gradient backpropagating of |
In current implementation, we have align_corners=False, which according to this post this will led position encoding at the edge be value padded(have the same value as the endpoint of source values, the original PE) and behave differently compared to patches at the center.
We therefore suggest changing to align_corners=True to avoid the difference between PE at the edge v.s. others.
image credit: Lucas Beyer
The text was updated successfully, but these errors were encountered: