Semantic segmentation, Image captioning and Image-text retrieval is implemented using self-supervised learning and Image-text captioning. The code has been executed in the Linux OS, by creating the virtual environment in conda.
Goal of the project
1. Self-supervised learning with DINOv2 model :
1.1 Nearest Neighbor Features with Negative Feature Space Distance as Score
1.2 Visualization of Nearest Neighbors and Calculating Crop Error with Feature Distance Ranking
1.3 Nearest Neighbor Features with Negative Cycle Distance as Score
1.4 Visualization of Nearest Neighbors and Calculating Crop Error with Cycle Distance Ranking
1.5 Nearest Neighbor matching with One-to-Many Frames
1.6 Visualization of Nearest Neighbors Matches for One-to-Many Frames
2. Image-Text Captioning Complete captions are generated with :
2.1 Greedy Search
2.2 Sampling
Image-Text Retrieval is performed using BLIP model
The project was build using the python libraries. Additionally, for package installations and environment management, 'Anaconda' has been used. Special thanks to the open-source community for making such a great contributions and for making them available.
A detailed report for the project has been attached.