This is CS230, Deep-Neural-Vision project repo. we will be keep adding models, results, as we keep performing experiments here.. Currently we tried Show, Attend, and Tell paper architecture in PyTorch, with additional Beam search in it.
For Metrics we used pycocoeval & it's wrapper pip package.
This model learns where to look. As you generate a caption, word by word, you can see the model's gaze shifting across the image. This is possible because of its Attention mechanism, which allows it to focus on the part of the image most relevant to the word it is going to utter next. Here are some captions generated on test images not seen during training or validation