johko · ATaylorAerospace · Mar 13, 2024 · Feb 4, 2024 · Feb 5, 2024 · Feb 5, 2024
@@ -67,7 +67,7 @@ This network takes the positional index and potentially positional encodings as
 This implies that instead of learning the values of the convolution filter directly, we learn a mapping from a temporal positional encoding to the values, which is more computationally efficient, especially for long sequences.
 
 <Tip>
-It's important to note that the mapping function  $\gamma_{\theta}$  can be conceptualized within various abstract models, such Neural Field or State Space Models (S4) as discussed in [H3 paper](https://arxiv.org/abs/2212.14052).
+It's important to note that the mapping function can be conceptualized within various abstract models, such Neural Field or State Space Models (S4) as discussed in [H3 paper](https://arxiv.org/abs/2212.14052).
 </Tip>
 
 ### Implicit convolutions

diff --git a/chapters/en/Unit 2 - Convolutional Neural Networks/googlenet.mdx b/chapters/en/Unit 2 - Convolutional Neural Networks/googlenet.mdx
@@ -84,4 +84,5 @@ class GoogLeNet(nn.Module):
         x = self.pre_layers(x)
         x = self.inception_blocks(x)
         x = self.output_net(x)
-        return F.softmax(x, dim=1)
+        return F.softmax(x, dim=1)
+```
@@ -0,0 +1,42 @@
+# Knowledge Distillation with Vision Transformers
+
+We are going to learn about Knowledge Distillation, the method behind [distilGPT](https://huggingface.co/distilgpt2) and [distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), two of *the most downloaded models on the Hugging Face Hub!*
+
+Presumably, we've all had teachers who "teach" by simply providing us the correct answers and then testing us on questions we haven't seen before, analogous to supervised learning
+of machine learning models where we provide a labeled dataset to train on. Instead of having a model train on labels, however,
+we can pursue [Knowledge Distillation](https://arxiv.org/abs/1503.02531) as an alternative to arriving at a much smaller model that can perform comparably to the larger model and much faster to boot.
+
+## Intuition Behind Knowledge Distillation
+
+Imagine you were given this multiple-choice question:
+
+![Multiple Choice Question](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/multiple-choice-question.png)
+
+If you had someone just tell you, "The answer is Draco Malfoy," that doesn't teach you a whole lot about each of the characters' relative relationships with Harry Potter.
+
+On the other hand, if someone tells you, "I am very confident it is not Ron Weasley, I am somewhat confident it is not Neville Longbottom, and 
+I am very confident that it *is* Draco Malfoy", this gives you some information about these characters' relationships to Harry Potter! 
+This is precisely the kind of information that gets passed down to our student model under the Knowledge Distillation paradigm.
+
+## Distilling the Knowledge in a Neural Network
+
+In the paper [*Distilling the Knowledge in a Neural Network*](https://arxiv.org/abs/1503.02531), Hinton et al. introduced the training methodology known as knowledge distillation,
+taking inspiration from *insects*, of all things. Just as insects transition from larval to adult forms that are optimized for different tasks, large-scale machine learning models can 
+initially be cumbersome, like larvae, for extracting structure from data but can distill their knowledge into smaller, more efficient models for deployment.
+
+The essence of Knowledge Distillation is using the predicted logits from a teacher network to pass information to a smaller, more efficient student model. We do this
+by re-writing the loss function to contain a *distillation loss*, which encourages the student model's distribution over the output space to approximate the teacher's.
+
+The distillation loss is formulated as:
+
+![Distillation Loss](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/KL-Loss.png)
+
+The KL loss refers to the [Kullback-Leibler Divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) between the teacher and the student's output distributions. 
+The overall loss for the student model is then formulated as the sum of this distillation loss with the standard cross-entropy loss over the ground-truth labels.
+
+To see this loss function implemented in Python and a fully worked out example in Python, let's check out the [notebook for this section](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb).
+
+<a target="_blank" href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+
@@ -0,0 +1,94 @@
+# Transfer Learning and Fine-tuning Vision Transformers for Image Classification
+
+## Introduction
+
+As the Transformers architecture scaled well in Natural Language Processing, the same architecture was applied to images by creating small patches of the image and treating them as tokens. The result was a Vision Transformer (Vision Transformers). Before we get started with transfer learning / fine-tuning concepts, let's compare Convolutional Neural Networks (CNNs) with Vision Transformers.
+
+### CNN vs Vision Transformers: Inductive Bias
+
+Inductive bias is a term used in machine learning to describe the set of assumptions that a learning algorithm uses to make predictions. In simpler terms, inductive bias is like a shortcut that helps a machine learning model make educated guesses based on the information it has seen so far.
+
+Here's a couple of inductive biases we observe in CNNs:
+
+- Translational Equivariance: an object can appear anywhere in the image, and CNNs can detect its features.
+- Locality: pixels in an image interact mainly with its surrounding pixels to form features.
+
+These are lacking in Vision Transformers. Then how do they perform so well? It's because they're highly scalable and they're trained on massive amounts of images. Hence, they overcome the need for these inductive biases.
+
+### Using pre-trained Vision Transformers
+
+It's not feasible for everyone to train a Vision Transformer on millions of images to get good performance. Instead, one can use openly available models from places such as the [Hugging Face Hub](https://huggingface.co/models?sort=trending).
+
+What do you do with the pre-trained model? You can apply transfer learning and fine-tune it!
+
+## Transfer Learning & Fine-Tuning for Image Classification
+
+The idea of transfer learning is that we can leverage the features learned by the Vision Transformers trained on a very large dataset and apply these features to our dataset. This can lead to significant improvements in model performance, especially when our dataset has limited data available for training.
+
+Since we are taking advantage of the learned features, we do not need to update the entire model either. By freezing most of the weights, we can train only certain layers to get excellent performance with less training time and low GPU consumption.
+
+### Multi-class Image Classification
+
+You can go through the transfer learning tutorial using Vision Transformers for image classification in this notebook:
+
+<a
+  target="_blank"
+  href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-image-classification.ipynb"
+>
+  <img
+    src="https://colab.research.google.com/assets/colab-badge.svg"
+    alt="Open In Colab"
+  />
+</a>
+
+This is what we'll be building: an image classifier to tell apart dog and cat breeds:
+
+<iframe
+  src="https://shreydan-oxford-iiit-pets-classifier.hf.space"
+  frameborder="0"
+  width="850"
+  height="450"
+></iframe>
+
+---
+
+It might be that the domain of your dataset is not very similar to the pre-trained model's dataset. Yet, instead of training a Vision Transformer from scratch, we can choose to update the weights of the entire pre-trained model albeit with a lower learning rate, which will "fine-tune" the model to perform well with our data.
+
+<Tip>
+  However, in most scenarios, applying transfer learning is ample in the case of
+  Vision Transformers.
+</Tip>
+
+### Multi-label Image Classification
+
+The tutorial above teaches multi-class image classification, where each image only has 1 class assigned to it. What about scenarios where each image has multiple labels in a multi-class dataset?
+
+This notebook will walk you through a fine-tuning tutorial using Vision Transformer for multi-label image classification:
+
+<a
+  target="_blank"
+  href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/fine-tuning-multilabel-image-classification.ipynb"
+>
+  <img
+    src="https://colab.research.google.com/assets/colab-badge.svg"
+    alt="Open In Colab"
+  />
+</a>
+
+We'll also be learning how to use [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/index) to write our custom training loops.
+This is what you can expect to see as the outcome of the multi-label classification tutorial:
+
+<iframe
+  src="https://shreydan-pascal-multilabel-classifier.hf.space"
+  frameborder="0"
+  width="850"
+  height="450"
+></iframe>
+
+---
+
+### Additional Resources
+
+- Original Vision Transformers Paper: _An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Paper](https://huggingface.co/papers/2010.11929)_
+- Swin Transformers Paper: _Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [Paper](https://huggingface.co/papers/2103.14030)_
+- A systematic empirical study in order to better understand the interplay between the amount of training data, regularization, augmentation, model size and compute budget for Vision Transformers: _How to train your Vision Transformers? Data, Augmentation, and Regularization in Vision Transformers [Paper](https://huggingface.co/papers/2106.10270)_
@@ -0,0 +1,100 @@
+# Transformer-based image segmentation
+
+In this section, we'll explore how Vision Transformers compare to Convolutional Neural Networks (CNNs) in image segmentation and detail the architecture of a vision transformer-based segmentation model as an example.
+
+<Tip warning={true}>
+  This section assumes familiarity with image segmentation, Convolutional Neural
+  Networks (CNNs), and the basics of Vision Transformers. If you're new to these
+  concepts, we recommend exploring related materials in the course before
+  proceeding.
+</Tip>
+
+## CNNs vs Transformers for Segmentation
+
+Before the emergence of Vision Transformers, CNNs (Convolutional Neural Networks) have been the go-to choice for image segmentation. Models like [U-Net](https://arxiv.org/abs/1505.04597) and [Mask R-CNN](https://arxiv.org/abs/1703.06870) captured the details that are needed to distinguish different objects in an image, making them state-of-the-art for segmentation tasks.
+
+Despite their excellent results over the past decade, CNN-based models have some limitations, which Transformers aims to solve:
+
+- **Spatial limitations**: CNNs learn local patterns through small receptive fields. This local focus makes it hard for them to "link" features that are far apart but related within the image, affecting their ability to accurately segment complex scenes/objects. Unlike CNNs, ViTs are designed to capture global dependencies within an image, leveraging the attention mechanism. This means ViT-based models consider the entire image at once, allowing them to understand complex relationships between distant parts of an image. For segmentation, this global perspective can lead to a more accurate delineation of objects.
+- **Task-Specific Components**: Methods like Mask R-CNN incorporate hand-designed components (e.g., non-maximum suppression, spatial anchors) to encode prior knowledge about segmentation tasks. These components add complexity and require manual tuning. In contrast, ViT-based segmentation methods simplify the segmentation process by eliminating the need for hand-designed components, making them more straightforward to optimize.
+- **Segmentation Task Specialization**: CNN-based segmentation models approach semantic, instance, and panoptic segmentation tasks individually, leading to specialized architectures for each task and separate research efforts into each. Recent ViT-based models like [MaskFormer](https://arxiv.org/abs/2107.06278), [SegFormer](https://arxiv.org/abs/2105.15203) or [SAM](https://arxiv.org/abs/2304.02643) provide a unified approach to tackling semantic, instance, and panoptic segmentation tasks within a single framework.
+
+## Spotlight on MaskFormer: Illustrating ViT for Image Segmentation
+
+MaskFormer ([paper](https://arxiv.org/abs/2107.06278), [Hugging Face transformers documentation](https://huggingface.co/docs/transformers/en/model_doc/maskformer)), introduced in the paper "MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation" is a model that predicts segmentation masks for each class present in an image, unifying semantic and instance segmentation in one architecture.
+
+### MaskFormer Architecture
+
+The figure below shows the architecture diagram taken from the paper.
+
+<img
+  width="600"
+  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/maskformer_architecture.png"
+/>
+
+The architecture is composed of three components:
+
+**Pixel-level Module**: Uses a backbone to extract image features and a pixel decoder to generate per-pixel embeddings.
+
+**Transformer Module**: Employs a standard Transformer decoder to compute per-segment embeddings from image features and learnable positional embeddings (queries), encoding global information about each segment.
+
+**Segmentation Module**: Generates class probability predictions and mask embeddings for each segment using a linear classifier and a Multi-Layer Perceptron (MLP), respectively. The mask embeddings are used in combination with per-pixel embeddings to predict binary masks for each segment.
+
+The model is trained with a binary mask loss, the same one as [DETR](https://github.com/johko/computer-vision-course/blob/9ad9b01f2383377ac9482dcbe02c91465b573b0b/chapters/en/Unit%203%20-%20Vision%20Transformers/Common%20Vision%20Transformers%20-%20DETR.mdx), and a cross-entropy classification loss per predicted segment.
+
+### Panoptic Segmentation Inference Example with Hugging Face Transformers
+
+Panoptic segmentation is the task of labeling every pixel in an image with its category and identifying distinct objects within those categories, combining both semantic and instance segmentation.
+
+```python
+from transformers import pipeline
+from PIL import Image
+import requests
+
+segmentation = pipeline("image-segmentation", "facebook/maskformer-swin-base-coco")
+
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+
+results = segmentation(images=image, subtask="panoptic")
+results
+```
+
+As you can see below, the results include multiple instances of the same classes, each with distinct masks.
+
+```bash
+[
+  {
+    "score": 0.993197,
+    "label": "remote",
+    "mask": <PIL.Image.Image image mode=L size=640x480 at 0x109363910>
+  },
+  {
+    "score": 0.997852,
+    "label": "cat",
+    "mask": <PIL.Image.Image image mode=L size=640x480 at 0x1093635B0>
+  },
+  {
+    "score": 0.998006,
+    "label": "remote",
+    "mask": <PIL.Image.Image image mode=L size=640x480 at 0x17EE84670>
+  },
+  {
+    "score": 0.997469,
+    "label": "cat",
+    "mask": <PIL.Image.Image image mode=L size=640x480 at 0x17EE87100>
+  }
+]
+```
+
+## Fine-tuning Vision Transformer-based Segmentation Models
+
+With many pre-trained segmentation models available, transfer learning and finetuning are commonly used to adapt these models to specific use cases, especially since transformer-based segmentation models, like MaskFormer, are data-hungry and challenging to train from scratch.
+these techniques leverage pre-trained representations to adapt these models to new data efficiently. Typically, for MaskFormer, the backbone, the pixel decoder, and the transformer decoder are kept frozen to leverage their learned general features, while the transformer module is finetuned to adapt its class prediction and mask generation capabilities to new segmentation tasks.
+
+[This notebook](https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-segmentation.ipynb) will walk you through a transfer learning tutorial on image segmentation using MaskFormer.
+
+## References
+
+- [MaskFormer Hugging Face documentation](https://huggingface.co/docs/transformers/en/model_doc/maskformer)
+- [Image Segmentation Hugging Face Task Guide](https://huggingface.co/docs/transformers/en/tasks/semantic_segmentation)
@@ -50,6 +50,12 @@
     local: "Unit 3 - Vision Transformers/Swin Transformer"
   - title: OneFormer
     local: "Unit 3 - Vision Transformers/oneformer"
+  - title: Vision Transformers for Image Classification
+    local: "Unit 3 - Vision Transformers/Vision Transformers for Image Classification"
+  - title: Vision Transformers for Image Segmentation
+    local: "Unit 3 - Vision Transformers/Vision Transformers for Image Segmentation"
+  - title: Knowledge Distillation with Vision Transformers
+    local: "Unit 3 - Vision Transformers/KnowledgeDistillation"
 - title: Unit 4 - Multimodal Models
   sections:
   - title: Introduction

@@ -17,7 +17,9 @@
         "\twidth=\"850\"\n",
         "\theight=\"450\">\n",
         "</iframe>\n",
-        "```"
+        "```\n",
+        "\n",
+        "Also there is a small section if you are interested in Transfer learning instead of fine tuning only."
       ]
     },
     {
@@ -1526,6 +1528,55 @@
       "source": [
         "Well, that's not bad. We can improve the results if we fine-tune further. You can find this fine-tuned checkpoint [here](hf-vision/detr-resnet-50-dc5-harhat-finetuned). "
       ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## How about Transfer learning ?"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "In this notebook , we primarily discussed about the fine-tuning a certain model to our custom dataset. What if , we only want transfer learning? Actually that is easy peasy! In transfer learning , we have to keep the parameter values aka weights, of the pretrained model frozen. We just train the classifier layer (in some cases, one or two more layers). In this case before starting the training process, we can do the following, "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "vscode": {
+          "languageId": "plaintext"
+        }
+      },
+      "outputs": [],
+      "source": [
+        "from transformers import AutoModelForObjectDetection\n",
+        "\n",
+        "id2label = {0:'head', 1:'helmet', 2:'person'}\n",
+        "label2id = {v: k for k, v in id2label.items()}\n",
+        "\n",
+        "\n",
+        "model = AutoModelForObjectDetection.from_pretrained(\n",
+        "    checkpoint,\n",
+        "    id2label=id2label,\n",
+        "    label2id=label2id,\n",
+        "    ignore_mismatched_sizes=True,\n",
+        ")\n",
+        "\n",
+        "for name,p in model.named_parameters():\n",
+        "    if not 'bbox_predictor' in name or not name.startswith('class_label'):\n",
+        "        p.requires_grad = False"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "That means, after loading the model , we freeze all of the layers except last 6 layers. Which are `bbox_predictor.layers` and `class_labels_classifier`."
+      ]
     }
   ],
   "metadata": {