Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit 3 : Vision Transformers / Transfer Learning & Fine-Tuning Chapter Content #204

Merged
merged 38 commits into from
Mar 13, 2024
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
73d9c0d
added mdx
asusevski Feb 4, 2024
f603ab1
Add mdx for image segmentation with vision transformers
hanouticelina Feb 5, 2024
aaba4e6
image classification mdx
shreydan Feb 5, 2024
75f7331
Merge pull request #24 from shreydan/add-mdx-kd
asusevski Feb 5, 2024
9f18e18
rename, add titles
shreydan Feb 6, 2024
eabb8db
Add Colab link
hanouticelina Feb 6, 2024
7183a81
add colab button
asusevski Feb 6, 2024
33edf14
Merge pull request #25 from shreydan/anthony-add-colab
asusevski Feb 7, 2024
f511ac1
added transfer learning
sezan92 Feb 7, 2024
db943e1
updated summary
sezan92 Feb 7, 2024
1ebfc9d
Merge pull request #26 from shreydan/vit-transfer-learning-od
shreydan Feb 7, 2024
c00d727
Update chapters/en/Unit 3 - Vision Transformers/KnowledgeDistillation…
asusevski Feb 10, 2024
147061b
Update chapters/en/Unit 3 - Vision Transformers/KnowledgeDistillation…
asusevski Feb 10, 2024
801776e
Update chapters/en/Unit 3 - Vision Transformers/KnowledgeDistillation…
asusevski Feb 10, 2024
73316e1
Update chapters/en/Unit 3 - Vision Transformers/KnowledgeDistillation…
asusevski Feb 10, 2024
255c60f
Update chapters/en/Unit 3 - Vision Transformers/KnowledgeDistillation…
asusevski Feb 10, 2024
85721ee
Update chapters/en/Unit 3 - Vision Transformers/KnowledgeDistillation…
asusevski Feb 10, 2024
fc08fa8
Update chapters/en/Unit 3 - Vision Transformers/Vision Transformers f…
asusevski Feb 10, 2024
b9a22c6
correct as per suggestions
shreydan Feb 11, 2024
2673f50
corrections
shreydan Feb 11, 2024
4f0dcd9
Fixes after review
hanouticelina Feb 11, 2024
778481c
Merge branch 'main' of github.com:shreydan/computer-vision-course
hanouticelina Feb 11, 2024
bb12e2c
update as per suggestions
shreydan Feb 16, 2024
c93f86a
Fixes after 2nd review
hanouticelina Feb 16, 2024
7325118
Merge branch 'main' of github.com:shreydan/computer-vision-course
hanouticelina Feb 16, 2024
206c08a
Update chapters/en/Unit 3 - Vision Transformers/KnowledgeDistillation…
asusevski Feb 19, 2024
85fcb65
Update chapters/en/Unit 3 - Vision Transformers/KnowledgeDistillation…
asusevski Feb 19, 2024
85401ea
fix typo
asusevski Feb 22, 2024
89a97ed
add ref to KL divergence wiki page
asusevski Feb 22, 2024
c8a95b0
Merge branch 'johko:main' into main
shreydan Mar 1, 2024
4a7d956
updated table of contents
shreydan Mar 1, 2024
3d30a27
Update _toctree.yml
shreydan Mar 1, 2024
870496a
fixes
merveenoyan Mar 12, 2024
1181ecd
Fix grammatical mistakes
merveenoyan Mar 12, 2024
d6b6099
Fix grammar errors
merveenoyan Mar 12, 2024
fbf87b7
Update Vision Transformers for Image Segmentation.mdx
merveenoyan Mar 12, 2024
34bc730
Update chapters/en/Unit 3 - Vision Transformers/KnowledgeDistillation…
merveenoyan Mar 12, 2024
4b24343
Merge branch 'main' into main
merveenoyan Mar 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion chapters/en/Unit 13 - Outlook/hyena.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ This network takes the positional index and potentially positional encodings as
This implies that instead of learning the values of the convolution filter directly, we learn a mapping from a temporal positional encoding to the values, which is more computationally efficient, especially for long sequences.

<Tip>
It's important to note that the mapping function $\gamma_{\theta}$ can be conceptualized within various abstract models, such Neural Field or State Space Models (S4) as discussed in [H3 paper](https://arxiv.org/abs/2212.14052).
It's important to note that the mapping function can be conceptualized within various abstract models, such Neural Field or State Space Models (S4) as discussed in [H3 paper](https://arxiv.org/abs/2212.14052).
</Tip>

### Implicit convolutions
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,4 +84,5 @@ class GoogLeNet(nn.Module):
x = self.pre_layers(x)
x = self.inception_blocks(x)
x = self.output_net(x)
return F.softmax(x, dim=1)
return F.softmax(x, dim=1)
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Knowledge Distillation with Vision Transformers

We are going to learn about Knowledge Distillation, the method behind [distilGPT](https://huggingface.co/distilgpt2) and [distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), two of *the most downloaded models on the Hugging Face Hub!*

Presumably, we've all had teachers who "teach" by simply providing us the correct answers and then testing us on questions we haven't seen before, analogous to supervised learning
of machine learning models where we provide a labeled dataset to train on. Instead of having a model train on labels, however,
we can pursue [Knowledge Distillation](https://arxiv.org/abs/1503.02531) as an alternative to arriving at a much smaller model that can perform comparably to the larger model and much faster to boot.

## Intuition Behind Knowledge Distillation

Imagine you were given this multiple-choice question:

![Multiple Choice Question](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/multiple-choice-question.png)

If you had someone just tell you, "The answer is Draco Malfoy," that doesn't teach you a whole lot about each of the characters' relative relationships with Harry Potter.

On the other hand, if someone tells you, "I am very confident it is not Ron Weasley, I am somewhat confident it is not Neville Longbottom, and
I am very confident that it *is* Draco Malfoy", this gives you some information about these characters' relationships to Harry Potter!
This is precisely the kind of information that gets passed down to our student model under the Knowledge Distillation paradigm.

## Distilling the Knowledge in a Neural Network
asusevski marked this conversation as resolved.
Show resolved Hide resolved

In the paper [*Distilling the Knowledge in a Neural Network*](https://arxiv.org/abs/1503.02531), Hinton et al. introduced the training methodology known as knowledge distillation,
taking inspiration from *insects*, of all things. Just as insects transition from larval to adult forms that are optimized for different tasks, large-scale machine learning models can
initially be cumbersome, like larvae, for extracting structure from data but can distill their knowledge into smaller, more efficient models for deployment.

The essence of Knowledge Distillation is using the predicted logits from a teacher network to pass information to a smaller, more efficient student model. We do this
by re-writing the loss function to contain a *distillation loss*, which encourages the student model's distribution over the output space to approximate the teacher's.

The distillation loss is formulated as:

![Distillation Loss](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/KL-Loss.png)

The KL loss refers to the [Kullback-Leibler Divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) between the teacher and the student's output distributions.
The overall loss for the student model is then formulated as the sum of this distillation loss with the standard cross-entropy loss over the ground-truth labels.

To see this loss function implemented in Python and a fully worked out example in Python, let's check out the [notebook for this section](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb).

<a target="_blank" href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Transfer Learning and Fine-tuning Vision Transformers for Image Classification

## Introduction

As the Transformers architecture scaled well in Natural Language Processing, the same architecture was applied to images by creating small patches of the image and treating them as tokens. The result was a Vision Transformer (Vision Transformers). Before we get started with transfer learning / fine-tuning concepts, let's compare Convolutional Neural Networks (CNNs) with Vision Transformers.

### CNN vs Vision Transformers: Inductive Bias

Inductive bias is a term used in machine learning to describe the set of assumptions that a learning algorithm uses to make predictions. In simpler terms, inductive bias is like a shortcut that helps a machine learning model make educated guesses based on the information it has seen so far.

Here's a couple of inductive biases we observe in CNNs:

- Translational Equivariance: an object can appear anywhere in the image, and CNNs can detect its features.
- Locality: pixels in an image interact mainly with its surrounding pixels to form features.

These are lacking in Vision Transformers. Then how do they perform so well? It's because they're highly scalable and they're trained on massive amounts of images. Hence, they overcome the need for these inductive biases.

### Using pre-trained Vision Transformers

It's not feasible for everyone to train a Vision Transformer on millions of images to get good performance. Instead, one can use openly available models from places such as the [Hugging Face Hub](https://huggingface.co/models?sort=trending).

What do you do with the pre-trained model? You can apply transfer learning and fine-tune it!

## Transfer Learning & Fine-Tuning for Image Classification

The idea of transfer learning is that we can leverage the features learned by the Vision Transformers trained on a very large dataset and apply these features to our dataset. This can lead to significant improvements in model performance, especially when our dataset has limited data available for training.

Since we are taking advantage of the learned features, we do not need to update the entire model either. By freezing most of the weights, we can train only certain layers to get excellent performance with less training time and low GPU consumption.

### Multi-class Image Classification

You can go through the transfer learning tutorial using Vision Transformers for image classification in this notebook:

<a
target="_blank"
href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-image-classification.ipynb"
>
<img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
/>
</a>

This is what we'll be building: an image classifier to tell apart dog and cat breeds:

<iframe
src="https://shreydan-oxford-iiit-pets-classifier.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>

---

It might be that the domain of your dataset is not very similar to the pre-trained model's dataset. Yet, instead of training a Vision Transformer from scratch, we can choose to update the weights of the entire pre-trained model albeit with a lower learning rate, which will "fine-tune" the model to perform well with our data.

<Tip>
However, in most scenarios, applying transfer learning is ample in the case of
Vision Transformers.
</Tip>

### Multi-label Image Classification

The tutorial above teaches multi-class image classification, where each image only has 1 class assigned to it. What about scenarios where each image has multiple labels in a multi-class dataset?

This notebook will walk you through a fine-tuning tutorial using Vision Transformer for multi-label image classification:

<a
target="_blank"
href="https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/fine-tuning-multilabel-image-classification.ipynb"
>
<img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
/>
</a>

We'll also be learning how to use [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/index) to write our custom training loops.
This is what you can expect to see as the outcome of the multi-label classification tutorial:

<iframe
src="https://shreydan-pascal-multilabel-classifier.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>

---

### Additional Resources

- Original Vision Transformers Paper: _An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Paper](https://huggingface.co/papers/2010.11929)_
- Swin Transformers Paper: _Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [Paper](https://huggingface.co/papers/2103.14030)_
- A systematic empirical study in order to better understand the interplay between the amount of training data, regularization, augmentation, model size and compute budget for Vision Transformers: _How to train your Vision Transformers? Data, Augmentation, and Regularization in Vision Transformers [Paper](https://huggingface.co/papers/2106.10270)_
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Transformer-based image segmentation

In this section, we'll explore how Vision Transformers compare to Convolutional Neural Networks (CNNs) in image segmentation and detail the architecture of a vision transformer-based segmentation model as an example.

<Tip warning={true}>
This section assumes familiarity with image segmentation, Convolutional Neural
Networks (CNNs), and the basics of Vision Transformers. If you're new to these
concepts, we recommend exploring related materials in the course before
proceeding.
</Tip>

## CNNs vs Transformers for Segmentation

Before the emergence of Vision Transformers, CNNs (Convolutional Neural Networks) have been the go-to choice for image segmentation. Models like [U-Net](https://arxiv.org/abs/1505.04597) and [Mask R-CNN](https://arxiv.org/abs/1703.06870) captured the details that are needed to distinguish different objects in an image, making them state-of-the-art for segmentation tasks.

Despite their excellent results over the past decade, CNN-based models have some limitations, which Transformers aims to solve:

- **Spatial limitations**: CNNs learn local patterns through small receptive fields. This local focus makes it hard for them to "link" features that are far apart but related within the image, affecting their ability to accurately segment complex scenes/objects. Unlike CNNs, ViTs are designed to capture global dependencies within an image, leveraging the attention mechanism. This means ViT-based models consider the entire image at once, allowing them to understand complex relationships between distant parts of an image. For segmentation, this global perspective can lead to a more accurate delineation of objects.
- **Task-Specific Components**: Methods like Mask R-CNN incorporate hand-designed components (e.g., non-maximum suppression, spatial anchors) to encode prior knowledge about segmentation tasks. These components add complexity and require manual tuning. In contrast, ViT-based segmentation methods simplify the segmentation process by eliminating the need for hand-designed components, making them more straightforward to optimize.
- **Segmentation Task Specialization**: CNN-based segmentation models approach semantic, instance, and panoptic segmentation tasks individually, leading to specialized architectures for each task and separate research efforts into each. Recent ViT-based models like [MaskFormer](https://arxiv.org/abs/2107.06278), [SegFormer](https://arxiv.org/abs/2105.15203) or [SAM](https://arxiv.org/abs/2304.02643) provide a unified approach to tackling semantic, instance, and panoptic segmentation tasks within a single framework.

## Spotlight on MaskFormer: Illustrating ViT for Image Segmentation

MaskFormer ([paper](https://arxiv.org/abs/2107.06278), [Hugging Face transformers documentation](https://huggingface.co/docs/transformers/en/model_doc/maskformer)), introduced in the paper "MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation" is a model that predicts segmentation masks for each class present in an image, unifying semantic and instance segmentation in one architecture.

### MaskFormer Architecture

The figure below shows the architecture diagram taken from the paper.

<img
width="600"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/maskformer_architecture.png"
/>

The architecture is composed of three components:

**Pixel-level Module**: Uses a backbone to extract image features and a pixel decoder to generate per-pixel embeddings.

**Transformer Module**: Employs a standard Transformer decoder to compute per-segment embeddings from image features and learnable positional embeddings (queries), encoding global information about each segment.

**Segmentation Module**: Generates class probability predictions and mask embeddings for each segment using a linear classifier and a Multi-Layer Perceptron (MLP), respectively. The mask embeddings are used in combination with per-pixel embeddings to predict binary masks for each segment.

The model is trained with a binary mask loss, the same one as [DETR](https://github.com/johko/computer-vision-course/blob/9ad9b01f2383377ac9482dcbe02c91465b573b0b/chapters/en/Unit%203%20-%20Vision%20Transformers/Common%20Vision%20Transformers%20-%20DETR.mdx), and a cross-entropy classification loss per predicted segment.

### Panoptic Segmentation Inference Example with Hugging Face Transformers

Panoptic segmentation is the task of labeling every pixel in an image with its category and identifying distinct objects within those categories, combining both semantic and instance segmentation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can briefly explain what panoptic segmentation is (you can find it here https://huggingface.co/docs/transformers/tasks/semantic_segmentation) and explain what's going on below. Also you could use pipeline, it's shorter

```python
from transformers import pipeline
from PIL import Image
import requests

segmentation = pipeline("image-segmentation", "facebook/maskformer-swin-base-coco")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

results = segmentation(images=image, subtask="panoptic")
results
```

As you can see below, the results include multiple instances of the same classes, each with distinct masks.

```bash
[
{
"score": 0.993197,
"label": "remote",
"mask": <PIL.Image.Image image mode=L size=640x480 at 0x109363910>
},
{
"score": 0.997852,
"label": "cat",
"mask": <PIL.Image.Image image mode=L size=640x480 at 0x1093635B0>
},
{
"score": 0.998006,
"label": "remote",
"mask": <PIL.Image.Image image mode=L size=640x480 at 0x17EE84670>
},
{
"score": 0.997469,
"label": "cat",
"mask": <PIL.Image.Image image mode=L size=640x480 at 0x17EE87100>
}
]
```

## Fine-tuning Vision Transformer-based Segmentation Models

With many pre-trained segmentation models available, transfer learning and finetuning are commonly used to adapt these models to specific use cases, especially since transformer-based segmentation models, like MaskFormer, are data-hungry and challenging to train from scratch.
these techniques leverage pre-trained representations to adapt these models to new data efficiently. Typically, for MaskFormer, the backbone, the pixel decoder, and the transformer decoder are kept frozen to leverage their learned general features, while the transformer module is finetuned to adapt its class prediction and mask generation capabilities to new segmentation tasks.

[This notebook](https://colab.research.google.com/github/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-segmentation.ipynb) will walk you through a transfer learning tutorial on image segmentation using MaskFormer.

## References

- [MaskFormer Hugging Face documentation](https://huggingface.co/docs/transformers/en/model_doc/maskformer)
- [Image Segmentation Hugging Face Task Guide](https://huggingface.co/docs/transformers/en/tasks/semantic_segmentation)
6 changes: 6 additions & 0 deletions chapters/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,12 @@
local: "Unit 3 - Vision Transformers/Swin Transformer"
- title: OneFormer
local: "Unit 3 - Vision Transformers/oneformer"
- title: Vision Transformers for Image Classification
local: "Unit 3 - Vision Transformers/Vision Transformers for Image Classification"
- title: Vision Transformers for Image Segmentation
local: "Unit 3 - Vision Transformers/Vision Transformers for Image Segmentation"
- title: Knowledge Distillation with Vision Transformers
local: "Unit 3 - Vision Transformers/KnowledgeDistillation"
- title: Unit 4 - Multimodal Models
sections:
- title: Introduction
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@
"\twidth=\"850\"\n",
Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested changes:

Object detection is a computer vision task that identifies and localizes objects within an image or a video. It involves two primary steps:

  1. first, recognizing the types of objects present (such as cars, people, or animals),
  2. Second, determining their precise locations by drawing bounding boxes around them.
  3. The input to these models is often an image (static or a video frame) containing multiple objects. For example, an image containing objects like a car, a person, a cycle etc. The output of these models is simply a set of numbers, which tells where the object is located (regressive output containing co-ordinates of the bounding box) and what that object is (classification)

There are many use cases for object detection. One most significant example being In the field of autonomous driving, for instance, Where object detection is used to detect different objects (like pedestrians, road signs, and traffic lights, etc) around the car that becomes one of the input for taking decisions.

If you want to understand more around the ins-and-outs of To learn more about object detection, check out the dedicated chapter about Object Detection 🤗


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this answered in the course chapter? I think we can skip this in the notebook


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Just execute the below cells to install the necessary packages." => "Execute the below cells to install the necessary packages."

If you mention transformers and PyTorch, then maybe it's worth mentioning the other libraries as well, and what you'll use them for.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

To make this tutorial interesting, lLet's consider a real-world example. Consider this scenario: Construction workers require the utmost safety at their workplace when working in construction areas. Basic safety protocols requires wearing them a helmet at all times every time. Since there are lot of construction workers, it is hard to keep and eye on everyone all the time everytime.

But, if we can have a camera system, which can detect persons and whether the person is wearing a helmet or not in real time, that would be awesome, right?

To improve safety, wouldn't it be helpful to have a camera system that can detect whether a person is wearing a helmet in real time?

So, we are going to Let's fine-tune a light weight object detection model for this doing the same. Let's dive in.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #3.    dataset = load_dataset("anindya64/hardhat")

It would be nice to say a couple of words about the dataset.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

Now, that we know what a sample data point contains, let's start with plotting plot a sample with . Here we are going to first draw the image and then also draw the corresponding bounding box associated.

Here is what we are going to do:

  1. Get the image and it's corresponding height and width.
  2. Make a draw object that can easily draw text and lines on image.
  3. Get the annotations dict from the sample.
  4. Iterate over it.
  5. For each, get the bounding box co-ordinates, which are x (where the bounding box starts horizontally), y (where the bounding box starts vertically), w (width of the bounding box), h (height of the bounding box).
  6. Now if the bounding box measures are normalized then scale it, else leave it.
  7. And finally draw the rectangle and the the class category text.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    # Now let's make a simple function on plotting multiple images

You can remove this code comment completely, or move it to markdown instead.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

AutoImageProcessor Preprocessing the images

Before fine-tuning the model, we must preprocess the data in such a way that it matches exactly with the approach it was used during the time if pre-training. HuggingFace AutoImageProcessor takes care of processing the image data to create pixel_values, pixel_mask, and labels that a DETR model can train with.

Now, let usLet's instantiate the image processor from the same checkpoint we want to use as the model to fine-tune.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:


In this section, we will preprocess the dataset. Basically, we will apply different types of augmentations to the images, along with their corresponding bounding boxes.

In simple terms, Augmentations are some set of random transformations like rotations, resizing etc. These are applied for the following reasons:

  1. To get more samples.
  2. To make the vision model more robust towards different conditions of the image.

We will use the albumentations library to achieve this. If you want to dig deeper into different types of augmentations, check out the corresponding unit to learn more.

Note: is there a link to the unit we could add here?


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    # now creating random augmentations using albumentations

There's no need for this code comment. Please remove


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

Once we initialize all the transformations, we need to make a function which formats the annotations and returns the a list of annotation with them in a very specific format.

This is because, the image_processor expects the annotations to be in the following format: {'image_id': int, 'annotations': List[Dict]}, where each dictionary is a COCO object annotation.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

Finally, we combine individual image and annotation transformations to do transformations over work on the whole batch of a dataset batch.

Here is the final code to do so:


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    train_dataset

This too is not needed


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    # Apply transformations for both train and test dataset

This is something you mention in the markdown, so there's no need to repeat it in the code comments.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the last sentence.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    train_dataset_transformed[0]

This can go before the previous markdown, meaning, before you explain what data collator does.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #16.        remove_unused_columns=False,

It's a good idea to explain at least what this parameter does.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    # delete this model, since it is already been uploaded to hub

I think you can skip this.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

Now we will try to do inference of our new fine-tuned model. Here we first write a very simple code on doing inference for object detection for some new images. 

And then we will club togather put everything together everything up and make a function out of it.


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reuse the function that you have defined earlier?


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to only show the results with the highest scores. Showing everything makes the image look cluttered. The pipeline earlier returns the top two scoring results, let's only show those


Reply via ReviewNB

Copy link
Collaborator

@MKhalusova MKhalusova Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "Clubbing it altogather" mean? Also, it's "altogether"  


Reply via ReviewNB

"\theight=\"450\">\n",
"</iframe>\n",
"```"
"```\n",
"\n",
"Also there is a small section if you are interested in Transfer learning instead of fine tuning only."
]
},
{
Expand Down Expand Up @@ -1526,6 +1528,55 @@
"source": [
"Well, that's not bad. We can improve the results if we fine-tune further. You can find this fine-tuned checkpoint [here](hf-vision/detr-resnet-50-dc5-harhat-finetuned). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How about Transfer learning ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook , we primarily discussed about the fine-tuning a certain model to our custom dataset. What if , we only want transfer learning? Actually that is easy peasy! In transfer learning , we have to keep the parameter values aka weights, of the pretrained model frozen. We just train the classifier layer (in some cases, one or two more layers). In this case before starting the training process, we can do the following, "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"from transformers import AutoModelForObjectDetection\n",
"\n",
"id2label = {0:'head', 1:'helmet', 2:'person'}\n",
"label2id = {v: k for k, v in id2label.items()}\n",
"\n",
"\n",
"model = AutoModelForObjectDetection.from_pretrained(\n",
" checkpoint,\n",
" id2label=id2label,\n",
" label2id=label2id,\n",
" ignore_mismatched_sizes=True,\n",
")\n",
"\n",
"for name,p in model.named_parameters():\n",
" if not 'bbox_predictor' in name or not name.startswith('class_label'):\n",
" p.requires_grad = False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That means, after loading the model , we freeze all of the layers except last 6 layers. Which are `bbox_predictor.layers` and `class_labels_classifier`."
]
}
],
"metadata": {
Expand Down
Loading
Loading