Skip to content

Commit

Permalink
Merge pull request #382 from huggingface/stage
Browse files Browse the repository at this point in the history
merging stage into main
  • Loading branch information
johko authored Feb 28, 2025
2 parents dcc536a + f5dfe12 commit 6f74f7f
Show file tree
Hide file tree
Showing 3 changed files with 226 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,18 @@

As the Transformers architecture scaled well in Natural Language Processing, the same architecture was applied to images by creating small patches of the image and treating them as tokens. The result was a Vision Transformer (ViT). Before we get started with transfer learning / fine-tuning concepts, let's compare Convolutional Neural Networks (CNNs) with Vision Transformers.

### CNN vs Vision Transformers: Inductive Bias
## Vision Transformer (VT) a Summary

To summarize, in Vision transformer, images are reorganized as 2D grids of patches. The models are trained on those patches.

The main idea can be found at the picture below:
![Vision Transformer](https://huggingface.co/datasets/hf-vision/course-assets/blob/main/Screenshot%20from%202024-12-27%2014-25-49.png)

But there is a catch! The Convolutional Neural Networks (CNN) are designed with an assumption missing in the VT. This assumption is based on how we perceive the objects in the images as humans. It is described in the following section.

## What are the differences between CNNs and Vision Transformers?

### Inductive Bias

Inductive bias is a term used in machine learning to describe the set of assumptions that a learning algorithm uses to make predictions. In simpler terms, inductive bias is like a shortcut that helps a machine learning model make educated guesses based on the information it has seen so far.

Expand All @@ -13,11 +24,14 @@ Here's a couple of inductive biases we observe in CNNs:
- Translational Equivariance: an object can appear anywhere in the image, and CNNs can detect its features.
- Locality: pixels in an image interact mainly with its surrounding pixels to form features.

These are lacking in Vision Transformers. Then how do they perform so well? It's because they're highly scalable and they're trained on massive amounts of images. Hence, they overcome the need for these inductive biases.
CNN models are very good at these two biases. ViT do not have this assumption. That is why for a dataset size up to a certain threshold actually CNNs are better than ViT. But ViT has another power!
The transformer architecture being (mostly) different types of linear functions allows ViT to become highly scalable. And that in turn allows ViT to overcome the problem of not having the above two
inductive biases with massive ammount of data!


### Using pre-trained Vision Transformers
### But how can everyone get access to massive datasets?

It's not feasible for everyone to train a Vision Transformer on millions of images to get good performance. Instead, one can use openly available models from places such as the [Hugging Face Hub](https://huggingface.co/models?sort=trending).
It's not feasible for everyone to train a Vision Transformer on millions of images to get good performance. Instead, one can use openly available model weights from places such as the [Hugging Face Hub](https://huggingface.co/models?sort=trending).

What do you do with the pre-trained model? You can apply transfer learning and fine-tune it!

Expand Down
Loading

0 comments on commit 6f74f7f

Please sign in to comment.