Merge pull request #382 from huggingface/stage

merging stage into main
huggingface · Feb 28, 2025 · 6f74f7f · 6f74f7f
2 parents dcc536a + f5dfe12
commit 6f74f7f
Show file tree

Hide file tree

Showing 3 changed files with 226 additions and 4 deletions.
diff --git a/...s/en/unit3/vision-transformers/vision-transformers-for-image-classification.mdx b/...s/en/unit3/vision-transformers/vision-transformers-for-image-classification.mdx
@@ -4,7 +4,18 @@
 
 As the Transformers architecture scaled well in Natural Language Processing, the same architecture was applied to images by creating small patches of the image and treating them as tokens. The result was a Vision Transformer (ViT). Before we get started with transfer learning / fine-tuning concepts, let's compare Convolutional Neural Networks (CNNs) with Vision Transformers.
 
-### CNN vs Vision Transformers: Inductive Bias
+## Vision Transformer (VT) a Summary
+
+To summarize, in Vision transformer, images are reorganized as 2D grids of patches. The models are trained on those patches.
+
+The main idea can be found at the picture below: 
+![Vision Transformer](https://huggingface.co/datasets/hf-vision/course-assets/blob/main/Screenshot%20from%202024-12-27%2014-25-49.png)
+
+But there is a catch! The Convolutional Neural Networks (CNN)  are designed with an assumption missing in the VT. This assumption is based on how we perceive the objects in the images as humans. It is described in the following section.
+
+## What are the differences between CNNs and Vision Transformers? 
+
+### Inductive Bias
 
 Inductive bias is a term used in machine learning to describe the set of assumptions that a learning algorithm uses to make predictions. In simpler terms, inductive bias is like a shortcut that helps a machine learning model make educated guesses based on the information it has seen so far.
 
@@ -13,11 +24,14 @@ Here's a couple of inductive biases we observe in CNNs:
 - Translational Equivariance: an object can appear anywhere in the image, and CNNs can detect its features.
 - Locality: pixels in an image interact mainly with its surrounding pixels to form features.
 
-These are lacking in Vision Transformers. Then how do they perform so well? It's because they're highly scalable and they're trained on massive amounts of images. Hence, they overcome the need for these inductive biases.
+CNN models are very good at these two biases. ViT do not have this assumption. That is why for a dataset size up to a certain threshold actually CNNs are better than ViT. But ViT has another power!
+The transformer architecture being (mostly) different types of linear functions allows ViT to become highly scalable. And that in turn allows ViT to overcome the problem of not having the above two
+inductive biases with massive ammount of data!
+
 
-### Using pre-trained Vision Transformers
+### But how can everyone get access to massive datasets?
 
-It's not feasible for everyone to train a Vision Transformer on millions of images to get good performance. Instead, one can use openly available models from places such as the [Hugging Face Hub](https://huggingface.co/models?sort=trending).
+It's not feasible for everyone to train a Vision Transformer on millions of images to get good performance. Instead, one can use openly available model weights from places such as the [Hugging Face Hub](https://huggingface.co/models?sort=trending).
 
 What do you do with the pre-trained model? You can apply transfer learning and fine-tune it!