What is Vision Transformer?

The Vision Transformer (ViT) is a transformer employed for vision tasks. Google introduced it for image classification. And it has slightly better accuracy than the famous ResNet for this task. Surprisingly, it did not use convolution, which is a rule of thumb when it comes to computer vision tasks.

To understand Vision Transformer, we must understand how transformers work. So, let's understand how transformers work.

Transformers

Transformers have taken over the role of Seq2Seq, Vec2SeqThe task when we want to generate a sequence output based on a vector input. For example, an image description generation task, and Seq2Vec tasks. This is because they employ the encoder-decoder model in such a way that it allows us to process inputs in parallel. Its working is shown in the figure below:

After this, the combined embeddings (input and positional embeddings) are then forwarded into multi-head attention blocks. These attention blocks allow the encoder to give more importance to the useful embeddings. These blocks are followed by a normalization layer and skip connection. This is followed by a feed-forward layer which is again followed by a combination of the normalization layer and skip connection. As a result, we get encoded embeddings.

Note: To learn more about how the attention mechanism works, refer to this.

Decoder

Just like encoders, the output is encoded into output embeddings. In some cases, these embeddings need to be different than our input embeddings. For example, when we translate English to French, our input embeddings will be English-based, while the output embeddings will be French-based. These output embeddings are combined with their respective positional embeddings and given to the multi-headed attention blocks, followed by a combination of skip connection and normalization.

Now the encoder and the decoder embeddings are combined and given to another layer of multi-headed attention blocks. This is followed by a feed-forward layer sandwiched between skip connections and normalization layers. The last layer is a softmax layer which will give us the desirable output.

Vision Transformer (ViT)

The Vision Transformer is just like the encoder part of the Transformer with a few changes. To convert the input images into sequences, we divide the input image into a fixed number of patches. These patches are then flattened into a sequence. The rest of the work is the same as that of an encoder, except we add a CLS embedding to the start of our sequence.

CLS is a code word for classification embedding as it is used for classification purposes. This embedding defines the class of the image. To make the final classification, we only need to use the CLS embedding. We can safely discard the rest of the encoder embeddings.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

What is Vision Transformer?

Transformers

Encoder

Decoder

Vision Transformer (ViT)

Conclusion