What is Vision Transformer?

The Vision Transformer (ViT) is a transformer employed for vision tasks. Google introduced it for image classification. And it has slightly better accuracy than the famous ResNet for this task. Surprisingly, it did not use convolution, which is a rule of thumb when it comes to computer vision tasks.

To understand Vision Transformer, we must understand how transformers work. So, let's understand how transformers work.

Transformers

Transformers have taken over the role of Seq2Seq, Vec2SeqThe task when we want to generate a sequence output based on a vector input. For example, an image description generation task, and Seq2Vec tasks. This is because they employ the encoder-decoder model in such a way that it allows us to process inputs in parallel. Its working is shown in the figure below:

The abstract working of a transformer

Encoder

The raw input is converted into input embeddings. These embeddings are a numeric representation of the raw input as the computer only understands the language of numbers. But these numbers are useless without context. To give context to these embeddings, we apply positional embedding, which is like giving more weightage to an embedding as compared to the other embeddings in the context.

The working of an Encoder in Transformer

After this, the combined embeddings (input and positional embeddings) are then forwarded into multi-head attention blocks. These attention blocks allow the encoder to give more importance to the useful embeddings. These blocks are followed by a normalization layer and skip connection. This is followed by a feed-forward layer which is again followed by a combination of the normalization layer and skip connection. As a result, we get encoded embeddings.

Note: To learn more about how the attention mechanism works, refer to this.

Decoder

Just like encoders, the output is encoded into output embeddings. In some cases, these embeddings need to be different than our input embeddings. For example, when we translate English to French, our input embeddings will be English-based, while the output embeddings will be French-based. These output embeddings are combined with their respective positional embeddings and given to the multi-headed attention blocks, followed by a combination of skip connection and normalization.

The working of a Decoder in Transformer

Now the encoder and the decoder embeddings are combined and given to another layer of multi-headed attention blocks. This is followed by a feed-forward layer sandwiched between skip connections and normalization layers. The last layer is a softmax layer which will give us the desirable output.

Vision Transformer (ViT)

The Vision Transformer is just like the encoder part of the Transformer with a few changes. To convert the input images into sequences, we divide the input image into a fixed number of patches. These patches are then flattened into a sequence. The rest of the work is the same as that of an encoder, except we add a CLS embedding to the start of our sequence.

CLS is a code word for classification embedding as it is used for classification purposes. This embedding defines the class of the image. To make the final classification, we only need to use the CLS embedding. We can safely discard the rest of the encoder embeddings.

The working of vision transformer

Conclusion

Even though Vision Transformer outperformed the ResNet slightly, it used a massive amount of data to train them. The Google research team trained this model on the JFT dataset, which contains 300M images.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved