Vision Transformer (ViT) For Image Classification

Identify images of different dog breeds.

Introduction

The reserach paper, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", introduced a Vision Transformer into computer vision, delivering state-of-the-art performance without relying on convolutional neural networks (CNNs). This innovation draws inspiration from the highly regarded Transformer architecture in natural language processing domain, where Transformers have outperformed CNNs. The goal of the project is to implement the paper and grasp the essential concepts in the Vision Transformer. We constructed the Vision Transformer (ViT) from scratch using PyTorch, conducted training with augmented data, and subsequently applied the model to perform image classification to identify dog breeds.

The figure below displays our project pipeline.

	Task Goal
1.	Build the Vision Transformer (ViT) from scratch. Implement each layers in the ViT.
2.	Apply Transfer Learning. Use the pretrained-ViT and make a comparison.
3.	Perform image classification on a given image. Identify the breed type of the dog in the image.

Vision Transformer (ViT)

Model Overview

This is the model overview of the Vision Transformer. A Vision Transformer model will contain three parts.

Block	Functions
Patch + Position Embedding inputs	The input image will be divided into a sequence of patches. The size of each patch is 16 x16. The image patches are flattened to 1D embedding through a trainable linear projection. Subsequently, a class embedding and position embeddings are concatenated with the embedded patches. While the position embeddings capture retain spatial information of the image, the class embedding serves as the label for image classification.
Transformer Encoder	The encoder comprises a multiheaded self-attention layer (MSA) and a multilayer perceptron (MLP). Transformer Encoders can be stacked to create a more complex model. Further details about the encoder will be provided below.
MLP Head	The output from Transformer Encoders, served as image representations, will go into MLP Head. In the context of image classification, it is reffered as a "classifier head". The output will be the class of image classification. The architecture of the MLP Head is implemented by a MLP.

Inside a Transformer Encoder block, it contains multiple components.

MLP

A multilayer perceptron.

Multi-Head Attention (MSA)

The attention mechanism plays a pivotal role in the Transformer model, as introduced in the paper "Attention Is All You Need. Multi-Head Attention allows an attention mechanism to run several times in parallel.

Norm

Layer Normalization (Norm) will be applied before the input of every block. It is used to regularize a neural network to prevent overfitting.

Residual Connection

Residual connections (skip connections) will be applied after every block. It overcomes vanishing gradients, and improve the stability of a deep neural network. It is proposed by the paper "Deep Residual Learning for Image Recognition".

ViT Training Results

After building the model from scratch, here are our training results.

Accuracy on test data: 43.846%

Analysis:

For the ViT model that we started from scratch, the trained result does not good enough. The final accuracy on the test data only reaches 43.8%. When we make prediction on test images, only one result is correct.
From the training/validation plots, the decreasing trend of the training loss has reached a plateau. The accuracy curves are unstable, and accuracy values are low. The plots show that our model is underfitting.
This could be caused by the limited size of our dataset for training the model. Even with data augmentation, our model comprises only about 25 hundred images, whereas the original paper leveraged large scale datasets such as ImageNet-21k and JFT-300M. Unfortunately, our computational resources restrict us from training on these large scale datasets. Thus, we need to take advantage of the pretrained model to perform a transfer learning.

Apply Transfer Learning:

Pretrained-ViT Training Results

After building the model from scratch, here are our training results.

Accuracy on test data: 97.44%

Analysis:

To employ a pretrained model from torchvision.models, we have to use the specific transform that is adopted in the pretrained model. The image dataset need to undergo the same transformation process as the original training data utilized in the pretrained models. Thus, the images here are processed.
The training/validation plots shows that loss curves keep decreasing and accuracy curves keep rising. The final accuracy on the test data are 97.44%. For image classification on the test images, the identifying results are all correct.
When we adopt the pretrained weights and ViT model, the test accuracy for image classification task become precise. This model perform well even though the scale of our training dataset is small.

Development

Project GitHub

Python

PyTorch

Vision Transformer (ViT) For Image Classification

Introduction

Vision Transformer (ViT)

Model Overview

MLP

Multi-Head Attention (MSA)

Norm

Residual Connection

ViT Training Results

Accuracy on test data: 43.846%

Analysis:

Apply Transfer Learning:

Pretrained-ViT Training Results

Accuracy on test data: 97.44%

Analysis:

Source Code Link

Development

Reference

Author: Patrick Yu

Vision Transformer (ViT) For Image Classification

Introduction

What are our tasks?

What dataset is used?

How do you train the model?

Vision Transformer (ViT)

Model Overview

MLP

Multi-Head Attention (MSA)

Norm

Residual Connection

ViT Training Results

Accuracy on test data: 43.846%

Analysis:

Apply Transfer Learning:

Pretrained-ViT Training Results

Accuracy on test data: 97.44%

Analysis:

Source Code Link

Development

Reference

Author: Patrick Yu