Vision Transformer (ViT) For Image Classification

Identify images of different dog breeds.


Introduction

The reserach paper, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", introduced a Vision Transformer into computer vision, delivering state-of-the-art performance without relying on convolutional neural networks (CNNs). This innovation draws inspiration from the highly regarded Transformer architecture in natural language processing domain, where Transformers have outperformed CNNs. The goal of the project is to implement the paper and grasp the essential concepts in the Vision Transformer. We constructed the Vision Transformer (ViT) from scratch using PyTorch, conducted training with augmented data, and subsequently applied the model to perform image classification to identify dog breeds.

The figure below displays our project pipeline.

Project Pipeline

Task Goal
1. Build the Vision Transformer (ViT) from scratch. Implement each layers in the ViT.
2. Apply Transfer Learning. Use the pretrained-ViT and make a comparison.
3. Perform image classification on a given image. Identify the breed type of the dog in the image.

We use Stanford Dogs Dataset. The dataset contains 20,850 images for 120 breeds of dogs. The dataset is also a subset of a large scale image database, ImageNet. You can download it from either one of the two sites.

We train the project on NVIDIA RTX A6000, and the training time is about 20 minutes.

Vision Transformer (ViT)

ViT_model
Model Overview

This is the model overview of the Vision Transformer. A Vision Transformer model will contain three parts.

Block Functions
Patch + Position Embedding inputs The input image will be divided into a sequence of patches. The size of each patch is 16 x16. The image patches are flattened to 1D embedding through a trainable linear projection. Subsequently, a class embedding and position embeddings are concatenated with the embedded patches. While the position embeddings capture retain spatial information of the image, the class embedding serves as the label for image classification.
Transformer Encoder The encoder comprises a multiheaded self-attention layer (MSA) and a multilayer perceptron (MLP). Transformer Encoders can be stacked to create a more complex model. Further details about the encoder will be provided below.
MLP Head The output from Transformer Encoders, served as image representations, will go into MLP Head. In the context of image classification, it is reffered as a "classifier head". The output will be the class of image classification. The architecture of the MLP Head is implemented by a MLP.
Transformer Encoder

Inside a Transformer Encoder block, it contains multiple components.

MLP

A multilayer perceptron.

Multi-Head Attention (MSA)

The attention mechanism plays a pivotal role in the Transformer model, as introduced in the paper "Attention Is All You Need. Multi-Head Attention allows an attention mechanism to run several times in parallel.

Norm

Layer Normalization (Norm) will be applied before the input of every block. It is used to regularize a neural network to prevent overfitting.

Residual Connection

Residual connections (skip connections) will be applied after every block. It overcomes vanishing gradients, and improve the stability of a deep neural network. It is proposed by the paper "Deep Residual Learning for Image Recognition".


ViT Training Results

After building the model from scratch, here are our training results.

LossPlot
AccuracyPlot
prediction1
prediction2
prediction3
Accuracy on test data: 43.846%

Analysis:

  1.   For the ViT model that we started from scratch, the trained result does not good enough. The final accuracy on the test data only reaches 43.8%. When we make prediction on test images, only one result is correct.
  2.   From the training/validation plots, the decreasing trend of the training loss has reached a plateau. The accuracy curves are unstable, and accuracy values are low. The plots show that our model is underfitting.
  3.   This could be caused by the limited size of our dataset for training the model. Even with data augmentation, our model comprises only about 25 hundred images, whereas the original paper leveraged large scale datasets such as ImageNet-21k and JFT-300M. Unfortunately, our computational resources restrict us from training on these large scale datasets. Thus, we need to take advantage of the pretrained model to perform a transfer learning.

Apply Transfer Learning:

Pretrained-ViT Training Results

After building the model from scratch, here are our training results.

LossPlot
AccuracyPlot
prediction_pViT
prediction_pViT2
prediction_pViT3
Accuracy on test data: 97.44%

Analysis:

  1.   To employ a pretrained model from torchvision.models, we have to use the specific transform that is adopted in the pretrained model. The image dataset need to undergo the same transformation process as the original training data utilized in the pretrained models. Thus, the images here are processed.
  2.   The training/validation plots shows that loss curves keep decreasing and accuracy curves keep rising. The final accuracy on the test data are 97.44%. For image classification on the test images, the identifying results are all correct.
  3.   When we adopt the pretrained weights and ViT model, the test accuracy for image classification task become precise. This model perform well even though the scale of our training dataset is small.

Source Code Link


Development

Python

PyTorch

PyTorch

Reference

  1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  2. Attention Is All You Need
  3. PyTorch Paper Replicating