Identify images of different dog breeds.
The reserach paper, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", introduced a Vision Transformer into computer vision, delivering state-of-the-art performance without relying on convolutional neural networks (CNNs). This innovation draws inspiration from the highly regarded Transformer architecture in natural language processing domain, where Transformers have outperformed CNNs. The goal of the project is to implement the paper and grasp the essential concepts in the Vision Transformer. We constructed the Vision Transformer (ViT) from scratch using PyTorch, conducted training with augmented data, and subsequently applied the model to perform image classification to identify dog breeds.
The figure below displays our project pipeline.
Task Goal | |
---|---|
1. | Build the Vision Transformer (ViT) from scratch. Implement each layers in the ViT. |
2. | Apply Transfer Learning. Use the pretrained-ViT and make a comparison. |
3. | Perform image classification on a given image. Identify the breed type of the dog in the image. |
This is the model overview of the Vision Transformer. A Vision Transformer model will contain three parts.
Block | Functions |
---|---|
Patch + Position Embedding inputs | The input image will be divided into a sequence of patches. The size of each patch is 16 x16. The image patches are flattened to 1D embedding through a trainable linear projection. Subsequently, a class embedding and position embeddings are concatenated with the embedded patches. While the position embeddings capture retain spatial information of the image, the class embedding serves as the label for image classification. |
Transformer Encoder | The encoder comprises a multiheaded self-attention layer (MSA) and a multilayer perceptron (MLP). Transformer Encoders can be stacked to create a more complex model. Further details about the encoder will be provided below. |
MLP Head | The output from Transformer Encoders, served as image representations, will go into MLP Head. In the context of image classification, it is reffered as a "classifier head". The output will be the class of image classification. The architecture of the MLP Head is implemented by a MLP. |
Inside a Transformer Encoder block, it contains multiple components.
A multilayer perceptron.
The attention mechanism plays a pivotal role in the Transformer model, as introduced in the paper "Attention Is All You Need. Multi-Head Attention allows an attention mechanism to run several times in parallel.
Layer Normalization (Norm) will be applied before the input of every block. It is used to regularize a neural network to prevent overfitting.
Residual connections (skip connections) will be applied after every block. It overcomes vanishing gradients, and improve the stability of a deep neural network. It is proposed by the paper "Deep Residual Learning for Image Recognition".
After building the model from scratch, here are our training results.
After building the model from scratch, here are our training results.