Vision Transformer PDF

Artificial Intelligence 8. Vision Transformers 16. Oct. 2024 Sang-Hoon Lee Ajou University Index ▪ Vision Transformer ▪ Masked Auto-Encoder (MAE) ▪ CLIP ▪ Vision Lang...

Artificial Intelligence 8. Vision Transformers 16. Oct. 2024 Sang-Hoon Lee Ajou University Index ▪ Vision Transformer ▪ Masked Auto-Encoder (MAE) ▪ CLIP ▪ Vision Language Models (VLM) ▪ Generative Models DiT (Diffusion Transformer) VDT (Video Diffusion Transformer) 1 Vision Transformer ▪ Vision Transformer (ViT) is a transformer designed for computer vision. 2 Vision Transformer ▪ Image Patch An input image is divided into patches, each of which is linearly mapped through a patch embedding layer, before entering a standard Transformer encoder Transformer uses an image patch as a text token 3 Vision Transformer ▪ Image Patch Reshape the Image [H, W, C] into a sequence of flattened 2D patches [N, PxPxC] ✓ (H,W) is the resolution of the original image ✓ C is the number of channels ✓ (P,P) is the resolution of each image patch ViT utilizes 16x16 input patch size ✓ Transformer’s sequence length is inversely proportional to the square of the patch size, thus models with smaller patch size are computationally more expensive 4 Vision Transformer ▪ Image Patch Positional Embedding for image patch ✓ When using a sequence of flattened 2D patches instead of image, we lose the positional information of image ✓ So, ViT added positional embedding to input patches to encode spatial information 5 Vision Transformer ▪ Image Patch Positional Embedding for image patch ✓ While there is a large gap between the performances of the model with no positional embedding and models with positional embedding, ✓ There is little to no difference between different ways of encoding positional information. 6 Vision Transformer ▪ Transformer Encoder Multi-head attention MLP 7 Vision Transformer ▪ Transformer Encoder Multi-head attention ✓ Self-attention allows ViT to integrate information across the entire image ✓ Each head attends different information. Then, each feature is integrated. 8 Vision Transformer ▪ Classification Head Input: CLS Token MLP layer for classification 9 Vision Transformer ▪ Results Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train 10 Masked Autoencoder ▪ Goal Self-supervised learning for vision transformer 11 Masked Autoencoder ▪ BERT-style Pre-training with random masking 12 Masked Autoencoder ▪ BERT-style Pre-training with random masking 13 Masked Autoencoder ▪ MAE pre-training can improve the performance on the downstream tasks 14 CLIP ▪ Contrastive Language-Image Pre-Training (CLIP) proposed to have the pre-training task of predicting which caption to learn image representations from scratch on a dataset of 400 million (image, text) pairs After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. 15 CLIP for Image Generation ▪ Text-to-Image generation The text representations which are trained with CLIP objective contain similar information of the image representations We can condition this text representation instead of image representation ✓ (Training) Image representation → Image generation ✓ (Inference) Text representation → Image generation 16 Vision Language Model ▪ MINIGPT-4 Aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer 17 Diffusion Transformer ▪ Transformers can be used for diffusion models 18 Diffusion Models ▪ Denoising Diffusion Probabilistic Models (DDPM) Diffusion model is a parameterized Markov chain trained using variational inference to produce samples matching the data after finite time ▪ Diffusion Process A Markov chain that gradually adds noise to the data in the opposite direction of sampling until signal is destroyed ▪ Reverse Process If the diffusion consists of small amounts of Gaussian noise, it is sufficient to set the sampling chain transitions to conditional Gaussians too, allowing for a particularly simple neural network parameterization 19 Diffusion Transformer 20 Video Diffusion Transformer (VDT) ▪ Video generation models using diffusion Transformer 21 Conclusion ▪ Vision Transformer The patch of Image can be used as a token for Transformers ▪ MAE Pre-training with masking strategy can increase the capacity of pre-trained models ▪ Clip/VLM NLP + Image ▪ Generative models with Transformers DiT VDT 22

Vision Transformer PDF

Document Details

Tags

Related

Summary

Full Transcript