Summary

This document is a lecture/presentation on deep learning, specifically focusing on contrastive learning, with topics including metric learning, triplet loss, and pre-trained language-vision models. It also briefly touches on supervised, unsupervised, and weak supervised learning.

Full Transcript

Deep Learning Week 14-2 1 Announcement ❖ Homework 2 is due on Dec 11th 11:59pm (Wed) ❖ The lecture ends on Dec 11th ❖ The final exam will take place on December 18th (Wed) ❖ For the final exam, there will be no sample questions However, the final exam follows a fo...

Deep Learning Week 14-2 1 Announcement ❖ Homework 2 is due on Dec 11th 11:59pm (Wed) ❖ The lecture ends on Dec 11th ❖ The final exam will take place on December 18th (Wed) ❖ For the final exam, there will be no sample questions However, the final exam follows a format similar to the midterm, including multiple-choice, true/false, and other question types 2 Content ❖ Metric Learning Learning to Rank Triplet Loss Contrastive Learning ❖ Pre-trained Language-Vision Model CLIP BLIP LLaVA 3 Supervised, Unsupervised, Weak Supervised Learning ❖ Supervised learning: There is an explicit label(s) Sentiment Classification, Language Acceptability Test (CoLA), Machine Translation ❖ Unsupervised learning: A model learns from data without any label Masked Language Modeling Autoregressive Language Modeling ❖ Weak supervision: We do not know what labels are, but labels are in relative relationship form >> 4 Weak Supervision for Language-Vision Domain A man preparing desserts in a kitchen covered in frosting. > A restaurant has modern wooden tables and chairs. 5 Metric Learning ❖ Metric Learning aims to answer the following question: When we obtain two samples, how much are they semantically close (=similar)? ❖ Metric learning aims to learn a ‘distance function’ over samples Specifically, we learn a “semantic distance”, which is “similarity” However, the similarity is defined with respect to a given dataset Hence metric learning aims to learn “Relative Similarity” 6 Metric Learning ❖ It is hard to apply supervised machine learning algorithms ❖ Distance ( ) = 0.7 ❖ At the first place, why we build a model that learns the “relative similarity”?= Why we do metric learning? It is much more easier to collect datasets that have ‘similarity’ Examples Youtube: Watched vs Not Watched videos Search Engine: First a few info vs the very last info Just take a look at log information that is it! No need to go over annotations to assign labels ❖ We do metric learning (= learn relative similarity), to learn samples ❖ Metric learning, Supervised? Unsupervised? 7 Learning to Rank: Build a ranking model ❖ We build a ranking model to encode ‘relative similarity’ of a given data: The ultimate purpose of a ranking model is to produce a permutation of items in new, unseen lists in a similar way to ranking in the training data ❖ The training data consists of lists of items Depending on the format of lists, we can divide into… Point-wise Pair-wise List-wise ❖ Example of ranking models: Document Retrieval Web search engine: text query, a list of information ordered by relevance to the query Collaborative Filtering Recommendation System. User, a list of items ordered by the user’s preference 8 How can we build a ranking model? ❖ Point-wise If (A,B) = 0.7 (A,C)=0.6, then we train a model by predicting these scores ❖ Pair-wise Example of training samples: ordered pair of two items for a query A is a query and the ordered pairs are: A prefers B over C, B>C How do we train a ranking model? We train a model to predict a score per item for A We preserve the order, not the score itself = minimize the average number of ‘inversions in ranking’ ❖ List-wise Example of training samples: an ordered list of more than 2 items for a query A is a query and the ordered list is: (B,C,D,E) Computationally expensive! Why? Due to permutation 9 https://arxiv.org/pdf/1503.03832 Triplet Loss ❖ Training data: (anchor, positive, negative) ❖ We aim to build a model that can tell us (anchor, positive) is closer than (anchor, negative) ❖ How can we do the training? Measure the distances between 1) anchor and negative and 2) anchor and positive If (1) is greater than (2), we calculate the difference between them and minimize that difference What is “Margin”? To ensure the negative sample stay far away, we leverage ‘Margin’ 10 Triplet Loss ❖ Training data: (anchor, positive, negative) ❖ We aim to build a model that can tell us (anchor, positive) is closer than (anchor, negative) ❖ Example of positives and negatives Positive Samples: If a user clicks on an item, it is treated as a positive sample. Negative Samples: To construct negative samples, we randomly select items that the user did not click. ❖ However, randomly selecting negative samples is not a good idea… Why? Challenging Negative Samples: It is crucial to carefully construct challenging negative samples. Without this, the model may stop learning effectively 11 Triplet Loss ❖ However, randomly selecting negative samples is not a good idea… Why? Challenging Negative Samples: It is crucial to carefully construct challenging negative samples. Without this, the model may stop learning effectively Example 1 Example 2 12 Contrastive Learning ❖ Contrastive learning train a model via a pairwise loss function We pair training samples as put a label as 1 if the pair is similar otherwise 0 We push dissimilar samples further away and pull similar samples closer How? The distance between (anchor, positive) to be small The distance between (anchor, negative) to be large ❖ If this is the case, contrastive learning and triplet loss look exactly the same ❖ However, these methods are different Triplet loss is calculated based on the difference between distances Contrastive Learning loss is calculated based on each distance itself (not difference) 13 Contrastive Learning: SimCLR, InfoNCE ❖ SimCLR Given a (i,j) is a positive pair, SimCLR calculates the loss as follows: ❖ Noise Contrastive Estimator, NCE Inspired by Negative Sampling In word2vec, we pair target word-context word and train a model to make sure the score for target-context to be high However, considering all words except context words will be very inefficient So, we just sample random words and use it as negative pairs! 14 https://openai.com/index/clip/ https://arxiv.org/pdf/2103.00020 CLIP: Contrastive Language Image Pre-training ❖ There are different pre-trained Language-Vision models VL BERT ViLBERT ConVIRT VirTex … ❖ OpenAI collect a new dataset of 400M (image,text) pairs from internet ❖ By using this collected dataset, they do pre-train Trial #1: Using the objective function that proposed in VirTex 15 https://openai.com/index/clip/ https://arxiv.org/pdf/2103.00020 CLIP: Contrastive Language Image Pre-training ❖ Apply Metric Learning! Contrastive Learning We learn better representations of image and text 16 https://openai.com/index/clip/ https://arxiv.org/pdf/2103.00020 CLIP: Contrastive Language Image Pre-training (Inference) ❖ Prompt Engineering ❖ Training Details Image Encoder: 5 ResNet, 3 ViT with modifications Text Encoder: Transformer with modifications Batch Size: 32,768 Training Time: 18 days on V100 GPUs * 592 17 BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) ❖ CLIP is a encoder-based model Hard to generate texts (=>Hard to do image captioning) Why? There is no decoder Encoder handles Understanding, Decoder handles Generation What if we just put the decoder and do pre-training? (VL-T5, SimVLM) Not good at Image-Text Retrieval Why? Due to misalignment between Image and Text ❖ They utilized data collected through web crawling which is very noisy ❖ BLIP MED: They proposed a Multimodal Mixture of Encoder-Decoder (MED) to effectively perform Image-Text generation and Image-Text retrieval (understanding) Instead of assigning understanding to the encoder and generation to the decoder, we allow both the encoder and decoder to handle both understanding and generation CapFilt: To address the issue of noisy data, they introduced a 'captioning and filtering' (CapFilt) approach 18 BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) ❖ MED Unimodal Encoder: ViT + BERT Image-Grounded Text Encoder Image-Grounded Text Decoder 19 BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) ❖ MED Unimodal Encoder: ViT + BERT Image-Grounded Text Encoder: Cross-Attention between Image and Text Image-Grounded Text Decoder 20 BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) ❖ MED Unimodal Encoder: ViT + BERT Image-Grounded Text Encoder: Cross-Attention between Image and Text Image-Grounded Text Decoder: Text Generation with using Image Representations 21 BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) ❖ BLIP is a pre-trained Language-Vision Model ❖ What are pre-training objectives? Understanding-based objectives Image-Text Contrastive Loss (ITC): Aligns the latent spaces of Image-Text (Representational Learning) Image-Text Matching Loss (ITM) to learn whether the Image-Text pair is correct or not (Classification) Generation-based objective Language Modeling Loss (LM): Autoregressive Text Generation Task 22 BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) ❖ Caption and Filtering (CapFilt) Human Annotated Dataset (High-quality) Web Crawled Dataset ❖ An Image-grounded Text Encoder (Filter) and an Image-grounded Text Decoder (Captioner) are fine-tuned using a human-annotated dataset, such as COCO 23 BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) 24 BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) 25 BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) 26 BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) ❖ Decoding Algorithm Comparison Top-p sampling (Nucleus Sampling) vs Beam search ❖ BLIP achieves performance improvement on various downstream tasks Image-Text Retrieval Image Captioning Visual Question Answering (VQA) ❖ However, there are so many training sessions… BLIP-2 27 LLaVA: Large Language and Vision Assistant from the paper “Visual Instruction Tuning” ❖ Instruction Tuning Prompt-Completion, Supervised Fine-tuning (FLAN-T5) Vicuna is instruction fine-tuned Llama 28 LLaVA: Large Language and Vision Assistant from the paper “Visual Instruction Tuning” ❖ Vicuna (Language Model) + Visual Encoder Visual Encoder: Pre-Trained CLIP Visual Encoder, ‘ViT-L/14’ 29 LLaVA: Large Language and Vision Assistant from the paper “Visual Instruction Tuning” ❖ Pre-training: Update only W Image Captioning To ensure the alignment between image feature space (H_v) and LLM Visual Tokenzier Training ❖ Fine-tuning W is fixed, LLM and Projection Layer are updated MultiModal ChatBot Science QA 30

Use Quizgecko on...
Browser
Browser