CS106EA: Lecture 2 - Introduction to Artificial Intelligence PDF

Summary

This Stanford University lecture introduces artificial intelligence, focusing on machine learning concepts like supervised, unsupervised, and reinforcement learning. It emphasizes the role of training data and examples.

Full Transcript

CS106EA: Lecture 2 Introduction to Artificial Intelligence Patrick Young, PhD Lecturer, Computer Science Stanford University 1 Topics for Today Introduction to Machine Learning (continued) Classical Machine Learning Mathematical Backgro...

CS106EA: Lecture 2 Introduction to Artificial Intelligence Patrick Young, PhD Lecturer, Computer Science Stanford University 1 Topics for Today Introduction to Machine Learning (continued) Classical Machine Learning Mathematical Background Jupyter Notebooks and Google Colab 2 Topics for Today Introduction to Machine Learning Overview Basic Machine Learning Model Inference vs. Learning / Production vs. Training Model Purpose – Predictive, Descriptive, Generative Learning Paradigms – Supervised, Unsupervised, Reinforcement Basic Terminology Example Datasets 3 Basic Machine Learning Model Cost / Objective Optimizer Function Training Set Training Instance Training Instance Model Prediction / Generated Output Training Instance Let’s go back to the basic machine learning model. Last lecture we talked about classifying models based on what their output was. This gave us predictive AI (which some practitioners further divide into regressive and classification variants), descriptive AI, and generative AI. We are now going to consider what the training sets look like. I do want to emphasize training. This isn’t about what the production data looks like I is about how the training data is labeled, production data is not labeled (often the objective of the model in actual use is to label production data, based on what it learned with the pre-labeled training data). 4 Model Purpose Predictive Systems Classification Regression Descriptive Systems Generative Systems Reprise from Last Lecture 5 Learning Paradigm 6 Learning Paradigms We can generally divide AI systems based on key characteristics of their training datasets. whether their training data is labelled how it is labelled Common Paradigms Supervised Learning Unsupervised Learning Self-Supervised Learning Semi-Supervised Learning Reinforcement Learning 7 Supervised Learning samples in dataset are labelled with expected output generally used for Predictive Models Training Set Training Instance Examples dataset of credit card transactions Training Instance label: legitimate / fraudulent information on houses label: price sold dataset of images Training Instance label of primary subject: dog, cat, horse, … 8 Unsupervised Learning no labels (expected output) provided generally used for Descriptive Models Training Set Training Instance Examples streaming service data on customers Training Instance objective: cluster users for recommendations provide as rich data as possible do not provide user labels allow system to make own conclusions on how to group Training Instance 9 Unsupervised Learning no labels (expected output) provided generally used for Descriptive Models Training Set Training Instance More Examples anomaly detection Training Instance objective: identify outliers no labels provided document organization take large number of documents place into logical classes with no predefined labels Training Instance 10 Self-Supervised Learning System generates its own labels using clearly defined algorithm Widely used for Large Language Models (e.g., ChatGPT) Training Set Training Instance Examples: Next-Word Prediction Training Instance create labelled samples from sentence feed in part of sentence and predict the next word Masked-Word Prediction Training Instance create samples by blocking words in sentences 11 Next Word Prediction System from generates sentence: theits ownbrown quick labelsfox using clearly jumped defined over algorithm the lazy dog Widely used for Large Language Models (e.g., ChatGPT) Input Label (Expected Output) Examples: the quick the quick brown Next-Word Prediction the quick brown fox create labelled samples from sentence the quick brown fox jumped feed in part of sentence and predict the quick brown fox jumped over the next word Masked-Word Prediction create samples by blocking words in sentences This is a common method for creating data for generative AI LLMs such as ChatGPT. 12 Masked Language Modeling System from generates sentence: theits ownbrown quick labelsfox using clearly jumped defined over algorithm the lazy dog Widely used for Large Language Models (e.g., ChatGPT) Input Label (Expected Output) Examples: quick brown fox jumped over the lazy dog the the brown fox jumped over the lazy dog quick Next-Word Prediction the quick fox jumped over the lazy dog brown create labelled samples from sentence the quick brown jumped over the lazy dog fox feed in part of sentence and predict the next the quick brown fox over the lazy dog word jumped Masked-Word Prediction create samples by blocking words in sentences This self-supervised data labelling method is used for BERT-style transformers. BERT- style transformers are related to GPT-style transformers but are designed on understanding language, not on generating it. 13 Masked Language Modeling System from generates sentence: theits ownbrown quick labelsfox using clearly jumped defined over algorithm the lazy dog Widely used for Large Language Models (e.g., ChatGPT) Input Label (Expected Output) Examples: quick brown fox jumped over the lazy dog the the brown fox jumped over the lazy dog quick Next-Word Prediction the quick fox jumped over the lazy dog brown create labelled samples from sentence the quick brown jumped over the lazy dog fox feed in part of sentence and predict the next the quick brown fox over the lazy dog word jumped Masked-Word Prediction Key point is system comes up create samples by blocking words in sentences with its own labels 14 Self-Supervised Learning very useful for large datasets far too large for human labelling how large? the most advanced models no longer publish training data but Meta’s Llama 2 (July 2023) was trained on 2 trillion tokens these are subword tokens which we’ll discuss in about 3-4 weeks but this is roughly equivalent to 1 trillion words 4 billion pages of text 11 million 360-page novels 15 Semi-Supervised Learning really a catch-all category any technique which combines supervised and unsupervised 16 Semi-Supervised Learning Examples Provide a mix of labelled and unlabeled data system may try to produce new labels on unlabeled data based on previously provided set of labeled data need careful checking for accuracy Start with no labels, let some clusters form add labels to clusters at some point Have system request human input when needed this is called active learning The specific technique chosen will depend on the nature of the dataset, the ability to easily determine labels for the data, and what our specific system objectives are. 17 Reinforcement Learning Programs learns by trial an error Program receives positive and negative feedback reinforces positive behavior and/or penalizes negative behavior Examples: learning to play video games teaching a robot to walk and interact with the world autonomous vehicles driving Robot Drawing generated by Dall·E 18 Reinforcement Learning Doesn’t match exactly with our usual Machine Learning model There is no pre-existing dataset with Reinforcement Learning Data determined by model interactions with the environment real world environment simulated real world environment entirely symbolic environment Standard Model (e.g., arcade or strategy video game) 19 Reinforcement Learning System gets data from the environment System carries out an action in the environment This generates a new state of the world (which is unlabeled) Data on revised state of the world acts as our new input. 20 Reinforcement Learning The loss function is replaced by a reward function if action moves world to a better state, we give a high reward if action moves world to worse state, we give a low or negative reward A good reward function is key to reinforcement learning If we’re giving negative penalties instead of positive rewards we may refer to the function as just the Objective Function 21 Reinforcement Learning The optimizer updates the model based on the reward for high reward, optimizer changes model to increase chances of similar actions in similar circumstances for low / negative rewards, optimizer changes model to prevent repeat in similar circumstances 22 Basic Terminology 23 Basic Terminology Dataset Sample 24 Basic Terminology Dataset May be referred to with different terms. Sample Sample = Observation or Instance 25 Basic Terminology Dataset Or terms specific to a field of study Sample In Natural Language Processing (NLP) Dataset = Corpus Sample = Document 26 Basic Terminology Terms related to Samples Features $595,000 1bd, 1ba, 776 sqft House Drawing generated by Dall·E number of bedrooms number of bathrooms square footage The term feature is occasionally used to refer to both input and output features, but more commonly refers to just the inputs 27 Basic Terminology Terms related to Samples Features Label $595,000 1bd, 1ba, 776 sqft House Drawing generated by Dall·E Label generally refers to the actual label on a sample, not the output of the model 28 Basic Terminology Terms related to Samples Features Label Terms related to Label Ground Truth Prediction Label generally refers to the actual ground truth label associated with a sample in our dataset, not the output of the model. It’s called ground truth because it typically originates from a real-world measurement of an actual data point or sample, where we know both the input features and the correct label. The output of the model, on the other hand, is called a prediction. A prediction represents what the model estimates the values should be given the input features. However, the prediction may not match reality, it may be slightly off, or it may be wildly inaccurate. 29 Example Datasets 30 Example Datasets Let’s take a look at some well-known datasets generally, these are teaching datasets but our last two are used for real-world applications as well 31 Example Datasets Let’s take a look at some well-known datasets generally, these are teaching datasets but our last two are used for real-world applications as well Predictive/Descriptive Datasets datasets designed for prediction can often be used for descriptive practice as well Generative Datasets 32 Irises This dataset teaches multiclass classification identify flowers as one of three species of Iris Features: Iris generated by Dall·E petal length and width in centimeters sepal length and width in centimeters Labels: samples identified as Setosa, Versicolor, or Virginica 33 Irises This dataset teaches multiclass classification identify flowers as one of three species of Iris Only 150 samples Iris generated by Dall·E 50 per species 34 Irises This dataset teaches multiclass classification identify flowers as one of three species of Iris Only 150 samples Actually predates computers Iris generated by Dall·E developed in 1936 to teach 50 per species mathematical techniques originally for discriminating classes more commonly used for prescriptive work now 35 California Housing Dataset Provides over 20,000 samples based on census block information. determine average cost of house in census block predictive regression model House Drawing generated by Dall·E Features: eight features including latitude, longitude, average house age, average number of bedrooms Labels: median house value in census block 36 California Housing Dataset Provides over 20,000 samples based on census block information. determine average cost of house in census block predictive regression model House Drawing generated by Dall·E Features: eight features including latitude, longitude, average house age, average number of bedrooms Labels: Don’t try to buy a house based on output from a model based on this median house value in census block dataset. The data is from 1990. 37 Titanic Dataset Includes data on 891 passengers. objective: design a model that accurately determines survivability based on features Features: Titanic generated by Dall·E passenger class, gender, age, number of parents or children on board, embarkation port, … Labels: whether passenger survived or not 38 Titanic Dataset Includes data on 891 passengers. objective: design a model that accurately determines survivability based on features Features: Titanic generated by Dall·E passenger class, gender, age, number of parents or children on board, embarkation port, … This dataset is used to teach Labels: students how to handle whether passenger survived or not missing data. 39 Drawing by Dall·E gets the Titanic Dataset number of smokestacks wrong. Titanic had four. Includes data on 891 passengers. objective: design a model that accurately determines survivability based on features Features: Titanic generated by Dall·E passenger class, gender, age, number of parents or children on board, embarkation port, … Labels: whether passenger survived or not 40 MNIST 70,000 digits handwritten by high school students and Census Bureau workers Objective identify handwritten digits Reading handwritten ZIP codes was one of the biggest early commercial successes in Machine Learning Neural Networks developed by Yann LeCun (now Chief AI Scientist at Meta) were used to read handwritten ZIP codes in the late 80s and early 90s. This was one of the first significant uses of neural networks for real-world production use. They were based on convolutional neural networks (CNNs), an architecture which we’ll be exploring in a few weeks time. 41 MNIST 70,000 digits handwritten by high school students and Census Bureau workers Objective identify handwritten digits Reading handwritten ZIP codes was one of the biggest early commercial successes in Machine Learning NIST is the National Institute of Standards and Technology MNIST is a Modified version of an earlier dataset put out by NIST 42 MNIST Features 28×28-pixel greyscale images individual pixel intensities from 0-255 We can treat this as 784 individual features but we’ll see later in the quarter that we can take advantage of image’s grid-like quality for better results Labels digit that the author intended to write 43 MNIST 4 2 9 Features 28×28 pixel greyscale images individual pixel intensities from 0-255 We can treat this as 784 individual features 3later but we’ll see – 36% 8we in the quarter that – 43% can 1 – 36% take advantage of image’s grid-like quality for 4 – 28% better results 2 – 30% 4 – 27% 2 – 25% 3 – 22% 9 – 24% Labels digit that the author intended to write The dataset includes some odd samples, where it’s not always clear from looking at it what the writer intended to write, but the labels do give the writers intent. Here are some examples of what my trained neural network thought of some of the somewhat ambiguous samples. 44 MNIST 4 2 9 Features 28×28 pixel greyscale images individual pixel intensities from 0-255 We can treat this as 784 individual features 3later but we’ll see From this week’s – Homework 36% 8we in the quarter that – 43% can 1 – 36% take advantage of image’s grid-like quality for using very4basic a results better – 28%neural network 2 – 30% 4 – 27% we’ll see better networks for image 2 – 25% 3 – 22% 9 – 24% processing Labels later in the quarter thisthat digit onethe stillauthor achieves almostto98% intended write accuracy Even though my neural network doesn’t quite know what to make of these three samples, overall when run on the full 70,000 samples it gets very high accuracy. In particular, it receives just shy of 98% accuracy on a subset called the testing subset— we’ll talk next week about the difference between training, validation, and testing datasets. 45 IMDb Reviews 50,000 movie reviews Feature text of review we’ll study Natural Language Processing (NLP) extensively later in the quarter Labels (2 versions of Dataset) Numeric scores Or just positive and negative classification IMDb logo is a trademarks of IMDb.com, Inc. 46 Common Crawl Dataset formed by using Bots to crawl the web. November 2024 version has 2.68 billion webpages 405 TiB of data assumed to have been foundational dataset for most modern ChatBots although they will have augmented the common crawl with their own curated data sources Typically used for self-supervised learning Common Crawl Logo from https://commoncrawl.org/ 47 Classical Machine Learning 48 Relationships of AI Subfields Artificial Machine Neural Deep Intelligence Learning Networks Learning Classical Machine Learning is the part of the Machine Learning oval that is not inside the Neural Network oval. 49 Techniques We can loosely breakdown AI techniques into Neural Network-based Techniques Classical Machine Learning Techniques Traditional Artificial Intelligence Techniques 50 Techniques We can loosely breakdown AI techniques into Neural Network-based Techniques Classical Machine Learning Techniques These are modern AI techniques not involving Neural Networks Traditional Artificial Intelligence Techniques 51 Techniques We can loosely breakdown AI techniques into Neural Network-based Techniques Classical Machine Learning Techniques Traditional Artificial Intelligence Techniques Traditional AI refers to older symbolic or rules-based techniques Sometimes referred to as GOFAI (Good-Old-Fashioned AI) Still used in some fields but for the most part of considered outdated for most purposes 52 Techniques We can loosely breakdown AI techniques into Neural Network-based Techniques Classical Machine Learning Techniques Traditional Artificial Intelligence Techniques The terms Traditional AI and Classical AI are sometimes conflated or swapped. But the definitions I’ve given are the most common in use. 53 Classical Vs. Neural Network Techniques Classical Techniques have important advantages over Neural Networks and Deep Learning Models are easier to understand Less computationally expensive Work with less data Less prone to overfitting Overfitting occurs when the model focuses too much on the specific training samples it has been given, memorizing their details rather than learning the general patterns or phenomena that underlie the data. As a result, the model performs well on the training data but struggles to generalize to new, unseen data. We’ll explore this concept in more depth next week. 54 Classical Machine Learning Example Linear Regression y = labels / expected value x = data features 55 Classical Machine Learning Example Linear Regression y = labels / expected value x = data features 56 Classical Machine Learning Example Linear Regression expected value new live / production x = data features data input 57 Classical Machine Learning Example Linear Regression While I’m showing the input as a single value, the input can be much more expected value complex. For example, it could be a pair of values. In fact, a typical input will be a feature vector of up to n values representing n different features. new live / production x = data features data input While some Machine Learning Algorithms can work with non-numeric data. Many are limited to working with numbers, so we’ll have to come up with a way of converting non-numeric data features to either integers or real numbers. 58 Classical Machine Learning Example Linear Regression Similarly, the output could also be a vector containing m different labels or expected value target values. new live / production x = data features data input For the Mathematically inclined: with multiple outputs we think of the model as consisting of a series of lines (or hyperplanes, if we have more than one input feature) one for each output value. 59 Classical Machine Learning Example Linear Regression Technically Linear Regression can be done without Machine Learning expected value There’s a formula. But the cost of the formula goes up as the cube of the number of dimensions. new live / production x = data features data input 60 Classical Machine Learning Example Decision Trees age >= 65 < 65 previously blood hospitalized pressure Yes No > 140 140 = 65 < 65 As you can see,previously a decision tree blood provides a solution that can be hospitalized pressure double checked Yes by a humanNo > 140 140 = 65 < 65 automatically for you. previously blood hospitalized Parameters we can play with include: pressure Yes MaximumNo Depth of the Tree > 140 140 each

Use Quizgecko on...
Browser
Browser