Machine Learning Module B Unit 2 PDF

LAS 205 Prof. Mohamed Module B K. Watfa Unit 2 Machine Learning, Neural Networks & Deep Learning Unit objectives ▪ Explain what is machine learning, Neural Networks and Deep Learning. ▪...

LAS 205 Prof. Mohamed Module B K. Watfa Unit 2 Machine Learning, Neural Networks & Deep Learning Unit objectives ▪ Explain what is machine learning, Neural Networks and Deep Learning. ▪ Describe what is meant by statistical model and algorithm. ▪ Describe data and data types. ▪ Describe machine learning types and approaches (Supervised, Unsupervised and Reinforcement). ▪ List different machine learning algorithms. ▪ Explain what neural networks and deep learning are, and why they are important in today’s AI field. ▪ Describe machine learning components. ▪ List the steps in the process to build machine learning applications. ▪ Explain what domain adaptation is and its applications. 2 Unit 2 : Sub-Units ▪ Unit 2a: What is Machine Learning? ML Algorithms ▪ Unit 2b: Naïve Bayes Classification ▪ Unit 2c: Regression (Linear and Logistic) ▪ Unit 2d: SVM & Decision Trees ▪ Unit 2e: K- Means Clustering ▪ Unit 2f: Neural Networks ▪ Unit 2g: Deep Learning & Model Evaluations ▪ Unit 2h: ML & NN Practice Problems 3 LAS 205 Prof. Mohamed Module B K. Watfa Unit 2a What is Machine Learning? Key Vocabulary ▪ Machine CSD AI & Machine Learning Lesson 1 - Learning - How Warm Up The Problem Solving Process computers recognize patterns and make decisions without How well did the model do? How can it be improved? being explicitly What is the impact on society? Who is included? Who is excluded? programmed 5 Machine Learning? ▪ What is machine learning? ▪ Machine learning algorithms ▪ What are neural networks? ▪ What is deep learning? ▪ How to evaluate a machine learning model? 6 Machine learning ▪ In 1959, the term “machine learning” was first introduced by Arthur Samuel. ▪ He defined it as the “field of study that gives computers the ability to learn without being explicitly programmed”. ▪ The learning process improves the machine model over time by using training data. ▪ The evolved model is used to make future predictions. 7 What is a statistical model? ▪ A model in a computer is a mathematical function that represents a relationship or mapping between a set of inputs and a set of outputs. ▪ New data “X” can predict the output “Y”. 8 Statistical model - Example ▪ Assume that a system is fed with data indicating that the rates of violent crime are higher when the weather is warmer and more pleasant, even rising sharply during warmer-than-typical winter days. ▪ Then, this model can predict the crime rate for this year compared to last year’s rates based on the weather forecast. ▪ Returning to the mathematical representation of the model that can predict crime rate based on temperature, we might propose the following mathematical model: 9 Machine learning algorithms ▪ The machine learning algorithm is a technique through which the system extracts useful patterns from historical data. ▪ These patterns can be applied to new data. ▪ The objective is to have the system learn a specific input/output transformation. ▪ Finding the appropriate algorithms to solve complex problems in various domains and knowing how and when to apply them is an important skill that machine learning engineers should acquire. ▪ Because the machine learning algorithms depend on data, understanding and acquiring data with high quality is crucial for accurate results. 10 Machine learning approaches 1) Supervised learning: Train by using labeled data, and learn and predict new labels for unseen input data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples ▪ Classification is the task of predicting a discrete class label, such as “black, white, or gray” and “tumor or not tumor”. ▪ Regression is the task of predicting a continuous quantity, such as “weight”, “probability”, and “cost”. 11 12 Machine learning approaches (cont.) ▪ 2) Unsupervised learning: Detect patterns and relationships between data without using labeled data. ▪ Clustering algorithms: Discover how to split the data set into a number of groups such that the data points in the same groups are more similar to each other compared to data points in other groups. 13 14 Machine learning approaches (cont.) 3) Semi-supervised learning: ▪ A machine learning technique that falls between supervised and unsupervised learning. ▪ It includes some labeled data with a large amount of unlabeled data. ▪ Labeling data is an expensive or time-consuming process. (human biases) Here is an example that uses pseudo-labeling: a) Use labeled data to train a model. b) Use the model to predict labels for the unlabeled data. c) Use the labeled data and the newly generated labeled data to create a new model. 15 16 Machine learning approaches (cont.) 4) Reinforcement learning ▪ Reinforcement learning uses trial and error (a rewarding approach) - goal- oriented learning. ▪ The algorithm discovers an association between the goal and the sequence of events that leads to a successful outcome. ▪ As the system performs certain actions, it finds out more about the world. ▪ Policy Optimization: Optimizes actions to maximize rewards. ▪ Value-Based Methods: Determines value of actions for decision-making. Example reinforcement learning applications: ▪ Robotics: A robot that must find its way. ▪ Self-driving cars. 17 18 19 ML Approach When It’s Used Techniques Example Applications - Classification: Spam Classification: Predicts detection in emails, image When labeled data is available categories. recognition. Supervised Learning and the goal is prediction. Regression: Predicts - Regression: House price continuous values. prediction, stock price forecasting. - Clustering: Customer Clustering: Groups similar segmentation in marketing, data points. Unsupervised When data is unlabeled, to anomaly detection. Dimensionality Reduction: Learning find structure or patterns. - Dimensionality Reduction: Simplifies data by reducing Image compression, feature features. extraction. When only a small portion of Pseudo-labeling: Generates - Medical image classification Semi-Supervised data is labeled, leveraging labels for unlabeled data to (limited labeled data), Learning both labeled and unlabeled improve model training. language translation tasks. data. Policy Optimization: - Game-playing AI (e.g., chess, Optimizes actions to maximize When learning optimal actions Go), robotics (autonomous Reinforcement rewards. in an environment based on navigation), recommendation Learning Value-Based Methods: rewards. systems (e.g., personalized Determines value of actions for content recommendations). decision-making. 20 21 Machine Learning Algorithms 22 Machine learning algorithms Understanding your problem and the different types of ML algorithms helps in selecting the best algorithm. Here are some machine learning algorithms: ▪ Naïve Bayes classification (supervised classification – probabilistic) ▪ Linear regression (supervised regression) ▪ Logistic regression (supervised classification) ▪ Support vector machine (SVM) (supervised linear or non-linear classification) ▪ Decision tree (supervised non-linear classification) ▪ K-means clustering (unsupervised learning) 23 LAS 205 Prof. Mohamed Module B K. Watfa Unit 2b Naïve Bayes Classification Unit 2 : Sub-Units ▪ Unit 2a: What is Machine Learning? ML Algorithms ▪ Unit 2b: Naïve Bayes Classification ▪ Unit 2c: Regression (Linear and Logistic) ▪ Unit 2d: SVM & Decision Trees ▪ Unit 2e: K- Means Clustering ▪ Unit 2f: Neural Networks ▪ Unit 2g: Deep Learning & Model Evaluations ▪ Unit 2h: ML & NN Practice Problems 25 Machine learning algorithms Understanding your problem and the different types of ML algorithms helps in selecting the best algorithm. Here are some machine learning algorithms: ▪ → Naïve Bayes classification (supervised classification – probabilistic) ▪ Linear regression (supervised regression) ▪ Logistic regression (supervised classification) ▪ Support vector machine (SVM) (supervised linear or non-linear classification) ▪ Decision tree (supervised non-linear classification) ▪ K-means clustering (unsupervised learning) 26 Naïve Bayes classification Naïve Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. ▪ For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. ▪ Features: Color, roundness, and diameter. ▪ Assumption: Each of these features contributes independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features. 27 Naïve Bayes classification - EXAMPLE ▪ Our objective is to make a new prediction for an unknown object. ▪ The unknown object has the following features: Color: Red; Shape: Round; Diameter: 10 cm ▪ Use Naïve Bayes to predict whether the Red, Round shaped, 10 cm diameter label is an apple or not. 28 Naïve Bayes classification (cont.) Your algorithm basically depends on calculating two probability values: ▪ Class probabilities: The probabilities of having each class in the training data set. ▪ Conditional probabilities: The probabilities of each input feature giving a specific class value. 29 Naïve Bayes classification (cont.) To do a classification, you must perform the following steps: Define two classes (CY and CN) that correspond to: 1. Apple = Yes and Apple = No. 2.Compute the probability for CY as x: p(CY | x): ▪ p(Apple = Yes | Colour = Red, Shape = round, Diameter => 10 cm) 3.Compute the probability for CN as x: p(CN | x): ▪ p(Apple = No | Colour = Red, Shape = round, Diameter => 10 cm) 4.Discover which conditional probability is larger: ▪ If p(CY |x) > p(CN |x), then it is an apple. 30 Naïve Bayes classification (cont.) ▪ Compute p(x|CY) = p(Colour = Red, Shape = round, Diameter =>10 ▪ cm | Apple = Yes). ▪ Naïve Bayes assumes that the features of the input data (the apple parameters) are independent (Multiply all values) 31 Naïve Bayes classification (cont.) Thus, we can rewrite p(x| CY) as: = p(Colour = Red | Apple = Yes) X p(Shape = round | Apple = Yes) X p(Diameter => 10 cm | Apple = Yes) Same for p(x| CN): = p(Color = Red | Apple = No) X p(Shape = round | Apple = No) X p(Diameter => 10 cm | Apple = No) 32 Naïve Bayes classification (cont.) 6. Calculate each conditional probability: For example, to calculate p(Colour = Red | Apple = Yes), you are asking, “What is the probability for having a red color object given that we know that it is an apple”. ▪ p(Colour = Red | Apple = Yes) = 3/5 (Out of five apples, three of them were red.) ▪ p(Colour = Red | Apple = No) = 2/5 ▪ p(Shape = Round | Apple = Yes) = 4/5 ▪ p(Shape = Round | Apple = No) = 2/5 ▪ p(Diameter = > 10 cm | Apple = Yes) = 2/5 ▪ p(Diameter = > 10 cm | Apple = No) = 3/5 33 Naïve Bayes classification (cont.) p(x| CY) = p(Colour = Red | Apple = Yes) X p(Shape = round | Apple = Yes) X p(Diameter => 10 cm | Apple = Yes) = (3/5) x (4/5) x (2/5) = 0.192 p(x| CN): = p(Color = Red | Apple = No) X p(Shape = round | Apple = No) X p(Diameter => 10 cm | Apple = No) = (2/5) x (2/5) x (3/5) = 0.096 p(Apple = Yes) = 5/10 p(Apple = No) = 5/10 34 Naïve Bayes classification (cont.) Compare p(CY | x) to p(CN | x): Therefore, the verdict is that it is an apple. 35 36 When to Use Naive Bayes? Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more complicated model. That said, they have several advantages: They are extremely fast for both training and prediction They provide straightforward probabilistic prediction They are often very easily interpretable They have very few (if any) tunable parameters. These advantages mean a naive Bayesian classifier is often a good choice as an initial baseline classification. ▪ Naive Bayes classifiers tend to perform especially well when the naive assumptions actually match the data (very rare in practice) For very well-separated categories, when model complexity is less important 37 LAS 205 Prof. Mohamed Module B K. Watfa Unit 2c Regression (Logistic & Linear ) Unit 2 : Sub-Units ▪ Unit 2a: What is Machine Learning? ML Algorithms ▪ Unit 2b: Naïve Bayes Classification ▪ Unit 2c: Regression (Linear and Logistic) ▪ Unit 2d: SVM & Decision Trees ▪ Unit 2e: K- Means Clustering ▪ Unit 2f: Neural Networks ▪ Unit 2g: Deep Learning & Model Evaluations ▪ Unit 2h: ML & NN Practice Problems 39 Machine learning algorithms Understanding your problem and the different types of ML algorithms helps in selecting the best algorithm. Here are some machine learning algorithms: ▪ Naïve Bayes classification (supervised classification – probabilistic) ▪ → Linear regression (supervised regression) ▪ → Logistic regression (supervised classification) ▪ Support vector machine (SVM) (supervised linear or non-linear classification) ▪ Decision tree (supervised non-linear classification) ▪ K-means clustering (unsupervised learning) 40 Linear Regression 41 Linear regression ▪ Linear regression is a linear equation that combines a specific set of input values (X) and an outcome (Y) that is the predicted output for that set of input values. ▪ Both the input and output values are numeric. ▪ This algorithm targets supervised regression problems, that is, the target variable is a continuous value. ▪ Fitting a line that is known as the regression line Examples for applications: ▪ Analyze the marketing effectiveness, pricing, and promotions on the sales of a product. ▪ Forecast sales by analyzing the monthly company’s sales for the past few years. ▪ Predict house prices with an increase in the sizes of houses. ▪ Calculate causal relationships between parameters in biological systems. 42 Linear regression (cont.) Size Price ▪ Assume that we are 30 30,000 studying the real 70 40,000 state market. 90 55,000 ▪ Objective: Predict 110 60,000 the price of a house given its size by 130 80,000 using previous 150 90,000 data. 180 95,000 190 110,000 43 Plot this data as a graph 44 Linear regression (cont.) Size Price ▪ Can you guess what is the best estimate for a price of a 30 30,000 140-meter square house? 70 40,000 90 55,000 Which one is correct? 110 60,000 ▪ A. $60,000 ▪ B. $95,000 130 80,000 ▪ C. $85,000 150 90,000 180 95,000 190 110,000 45 Linear regression (cont.) ▪ Target: A line that is within a “proper” distance from all points. ▪ Error: The aggregated distance between data points and the assumed line. ▪ Solution: Calculate the error iteratively until you reach the most accurate line with a minimum error value (that is, the minimum distance between the line and all points). 46 Linear regression (cont.) ▪ After the learning process, you get the most accurate line, the bias, and the slope to draw your line. ▪ Here is our linear regression model representation for this problem: ▪ h(p) = p0 + p1 * X1 or ▪ Price = 30,000 + 392* Size ▪ Price = 30,000 + 392* 140 = 85,000 47 Linear regression (cont.) – Higher Dimensions ▪ In higher dimensions where we have more than one input (X), the line is called a plane or a hyper-plane. ▪ The equation can be generalized from simple linear regression to multiple linear regression as follows: Y(X)=p0+p1*X1+p2*X2+...+pn*Xn 48 49 Logistic Regression 50 Logistic regression ▪ Supervised classification algorithm. ▪ Target: A dependent variable (Y) is a discrete category or a class (not a continuous variable as in linear regression). ▪ Example: Class1 = Cancer, Class2 = No Cancer ▪ Assumes Linear relationship. 51 Logistic regression (cont.) ▪ Logistic regression is named for the function that is used at the core of the algorithm. ▪ The logistic function (sigmoid function) is an S-shaped curve for data discrimination across multiple classes. It can take any real value 0 – 1. Logistic function 52 Logistic regression (cont.) ▪ During the learning process, the system tries to generate a model (estimate a set of parameters p0, p1, …) that can best predict the probability that Y will fall in class A or B given the input X. ▪ The sigmoid function squeezes the input value between [0,1]. ▪ so if the output is 0.77 it is closer to 1, and the predicted class is 1. ▪ Logistic regression equation: Y represents the predicted probability that the outcome is 1 (or "success"). p0 and p1 are parameters of the model. X is the input variable (feature) we’re using to predict Y 53 Logistic regression (cont.) - Optimization ▪ The parameters p0 and p1 in logistic regression are generated through a process called training or fitting the model to the data. ▪ The goal in logistic regression is to find values for p0 and p1 that make the model's predictions as accurate as possible. ▪ This accuracy is usually defined by minimizing the logistic loss function (also known as cross-entropy loss), which measures how well the predicted probabilities match the actual class labels in the training data. 54 Logistic regression (EXAMPLE) ▪ Example: Assume that the estimated values of p’s for a certain model that predicts the gender from a person’s height – 150 cm are: ▪ p0= -120 and p1 = 0.5. ▪ Height X=150 ▪ Class 0 represents female and class 1 represents male. 55 Logistic regression (EXAMPLE) - cont Interpretation Result: Y≈0.00004539 Interpretation of Y: This value represents the probability that a person with a height of 150 cm is male (Class 1). Since the value is very close to 0, the model predicts that this person is almost certainly female (Class 0). Probability: P(male∣height=150)≈0 56 57 LAS 205 Prof. Mohamed Module B K. Watfa Unit 2d SVM & Decision Trees Unit 2 : Sub-Units ▪ Unit 2a: What is Machine Learning? ML Algorithms ▪ Unit 2b: Naïve Bayes Classification ▪ Unit 2c: Regression (Linear and Logistic) ▪ Unit 2d: SVM & Decision Trees ▪ Unit 2e: K- Means Clustering ▪ Unit 2f: Neural Networks ▪ Unit 2g: Deep Learning & Model Evaluations ▪ Unit 2h: ML & NN Practice Problems 59 Machine learning algorithms Understanding your problem and the different types of ML algorithms helps in selecting the best algorithm. Here are some machine learning algorithms: ▪ Naïve Bayes classification (supervised classification – probabilistic) ▪ Linear regression (supervised regression) ▪ Logistic regression (supervised classification) ▪ → Support vector machine (SVM) (supervised linear or non-linear classification) ▪ → Decision tree (supervised non-linear classification) ▪ K-means clustering (unsupervised learning) 60 Support vector machine ▪ SVM: The goal is to find a separating hyperplane between positive and negative examples of input data. ▪ SVM is also called a “large Margin Classifier”. ▪ The SVM algorithm seeks the hyperplane with the largest margin, that is, the largest distance to the nearest sample points. 61 Support vector machine (cont.) Assume that a data set lies in a two-dimensional space and that the hyperplane will be a one-dimensional line. Although many lines (in light blue) do separate all instances correctly, there is only one optimal hyperplane (red line) that maximizes the distance to the closest points (in yellow). 62 Support vector machine (cont.) ▪ Large Margin: The SVM doesn’t just look for any line that separates the classes; it looks for the one that creates the largest margin. ▪ Support Vectors: The points that are closest to the hyperplane (from each class) are crucial in defining the position of the hyperplane, which is why the algorithm is called a Support Vector Machine. ▪ Robust to New Points: A larger margin generally helps the model classify new points more accurately, making it a robust classifier. 63 64 Decision Trees 65 Decision tree ▪ A supervised learning algorithm that uses a tree structure to model decisions ▪ Can be used for classification and regression problems ▪ It resembles a flow-chart or if-else cases. An example for applications is general business decision- making like: ▪ predicting customers’ willingness to purchase a given product in a given setting, for example, online versus a physical store. 66 Decision tree A decision tree includes three main entities: root node, decision nodes, and leaves. A decision tree builds the classification or regression model in the form of a tree structure. It resembles a flowchart, and is easy to interpret because it breaks down a data set into smaller and smaller subsets while building the associated decision tree. 67 Decision tree - EXAMPLE Outlook Temp. Humidity Wind PlayTennis ▪ The “Play Tennis” example Sunny Hot High Weak No is one of the most popular Sunny Hot High Strong No examples to explain decision Overcast Hot High Weak Yes trees. Rainy Mild High Weak Yes ▪ In the data set, the label is Rainy Cool Normal Weak Yes represented by “PlayTennis”. Rainy Cool Normal Strong No The features are the rest of the Overcast Cool Normal Strong Yes columns: “Outlook”, Sunny Mild High Weak No “Temperature”, “Humidity”, and Sunny Cool Normal Weak Yes “Wind”. Rainy Mild Normal Weak Yes ▪ Our goal here is to predict, Sunny Mild Normal Strong Yes based on some weather Overcast Mild High Strong Yes conditions, whether a player Overcast Hot Normal Weak Yes can play tennis or not. Rainy Mild High Strong No 68 Decision tree - EXAMPLE ▪ The decision tree is built by selecting features that best split the data. We do this using metrics like information gain. ▪ Entropy: It is the measure of the amount of uncertainty and randomness in a set of data for the classification task. ▪ Information gain: It is used for ranking the attributes or features to split at given node in the tree. ▪ Information gain = (Entropy of distribution before the split)–(entropy of distribution after it) 69 Choosing the Best Feature & Splitting 1- Choosing the Best Feature ▪ The algorithm will calculate the information gain for each feature and select the one with the highest information gain as the root node. ▪ In this dataset, Outlook is chosen as the root feature because it has the highest information gain. 2- Splitting by Outlook: We create branches for each possible value of Outlook: Sunny, Overcast, and Rainy. For each branch, we then look at the remaining features (Temperature, Humidity, Wind) and continue splitting. 70 SubTrees 3- Recursion for Subtrees: For example, under the "Sunny" branch, we might choose Humidity next as it best splits the remaining data for "Sunny" days. Similarly, for the "Rainy" branch, we might select Wind. 4- Stopping Conditions: We continue splitting until we reach subsets where all instances have the same label (e.g., all "Yes" or all "No") or until further splitting does not provide significant information gain. In cases where splitting stops, we label the node as a leaf with the majority class in that branch. 71 Decision tree - EXAMPLE Outlook Temp. Humidity Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rainy Mild High Weak Yes Rainy Cool Normal Weak Yes Rainy Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rainy Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rainy Mild High Strong No Eventually, we want to make a classification of “if Play Tennis = {Yes, No}”. 72 Using the Decision Tree for Predictions ▪ Once the tree is built, it can be used to predict the label (PlayTennis) for new data. ▪ Suppose we have new data: Outlook = Sunny, Humidity = Normal. ▪ Outlook = Sunny ➔ Go to the "Sunny" branch. ▪ Humidity = Normal ➔ Go to the "Normal" branch. ▪ We reach a leaf node with the label Yes, so the prediction is Yes (the player can play tennis). 73 74 Advantages & Limitations Advantages of Decision Trees Interpretable: The structure of the tree makes it easy to understand the decision-making process. Non-linear: Decision trees can capture non-linear relationships in data by creating branches that represent complex rules. Limitations Overfitting: Decision trees can be prone to overfitting, especially if they grow too deep and capture noise in the training data. Bias in Splitting: Information gain may favor features with more categories, potentially introducing bias in the splits. 75

Machine Learning Module B Unit 2 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue