NCS490 Introduction to AI for Security Lecture 1 PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a basic introduction to machine learning concepts for a security course. It covers various types of learning paradigms, problem types, and features. It appears to be an introductory lecture, not a past paper.
Full Transcript
NCS490: Introduction to AI for Security | Lecture 1 | August 27-29, 2024 So What Is Machine Learning? Automated automation Getting computers to program themselves Writing software is the bottleneck Let the data do the work instead Field of study that gives computers the ability...
NCS490: Introduction to AI for Security | Lecture 1 | August 27-29, 2024 So What Is Machine Learning? Automated automation Getting computers to program themselves Writing software is the bottleneck Let the data do the work instead Field of study that gives computers the ability to learn without being explicitly programmed Machine Learning Problem Types Based on Type of Data ○ Supervised, Unsupervised, Semi supervised, Reinforcement Learning Based on Type of Output ○ Regression, Classification, Clustering Based on Type of Model ○ Generative, Discriminative Types of Learning based on Type of Data Supervised learning ○ Training data includes desired outputs ○ Trying to learn a relation between input data and the output Unsupervised learning ○ Training does not include desired outputs ○ Trying to “understand” the data Semi supervised learning ○ Training data includes a few desired outputs Reinforcement learning ○ Rewards from sequence of actions Types of Learning based on Type of Output Regression: predicts continuous values ○ For example: What is the price of a house in California? What is the probability that a user will click on this ad Prices of homes will vary Classification: predicts discrete values ○ For example: Is a given email message spam or not? Is this an image of a dog, a cat, or a hamster? Spam or not spam? Clustering: An unsupervised machine learning technique designed to group unlabeled examples based on their similarity to each other. ○ If the examples are labeled, this kind of grouping is called classification. Based on Type of Model Generative: explicitly learns the actual distribution of each class ○ Generates new data similar to the data on which it was trained. These models are called generative because they create something new; certain the patterns in data and create something new from its learnings Naive Bayes Hidden Markov Models Bayesian Networks Markov Random fields ⋆ Discriminative: learns the decision boundary between the classes ○ Do not generate content. They learn to distinguish between different kinds of data instances and are useful for tasks such as classification Logistic regression SVMs Traditional neural networks Nearest neighbor Regression vs Classification vs Clustering Regression gives us a continuous number Classification gives us a us/no Clustering groups together ML Terminologies ML Basic Terminologies: Labels Features Examples Models Labels: A label is the thing or the output that we are predicating in the classification or regression task This applies to both classification and regression problems. For instance, if you're trying to predict the type of pet someone will choose, your input features might include age, home region, family income, etc. ○ The label is the final choice, such as dog, fish, iguana, rock, etc. Usually denoted with the variable y. Features Feature are the variables (a.k.a attribute) that describe the input data A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features Usually denoted as x1, x2,... xn In a spam detector example, the features could include the following ○ Words in the email text ○ Sender’s address ○ Time of day the email was sent Basically feature is input ○ For instance, if you're trying to predict the type of pet someone will choose, your input features might include age, home region, family income, etc. The label is the final choice, such as dog, fish, iguana, rock, etc. Types of attributes Categorical: Data that can be categorized into distinct groups or values | Information that can be put into categories, ○ Red, blue, brown yellow, No natural ordering to categories Usually encoded as numbers Ordinal: Data that has a natural order ○ Finite set of discrete values with a ranked ordering between values ○ Poor, satisfactory, good, excellent There is a natural ordering to categories Encoded as numbers to preserve ordering Numeric: Data that can be represented as integer or real values ○ Can be expressed as a number Height, weight, and temperature. Integers or real numbers Meaningful to add, multiply, and compute The process of generating the features for machine learning problem is called feature engineering ○ Involves the extraction and transformation of variables from raw data, such as price lists, product descriptions, and sales volumes so that you can use features for training and prediction Data samples (Examples) Data sample/Example is a particular instance of data, x. (Note that x is a vector of features and it may have an associated label) We break into two categories ○ Labels examples: Used for training ○ Unlabeled examples: Used for interference/testing Imagine you wanted to know the average height of people in your city. It would take forever to measure every single person! ○ But if you took a random sample of a few hundred people, you could calculate an average that would be pretty close to the actual average height Example Model A model defines a relationship between features and a label Model: A model defines a relationship between features and label Two phases of a model’s life: ○ Training: Creating or learning the model. You show the model labeled examples and enable the model to gradually learn the relationships between features and label ○ Testing/Interference: Applying the trained model to unlabeled examples. You use the trained model to make useful predictions. (y’) Machine Learning Applications: ML Application 1: Credit Approval Numeric Features (can be represented as integer or real values) ○ Loan amount (e.g., $1000) ○ Income (e.g., $65000) Ordinal Features (has a natural order) ○ Savings: (none, 1000) ○ Employed: {unemployed, 7yrs} Categorical Features (can be categorized into distinct groups) ○ Purpose: (car, appliance, repairs, education, business) ○ Personal: (single, married, divorced, separated) Labels (Categorical output) ○ Approve credit application ○ Disapprove credit application ML Application 2: Handwritten Digits Recognition Represent each pixel as a separate attribute either Categorical OR Ordinal: Categorical Features (can be categorized into distinct groups) ○ (white) or (black) based on a threshold Ordinal Features: ○ Degree of ”blackness” of a pixel Labels (Categorical): ○ {0,1,2,3,4,5,6,7,8,9} Hard feature engineering process Linear Regression Linear Regression Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data ○ Example: Scientists found that crickets chirp more frequently on hotter days than cooler days. Linear Regression A linear relationship: ○ True, the line doesn’t pass through every dot. ○ However, the line does clearly show the relationship between chirps and temperatures. ○ y = mx + b Where: ○ y: is the temperature in Celsius – the value we’re trying to predict. ○ m: is the slope of the slime ○ x: is the number of chirps per minute – the value of our input feature ○ b: is the y-intercept Linear Regression In machine learning, we’ll write the equation for a model slightly differently: ○ y’ = w1x1 + w0 Where: ○ y’: is the predicted label (a desired output) ○ w1: is the weight of feature 1. Weight is the same concept as the “slope” ○ x1: is feature 1 ○ w0 or b: is the bias (the y intercept) Note: ○ A model that relies on three features might look as follows: y’ = w3x3+w2x2+w1x1+w0 ○ Bias in machine learning (ML) is a systematic error that occurs when an ML model makes incorrect assumptions during the learning process / some aspects of a dataset are given more weight and/or representation than other Training and Loss: Training a model simply means learning (determining) good values for the weights and the bias from labeled examples Loss is the penalty for a bad prediction ○ A measure of the difference between a model's predicted values and the actual values ○ Perfect prediction means the loss is zero ○ Bad model has high loss Suppose we selected the following weights and bias ○ The right has lower loss Look how we can’t match up the the data points from the y to the x from the left one, Squared loss: The linear regression models use a popular loss function called squared loss. ○ Is the overall amount of error in your mode Also known as L2 Is represented as follow: [observation(x) − prediction(x)]² = (y − y′)² ○ Why squared loss? Using the exact error some negative values will cancel the value of some positive values ○ Can we do absolute loss? Can give you the magnitude of a number, Mean square error (MSE) Measures how well a predictive model performs by evaluating the average squared difference between the predicted and actual values in a dataset Is the average squared loss per example over the whole dataset. (x,y) is an example in which ○ y is the label ○ x is the feature prediction(x) is equal to y′ = w1x +w0 D is the dataset that contains all (x,y) pairs N is the number of samples in D Reducing Loss: Training is a feedback process that uses the loss function to improve the model parameters The training is an iterative process ○ repeating tasks to improve a product What initial values should we set for w1 and w0? How to update w1 and w0? ○ Use the error to update Gradient Descent: Looks at the current set of parameters and changes them one step at a time such that they output the desired values. ○ we get closer and closer to the optimal set of parameters. Example: : it is nightfall and you are on top of a hill and want to get to the village down low in the valley. Fortunately, you have a trusty flashlight that helps you see the steepest direction locally around you despite the darkness. You take each step in the direction of the steepest descent using the flashlight and reach the village at the bottom fairly quickly. Gradient Descent: Assume (for simplicity) we are only concerned with finding w1. Assume we had the time and the computing resources to calculate the loss for all possible values of w1. ○ Regression problems yield convex loss vs. weight plots. ○ Bottom will give us the least loss Gradient Descent: Gradient descent enables you to find the optimal without computing for all possible values. Gradient descent has the following steps 1. Pick a random starting point for w 2. Calculate the gradient of the loss curve at w 3. Update w 4. go to 2, till convergence Gradient Descent: Note that a gradient is a vector, so it has both of the following characteristics: ○ Magnitude ○ Direction The gradient descent algorithm takes a step in the direction of the negative gradient ○ D loss/dw is the gradient loss ○ Gradient helps us know to go left or right Gradient Descent: The gradient descent algorithm adds some fraction of the gradient’s magnitude (Learning Rate η) to the previous point. ○ “Draw a straight line going down tangent to this point” Convergence Criteria: Convergence is when a model reaches a stable state and stops improving its predictions ○ Refers to subsets that determine the convergence of functions based on specific conditions, such as the existence of a set where the function is defined and converges for almost all elements. Bottom of the smiley face gradient For convex functions, optimum occurs when ○ Smiley face graphs In practice, stop when Learning rate: A tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. ○ Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size). Also called hyper parameter (configuration variables that are manually set before training a mode) Generalization and Gradient: For n features: ○ Note w0 is the bias (intercept), and x0 = 1. Vector representation y’ = wᵗx Loss = ℓ = (y −y′)2 Gradient derivation Types of Gradient Descents: Batch Gradient Descent: ○ The use of the entire training dataset for each iteration of the learning process. ○ Takes into consideration the averaging over all the gradients of training data. ○ Data sets often contain billions or even hundreds of billions of examples. ○ Can take a very long time to compute. Stochastic Gradient Descent (SGD): ○ Processes training data in small batches or individual data points instead of the entire dataset at once. Uses only a single example (a batch size of 1) per iteration. ○ Very noisy. Random or unpredictable fluctuations in data that disrupt the ability to identify target patterns or relationships ⋆ Mini-Batch Gradient Descent: ○ Compromise between full-batch iteration and SGD. ○ Typically a batch of size between 10 and 1,000 examples, chosen at random. Logistic Regression Introduction: Logistic Regression: ○ Accomplishes binary classification tasks by predicting the probability of a binary event occurring. Used for binary classification ○ The inputs are the features values and the output (y) is a probability from 0 to 1 Limited to two possible outcomes: yes/no, 0/1, or true/false. Note: ○ Logistic regression is a linear classifier ○ The equation of the decision boundary: 0 = w2x2 + w1x1 + w0 ○ Class 0 condition: 0 > w2x2 + w1x1 + w0 ○ Class 1 condition: 0 < w2x2 + w1x1 + w Dependent variable (Y) is 0 or 1 depending on whether it actually happened. Examples: ○ For instance, suppose we were given past data about the number of hours students spent studying and their corresponding outcome of some exam - either pass or fail. If our objective is to predict whether a student will pass or fail an exam given the number of hours studied, Logistic regression is really good because it's simple enough to be done without much data. It's also nice to know exactly how a variable affects the outcome. Downside is that it can't handle complex effects. Sigmoid Function: In order to map predicted values to probabilities, we use the sigmoid function. ○ Convert input values into a probability between 0 and 1 which value to pass as output and what not to pass as output. Logistic Regression: We can set decision boundary z = w2x2 +w1x1 +w0 Logistic Loss Function: Since y’ in logistic regression is a probability between 0 and 1. Our loss can be defined with the following loss function ○ if y = 1: Loss =-log(y’) ○ if y = 0: Loss =-log(1-y’) Generalization and Gradient: For n features: vector representation: y′ = sigmoid(z) = σ(z) ℓ = −ylog(y′)−(1−y)log(1−y′) Gradient Derivation: Summary Linear vs Logistic Regression: Loss is a convex function in terms of the weights (y’ is function of the weights) Linear Regression Logistic Regression Loss During the training of a supervised model, a measure of how far a model's prediction is from its label.