6CS012 AI & Machine Learning Lecture 2 PDF, Herald College
Document Details

Uploaded by ExceptionalAlbuquerque7997
Herald College Kathmandu
2025
Siman Giri
Tags
Related
- Artificial Intelligence and Machine Learning for Business (AIMLB) PDF
- CS488-Ch-9A Machine Learning PDF
- Artificial Intelligence and Machine Learning Applications in Smart Production (PDF)
- DST301 Artificial Intelligence Applications Lecture 02 - Machine Learning PDF
- CPCS-335 Introduction to Artificial Intelligence Lecture 8 PDF
- Intelligence Artificielle: Réseaux de Neurones PDF
Summary
This document contains lecture slides on artificial intelligence and machine learning. It covers topics such as understanding the components of learning from a classification perspective, revisiting machine learning with softmax regression. The slides are related to the 6CS012 module at Herald College Kathmandu.
Full Transcript
6CS012 – Artificial Intelligence and Machine Learning. Lecture – 02 Understanding the Components of Learning. A Classification Perspective. Siman Giri {Module Leader – 6CS012}...
6CS012 – Artificial Intelligence and Machine Learning. Lecture – 02 Understanding the Components of Learning. A Classification Perspective. Siman Giri {Module Leader – 6CS012} Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 1 Softmax Regression. Learning Outcomes!! To revise the various components of (Machine) Learning that we discuss @5CS037. To review and re-familiarize above mentioned components with the context of Classification task → “Logistic Regression”. By the end of the week Understand the limitation of Logistic Regression and Challenges of Machine Learning and Justify the need for deep learning. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 2 Softmax Regression. 1. Understanding a Machine Learning Problem. {Components of a Machine Learning System.} Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 3 Softmax Regression. 1.1 What is Machine Learning? Machine/Deep learning is a sub-domain of artificial intelligence (AI) that utilizes Statistics, Pattern recognition, knowledge discovery and data mining to automatically learn and improve with experiences without being explicitly programmed. Disclaimer!!! “In Machine/Deep Learning we do not write a program to solve a specific problem or task instead we write a code/program to facilitate machine to learn from the data.” A machine learning algorithm learns from the training data: Input: Training Data (e.g., emails x and their labels y) Output: A prediction function that produces output y given input x Almost any application that involves understanding data that come from the real world can be best addressed using machine learning. Great examples are image classification , object detection and many kinds of language-processing tasks. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 4 Softmax Regression. 1.1.1 What isn’t Machine Learning? What isn’t it? It is not artificial intelligence: At least, not exactly. Though they are in a relationship. (“It’s complicated”) Provocatively: it’s the bit of AI that works It’s not on the verge of ending civilization in a robot singularity. More prosaically: it isn’t always the right tool for the job Excels for well-defined questions with densely sampled data Not good at abstract reasoning — mostly does not even operate in that space Avoid the human trap of imagining an ML model understands anything Machine learning is a very general and useful framework, but it is not “magic” and may not always work. In order to better understand when it will and when it will not work, it is useful to formalize the learning problem more. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 5 Softmax Regression. 1.2 Components of a Learning System. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 6 Softmax Regression. 1.3.1 Data and Problem Class. There are many different problem class in a machine learning which vary according to what kind of data is provided and what kind of conclusions are to be drawn from it. The problem class could be broadly classified as: Supervised Learning: The idea of supervised learning is that the learning system is given inputs and told which specific outputs should be associated with them. We expect machine to learn the function which maps input to its associated label. Unsupervised Learning: Unsupervised learning doesn’t involve learning a function from inputs to outputs based on a set of input-output pairs. Instead, one is given a data set and generally expected to find some patterns or structure inherent in it. Reinforcement Learning: In reinforcement learning, the goal is to learn a mapping from input values (typically assumed to be states of an agent or system; for now, think e.g. the velocity of a moving car) to output values (typically we want control actions; for now, think e.g. if to accelerate or hit the brake). Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 7 Softmax Regression. 1.3.2 Data and Data formats. Some terminology associated with dataset in practice: Variables: Target or output variables also referred as dependent variables. In General, we denoted as 𝐗 𝐧×𝐝 → also called Feature Matrix. Predictor, Feature or input variables also referred as independent variables. In General, we denoted as: For actual target variable or target variable from Dataset: 𝐘𝐧 : 𝐲𝐢 … 𝐲𝐧 → label vector. For predicted target variable i.e. label your model assign to new input data: Yn : yෝi … , yෞn → predicted label vector In General Data is denoted as {caution row of x must be same as row of y i.e. n == n}: 𝕯: {𝐗 𝐧×𝐝 , 𝐘𝐧 ∶ 𝐱 𝟏 , 𝐲𝟏 , … , 𝐱 𝐧 , 𝐲𝐧 } Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 8 Softmax Regression. 1.3.3 Data, Problem Class: Supervised Learning. Data in Supervised Learning: For Supervised Learning Setup, training data comes in pairs of inputs (x, y): where 𝐗 ∈ 𝐑𝐝 is the input instance and 𝐘 its label, which can be written as: 𝕯 = 𝐱 𝟏 , 𝐲𝟏 … 𝐱 𝐧 , 𝐲𝐧 ⊆ 𝐑𝐝 ∗ 𝐂 Where: 𝐑𝐝 : d-dimensional feature space. 𝐱 𝐢 : input vector of the 𝐢𝐭𝐡 sample. 𝐲𝐢 : label of the 𝐢𝐭𝐡 sample. 𝐂: label space. Fig: Example Dataset for Image Classification. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 9 Softmax Regression. 1.3.3.1 Supervised Learning: Task. Regression Classification Target is numerical, i.e Target is categorical, i.e. 𝐘 ∈ 𝐂 = {𝟏, 𝟐, … 𝐊} 𝐘∈𝐂=ℝ If 𝐊 = 𝟐 𝐢. 𝐞. 𝐭𝐰𝐨 𝐜𝐥𝐚𝐬𝐬 𝟎 𝐨𝐫 𝟏 → 𝐁𝐢𝐧𝐚𝐫𝐲 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧. e.g., predict days a patient has to stay in hospital else: 𝐊 ≥ 𝟐 𝐮𝐩𝐭𝐨 𝐊 → 𝐌𝐮𝐥𝐭𝐢𝐜𝐥𝐚𝐬𝐬 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧. e.g., predict one of two risk categories for a life insurance customer. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 10 Softmax Regression. 1.3.4 To Summarize: Data in Supervised Learning. In formal notation datasets are given and are in the In Supervised Learning: following form: 𝕯 = (𝐱 𝟏 , 𝐲𝟏 ) , … , 𝐱 𝐧 , 𝐲𝐧 ∈ 𝐗 × 𝐘 𝐧 We call: 𝐗 𝐧 : the input space define by the dimension 𝐝 = 𝐝𝐢𝐦 𝐗 , Thus, the feature matrix X is 𝐗 𝐧×𝐝 and 𝐱 𝐣 = 𝐱 𝐣𝟏 , … , 𝐱 𝐣𝐝 is the 𝐣𝐭𝐡 feature vector. 𝐘𝐧 : the target or label vector. We assume some kind of relationship between 𝐘𝐧 ∈ 𝐂 & 𝐢𝐟 𝐂 = ℝ → Regression Task. the features and the target, 𝐘𝐧 ∈ 𝐂 & 𝐢𝐟 𝐂 = 𝟎, 𝟏, 𝟐, … 𝐊 → Classification Task. in a sense that the value of the target variable can be explained by a combination of the features. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 11 Softmax Regression. 1.2 Components of a Learning System. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 12 Softmax Regression. 1.4 Machine Learning Model. Given some potentially multi– valued input 𝐱 𝐣 = 𝐱 𝐣𝟏 , … , 𝐱 𝐣𝐝 is the 𝐣𝐭𝐡 feature vector. Predict a potentially multi – valued output 𝐲 = 𝐲𝟏 , … , 𝐲𝐦 {Collectively, x and y are the model variables}. Formally for Supervised Learning Setup we can define model as: Q. How do we select the Model? Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 13 Softmax Regression. 1.4.1 Model Selection i.e. 𝓜 ∈ 𝓗. How do we select a model from model class: 𝓗 ∈ 𝓜? In some cases, the ML practitioner will have a good idea of what an appropriate model class is, and will specify it directly. In other cases, we may consider several model classes and choose the best based on some objective function. This restricted set of functions defining a specific model class is called a hypothesis space: 𝓗 = {𝐟: 𝐟 𝐛𝐞𝐥𝐨𝐧𝐠𝐬 𝐭𝐨 𝐚 𝐜𝐞𝐫𝐭𝐚𝐢𝐧 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐚𝐥 𝐟𝐚𝐦𝐢𝐥𝐲} Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 14 Softmax Regression. 1.4.2 Models and Parameters. Models and Parameters: All models within one hypothesis space share a common functional structure i.e. they are fully defined by some properties let’s call it parameter and let’s denote with 𝛉 ∈ 𝚯: 𝐏𝐚𝐫𝐚𝐦𝐞𝐭𝐞𝐫 𝐒𝐩𝐚𝐜𝐞 {in Machine Learning we normally use 𝐰 also called weights}. Functions are fully determined by parameters. Thus, we can re-write 𝓗 = {𝐟𝛉 : 𝐟𝛉 𝐛𝐞𝐥𝐨𝐧𝐠𝐬 𝐭𝐨 𝐜𝐞𝐫𝐭𝐚𝐢𝐧 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐚𝐥 𝐟𝐚𝐦𝐢𝐥𝐲 𝐩𝐚𝐫𝐚𝐦𝐞𝐭𝐞𝐫𝐢𝐳𝐞𝐝 𝐛𝐲 𝛉} E.g., in the case of linear functions, 𝐲 = 𝛉𝟎 + 𝛉𝟏 𝐱, the parameters 𝛉𝟎 (intercept) and 𝛉𝟏 (slope) determine the relationship between y and x We collect all parameters in a parameter vector 𝛉 = 𝛉𝟏 , 𝛉𝟐 , … , 𝛉𝐧 from parameter space 𝚯 Finding the optimal model means finding the optimal or best set of parameters for chosen function from the model class ℳ. In supervised learning; finding the best set of parameters usually means fitting or training a model on labeled training dataset such that best set of parameters could be determined. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 15 Softmax Regression. 1.2 Components of a Learning System. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 16 Softmax Regression. 1.5 The Training of Machine Learning Model. Learning Process aka parameter fitting or most commonly called Model training defines the task of: Finding the optimal model means finding the optimal or best set of parameters for chosen function from the model class ℳ. In supervised learning; finding the best set of parameters usually means fitting or training a model on labeled training dataset such that best set of parameters could be determined. The question to be asked here is: How do we fit or train the model so we get best set of parameters? How do we determined which set of parameters best fits our function or model? Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 17 Softmax Regression. 1.6 Elements of a Learning Process. Governed under the framework of Empirical Risk Minimization (ERM) {learning theory}, the following are the key elements of a supervised machine learning process: Decision Process/Function (Representation/Model): Machine learning models are used to infer or estimate outputs based on input data. Input data can be labeled (e.g., supervised learning) or unlabeled (e.g., unsupervised learning). Error Function (Evaluation): A performance metric evaluates how well the model's predictions align with the actual outcomes. The choice of metric depends on: Learning type: Supervised or unsupervised. Task type: Classification (e.g., accuracy) or regression (e.g., mean squared error). Model Optimization Process: An iterative algorithm updates the model's parameters to minimize the error function. Optimization continues until a specified threshold or an acceptable evaluation metric is achieved. Common methods include gradient descent and its variants. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 18 Softmax Regression. 1.6.1 Model as a Decision Function. A decision function (aka prediction function) gets input 𝐱 ∈ 𝓧 and produces an action 𝐚 ∈ 𝓐: i.e. 𝐟: 𝓧 → 𝓐 where: 𝐱 ⟼𝐟 𝐱 In the context of Supervised Machine Learning: Action 𝐚 ∈ 𝓐 depends on label 𝒚 ∈ 𝓨 i.e. if 𝓨 ∈ ℝ action → predict the value of label → Regression Task. 𝓨 ∈ 𝓒: 𝓒 = {𝟎, 𝟏, … , 𝐊} action→ assign a class to a label → Classification Task. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 19 Softmax Regression. 1.6.2 Learn the Parameters: Error Metrics. Goal: Find the best parameters of a model that best fits training data. Intuition: If we can evaluate how good a prediction function is we can turn this into an optimization problem. The quality of predictions from a learned model is often expressed in terms of a loss function. We specify evaluation criteria at two levels: how an individual prediction is scored → pointwise loss, and how the overall behavior of the prediction or estimation system is scored → average loss or Risk. “Cautions!! Risk is a Theoretical concepts backed by Statistical Learning Theory, Let’s formally define Risk first!!!!!” Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 20 Softmax Regression. 1.6.3 Statistical Learning Theory. Define a space where a decision or prediction function is applicable: Assume there is a data generating distribution ℙ𝓧×𝓨. All input/output pairs 𝑥, 𝑦 are generated i.i.d from ℙ𝓧×𝓨. One common expectation is to have prediction function 𝐟 𝐱 “that does well on average” i.e. : 𝓵 𝐟 𝐱 , 𝐲 is usually small. We can not compute the risk function, because we do not know the theoretical distribution or true ℙ𝓧×𝓨 hence expectation could not be computed. But in Machine Learning/statistics or data science we can estimate it based on our empirical observation or sampled data. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 21 Softmax Regression. 1.6.3.1 Empirical Risk. Empirical risk is an estimated risk of theoretical risk, and computed on provided set of sample data. Thus, for any sample data: Let 𝓓𝐧 = 𝐱 𝟏 , 𝐲𝟏 , … , 𝐱 𝐧 , 𝐲𝐧 be drawn 𝐢. 𝐢. 𝐝 from ℙ𝓧×𝓨. Empirical Risk Could be defined as: Thus, the empirical risk is average loss on our sampled data, we expect it to be as minimum as possible, Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 22 Softmax Regression. 1.6.4 Empirical Risk. Thus, we want function 𝐟 described by set of parameters with 𝛉∗ ∈ 𝚯 which produces the minimum loss on average. So, Remember the question: How do we fit or train the model so we get best set of parameters? Using evaluation measure also called as loss function, which in general is the difference between true label and predicted label. How do we determined which set of parameters best fits our function or model? Which ever set of parameters describes the function and produce the the minimum risk(estimated), which is an average loss on all our sampled data points. The answer could be collectively described by → Empirical Risk Minimization. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 23 Softmax Regression. 1.6.5 Empirical Risk Minimization. Among a collection of candidate function 𝐟 described by set of parameters 𝛉 ∈ 𝚯 we desired to select a function 𝐟 learned from sample data 𝓓𝐧 { 𝐱 𝐢 , 𝐲𝐢 } which is described by parameter 𝛉∗ ∈ 𝚯 and produces minimum average loss – Risk 𝐑 𝐟. Formally: Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 24 Softmax Regression. 1.2 Components of a Learning System. Finding a Final Model An optimization process. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 25 Softmax Regression. 1.7 An Optimization Process. How do you find such a function: 𝐟 → 𝐚𝐫𝐠𝐦𝐢𝐧 𝐟 is an Optimization problem i.e. we can find an empirical risk minimizer 𝐟መ by minimizing a loss function ℓ also called objective function. Machine Learning as an Optimization problem could be defined as: For a given loss function ℓ 𝒇 𝒙𝒊 , 𝒚𝒊 , we want to find set of parameters 𝛉∗ ∈ 𝚯 that produce a minimum loss value which can be written as: 𝛉∗∈𝚯 = 𝐚𝐫𝐠𝐦𝐢𝐧𝓛 𝐱, 𝐲, 𝛉, 𝐟 How do we find such a parameter? Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 26 Softmax Regression. 1.7.1 An Iterative Approach. In a nutshell, Numerical Solution are iterative systematic search: If the loss is (more or less) differentiable, we can find the local gradient, Unless we’re at a minimum, destination must be further down* So, keep taking small steps down the gradient until we get there “Gradient Descent Algorithm” Iterative or numerical methods (e.g., gradient descent) systematically refine the solution by leveraging gradient information or other feedback, improving precision and convergence to an optimal solution. Iterative methods focus computations on regions of interest and adaptively refine the search, reducing unnecessary evaluations and overcoming inefficiency. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 27 Softmax Regression. 1.8 So, How Good is Your Model? If you find a function 𝐟𝛉∗ with low loss on your data 𝓓𝐧 𝐱 𝐢 , 𝐲𝐢 , how can we tell how good is the learned model? In practice: The performance metrics are evaluation measure also called measure of error are used to tell how well you learned the model parameter or how best it fitted your training data. Various kind of performance metrics are available based on various task Regression or Classification. We can also use loss function as measure of evaluation for some task. We will discuss more on this when we discuss various model in upcoming weeks, but the question we are exploring today is: For any evaluation metric used, How could you say model learned from some sampled training data will do well on real world data which are not in sampled data? “how do you know it will make a correct prediction on new data which are not in training data 𝓓𝐧 𝐱𝐢 , 𝐲𝐢 ?” How well your model generalize? Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 28 Softmax Regression. 1.9 ML in Practice: Applied ML. A machine learning problem could be defined with: Observed input 𝐱 ∈ 𝓧 → 𝐈𝐧𝐩𝐮𝐭 𝐒𝐩𝐚𝐜𝐞. Take action 𝐚 ∈ 𝓐 → 𝐀𝐜𝐭𝐢𝐨𝐧 𝐒𝐩𝐚𝐜𝐞{𝐑𝐞𝐠𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝐨𝐫 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧} Observe Outcome 𝐲 ∈ 𝓨 → 𝐎𝐮𝐭𝐜𝐨𝐦𝐞 𝐒𝐩𝐚𝐜𝐞. Evaluate action in relation to the Outcome using ERM framework i.e. Given a loss function ℓ: 𝓐 × 𝓨 → ℝ, Choose a hypothesis space 𝓕 from 𝓗 = {𝓕 𝐦𝐨𝐝𝐞𝐥 ∈ 𝓜𝐜𝐥𝐚𝐬𝐬 𝐨𝐟 𝐌𝐨𝐝𝐞𝐥𝐬 } Use an optimization method to find an empirical risk minimizer 𝑓መ𝑛 ∈ ℱ : 𝟏 𝒇 𝒏 = 𝒂𝒓𝒈𝒎𝒊𝒏𝒇∈𝓕 σ𝒏𝒊=𝟏 ℓ(𝒇 𝒙𝒊 , 𝒚𝒊 ) 𝒏 ሚ መ (Or find a 𝑓 that comes close to 𝑓) Your job as an ML practitioner: Choose an appropriate Model ℱ from class of model ℳ𝑐𝑙𝑎𝑠𝑠 𝑜𝑓 𝑀𝑜𝑑𝑒𝑙𝑠 and fit model directly and hope model will perform appropriately – model fitting problem; Consider several model from model class and choose the best based on some performance metric – model selection problem. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 29 Softmax Regression. 2. Understanding Classification Task. {Building a Logistic Regression for Classification Task.} Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 30 Softmax Regression. 2.1 Definition: Task of Classification. While for regression the model simply maps a independent variable 𝓧𝐧×𝐝 to dependent variable 𝓨𝐧 i.e. 𝐟: 𝓧∈ℝ𝐝 → 𝓨∈ℝ For classification it is slightly more complicated as in classification our dependent variable happens to be discrete output called class. i.e. 𝐲 ∈ 𝓨 = 𝐂𝟏 , … , 𝐂𝐤 Here: 𝐂𝟏 , … , 𝐂𝐤 ⇒ Discrete Classes 𝐚𝐧𝐝 𝟐 ≤ 𝐤 < ∞ for any given data 𝓓 = 𝐱 𝐢 , 𝐲𝐢 , … , 𝐱 𝐧 , 𝐲𝐧 ∈ [𝓧𝐧×𝐝 × 𝓨𝐧 ]. Based on target or dependent variable i.e. 𝐘 ∈ 𝐂 = {𝟏, 𝟐, … 𝐊} ; Classification Task can be of two type: If 𝐊 = 𝟐 𝐢. 𝐞. 𝐭𝐰𝐨 𝐜𝐥𝐚𝐬𝐬 𝟎 𝐨𝐫 𝟏 → 𝐁𝐢𝐧𝐚𝐫𝐲 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧. else: 𝐊 ≥ 𝟐 upto 𝐊 → 𝐌𝐮𝐥𝐭𝐢𝐜𝐥𝐚𝐬𝐬 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 31 Softmax Regression. 2.1.1 Binary Vs. Multi Class Classification. For convenience, we often encode these classes For convenience, we often encode these classes differently. differently. For Binary Classification: For Multiclass Classification: 𝐤 = 𝟐: Usually use 𝓨 = 𝟎, 𝟏. 𝐤 ≥ 𝟑: Could use 𝓨 = 𝟎, 𝟏, … , 𝐤 but mostly preferred one hot encoding i.e. k-length vector representation for each class: 𝒐𝒊 = 𝕀 𝒚 = 𝒊 ∈ 𝟎, 𝟏. For example: Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 32 Softmax Regression. 2.2 What are Linear Models? Linear models are a class of models in machine learning and statistics that assume a linear relationship between the input features (𝓧) and the output 𝓨 variable(s). Linear Relationship: A linear model predicts the output as a weighted sum of the input features: 𝐲 = 𝐰𝟏 𝐱 𝟏 + 𝐰𝟐 𝐱 𝟐 + ⋯ + 𝐰𝐧 𝐱 𝐧 + 𝐛 Where: 𝐲 ∈ ℝ: 𝐢𝐬 𝐭𝐡𝐞 𝐨𝐮𝐭𝐩𝐮𝐭 𝐝𝐞𝐩𝐞𝐧𝐝𝐞𝐧𝐭 𝐯𝐚𝐫𝐚𝐢𝐛𝐥𝐞 , 𝓧 ∈ ℝ𝐝 : 𝐝 𝐢𝐬 𝐭𝐡𝐞 𝐝𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧. {For the purpose of training we are given n observations.} 𝐱 𝟏 , 𝐱 𝟐 , … , 𝐱 𝐧 𝐚𝐫𝐞 𝐭𝐡𝐞 𝐢𝐧𝐩𝐮𝐭 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬 (𝐢𝐧𝐝𝐞𝐩𝐞𝐧𝐝𝐧𝐞𝐭 𝐨𝐫 𝐞𝐱𝐩𝐥𝐚𝐢𝐧𝐚𝐭𝐨𝐫𝐲 𝐯𝐚𝐫𝐢𝐚𝐛𝐥𝐞𝐬) 𝐛: 𝐢𝐬 𝐭𝐡𝐞 𝐛𝐢𝐚𝐬 𝐢𝐧𝐭𝐞𝐫𝐜𝐞𝐩𝐭 Linear Models are only useful if there exist a linear relationship i.e. in the context of Classification Problem Linear Models are only useful for linearly separable data. Linearly separable data are those data, where there exist a line or hyperplane that can linearly separate the data. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 33 Softmax Regression. 2.2.1 Linear Models for Classification. Can we use linear models for classification? Yes! If there exist a function that can: Takes in the raw output of 𝐰𝟎 + 𝐰 𝐓 𝐱 and produces a value strictly between 0 and 1. Also preserves the ordering of the input values i.e. larger 𝐰𝟎 + 𝐰 𝐓 𝐱 implies a higher probability. Does there exist such function? Yes, and such function is called a sigmoid function. Lec -02- Revisiting Machine Learning with an Example of 2/27/2025 34 Softmax Regression. 2.2.2 General Introduction: Sigmoid Function. Logistic/Sigmoid function: The logistic function 𝝈 is a function from the real line to the unit interval (0,1) 𝟏 𝐞𝐭 𝛔 𝐭 = 𝟏+𝐞−𝐭 = 𝟏+𝐞𝐭 −∞