Probabilistic Classification PDF
Document Details
Uploaded by Deleted User
Umm Al-Qura University
Dr Idrees Alsolbi, Dr Omima Fallatah
Tags
Summary
This document presents a lecture on probabilistic classification, specifically focusing on Naive Bayes and Logistic Regression. It covers different types of Naive Bayes models (Gaussian, Multinomial, and Bernoulli) and their applications. The document also discusses advantages, limitations, and example applications of these methods.
Full Transcript
Probabilistic Classification Dr Idrees Alsolbi Assistant Professor And Dr Omima Fallatah Assistant Professor College of Computers Department of Data Science Data Analysis 2 Probabilistic Classification...
Probabilistic Classification Dr Idrees Alsolbi Assistant Professor And Dr Omima Fallatah Assistant Professor College of Computers Department of Data Science Data Analysis 2 Probabilistic Classification Some algorithms in machine learning and statistics use a probabilistic approach to make predictions These algorithms rely on probability distributions, likelihoods, and Bayesian inference to classify data or make decisions Common examples : – Naïve Bayes – Logistic Regression What is Naive Bayes? A supervised learning algorithm based on applying Bayes’ theorem. One of the most simple and effective classification algorithms Primarily used for classification tasks. Assumes that features are independent (hence "naive"). Widely utilized for their simplicity and efficiency in machine learning Effective for large datasets and useful for text classification. What is Naive Bayes? It is highly used in text classification. In text classification tasks, data contains high dimension (as each word represent one feature in the data) Examples: spam filtering, sentiment detection, rating classification The advantage of using naïve Bayes is its speed. It is fast and making prediction is easy with high dimension of data. Given a set of feature value, This model predict the probability of an instance belonging to a class. Therefore, it is a probabilistic classifier. it assumes that one feature in the model is independent of existence of another feature. In other words, each feature contributes to the predictions with no relation between each other. In real world, this condition satisfies rarely. Why it is Called Naive Bayes? The “Naive” part of the name indicates the simplifying assumption made by the Naïve Bayes classifier. The classifier assumes that the features used to describe an observation are conditionally independent, given the class label. The “Bayes” part of the name refers to Reverend Thomas Bayes, an 18th-century statistician and theologian who formulated Bayes’ theorem How Naive Bayes Works? 1. Calculate prior probabilities for each class. 2. Calculate the likelihood for each feature. 3. Apply Bayes' Theorem to compute the posterior probability for each class. 4. Choose the class with the highest posterior probability. Example The dataset is divided into two parts, namely, feature matrix and the response vector. Feature matrix contains all the vectors(rows) of dataset in which each vector consists of the value of dependent features. E.g., ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’. Response vector contains the value of class variable(prediction or output) for each row of feature matrix. E.g.,‘Play golf’ Assumption of Naive Bayes Feature independence: The features of the data are conditionally independent of each other, given the class label. Features are equally important: All features are assumed to contribute equally to the prediction of the class label. No missing data: The data should not contain any missing values. Back to our golf example We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing to do with the humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence, the features are assumed to be independent. Secondly, each feature is given the same weight(or importance). For example, knowing only temperature and humidity alone can’t predict the outcome accurately. None of the attributes is irrelevant and assumed to be contributing equally to the outcome. Bayes' Theorem P( X |C )P(C ) P(C |X ) = P( X ) Likelihood Prior Posterior = Evidence Example Example: Play Tennis Learning Phase We have four variables, for each one we calculate the conditional probability table Outlook Play=Yes Play=No Temperature Play=Yes Play=No Sunny 2/9 3/5 Hot 2/9 2/5 Overcast 4/9 0/5 Mild 4/9 2/5 Rain 3/9 2/5 Cool 3/9 1/5 Humidity Play=Yes Play=No Wind Play=Yes Play=No High 3/9 4/5 Strong 3/9 3/5 Normal 6/9 1/5 Weak 6/9 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 Test Phase – Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) – Look up tables P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 – MAP rule P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. Types of Naive Bayes Model 1. Gaussian Naive Bayes Description: This type is used when the features are continuous and it assumes that the continuous features follow a normal distribution (Gaussian distribution) within each class. Assumption: Each class has a Gaussian distribution for each feature, characterized by a mean and a variance. Use Case Example: Classifying whether a patient has a disease based on continuous features like age, blood pressure, and cholesterol levels, which may follow a normal distribution. Types of Naive Bayes Model 2. Multinomial Naive Bayes Description: This type is suitable for discrete features that represent counts or frequencies. It assumes that these features follow a multinomial distribution. Assumption: The feature values are discrete and represent the counts or occurrences of different events (e.g., words in a document). Use Case Example: Text classification tasks such as spam detection or document categorization, where the features are the frequency of words in an email or a document. Types of Naive Bayes Model 3. Bernoulli Naive Bayes Description: This variant is used when the features are binary/boolean (e.g., 1/0, yes/no, true/false). It models whether a particular feature is present or absent in a given instance. Assumption: The features are binary (e.g., presence or absence of a word in a document), and the likelihoods are calculated using the Bernoulli distribution. Use Case Example: Binary text classification, such as spam detection where the features are whether specific words appear or do not appear in the email (rather than how many times they appear, as in Multinomial Naive Bayes). Advantages of Naive Bayes Classifier Easy to implement and computationally efficient. Effective in cases with a large number of features. Performs well even with limited training data. It performs well in the presence of categorical features. Limitations of Naive Bayes Classifier Assumes that features are independent, which may not always hold in real-world data. Can be influenced by irrelevant attributes. May assign zero probability to unseen events, leading to poor generalization. What is Logistic Regression? Logistic regression is a supervised learning algorithm used for binary classification tasks, where we use sigmoid function. predicts the output of a categorical dependent variable. Therefore, the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1. The model outputs a probability, and based on a threshold (commonly 0.5), it classifies the input into one of the two classes. Logistic Function – Sigmoid Function The sigmoid function is a mathematical function used to map the predicted values to probabilities. It maps any real value into another value within a range of 0 and 1. The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the “S” form. The function squashes the output of a linear model into the range [0,1], making it interpretable as a probability. How Logistic Regression Works? The model finds the best-fitting line (or hyperplane) that separates the two classes by learning coefficients β0,β1,…βn. It estimates the probability of each class and applies a decision threshold to classify the input. Logistic Regression Example Example: Predict whether a student will be admitted to a university based on their GPA and SAT score. Logistic regression will estimate the probability of admission and classify the student as admitted or not. Logistic Regression vs. Linear Regression Linear Regression Logistic Regression predicts probabilities for binary outcomes (e.g., predicts continuous values (e.g., house prices). pass/fail) In this we predict the value of continuous In this we predict values of categorical variables variables In this we find best fit line. In this we find S-Curve. The output must be continuous value, such as Output must be categorical value such as 0 or 1, price, age, etc. Yes or no, etc. There may be collinearity between the There should be little to no collinearity between independent variables. independent variables. Logistic Regression advantages and Limitations Advantages: Simple to implement and interpret. Outputs reliable probabilities. Works well when the relationship between features and the target is approximately linear. Limitations: Assumes a linear decision boundary. Sensitive to outliers. May not perform well on complex datasets with non-linear relationships. works best with a reasonably large dataset. Conclusions Naive Bayes: A simple, fast, and interpretable classifier that relies on the assumption of independence between features. It's effective for text classification and other tasks where the independence assumption holds. Logistic Regression: A powerful, interpretable model for binary and multi-class classification, providing reliable probability outputs. It works well for linearly separable data. Both models are grounded in probability theory and provide insights into how likely a data point belongs to a particular class. Both models are widely used in industry and academia for spam detection, disease diagnosis, and many other applications. Any Questions? Discriminant Analysis Dr Idrees Alsolbi Assistant Professor And Dr Omima Fallatah Assistant Professor College of Computers Department of Data Science Data Analysis 2 Discriminant Analysis What is the main problem? Discriminant Analysis The Problem… Training Situation: Data on p predictors, Membership of one of g groups Classification Problem: Data on p predictors Unknown group membership Discriminant Analysis Fisher’s Iris Data: Identify the three species? 2.5 2.0 Petal Width 1.5 1.0 0.5 2.0 2.5 3.0 3.5 4.0 Sepal Width What is Discriminant Analysis? Discriminant Analysis is a statistical technique used to determine which variables discriminate between two or more naturally occurring groups. It helps in classifying a set of observations into predefined classes based on predictor variables (features). The main goal of Discriminant Analysis is to find a combination of features that best separates the classes. It creates a decision boundary or discriminant function that maximizes the separation between the groups. It is useful when you want to understand how well your data can be separated into distinct categories, such as customers, products, diseases, etc What is Discriminant Analysis? Examples: 1- Classifying customers into two customers who buy a product vs. those who don’t. 2- Classifying loan applicants as high-risk or low-risk based on their financial history and other factors. Each group has certain characteristics, such as age, income, and shopping frequency. Discriminant Analysis identifies the combination of these characteristics that most effectively distinguishes between Group A and Group B. Types of Discriminant Analysis Linear Discriminant Analysis (LDA): 1. Assumes that the groups have the same covariance matrix and are normally distributed. 2. Suitable when the independent variables have a linear relationship. Quadratic Discriminant Analysis (QDA): 1. Does not assume equal covariance matrices. 2. Useful when the relationship between variables is non-linear. Regularized Discriminant Analysis (RDA): 1. A blend of LDA and QDA, introducing regularization to avoid overfitting. Linear Discriminant Analysis (LDA) Linear Discriminant Analysis, or simply LDA, is a well-known feature extraction technique that has been used successfully in many statistical pattern recognition problems. LDA is a method that finds a linear combination of features that best separates two or more classes. It assumes that the features are normally distributed within each class LDA is often called Fisher Discriminant Analysis (FDA). Linear Discriminant Analysis (LDA) LDA projects the data onto a line such that the separation between the means of the classes is maximized, while the variance within each class is minimized. It then uses this projection to classify new observations. Classify the item x at hand to one of J groups based on measurements on p predictors. Rule: Assign x to group j that has the closest mean j = 1, 2, …, J Distance Measure: Mahalanobis Distance…. Takes the spread of the data into Consideration Linear Discriminant Analysis (LDA) LDA seeks to find discriminatory features that provide the best class separability The discriminatory features are obtained by maximizing the between-class covariance matrix and minimizing the within-class covariance matrix. Linear Discriminant Analysis (LDA) Distance Measure: For j = 1, 2, …, J, compute dj ( x ) = (x - x j S pl (x - x j ) )T -1 Assign x to the group for which dj is minimum S is the pooled estimate of the pl covariance matrix Quadratic Discriminant Analysis (QDA) QDA is an extension of LDA that allows for non-linear boundaries by assuming that each class has its own covariance matrix. It’s more flexible than LDA but requires more data to estimate the parameters. Does not assume equal covariance matrices. Because of this, the decision boundaries in QDA are quadratic (curved) rather than linear. Quadratic Discriminant Analysis (QDA) Rule: assign to group j if Q j ( x ) is the largest. Optimal if the J groups of measurements are multivariate normal More flexible than LDA, especially with non-linear data. Can capture more complex patterns in the data. Regularized Discriminant Analysis (RDA) RDA is a compromise between LDA and QDA. It introduces a regularization parameter that shrinks the covariance estimates toward a common value. RDA combines LDA and QDA by adjusting the contribution of the shared and individual covariance matrices. Balances the flexibility of QDA with the simplicity of LDA. Reduces the risk of overfitting, especially in small or noisy datasets. Regularized Discriminant Analysis (RDA) The regularization parameter (often denoted by λ ) controls the trade- off: when λ=0, RDA is equivalent to QDA, and when λ=1, it is equivalent to LDA. In scenarios where the distinction between classes is neither purely linear nor fully quadratic, RDA provides a flexible solution. LDA or PCA? The primary purpose of LDA is to separate samples of distinct groups by transforming them to a space which maximises their between-class separability while minimising their within-class variability. LDA is a Supervised Method used for classification and aims to find a linear combination of features that best separates two or more classes. PCA identifies the directions (principal components) along which the variance of the data is maximized, capturing the most important patterns in the data. PCA is Unsupervised Method used for feature extraction and dimensionality reduction without considering any class labels. LDA or PCA? LDA seeks directions that are efficient for discriminating data whereas PCA seeks directions that are efficient for representing data. The directions that are discarded by PCA might be exactly the directions that are necessary for distinguishing between groups. Example 1 Loan Approval Example: Imagine: A bank wants to decide if a person should get a car loan. Task: The loan officer needs to determine if the applicant is likely to repay the loan or default. Process: The officer looks at the applicant's details (like age, income, etc.). He compares these details with past borrowers who repaid successfully and those who didn’t. Based on which group the applicant is more similar to, the officer makes a decision. Example 2 HATCO is a large industrial supplier A marketing research firm surveyed 100 HATCO customers There were two different types of customers: Those using Specification Buying and those using Total Value Analysis HATCO management believes that the two different types of customers evaluate their suppliers differently Example 2 In a B2B situation, HATCO wanted to know the perceptions that its customers had about it They gathered data on 7 variables 1. Delivery speed 2. Price level 3. Price flexibility 4. Manufacturer’s image 5. Overall service 6. Salesforce image 7. Product quality Each var was measured on a 10 cm graphic rating scale Example 2 Stage 1: Objectives of Discriminant Analysis Which perceptions of HATCO best distinguish firms using each buying approach? Stage 2: Research design a. Dep var is the buying approach of customers. It is categorical. Indep var are X1 to X7 as mentioned above b. Overall sample size is 100. Each group exceeded the minimum of 20 per group c. Analysis sample size was 60 and holdout sample size was 40 Example 2 Stage 3: Assumptions of discriminant analysis All the assumptions were met. Stage 4: Estimation of discriminant analysis and assessing fit Before estimation, we first examine group means for X1 to X7 and the significances of difference in means a. Estimation is done using the Stepwise procedure. - The independent var which has the largest Mahalanobis D2 distance is selected first and so on, till none of the remaining var are significant - The discriminant function is obtained from the unstandardized coefficients Example 3 Imagine we are trying to classify customers into two groups based on their likelihood of purchasing a product: Group 1 (Will Purchase) and Group 2 (Will Not Purchase). We have data on two independent variables: x1: Annual Income (in $1000s) x2: Age (in years) Suppose we have the following data for some customers: Customer Income (x1) Age (x2) Purchase Group (Y) 1 50 25 1 (Will Purchase) 2 60 30 1 (Will Purchase) 3 40 22 2 (Will Not Purchase) 4 70 28 1 (Will Purchase) 5 45 35 2 (Will Not Purchase) If the discriminant score (Y) is above a certain threshold, say 0, the customer is classified into Group 1 (Will Purchase). If the score is below 0, the customer is classified into Group 2 (Will Not Purchase). Example 3 Data Collection: Gather data for observations where group membership is known, including the values of independent variables. Discriminant Function Estimation: Use statistical software to calculate the discriminant function, which involves finding the best coefficients for the independent variables. Classification: Apply the discriminant function to new observations to predict their group membership. Validation: Test the accuracy of the discriminant function using a separate dataset or cross-validation techniques. Example 3 Finance and Investment Purpose: To classify companies or stocks into different risk categories. Example: Financial analysts use DA to classify companies into "High Risk" or "Low Risk" investment categories based on factors such as earnings, debt levels, and market trends. This helps investors make more informed decisions on where to allocate funds. Example 3 Medical Diagnosis Objective: To classify patients based on symptoms or test results into different diagnostic groups. Application: In healthcare, DA can be used to predict whether a patient has a particular disease based on their test results and symptoms. For example, by analyzing blood pressure, cholesterol levels, and other variables, doctors can classify patients into "healthy" or "at risk" for conditions like heart disease. Discriminant Analysis Limitations Discriminant Analysis (both LDA and QDA) assumes that the features are normally distributed within each class. (This is not always the reality!) It is sensitive to outliers which misleads the estimated decision boundaries, leading to incorrect classifications. This is especially problematic in small datasets where a few outliers can have a large impact. In very high-dimensional or large-scale datasets, DA methods can become computationally expensive and may not scale well, especially QDA due to its quadratic nature. References: http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html#a- summary-of-the-pca-approach http://cs.fit.edu/~dmitra/ArtInt/ProjectPapers/PcaTutorial.pdf Sebastian Raschka, Linear Discriminant Analysis Bit by Bit, http://sebastianraschka.com/Articles/414_python_lda.html , 414. Zhihua Qiao, Lan Zhou and Jianhua Z. Huang, Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data Any Questions? Ensemble Methods Dr Idrees Alsolbi Assistant Professor And Dr Omaima Fallatah Assistant Professor College of Computers Department of Data Science Data Analysis 2 Ensemble methods An ensemble is a composite model, combines a series of low performing classifiers with the aim of creating an improved classifier. Here, individual classifier vote and final prediction label returned that performs majority voting. Ensembles offer more accuracy than individual or base classifier. Ensemble methods can parallelize by allocating each base learner to different-different machines. Ensemble methods Typical application: classification Ensemble of classifiers is a set of classifiers whose individual decisions combined in some way to classify new examples Simplest approach: 1. Generate multiple classifiers 2. Each votes on test instance 3. Take majority as classification Aim: take simple mediocre algorithm and transform it into a super classifier without requiring any fancy new algorithm Basic Ensemble Structure Why ensembles ? Minimize two sets of errors: 1. Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have an under-performing model which keeps on missing essential trends. 2. Variance useful to quantify how much a model's predictions fluctuate based on different training data. A high variance model will over-fit on your training population and perform poorly on any observation beyond training. Bias – Variance trade-off The goal is to maintain a balance between these two types of errors. This is known as the trade-off management of bias-variance errors. Ensemble learning is one way to execute this trade off analysis. Types of Ensemble Methods 1. Parallel training with different training sets: bagging 2. Sequential training, iteratively re-weighting training examples so current classifier focuses on hard examples: boosting Bootstrap Bootstrap is a resampling method that involves repeatedly drawing samples with replacement from a dataset to create multiple "bootstrap samples." Each sample is the same size as the original dataset but may include duplicate data points. Example: Let’s say we have a dataset with 5 samples: A, B, C, D, E. A bootstrap sample might look like: A, C, A, D, B (note that A appears twice and E is missing). Another bootstrap sample could be: B, B, D, E, C. Bagging Bagging stands for bootstrap aggregation. It combines multiple learners in a way to reduce the variance of estimates. For example, random forest - Trains M Decision Tree, you can train M different trees on different random subsets of the data and perform voting for final prediction. Bagging reduces variance (overfitting) of the base learner, like decision trees, by averaging their outputs. Bagging Random Forest Random forest is a type of supervised machine learning algorithm based on ensemble learning. The random forest algorithm combines multiple algorithm of the same type i.e. multiple decision trees, resulting in a forest of trees, hence the name "Random Forest". The random forest algorithm can be used for both regression and classification tasks. How Random Forest Works? Divide training examples into multiple training sets (bagging) Train a decision tree on each set (can randomly select subset of features to consider) Aggregate the predictions of each tree to make classification/regression decision: – For classification tasks, Random Forest makes a prediction by majority voting: the class predicted most frequently by the individual trees is selected. – For regression tasks, the prediction is the average of the predictions from all trees. Random Forest (classification task) Random Forest (Regression) Random Forest (Example) Medical Diagnosis: Problem: Predicting whether a patient has a particular disease based on various medical test results (features). Application: Random Forest can be used to classify patients into "disease" or "no disease" categories. Each decision tree in the forest might look at different combinations of test results (e.g., blood pressure, cholesterol levels, age), and the final diagnosis is based on the majority vote across all the trees. Outcome: This approach can help improve diagnostic accuracy by reducing the chances of overfitting to any single test result. Random Forest (Example) Credit Scoring: Problem: Assessing the creditworthiness of loan applicants based on features like income, credit history, employment status, etc. Application: A Random Forest model can be trained on past loan data to predict whether a new applicant is likely to default on a loan. Each tree in the forest might focus on different aspects of the applicant's profile, and the ensemble prediction helps provide a more reliable assessment. Outcome: Random Forests are often used in financial services to build robust credit scoring models that are less prone to overfitting compared to single decision trees. Random Forest Adv/Downsides Random Forest are among most widely used algorithms: Don’t require a lot of tuning Typically, very accurate Handles heterogeneous features well (trees) Implicitly selects most relevant features Downsides: less interpretable, slower to train (but parallellizable) don’t work well on high dimensional sparse data (e.g. text) Boosting Boosting algorithms are a set of the low accurate classifier to create a highly accurate classifier. Low accuracy classifier (or weak classifier) offers the accuracy better than random guessing. This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added. Boosting Each classifier trained given knowledge of the performance of previously trained classifiers: focus on hard examples. Highly accurate classifier( or strong classifier) offer error rate close to 0. Boosting algorithm can track the model who failed the accurate prediction. Final classifier: weighted sum of component classifiers Examples: ADAboost (Adaptive Boosting) By correcting mistakes iteratively, boosting reduces bias, making the final model more accurate. ADAboost First train the base classifier on all the training data with equal importance weights on each case. Then re-weight the training data to emphasize the hard cases and train a second model. – Assigns the higher weight to wrong classified observations, and lower weight for correctly classified ones – The more accurate classifier will get high weight. Keep training new models on re-weighted data The final model is a weighted sum of the base classifiers. ADAboost https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-learning/ ADAboost (Example) Face Detection: Problem: Detecting faces in images, which is a crucial task in computer vision. Application: AdaBoost can be used to create a strong classifier for face detection by combining several weak classifiers. Each weak classifier might focus on different facial features, such as edges, texture, or color. Outcome: AdaBoost iteratively boosts the importance of misclassified faces in the training process, making the final model very effective at distinguishing faces from non-faces. This technique was famously used in the Viola-Jones face detection framework. ADAboost (Example) Fraud Detection: Problem: Identifying fraudulent transactions in financial data. Application: AdaBoost can be used to build a model that identifies fraudulent transactions based on features like transaction amount, location, time, and frequency. Initially, simple models might miss some of the fraudulent transactions, but AdaBoost focuses on these difficult cases in subsequent rounds. Outcome: By boosting the misclassified transactions, AdaBoost creates a stronger overall model that is better at catching fraud while minimizing false positives. Boosting Advantages – Fast learning – Capable of learning any function (given appropriate weak learner) – Simple and easy to implement Disadvantages – AdaBoost can be very sensitive to noisy data and outliers because it increases the weight of misclassified instances. Summary Finally, you can say Ensemble learning methods are meta-algorithms that combine several machine learning methods into a single predictive model to increase performance. Parallel training with different training sets: bagging train separate models on overlapping training sets, average their predictions Sequential training, iteratively re-weighting training examples so current classifier focuses on hard examples: boosting Bagging is a variance-reduction technique while Boosting is a bias- reduction technique References https://mitu.co.in/wp-content/uploads/2022/04/Ensemble-Learning.pdf Ensemble Techniques Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar