Data Mining Classification PDF

BPD2233: DATA MINING Classification 1 Learning Objectives ▪ To comprehend the concept, types and working of classification ▪ To identify the major differences between classification and regression problem ▪ To become familiar about the working of classification ▪ To introduce the decision tree classification system with concepts of information gain and Gini index ▪ To understand the workings of the Naïve bayes method 2 Introduction ▪ Two forms of data analytics: Classification Regression Are used for predicting future by analyzing existing data Predict discrete value or class Predict a continuous value Example: Paul the octopus Example: The growth of social media usage 3 Classification ▪ Is a classification method which is used by machine learning researchers and statisticians for predicting the outcome of unknown sample. ▪ Used for categorization of object (or things) inti given discrete number of class ▪ Classification problems can be of two types: Binary Multiclass The target attribute can only have 2 The target attribute can have more possible value than value Example: Example: a tumor is either cancerous or not? a tumor can be of type 1, type 2, a team will either win or lose? type 3 cancer a winner can be happy, sad or speechless 4 Classification ▪ Some examples of business situation where the classification technique is applied are: To analyze the credit history of bank customers – identify if it would be risky or safe to grant them loans ✓ How – The prediction will predict a discrete value representing either risky, or safe To analyze the purchase history of a shopping mall’s customer to predict whether they will buy a certain product or not. ✓ How – The prediction will predict yes or no 5 Type Classification Posteriori Priori From the later From the earlier Something derived by reasoning Something derived by reasoning from the observed fact from self-evident proposition A supervised machine learning An unsupervised machine learning approach approach E.g.: Apples are sweet E.g.: Every apple is a fruit 6 Input and Output Attributes ▪ Data contains two types of attributes, namely: Input attributes / independent attributes – all other attributes Output attributes / dependent attributes – the class attribute that represents the output of all other attributes The attributes can be of different types Numerical attributes – numerical attributes Nominal or categorical attributes – attributes whose domain is not numerical During classification, it is important to have a database of sufficient size for training the model accurately. 7 Working of Classification ▪ Is a two-step process: First step A classifier is built based on the training data obtained by analyzing database tuple and their associated class labels. By analyzing data, the system learns and creates some rules prediction Second step These prediction rules are tested on some unknown instances, i.e., test data Rules are used to make the predictions about the output attributes or class The predictive accuracy of classifier is calculated The system performs in an iterative manner to improve its accuracy. 8 Working of Classification ▪ Example – analyze data of previous loan application 9 Working of Classification ▪ The same process of training and testing the classifier 10 Guidelines for Size and Quality of the Training Dataset Should be a balance between the number of training samples and independent attributes. The number of training samples required is likely to be relatively small is the number of independent or input attributes is small and similarly The number of training samples required is likely to be relatively large if the number of independent of input attributes is large The quality of classifier depends upon the quality of the training data. The training data should be available based on the number of class Classifier – Decision Tree, Naïve Bayes, Support Vector Machine and Neural Network 11 Decision Tree Classifier 12 Introduction ▪ In the decision tree classifier, predictions are made by using multiple “if…then…” condition, which similar are to the control statements in different programming language ▪ The decision tree structure consists of a root node, branches and leaf nodes ▪ Each internal node represents a condition on some input ▪ Each branch specifies the outcome of the condition ▪ Each leaf node holds a class label ▪ Root node is the topmost node in the tree 13 Introduction ▪ Case study: Predict whether a customer will buy a laptop or not ▪ Task: From the figure 5.6, write the “if..,then…” statement that represent the decision tree classifier 14 Building Decision Tree Late 1970s and early 1980s J. Ross Quinlan (researcher Developed decision tree algorithm known as ID3 (Iterative machine Dichotomiser) learning) Propose C4.5 (a successor of ID3) Become a benchmark to which newer supervised learning algorithm. Decision tree is a common machine learning technique which has been implemented many machine learning tool like Weka, R, Matlab as well as come programming language such as Python, Java, etc. These algorithm are based on the concept of Information Gain and Gini Index 15 Concept of Information Theory ▪ Decision tree algorithm works on the basis of information theory ▪ Information theory was developed by Claude Shannon ▪ Information is directly related with uncertainty ▪ If there is uncertainty, then there is information and if there is no uncertainty then is no information ▪ Example: Coin toss if a coin is biased having a head on both sides, then the result of tossing it does not give information if a coin is unbiased having a head and a tail then the result of the toss provide some information 16 Concept of Information Theory 17 Defining in Information in Terms of Probability Information theory defines entropy which is average amount of information given by source of data Entropy is a measure of the uncertainty or unpredictability in a set of possible outcomes. It quantifies the amount of "surprise" associated with random variables. Therefore, the total information for an event is calculated by the following equation 18 Information Gain vs Index Features Information Gain Gini Index Measures the reduction of uncertainty Measures impurity (to a measure of the Definition (entropy) degree to which a dataset or a subset of data is mixed with different classes) Interpretability Directly related to information and Less intuitive, focuses on class purity uncertainty Bias Can favor attributes with many values Can favor attributes with many values Computation More computationally intensive Faster and simpler to compute Usage Preferred in algorithms like ID3, C4.5 Used in CART (Classification and Regression Trees) 19 Example Decision Tree Splitting Attributes Home Marital Annual Defaulted ID Owner Status Income Borrower Home 1 Yes Single 125K No Owner 2 No Married 100K No Yes No 3 No Single 70K No NO MarSt 4 Yes Married 120K No Single, Divorced Married 5 No Divorced 95K Yes 6 No Married 60K No Income NO 7 Yes Divorced 220K No < 80K > 80K 8 No Single 85K Yes NO YES 9 No Married 75K No 10 No Single 90K Yes 10 Model: Decision Tree Training Data 20 Apply Model to Test Data Test Data Start from the root of tree. Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home Owner 10 Yes No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES 21 Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES 2/1/2021 Introduction to Data Mining, 2 nd Edition 22 Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES 23 Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES 24 Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES 25 Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Assign Defaulted to “No” Income NO < 80K > 80K NO YES 2/1/2021 Introduction to Data Mining, 2 nd Edition 26 Apply Model to Test Data MarSt Single, Married Divorced Home Marital Annual Defaulted ID Owner Status Income Borrower NO Home 1 Yes Single 125K No Yes Owner No 2 No Married 100K No 3 No Single 70K No NO Income 4 Yes Married 120K No < 80K > 80K 5 No Divorced 95K Yes NO YES 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No There could be more than one tree that fits the same data! 10 No Single 90K Yes 10 27 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class Tree 1 Yes Large 125K No Induction 2 No Medium 100K No algorithm 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Decision Tree Tid Attrib1 Attrib2 Attrib3 Class Model 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? Deduction 14 No Small 95K ? 15 No Large 67K ? 10 Test Set 28 Decision Tree Classification ▪ Advantages: Relatively inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Robust to noise (especially when methods to avoid overfitting are employed) Can easily handle redundant attributes Can easily handle irrelevant attributes (unless the attributes are interacting) ▪ Disadvantages:. Due to the greedy nature of splitting criterion, interacting attributes (that can distinguish between classes together but not individually) may be passed over in favor of other attributed that are less discriminating. Each decision boundary involves only a single attribute 29 Practical Application ▪ Customer Churn Prediction: Identifying customers who are likely to leave, allowing businesses to take preemptive action. ▪ Credit Risk Assessment: Classifying loan applicants into "high risk" and "low risk" categories based on factors like credit history and income. ▪ Product Recommendation: Segmenting customers to recommend relevant products, boosting sales and customer satisfaction. ▪ Fraud Detection: Flagging potentially fraudulent transactions based on patterns found in past transactions 30 Naives Bayes Classifier 31 Introduction ▪ Naive Bayes is a family of probabilistic algorithms based on Bayes' Theorem, primarily used for classification tasks in data mining. ▪ It is particularly known for its simplicity and efficiency, making it suitable for large datasets. ▪ Bayes' Theorem At the core of Naive Bayes is Bayes' Theorem, which describes the probability of an event based on prior knowledge of conditions that might be related to the event. The theorem is expressed as: 32 Naïve Assumption ▪ The term "naive" in Naive Bayes comes from the assumption that the features used for classification are independent given the class label. While this assumption rarely holds in real-world applications, Naive Bayes often performs surprisingly well despite this. 33 Classifier Type ▪ Gaussian Naive Bayes: Assumes that the features follow a normal (Gaussian) distribution. It is suitable for continuous data. ▪ Multinomial Naive Bayes: Used for discrete counts, particularly effective in document classification and natural language processing (NLP). ▪ Bernoulli Naive Bayes: Similar to Multinomial but works with binary features (0 or 1), often used in text classification. 34 How it work ▪ The Naive Bayes algorithm involves the following steps: Training Phase: - Calculate the prior probabilities P(C) class C. - For each feature, calculate the likelihood P(Fi∣C) where Fi is the i-th feature. Prediction Phase: - For a new instance, calculate the posterior probability for each class using Bayes' Theorem: - The class with the highest posterior probability is assigned as the predicted class. 35 Advantages and Limitation ▪ Advantages: Simplicity: Easy to implement and understand. Efficiency: Performs well with large datasets and has low computational cost. Scalability: Suitable for high-dimensional data, such as text data. Works Well with Less Data: Can yield good performance even with small training datasets. ▪ Limitations: Independence Assumption: The assumption of feature independence may not hold in real-life scenarios, which can affect accuracy Zero Probability Problem: If a category in a feature is not present in the training data, it can lead to zero probability in predictions. This can be addressed using techniques like Laplace smoothing. 36 Practical Application ▪ Spam Detection: Classifying emails as spam or not spam based on the content ▪ Sentiment Analysis: Analyzing customer reviews to determine the sentiment (positive, negative, neutral). ▪ Customer Segmentation: Classifying customers into different segments based on purchasing behavior. ▪ Recommendation Systems: Predicting user preferences based on past behaviors. 37 Example Example: Play Tennis The learning phase for tennis example P(Play=Yes) = 9/14 P(Play=No) = 5/14 We have four variables, we calculate for each we calculate the conditional probability table Outlook Play=Yes Play=No Temperature Play=Yes Play=No Sunny 2/9 3/5 Hot 2/9 2/5 Overcast 4/9 0/5 Mild 4/9 2/5 Rain 3/9 2/5 Cool 3/9 1/5 Humidity Play=Yes Play=No Wind Play=Yes Play=No High Strong 3/9 3/5 3/9 4/5 Normal Weak 6/9 2/5 6/9 1/5 The test phase for the tennis example Test Phase – Given a new instance of variable values, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) – Given calculated Look up tables P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 – Use the MAP rule to calculate Yes or No P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. Metric to Assess the Quality of Classifier ▪ True Positive ▪ False Positive ▪ True Negative ▪ False Negative ▪ Precision ▪ Recall ▪ F-Measure The Boy Who Cried Out ▪ Definition: ”Wolf” – Positive Class and “No Wolf” – Negative Class ▪ Wolf prediction – 2 x 2 confusion matrix - four possible outcomes The Boy Who Cried Out ▪ Definition: ”Wolf” – Positive Class and “No Wolf” – Negative Class ▪ Wolf prediction – 2 x 2 confusion matrix - four possible outcomes TP, FP, TN, FN True Positive (TP) occurs when the classifier correctly predicts a positive outcome. Example: Imagine a classifier designed to detect spam emails. If an email is actually spam and the classifier correctly labels it as spam, this is a true positive. False Positive (FP) happens when the classifier incorrectly predicts a positive outcome for something that is actually negative. This is also known as a Type I error. Example: Continuing with the spam filter example, if an email is actually not spam (legitimate) but the classifier labels it as spam, this is a false positive. True Negative (TN): The classifier correctly predicts a negative outcome. Example: In the spam a true negative would mean a legitimate email was correctly identified as not spam. False Negative (FN): The classifier incorrectly predicts a negative outcome for something that is actually positive. Example: In the spam this would mean an actual spam email was incorrectly identified as not spam (missed spam). Confusion Matrix ▪ These terms are often visualized in confusion matrix Predicted Positive Predicted Negative Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN) ▪ This matrix computes important metrics like accuracy, precision, recall, and F1-score, which help evaluate the classifier’s effectiveness. Accuracy ▪ Definition: Measures the overall proportion of correct predictions among all predictions. Use: Suitable for balanced datasets. However, accuracy can be misleading with imbalanced datasets, where it might be high just because the model predicts the majority class well. Precision (Positive Predictive Value) ▪ Definition: Measures how many of the positive predictions are actually correct. ▪ Use: Important when false positives are costly (e.g., in spam detection, where marking legitimate emails as spam is undesirable). Recall (Sensitivity or True Positive Rate ▪ Definition: Measures how many of the actual positives the classifier correctly identified. ▪ Use: Crucial when missing actual positives is costly (e.g., in medical diagnostics, where missing a disease could have serious consequences). F-Measure or F1 Score ▪ Definition: The harmonic mean of Precision and Recall, balancing them when there's an uneven class distribution. ▪ Use: Effective for imbalanced datasets. It provides a single metric to evaluate the balance between precision and recall, particularly useful when there’s a trade-off between false positives and false negatives. POP QUIZ… ▪ Which type of machine learning techniques will be used for the following? i. Prediction of the price of a house - regression ii. Predicting the type of disease - classification iii. Predicting tomorrow’s temperature - regression iv. Predicting if tomorrow will be cooler or hooter than today - classification Remind Me… 52 Discrete vs Continuos https://www.youtube.com/watch?v=Cg0W6mod9Hw 53 Term ▪ Training dataset - is a subset of your dataset that you use to teach a machine learning model to recognize patterns or perform your criteria. ▪ Testing dataset - unseen data to test your model. use it to evaluate the performance and progress of your algorithms’ training and adjust or optimize it for improved results. Represent the actual dataset Be large enough to generate meaningful predictions ▪ A classifier - an algorithm that automatically orders or categorizes data into one or more of a set of “classes.” Example: an email classifier that scans emails to filter them by class label: Spam or Not Spam. ▪ Entropy – a measure of randomness or disorder of a system 54 THANK YOU 55

Data Mining Classification PDF

Document Details

Tags

Related

Summary

Full Transcript