Document Details
Uploaded by LuxuryUniverse
Prasad V. Potluri Siddhartha Institute of Technology
Tags
Full Transcript
UNIT-III Classification: Introduction, General Frame Work to Classification, Applications. Decision Tree Induction: Introduction, Decision Tree Representation, Attribute Selection Measures, Decision Tree Learning Algorithm, Tree Pruning, Issues in Decis...
UNIT-III Classification: Introduction, General Frame Work to Classification, Applications. Decision Tree Induction: Introduction, Decision Tree Representation, Attribute Selection Measures, Decision Tree Learning Algorithm, Tree Pruning, Issues in Decision Tree, Metrics for Evaluating Classifier Performance. 3.1.Classification Classification is a technique in data mining that involves categorizing or classifying data objects into predefined classes, categories, or groups based on their features or attributes. It is a supervised learning technique that uses labeled data to build a model that can predict the class of new, unseen data. It is an important task in data mining because it enables organizations to make informed decisions based on their data. For example, a retailer may use data classification to group customers into different segments based on their purchase history and demographic data. This information can be used to target specific marketing campaigns for each segment and improve customer satisfaction. 3.1.1.General Approach to Classification (How does classification work) Data classification is a two-step process, consisting of 1. A learning step (where a classification model is constructed) 2. A classification step (where the model is used to predict class labels for given data). 1. A learning step (where a classification model is constructed) : In the first step, a classifier is built describing a predetermined collection of data classes or concepts. This is the learning step or training phase, where a classification algorithm builds the classifier by analyzing or “learning from” a training set made up of database tuples and their associated class labels. Fig: a)Learning: Training data are analyzed by a classification algorithm. Here, the class label attribute is loan decision, and the learned model or classifier is represented in the form of classification rules. A tuple, X, is described by an n- dimensional attribute vector, X = (x1, x2, … xn), defining n measurements create on the tuple from n database attributes, accordingly, A1,A2,... An. Every tuple, X, is considered to belong to a predefined class as decided by another database attribute known as the class label attribute. The class label attribute is discrete- valued and unordered. It is categorical in that every value provide as a category or class. The single tuples creating up the training set are defined as training tuples and are choose from the database under analysis. In the framework of classification, data tuples can be defined as samples, instances, data points, or objects. Because the class label of every training tuple is supported, this step is called a supervised learning. It can compare with unsupervised learning (or clustering), in which the class label of every training tuple is not popular, and the number or set of classes to be understand cannot be known in advance. 2. A classification step (where the model is used to predict class labels for given data): In the second phase, the model can be used for classification. First, the predictive accuracy of the classifier is predicted. If it can use the training set to calculate the accuracy of the classifier, this estimate can be optimistic, because the classifier tends to over fit the records (i.e., during learning it can incorporate some specific anomalies of the training records that are not present in the general data set complete). Hence, a test set is utilized; create up of test tuples and their related class labels. These tuples are randomly choosed from the general data set. They are separate of the training tuples, defining that they are not used to make the classifier. Fig :(b) Classification Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples. 3.2.Applications: Classification is a fundamental machine learning technique that involves categorizing data into predefined classes or labels. Its applications span numerous fields, providing valuable solutions to a wide array of problems. Here are some prominent applications: 1. Healthcare: o Disease Diagnosis: Classifying medical images (e.g., X-rays, MRIs) to detect diseases like cancer or pneumonia. o Predicting Patient Outcomes: Classifying patients based on their risk of developing certain conditions or the likely outcomes of treatments. 2. Finance: o Credit Scoring: Classifying loan applicants as low or high risk. o Fraud Detection: Identifying fraudulent transactions by classifying them as legitimate or suspicious. 3. Marketing: o Customer Segmentation: Grouping customers based on their purchasing behavior. o Churn Prediction: Predicting which customers are likely to stop using a service. 4. Image and Video Analysis: o Object Detection and Recognition: Identifying and classifying objects within images or videos (e.g., facial recognition, autonomous driving). o Content Moderation: Detecting and classifying inappropriate content in images and videos on social media platforms. 5. Natural Language Processing: o Sentiment Analysis: Classifying text data based on sentiment (positive, negative, neutral). o Spam Detection: Classifying emails or messages as spam or not spam. o Language Translation: Automatically translating text from one language to another. 6. Cyber security: o Intrusion Detection: Classifying network activity to identify potential security threats. o Malware Detection: Identifying and classifying malicious software. 7. Manufacturing: o Quality Control: Classifying products as defective or non-defective. o Predictive Maintenance: Classifying equipment status to predict failures and schedule maintenance. 8. Social Media: o Recommendation Systems: Classifying user preferences to recommend content or products. o Fake News Detection: Identifying and classifying news articles as real or fake. 9. Automotive: o Autonomous Vehicles: Classifying objects around the vehicle to make driving decisions. o Driver Monitoring: Classifying driver behavior to enhance safety features. 10. Human Resources: o Resume Screening: Classifying job applications to shortlist candidates. o Employee Performance: Classifying employees based on performance metrics for promotions or training needs. 11. Retail: o Inventory Management: Classifying products based on sales data to manage stock levels. o Demand Forecasting: Predicting product demand by classifying historical sales data. Classification techniques, including decision trees, support vector machines, neural networks, and ensemble methods, provide the backbone for these applications, enabling automation, accuracy, and efficiency in various domains. Metrics for Evaluating Classifier Performance Evaluating the performance of your classification model is crucial to ensure its accuracy and effectiveness. In a classification problem, the category or classes of data is identified based on training data. The model learns from the given dataset and then classifies the new data into classes or groups based on the training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not Spam, etc. To evaluate the performance of a classification model, different metrics are used, and some of them are as follows: o Accuracy o Confusion Matrix o Precision o Recall Classification Accuracy Classification accuracy is a fundamental metric for evaluating the performance of a classification model, providing a quick snapshot of how well the model is performing in terms of correct predictions. This is calculated as the ratio of correct predictions to the total number of input Samples. It works great if there are an equal number of samples for each class. For example, we have a 90% sample of class A and a 10% sample of class B in our training set. Then, our model will predict with an accuracy of 90% by predicting all the training samples belonging to class A. If we test the same model with a test set of 60% from class A and 40% from class B. Then the accuracy will fall, and we will get an accuracy of 60%. Classification accuracy is good but it gives a False Positive sense of achieving high accuracy. The problem arises due to the possibility of misclassification of minor class samples being very high. II. Confusion Matrix A confusion matrix is a tabular representation of prediction outcomes of any binary classifier, which is used to describe the performance of the classification model on a set of test data when true values are known. The confusion matrix is simple to implement, but the terminologies used in this matrix might be confusing for beginners. A typical confusion matrix for a binary classifier looks like the below image(However, it can be extended to use for classifiers with more than two classes). Example: We can determine the following from the above matrix: o In the matrix, columns are for the prediction values, and rows specify the Actual values. Here Actual and prediction give two possible classes, Yes or No. So, if we are predicting the presence of a disease in a patient, the Prediction column with Yes means, Patient has the disease, and for NO, the Patient doesn't have the disease. o In this example, the total number of predictions are 165, out of which 110 time predicted yes, whereas 55 times predicted No. o However, in reality, 60 cases in which patients don't have the disease, whereas 105 cases in which patients have the disease. In general, the table is divided into four terminologies, which are as follows: 1. True Positive(TP): In this case, the prediction outcome is true, and it is true in reality, also. 2. True Negative(TN): in this case, the prediction outcome is false, and it is false in reality, also. 3. False Positive(FP): In this case, prediction outcomes are true, but they are false in actuality. 4. False Negative(FN): In this case, predictions are false, and they are true in actuality. II. Precision The precision metric is used to overcome the limitation of Accuracy. The precision determines the proportion of positive prediction that was actually correct. It can be calculated as the True Positive or predictions that are actually true to the total positive predictions (True Positive and False Positive). IV. Recall or Sensitivity It is also similar to the Precision metric; however, it aims to calculate the proportion of actual positive that was identified incorrectly. It can be calculated as True Positive or predictions that are actually true to the total number of positives, either correctly predicted as positive or incorrectly predicted as negative (true Positive and false negative). The formula for calculating Recall is given below: Example Problem: A model that classifies emails as spam or not spam. After running your model, you compare its predictions to the actual classifications and get the following results: True Positives (TP): 50 emails correctly identified as spam. True Negatives (TN): 40 emails correctly identified as not spam. False Positives (FP): 10 emails incorrectly identified as spam. False Negatives (FN): 5 emails incorrectly identified as not spam. Using this example, explain the following classification metrics: Accuracy, Precision, Recall Sol:1) Accuracy: 2) Precision: 3) Recall: 3.Decision Tree Induction: Decision tree Induction: It is the learning of decision trees from class-labeled training tuples. Decision tree: A decision tree is a flowchart-like tree structure, where each internal node (non leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node. Decision Tree is a supervised learning method used in data mining for classification and regression methods. It is a tree that helps us in decision-making purposes. The decision tree creates classification or regression models as a tree structure. It separates a data set into smaller subsets, and at the same time, the decision tree is steadily developed. The final tree is a tree with the decision nodes and leaf nodes. A decision node has at least two branches. The leaf nodes show a classification or decision. Decision trees can deal with both categorical and numerical data. The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class. The benefits of having a decision tree are as follows − It does not require any domain knowledge. It is easy to comprehend. The learning and classification steps of a decision tree are simple and fast. Decision Tree Induction Algorithm A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive divide-and-conquer manner. The algorithm is called with three parameters: 1. Data partition (D) 2. Attribute list, 3. Attribute selection method. Data partition it is the complete set of training tuples and their associated class labels. attribute list is a list of attributes describing the tuples. Attribute selection method specifies a heuristic procedure for selecting the attribute that “best” discriminates the given tuples according to class. This procedure employs an attribute selection measure such as information gain or the Gini index. Whether the tree is strictly binary is generally driven by the attribute selection measure. Some attribute selection measures: such as the Gini index, enforce the resulting tree to be binary. Others, like information gain, do not, therein allowing multi way splits (i.e., two or more branches to be grown from a node). Basic algorithm for inducing a decision tree from training tuples. If the tuples in D are all of the same class, then node N becomes a leaf and is labeled with that class (steps 2 and 3). Note that steps 4 and 5 are terminating conditions. All terminating conditions are explained at the end of the algorithm. Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion. The splitting criterion tells us which attribute to test at node N by determining the “best” way to separate or partition the tuples in D into individual classes (step 6). The splitting criterion also tells us which branches to grow from node N with respect to the outcomes of the chosen test. More specifically, the splitting criterion indicates the splitting attribute and may also indicate either a split-point or a splitting subset. The splitting criterion is determined so that, ideally, the resulting partitions at each branch are as “pure” as possible. A partition is pure if all the tuples in it belong to the same class. In other words, if we split up the tuples in D according to the mutually exclusive outcomes of the splitting criterion, we hope for the resulting partitions to be as pure as possible. The node N is labeled with the splitting criterion, which serves as a test at the node (step 7). A branch is grown from node N for each of the outcomes of the splitting criterion. The tuples in D are partitioned accordingly (steps 10 to 11). There are three possible scenarios, as illustrated in Figure 8.4. Let A be the splitting attribute. A has v distinct values, {a1, a2,..., av }, based on the training data. Figure 8.4 This figure shows three possibilities for partitioning tuples based on the splitting criterion, each with examples. Let A be the splitting attribute. (a) If A is discrete-valued, then one branch is grown for each known value of A. (b) If A is continuous-valued, then two branches are grown, corresponding to A ≤ split point and A > split point. (c) If A is discrete-valued and a binary tree must be produced, then the test is of the form A ∈ SA, where SA is the splitting subset for A. 1.Given partition have the same value for A, A need not be considered in any future partitioning of the tuples. Therefore, it is removed from attribute list (steps 8 and 9). Data Partitions: 2. continuous-valued : A is continuous-valued: In this case, the test at node N has two possible outcomes, corresponding to the conditions A ≤ split point and A > split point, respectively, where split point is the split-point returned by Attribute selection method as part of the splitting criterion. (In practice, the split-point, a, is often taken as the midpoint of two known adjacent values of A and therefore may not actually be a preexisting value of A from the training data.) Two branches are grown from N and labeled according to the previous outcomes (Figure 8.4b). The tuples are partitioned such that D1 holds the subset of class-labeled tuples in D for which A ≤ split point, while D2 holds the rest. 3. discrete-valued : A is discrete-valued and a binary tree must be produced (as dictated by the attribute selection measure or algorithm being used): The test at node N is of the form “A ∈ SA?,” where SA is the splitting subset for A, returned by Attribute selection method as part of the splitting criterion. It is a subset of the known values of A. If a given tuple has value aj of A and if aj ∈ SA, then the test at node N is satisfied. Two branches are grown from N (Figure 8.4c). By convention, the left branch out of N is labeled yes so that D1 corresponds to the subset of class- labeled tuples in D that satisfy the test. The right branch out of N is labeled no so that D2 corresponds to the subset of class-labeled tuples from D that do not satisfy the test. The algorithm uses the same process recursively to form a decision tree for the tuples at each resulting partition, Dj , of D (step 14). The recursive partitioning stops only when any one of the following terminating conditions is true: 1. All the tuples in partition D (represented at node N) belong to the same class (steps 2 and 3). 2. There are no remaining attributes on which the tuples may be further partitioned (step 4). In this case, majority voting is employed (step 5). This involves converting node N into a leaf and labeling it with the most common class in D. Alternatively, the class distribution of the node tuples may be stored. 3. There are no tuples for a given branch, that is, a partition Dj is empty (step 12). In this case, a leaf is created with the majority class in D (step 13). The resulting decision tree is returned (step 15). The computational complexity of the algorithm given training set D is O(n × |D| × log(|D|)), where n is the number of attributes describing the tuples in D and |D| is the number of training tuples in D. This means that the computational cost of growing a tree grows at most n × |D| × log(|D|) with |D| tuples. The proof is left as an exercise for the reader. Incremental versions of decision tree induction have also been proposed. When given new training data, these restructure the decision tree acquired from learning on previous training data, rather than relearning a new tree from scratch. Advantages of Using Decision Trees: 1. Easy to understand and interpret: Decision trees are a visual and intuitive model that can be easily understood by both experts and non-experts. 2. Handle both numerical and categorical data: Decision trees can handle a mix of numerical and categorical data, which makes them suitable for many different types of datasets. 3. Can handle large amounts of data: Decision trees can handle large amounts of data and can be updated with new data as it becomes available. 4. Can be used for both classification and regression tasks: Decision trees can be used for both classification, where the goal is to predict a discrete outcome, and regression, where the goal is to predict a continuous outcome. Disadvantages of Decision Tree Induction : 1. Prone to over fitting: Decision trees can become too complex and may not generalize well to new data. This can lead to poor performance on unseen data. 2. Sensitive to small changes in the data: Decision trees can be sensitive to small changes in the data, and a small change in the data can result in a significantly different tree. 3. Biased towards attributes with many levels: Decision trees can be biased towards attributes with many levels, and may not perform well on attributes with a small number of levels. Attribute Selection Measures The measure of attribute selection is a heuristic in nature for selecting the splitting criterion that “best” separates a given data partition, D, of class-labeled training tuples into individual classes. It determines how the tuples at a given node are to be split.The attribute selection measure provides a ranking for each attribute describing the given training tuples. The three methods are used for attribute selection as follows: 1. Entropy 2. Information Gain 3. Gain Ratio 4. Gini Index Entropy: Entropy refers to a common way to measure impurity. In the decision tree, it measures the randomness or impurity in data sets. Information Gain: Information gain is used for deciding the best features/attributes that render maximum data about a class. It follows the method of entropy while aiming at reducing the level of entropy, starting from the root node to the leaf nodes. This attribute minimizes the information needed to classify the tuples in the resulting partitions. Information Gain Algorithm: Let D, the data partition, be a training set of class-labeled tuples. let class label attribute has m distinct values defining m distinct classes, Ci (for i = 1,..., m). Let Ci,D be the set of tuples of class Ci in D. Let |D| and |Ci,D| denote the number of tuples in D and Ci,D, respectively. Then the expected information needed to classify a tuple in D is given by where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. Info(D) is the average amount of information needed to identify the class label of a tuple in D. Info(D) is also known as the entropy of D. Now, suppose we have to partition the tuples in D on some attribute A having v distinct values, {a1 , a2 ,..., av }. Then the expected information required to classify the tuple from D based on attribute A is: The term |Dj | /|D| acts as the weight of the j th partition. Info A (D) is the expected information required to classify a tuple from D based on the partitioning by A. Information gain is defined as the difference between the original information requirement and the new requirement (i.e. obtained after portioning on A). Gain(A) = Info (D) – Info A (D) Now, the attribute A with the highest information gain is chosen as the splitting attribute. The Information Gain measures the expected reduction in entropy. Entropy measures impurity in the data and information gain measures reduction in impurity in the data. The feature which has minimum impurity will be considered as the root node. Information gain is used to decide which feature to split on at each step in building the tree. The creation of sub-nodes increases the homogeneity, that is decreases the entropy of these nodes. The more the child node is homogeneous, the more the variance will be decreased after each split. Thus Information Gain is the variance reduction and can calculate by how much the variance decreases after each split. Information gain of a parent node can be calculated as the entropy of the parent node subtracted entropy of the weighted average of the child node. Example of Information Gain: The dataset has 10 observations belonging to two classes YES and NO. Where 6 observations belong to the class, YES, and 4 observations belong to class NO. For a dataset having many features, the information gain of each feature is calculated. The feature having maximum information gain will be the most important feature which will be the root node for the decision tree. Example:2 Gain Ratio/ Uncertainty Coefficient : Gain Ratio is an alternative to Information Gain that is used to select the attribute for splitting in a decision tree. It is used to overcome the problem of bias towards the attribute with many outcomes. For example, Suppose we have two features, “Color” and “Size” and we want to build a decision tree to predict the type of fruit based on these two features. The “Color” feature has three outcomes (red, green, yellow) and the “Size” feature has two outcomes (small, large). Using the information gain method, the “Color” feature would be chosen as the best feature to split on because it has the highest information gain. However, this could be a problem because the “Size” feature could be a better feature to split on because it is less ambiguous and has fewer outcomes. To overcome this problem, we can use the gain ratio method. Gain Ratio is a measure that takes into account both the information gain and the number of outcomes of a feature to determine the best feature to split on. We use Gain Ratio to normalize the Information Gain by the Split Info. Gain Ratio=Information Gain/Entropy From the above formula, it can be stated that if entropy is very small, then the gain ratio will be high and vice versa. Be selected as splitting criterion, Quinlan proposed following procedure, 1. First, determine the information gain of all the attributes, and then compute the average information gain. 2. Second, calculate the gain ratio of all the attributes whose calculated information gain is larger or equal to the computed average information gain, and then pick the attribute of higher gain ratio to split. Use Gain Ratio: Correcting for Bias: While information gain tends to favor features with a large number of distinct values, the gain ratio corrects this bias, making it a more balanced criterion. Better Feature Selection: It helps in selecting features that are not just good at splitting the data but also do so in a balanced way, leading to better generalization and performance of the decision tree. Example Problem: Suppose we have a small dataset of 14 samples, each classified as either "Yes" or "No" for whether a person will play tennis based on two features: Outlook and Wind. Outlook Wind Play Tennis Sunny Weak No Sunny Strong No Overcast Weak Yes Rain Weak Yes Rain Weak Yes Rain Strong No Overcast Strong Yes Sunny Weak No Sunny Weak Yes Rain Weak Yes Sunny Strong Yes Overcast Weak Yes Overcast Strong Yes Rain Strong No Gini index/ Gini Impurity: The Gini index can also be used for feature selection. The tree chooses the feature that minimizes the Gini impurity index. The higher value of the Gini Index indicates the impurity is higher. The Gini Index also called as Gini Impurity. The Gini Index or Gini Impurity favors large partitions and is very simple to implement. It performs only binary split. For categorical variables, it gives the results in terms of “success” or “failure”. The Gini Index (also known as Gini Impurity) is another metric used in decision tree algorithms, particularly in CART (Classification and Regression Trees). It measures the impurity of a dataset, where a lower Gini Index indicates a purer node. Gini Index can be calculated from the below mathematical formula where c is the number of classes and pi is the probability associated with the ith class. Example Problem: Let's we want to decide whether a person will play tennis based on the Outlook and Wind features. Outlook Wind Play Tennis Sunny Weak No Sunny Strong No Overcast Weak Yes Rain Weak Yes Rain Weak Yes Rain Strong No Overcast Strong Yes Sunny Weak No Sunny Weak Yes Rain Weak Yes Sunny Strong Yes Overcast Weak Yes Overcast Strong Yes Rain Strong No Tree Pruning /Pruning decision trees: Decision tree pruning is a critical technique in machine learning used to optimize decision tree models by reducing over fitting and improving generalization to new data. Pruning means to change the model by deleting the child nodes of a branch node. The pruned node is regarded as a leaf node. Leaf nodes cannot be pruned. Pruning is a technique used in decision tree algorithms to reduce the size of the tree by removing sections of the tree that provide little to no power in predicting target variables. The goal is to improve the model's generalization and prevent overfitting. A decision tree consists of a root node, several branch nodes, and several leaf nodes. The root node represents the top of the tree. It does not have a parent node; however, it has different child nodes. Branch nodes are in the middle of the tree. A branch node has a parent node and several child nodes. Leaf nodes represent the bottom of the tree. A leaf node has a parent node. It does not have child nodes. Types Of Decision Tree Pruning There are two main types of decision tree pruning: 1) Pre-Pruning (Early Stopping) 2) Post-Pruning (Reducing Nodes) Pre-Pruning (Early Stopping) Sometimes, the growth of the decision tree can be stopped before it gets too complex, this is called pre-pruning. It is important to prevent the overfitting of the training data, which results in a poor performance when exposed to new data. Some common pre-pruning techniques include: Maximum Depth: It limits the maximum level of depth in a decision tree. Minimum Samples per Leaf: Set a minimum threshold for the number of samples in each leaf node. Minimum Samples per Split: Specify the minimal number of samples needed to break up a node. Maximum Features: Restrict the quantity of features considered for splitting. By pruning early, we come to be with a simpler tree that is less likely to over fit the training facts. Post-Pruning (Reducing Nodes) After the tree is fully grown, post-pruning involves removing branches or nodes to improve the model’s ability to generalize. Some common post-pruning techniques include: Cost-Complexity Pruning (CCP): This method assigns a price to each sub tree primarily based on its accuracy and complexity, then selects the sub tree with the lowest fee. Reduced Error Pruning: Removes branches that do not significantly affect the overall accuracy. Minimum Impurity Decrease: Prunes nodes if the decrease in impurity (Gini impurity or entropy) is beneath a certain threshold. Minimum Leaf Size: Removes leaf nodes with fewer samples than a specified threshold. Post-pruning simplifies the tree while preserving its Accuracy. Decision tree pruning helps to improve the performance and interpretability of decision trees by reducing their complexity and avoiding over fitting. Proper pruning can lead to simpler and more robust models that generalize better to unseen data. Example of Tree pruning: Example Problem: Loan Approval Imagine a bank wants to build a decision tree model to predict whether a loan applicant will repay a loan or default. The dataset includes the following features: Income (High, Medium, Low) Credit Score (Excellent, Good, Fair, Poor) Loan Amount (High, Medium, Low) Loan Purpose (Home, Car, Education, Personal) Repayment Status (Yes, No) [Target variable] Issues in Decision Tree 1. Avoiding Over fitting the Data When we are designing a machine learning model, a model is said to be a good machine learning model, if it generalizes any new input data from the problem domain in a proper way. This helps us to make predictions in the future data, that data model has never seen. 2.Under fitting A machine learning algorithm is said to have under fitting when it cannot capture the underlying trend of the data. Under fitting destroys the accuracy of our machine learning model. Its occurrence simply means that our model or the algorithm does not fit the data well enough. It usually happens when we have less data to build an accurate model and also when we try to build a linear model with a non-linear data. In such cases the rules of the machine learning model are too easy and flexible to be applied on such a minimal data and therefore the model will probably make a lot of wrong predictions. Under fitting can be avoided by using more data and also reducing the features by feature selection. 3.Over fitting A machine learning algorithm is said to be over fitted, when we train it with a lot of data. When a model gets trained with so much of data, it starts learning from the noise and inaccurate data entries in our data set. Then the model does not categorize the data correctly, because of too much of details and noise. The causes of over fitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. A solution to avoid over fitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees. Practical issues in learning decision trees include determining how deeply to grow the decision tree: 1. Handling Continuous Attributes, 2. Choosing An Appropriate Attribute Selection Measure, 3. Handling Training Data With Missing Attribute Values, 4. Handling Attributes With Differing Costs, And 5. Improving Computational Efficiency. Model Evaluation and Selection Model evaluation is the process of using different evaluation metrics to understand a machine learning model’s performance, as well as its strengths and weaknesses. Model evaluation is important to assess the efficacy of a model during initial research phases, and it also plays a role in model monitoring. Model selection is an essential phase in the development of powerful and precise predictive models in the field of machine learning. Model selection is the process of deciding which algorithm and model architecture is best suited for a particular task or dataset. Steps in Model Evaluation and Selection 1. Data Splitting Training Set: Used to train the model. Validation Set: Used to tune model parameters and select the best model. Test Set: Used to evaluate the final model's performance. Common data splitting methods include: Holdout Method: Splitting the data into training, validation, and test sets. Cross-Validation: Especially k-fold cross-validation, where the data is split into k subsets, and the model is trained and validated k times, each time using a different subset as the validation set and the remaining k-1 subsets as the training set. 2. Choosing Evaluation Metrics The choice of evaluation metric depends on the problem type (classification, regression, etc.) and the specific goals of the model. Classification Metrics: o Accuracy: The proportion of correctly predicted instances. o Precision: The proportion of true positive predictions among all positive predictions. o Recall (Sensitivity): The proportion of true positives identified among all actual positives. o F1 Score: The harmonic mean of precision and recall, useful when classes are imbalanced. o ROC-AUC: Measures the area under the receiver operating characteristic curve, assessing the trade-off between true positive rate and false positive rate. Regression Metrics: o Mean Absolute Error (MAE): The average of absolute differences between predicted and actual values. o Mean Squared Error (MSE): The average of squared differences between predicted and actual values. o Root Mean Squared Error (RMSE): The square root of MSE, providing error in the same units as the target variable. o R-squared: The proportion of variance in the target variable explained by the model. Other Metrics: o Logarithmic Loss (Log Loss): Used for probabilistic predictions in classification tasks. o Confusion Matrix: A summary of prediction results on a classification problem. 3. Model Selection Techniques Cross-Validation: The most common technique for model selection, helping to avoid overfitting and underfitting by providing a more reliable estimate of model performance. Grid Search: An exhaustive search over specified parameter values for an estimator, often combined with cross-validation. Random Search: Similar to grid search but randomly samples parameter combinations, which can be more efficient. Bayesian Optimization: An advanced method that builds a probabilistic model to explore the parameter space more efficiently. 4. Bias-Variance Tradeoff Bias: Error due to overly simplistic assumptions in the learning algorithm. High bias can cause underfitting. Variance: Error due to model complexity and sensitivity to small fluctuations in the training data. High variance can cause overfitting. Goal: Find a balance between bias and variance to minimize total error. 5. Hyperparameter Tuning Models often have hyperparameters that need to be set before the learning process begins. These hyperparameters can significantly affect model performance. Techniques like grid search, random search, and Bayesian optimization are used to find the best hyperparameter values. 6. Model Comparison Model Performance: Compare models based on performance metrics. For example, in classification, a model with higher accuracy and F1 score may be preferred. Model Complexity: Simpler models are often preferred if they perform comparably to more complex ones. Computational Efficiency: Consider the training and inference time, especially for large datasets or real-time applications. 7. Ensemble Methods Combine the predictions of multiple models to create a more robust model. Common ensemble techniques include: o Bagging: Builds multiple models (typically of the same type) and combines their predictions, e.g., Random Forest. o Boosting: Sequentially builds models that correct the errors of previous models, e.g., AdaBoost, Gradient Boosting. o Stacking: Combines different models (meta-modeling) to leverage the strengths of each. 8. Final Model Evaluation Once the best model is selected, it should be evaluated on the test set to estimate its performance on unseen data. This step provides a final estimate of the model’s generalization ability and helps to confirm that the model selection process was effective. Example: Model Evaluation and Selection in a Classification Task Suppose you are working on a binary classification problem to predict whether a customer will churn (leave the service). You have several candidate models: Logistic Regression, Decision Tree, Random Forest, and a Support Vector Machine (SVM). 1. Split the Data: Use 70% of the data for training and 30% for testing. Use cross- validation on the training set to evaluate different models. 2. Choose Evaluation Metrics: o Primary metric: F1 Score (due to class imbalance). o Secondary metrics: Accuracy, Precision, Recall, and ROC-AUC. 3. Model Selection: o Perform cross-validation to compare models. o Use grid search to tune hyperparameters for each model. o Compare models based on cross-validation performance. 4. Bias-Variance Tradeoff: o Analyze learning curves to check for underfitting or overfitting. o If Random Forest overfits, try reducing the number of trees or maximum depth. 5. Final Model: o Suppose Random Forest performs best with an F1 score of 0.85 on cross-validation. o Evaluate the final Random Forest model on the test set to confirm its performance. 6. Result: o If the test set F1 score is consistent with the cross-validation results, the Random Forest model is selected for deployment. By following these steps, you ensure that the chosen model is the best fit for your data, balances bias and variance, and is likely to perform well on unseen data.