Fundamentals of Data Science PDF

Document Details

UnboundNobility4383

Uploaded by UnboundNobility4383

Dr. Nermeen Ghazy

Tags

data visualization data science decision trees data exploration

Summary

These lecture notes cover fundamental topics in data science, focusing on data visualization techniques. Concepts and methods for data representation are detailed. The notes also include a roadmap for data exploration, along with examples on using decision trees.

Full Transcript

Fundamentals of Data Science DS302 Dr. Nermeen Ghazy Reference Books Data Science :Concepts and Practice, Vijay Kotu and Bala Deshpande,2019. DATA SCIENCE: FOUNDATION & FUNDAMENTALS, B. S. V. Vatika, L. C. Dabra, Gwalior,2023....

Fundamentals of Data Science DS302 Dr. Nermeen Ghazy Reference Books Data Science :Concepts and Practice, Vijay Kotu and Bala Deshpande,2019. DATA SCIENCE: FOUNDATION & FUNDAMENTALS, B. S. V. Vatika, L. C. Dabra, Gwalior,2023. 2 Data Visualization Visualizing data is one of the most important techniques of data discovery and exploration. Though visualization is not considered a data science technique, terms like visual mining or pattern discovery based on visuals are increasingly used in the context of data science, particularly in the business world. The discipline of data visualization encompasses the methods of expressing data in an abstract visual form. The visual representation of data provides easy comprehension of complex data with multiple attributes and their underlying relationships. 5 Data Visualization The motivation for using data visualization includes: Comprehension of dense information: A simple visual chart can easily include thousands of data points. By using visuals, the user can see the big picture, as well as longer term trends that are extremely difficult to interpret purely by expressing data in numbers. Relationships: Visualizing data in Cartesian coordinates enables exploration of the relationships between the attributes. Although representing more than three attributes on the x, y, and z-axes is not feasible in Cartesian coordinates, there are a few creative solutions available by changing properties like the size, color, and shape of data markers or using flow maps, where more than two attributes are used in a two-dimensional medium. 6 Data Visualization The motivation for using data visualization includes: Univariate Visualization Visual exploration starts with investigating one attribute at a time using univariate charts. The techniques discussed in this section give an idea of how the attribute values are distributed and the shape of the distribution. 7 Data Visualization Histogram A histogram is one of the most basic visualization techniques to understand the frequency of the occurrence of values. It shows the distribution of the data by plotting the frequency of occurrence in a range. In a histogram, the attribute under inquiry is shown on the horizontal axis and the frequency of occurrence is on the vertical axis. For a continuous numeric data type, the range or binning value to group a range of values need to be specified. 8 Data Visualization 9 Data Visualization Quantile plot 10 Data Visualization The motivation for using data visualization includes: Univariate Visualization Visual exploration starts with investigating one attribute at a time using univariate charts. The techniques discussed in this section give an idea of how the attribute values are distributed and the shape of the distribution. 11 Data Visualization Multivariate Visualization The multivariate visual exploration considers more than one attribute in the same visual. The techniques discussed in this section focus on the relationship of one attribute with another attribute. These visualizations examine two to four attributes simultaneously. 12 Data Visualization Class-stratified quartile plot 13 Data Visualization Distribution plot 14 Data Visualization Scatter plot 15 Data Visualization Scatter multiple 16 Data Visualization Multiple Scatter matrix 17 Data Visualization Bubble sort 18 Data Visualization Density chart 19 Data Visualization Parallel chart 20 Data Visualization Deviation chart 21 Data Visualization Andrews curves 22 Roadmap for data exploration If there is a new dataset that has not been investigated before, having a structured way to explore and analyze the data will be helpful. Here is a roadmap to inquire a new dataset. Not all steps may be relevant for every dataset and the order may need to be adjusted for some sets, so this roadmap is intended as a guideline. 23 Roadmap for data exploration 1. Organize the data set 2. Find the central point for each attribute 3. Understand the spread of the attributes 4. Visualize the distribution of each attribute 5. Pivot the data 6. Watch out for outliers 7. Understanding the relationship between attributes 8. Visualize the relationship between attributes 9. Visualization high dimensional data sets 24 Models 26 Classification 1. Enter the realm of data science The process in which historical records are used to predict an uncertain future. At a fundamental level, most data science problems can be categorized into either class or numeric prediction problems. In classification or class prediction, one should try to use the information from the predictors or independent variables to sort the data samples into two or more distinct classes or buckets. In the case of numeric prediction, one would try to predict the numeric value of a dependent variable using the values assumed by the independent variables. 27 The target variable is categorical. Predictors could be of any data type. 28 Decision Trees A decision tree is a supervised learning algorithm used for both classification and regression problems. Simply put, it takes the form of a tree with branches representing the potential answers to a given question. There are metrics used to train decision trees. One of them is information gain (Entropy). 29 Decision trees Decision trees (also known as classification trees) are probably one of the most intuitive and frequently used data science techniques. From an analyst’s point of view, they are easy to set up From a business user’s point of view they are easy to interpret. Classification trees, are used to separate a dataset into classes belonging to the response variable. Usually the response variable has two classes: Yes or No (1 or 0). If the response variable has more than two categories, then variants of the decision tree algorithm have been developed that may be applied. Classification trees are used when the response or target variable is categorical in nature. 30 Decision trees How It Works A decision tree model takes a form of decision flowchart where an attribute is tested in each node. At end of the decision tree path is a leaf node where a prediction is made. The nodes split the dataset into subsets. In a decision tree, the idea is to split the dataset based on the homogeneity of data. 31 Tree split - Entropy Entropy is an information theory metric that measures the impurity or uncertainty in a group of observations. It determines how a decision tree chooses to split data. 32 Entropy Defined entropy as log2 (1/p) or-log2 (p) where p is the probability of an event occurring. If the probability for all events is not identical, a weighted expression is needed and, thus, entropy, H, is adjusted as follows: where k=1, 2, 3,..., m represents the m classes of the target variable. Pk represents the proportion of samples that belong to class k. 33 Gini index The Gini index (G) is similar to the entropy measure in its characteristics and is defined as The value of G ranges between 0 and a maximum value of 0.5, otherwise has properties identical to H, and either of these formulations can be used to create partitions in the data 34 Decision Tree http://archive.ics.uci.edu/ml/datasets/ All datasets used in this book are available at the companion website. 35 Decision Tree 36 Decision Tree Start by partitioning the data on each of the four regular attributes. Let us start with Outlook. There are three categories for this variable: sunny, overcast, and rain. We see that when it is overcast, there are four examples where the outcome was Play5yes for all four cases (Fig. 4.2) and so the proportion of examples in this case is 100% or 1.0. if we split the dataset here, the resulting four sample partition will be 100% pure for Play5 yes. 37 Decision Tree Mathematically for this partition, the entropy can be calculated using as: Similarly, the entropy in the other two situations for Outlook can be calculated: 38 Decision Tree The total information is calculated as the weighted sum of these component entropies. There are four instances of Outlook=overcast, thus, the proportion for overcast is given by poutlook: overcast=4/14. The other proportions (forOutlook5sunnyandrain) are5/14each: 39 Decision Tree Had the data not been partitioned along the three values for Outlook, the total information would have been simply the weighted average of the respective entropies for the two classes whose overall proportionswere5/14 (Play=no) and9/14(Play=yes): 40 Decision Tree By creating these splits or partitions, some entropy has been reduced(and, thus, some information has been gained). This is called, aptly enough, information gain. In the case of Outlook, this is given simply by: 41 Decision Tree Similar information gain values for the other three attributes can now be computed, as shown in Table 4.2. it is clear that if the dataset is partitioned into three sets along the three values of Outlook, the largest information gain would be experienced 42 Decision Tree Similar information gain values for the other three attributes can now be computed, as shown in Table 4.2. it is clear that if the dataset is partitioned into three sets along the three values of Outlook, the largest information gain would be experienced 43 Decision Tree 44 Decision Tree When to Stop Splitting Data? In real-world datasets, it is very unlikely that to get terminal nodes that are 100% homogeneous as was just seen in the golf dataset. In this case, the algorithm would need to be instructed when to stop. There are several situations where the process can be terminated: 1- No attribute satisfies a minimum information gain threshold (such as the one computed in Table 4.2). 2- A maximal depth is reached: as the tree grows larger, not only does interpretation get harder. 3- There are less than a certain number of examples in the current subtree. 45 Decision Tree Now the application of the decision tree algorithm can be summarized with this simple five-step process: 1. Using Shannon entropy, sort the dataset into homogenous (by class) and non-homogeneous variables. Homogeneous variables have low information entropy and non-homogeneous variables have high information entropy. This was done in the calculation of I outlook, no partition. 46 Decision Tree 2. Weight the influence of each independent variable on the target variable using the entropy-weighted averages. This was done during the calculation of I outlook in the example. 3. Compute the information gain, which is essentially the reduction in the entropy of the target variable due to its relationship with each independent variable. This is simply the difference between the information entropy found in step 1 minus the joint entropy calculated in step 2. This was done during the calculation of I outlook, no partition- I outlook. 47 Decision Tree 4. The independent variable with the highest information gain will become the root or the first node on which the dataset is divided. This was done using the calculation of the information gain table. 5. Repeat this process for each variable for which the Shannon entropy is nonzero. 48 Measure of impurity Measure of impurity: Every split ties to make child node more pure. Gini impurity Information Gain (Entropy) Misclassification Error 49 Decision Tree: How to Implement? The complete RapidMiner process for implementing the decision tree model is shown in Fig. 4.5 50 Decision Tree: How to Implement? The key building blocks for this process are: training dataset, test dataset, model building, predicting using the model, predicted dataset, model representation, and performance vector. 51 Decision Tree: How to Implement? The decision tree process has two input datasets. The training dataset, and the test dataset. The modeling block builds the decision tree using the training dataset. 52 Decision Tree: How to Implement? The Apply model block predicts the class label of the test dataset using the developed model and appends the predicted label to the dataset. The predicted dataset is one of the three outputs of the process and is shown in Fig. 4.7. 53 Decision Tree: How to Implement? Note the prediction test dataset has both the predicted and the original class label. The model has predicted correct class for nine of the records, but not for all. The five incorrect predictions are highlighted in Fig. 4.7. 54 Decision Tree: How to Implement? The decision tree model developed using the training dataset is shown in Fig. 4.8. This is a simple decision tree with only three nodes. The leaf nodes are pure with a clean split of data. 55 Decision Tree: How to Implement? The performance evaluation block compares the predicted class label and the original class label in the test dataset to compute the performance metrics like accuracy, recall, etc. Fig. 4.9 shows the accuracy results of the model and the confusion matrix. It is evident that the model has been able to get 9 of the 14 class predictions correct and 5 of the 14 (in boxes) wrong, which translates to about 64% accuracy. 56 Decision Tree: How to Implement? 57 58

Use Quizgecko on...
Browser
Browser