Fundamentals of Data Science DS302 PDF
Document Details
Uploaded by Deleted User
Dr. Nermeen Ghazy
Tags
Summary
This document is a lecture on fundamentals of data science. It covers topics like modeling, application, knowledge extraction and data exploration using visual tools like histograms. It also discusses techniques for handling different types of data.
Full Transcript
Fundamentals of Data Science DS302 Dr. Nermeen Ghazy Reference Books Data Science :Concepts and Practice, Vijay Kotu and Bala Deshpande,2019. DATA SCIENCE: FOUNDATION & FUNDAMENTALS, B. S. V. Vatika, L. C. Dabra, Gwalior,2023....
Fundamentals of Data Science DS302 Dr. Nermeen Ghazy Reference Books Data Science :Concepts and Practice, Vijay Kotu and Bala Deshpande,2019. DATA SCIENCE: FOUNDATION & FUNDAMENTALS, B. S. V. Vatika, L. C. Dabra, Gwalior,2023. 2 Modeling A model is the abstract representation of the data and the relationships in a given dataset. A simple rule of thumb like “mortgage interest rate reduces with increase in credit score” is a model; although there is not enough quantitative information to use in a production scenario, it provides directional information by abstracting the relationship between credit score and interest rate. 5 Modeling Fig. 2.4 shows the steps in the modeling phase of predictive data science. Association analysis and clustering are descriptive data science techniques where there is no target variable to predict; hence, there is no test dataset. However, both predictive and descriptive models evaluation step. 6 Modeling Splitting training and test data sets The modeling step creates a representative model inferred from the data. The dataset used to create the model, with known attributes and target, is called the training dataset. The validity of the created model will also need to be checked with another known dataset called the test dataset or validation dataset. To facilitate this process, the overall known dataset can be split into a training dataset and a test dataset. A standard rule of thumb is data two-thirds of the are to be used as training as a test dataset 7 Modeling 8 Application In business applications, the results of the data science process have to be assimilated into the business process usually in software applications. Deployment is the stage at which the model becomes production ready The model deployment stage has to deal with: 1. Product readiness 2. Technical integration 3. Model response time 4. Remodeling 5. Assimilation 9 Knowledge The data science process provides a framework to extract nontrivial information from data To extract knowledge from these massive data assets, advanced approaches need to be employed, like data science algorithms. The data science process starts with prior knowledge and ends with posterior knowledge, which is the incremental insight gained. The data science process can bring up spurious irrelevant patterns from the dataset 10 Data Exploration Objectives of Data Exploration: Data exploration aims to understand the dataset's structure, uncover underlying patterns, and assess data quality before performing in- depth analysis. which is: 1. Data understanding 2. Data preparation 3. Data science tasks 4. Interpreting the results 12 1-Data understanding Data exploration provides a broad overview of each attribute (or variable) in the dataset and examines interactions between attributes. It helps answer questions such as: What is a typical value for each attribute? How much do data points vary from this typical value? And, are there any extreme values present? 13 2-Data Preparation Before applying data science algorithms, the dataset must be prepared to address any anomalies that may exist. These anomalies include outliers, missing values, and highly correlated attributes. Certain data science algorithms perform poorly when input attributes are correlated, so it’s essential to identify and remove these correlated attributes. 14 3-Data science tasks Basic data exploration can sometimes serve as a substitute for the entire data science process. For instance, scatterplots can reveal clusters in low-dimensional data or assist in developing regression or classification models by providing simple, visually based rules. 15 4-Interpreting the results Data exploration aids in understanding prediction, classification, and clustering outcomes in the data science process. Histograms, for example, are useful for visualizing attribute distributions and can also help in assessing numeric predictions, estimating error rates, and more. 16 Datasets The Iris dataset, introduced by Ronald Fisher in his 1936 work on discriminant analysis, is one of the most popular datasets for learning data science. Iris is a widely distributed flowering plant genus containing over 300 species, each with unique physical traits, such as variations in flower and leaf shapes and sizes. The dataset comprises 150 observations from three species—Iris setosa, Iris virginica, and Iris versicolor—with 50 observations each. Each observation includes four attributes: sepal length, sepal width, petal length, and petal width, alongside the species label as the fifth attribute. In Iris flowers, sepals and petals are both bright purple, although they can be distinguished by shape differences. The petals are the brightly colored inner part of the flowers and the sepals form the outer part of the flower and are usually green in color as shown in Fig 3.1. 17 Datasets In the Iris dataset, all four attributes are continuous numeric values measured in centimeters. One species, Iris setosa, can be easily identified with a simple rule, such as having a petal length of less than 2.5 cm. Differentiating between the Iris virginica and Iris versicolor classes, however, requires more complex rules that involve additional attributes. The dataset is widely available across standard data science tools, like RapidMiner, or can be downloaded from public sources such as the University of California Irvine’s Machine Learning Repository (Bache & Lichman, 2013). This dataset can be accessed through the book's companion website: www.IntroDataScience.com. 18 Iris versicolor 19 Datasets The Iris dataset is widely used in data science education due to its simplicity, making it easy to understand and explore. It serves as an ideal example to demonstrate how different data science algorithms approach a common problem using a standard dataset. With three class labels, one of which (Iris setosa) is easily identifiable through visual inspection, the dataset illustrates straightforward classification. Distinguishing the remaining two classes requires more nuanced rules, reinforcing the insights from visual rules while also encouraging data science methods to develop beyond the constraints of visual analysis. 20 Types of data Data comes in various formats and types, and understanding the properties of each attribute or feature gives insight into what kinds of operations are possible. For instance, in traffic data, the attribute "traffic density" can be represented in several ways: As a numeric value, like 50 cars per kilometer or 75 vehicles per square mile. Using ordered labels, such as "high," "medium," or "low." By the number of hours per day where traffic density exceeds a certain threshold, like 5 hours with high traffic density. Each of these representations indicates traffic density in an area, they are in different data types. Some of these data types, such as numeric and ordered labels, may also be converted into each other for analysis. 21 Descriptive Statistics Descriptive statistics involve analyzing the overall summaries of a dataset to better understand its characteristics. These measures are commonly applied in everyday contexts. For instance, calculating the average age of employees in a company, finding the median rental price for apartments in a city, or determining the range of monthly utility bills in a region are all examples of descriptive statistics in action. In general, descriptive analysis focuses on key attributes of a sample or population, such as central tendency (mean or median), spread (range or variance), and overall distribution, helping to capture essential insights about the data. 22 Descriptive Statistics Descriptive statistics can be broadly classified into two categories based on the number of attributes analyzed: 1- Univariate exploration. 2- Multivariate exploration. Univariate exploration focuses on a single attribute to summarize its characteristics, while multivariate exploration examines the relationships and interactions between multiple attributes. This classification helps determine the complexity of the analysis and the insights that can be derived from the data. 23 Descriptive Statistics - Univariate Univariate data exploration denotes analysis of one attribute at a time. The example Iris dataset for one species, I. setosa, has 50 observations and 4 attributes, as shown in Table 3.1. Here some of the descriptive statistics for sepal length attribute are explored. 24 Descriptive Statistics - Univariate Measures of Central Tendency The goal of identifying the central location of an attribute is to summarize the dataset with a single representative value. Mean: The mean is the arithmetic average of all observations in the dataset. It is calculated by adding all the data points together and dividing the sum by the total number of data points. Median: The median represents the central point in the distribution. To find the median, all observations are sorted from smallest to largest, and the middle observation in the sorted list is selected. If there is an even number of data points, the median is determined by averaging the two middle values. 25 Mode: Useful for understanding frequency in categorical data or for identifying the most common value. The mean, median, and mode of an attribute can vary, reflecting the distribution's shape. Outliers impact the mean, but the median is usually unaffected. The mode may differ from the mean and median, especially in datasets with multiple natural distributions. Choosing the Right Measure: Mean: Best used for quantitative data where values are evenly distributed. Median: Preferred in skewed distributions or when outliers are present, as it provides a better representation of the central tendency. Mode: Useful for understanding frequency in categorical data or for identifying the most common value. 26 Example A Let's consider a dataset containing information about students' exam scores in different subjects. The dataset includes attributes like Mathematics, Science, English, and History scores for each student. Here’s how you can organize the dataset and apply the specified steps in detail: 1. Organize the Data Set Begin by structuring your dataset in a tabular format, where each row represents a student and each column corresponds to an attribute (subject score). Student ID Mathematics Science English History 1 85 78 92 88 2 90 82 85 84 3 70 75 80 70 4 95 88 89 90 5 60 68 72 65 27 Example A 2. Find the Central Point for Each Attribute Calculate the mean (average) or median for each subject to find the central tendency. Mean for Mathematics: (85+90+70+95+60)/5=82 Mean for Science: (78+82+75+88+68)/5=78.2 Mean for English: (92+85+80+89+72)/5=83.6 Mean for History: (88+84+70+90+65)/5=77.4 To calculate the median or mode for example: 1. Mathematics Data: 60, 70, 85, 90, 95 Median: 85 (middle value) Mode: No mode (all values are unique) 28 Descriptive Statistics - Univariate Measure of spread There are two common metrics to quantify spread. Range: The range is the difference between the maximum value and the minimum value of the attribute. The range is simple to calculate and articulate but has shortcomings as it is severely impacted by the presence of outliers and fails to consider the distribution of all other data points in the attributes. Deviation: The variance and standard deviation measures the spread, by considering all the values of the attribute. Deviation is simply measured as the difference between any given value (xi) and the mean of the sample (μ). The variance is the sum of the squared deviations of all data points divided by the number of data points. For a dataset with N observations, the variance is given by the following equation: 29 Descriptive Statistics - Univariate Measure of spread Standard deviation is the square root of the variance. Since the standard deviation is measured in the same units as the attribute, it is easy to understand the magnitude of the metric. High standard deviation means the data points are spread widely around the central point. Low standard deviation means data points are closer to the central point. If the distribution of the data aligns with the normal distribution, then 68%of the data points lie within one standard deviation from the mean. 30 Example A To complete Example A 3. Understand the Spread of the Attributes Calculate measures of spread such as the standard deviation and range for each subject. The range for Mathematics: 95−60=35 Standard Deviation for Mathematics (using the formula): 31 Descriptive Statistics -Univariate Fig. 3.2 provides the univariate summary of the Iris dataset with all 150 observations, for each of the four numeric attributes. 32 Multivariate exploration Multivariate exploration is the study of more than one attribute in the data set simultaneously. This technique is critical to understanding the relation ship between the attributes, which is central to data science methods. Similar to univariate explorations, the measure of central tendency and variance in the data will be discussed. 33 Multivariate exploration Central Data Point In the Iris dataset, each data point as a set of all the four attributes can be expressed: observation i: {sepal length, sepal width, petal length, petal width} For example, observation one: {5.1, 3.5, 1.4, 0.2}. This observation point can also be expressed in four-dimensional Cartesian coordinates and can be plotted in a graph (although plotting more than three dimensions in a visual graph can be challenging) 34 Multivariate exploration Central data point In this way, all 150 observations can be expressed in Cartesian coordinates. If the objective is to find the most “typical” observation point, it would be a data point made up of the mean of each attribute in the dataset independently. For the Iris dataset shown in Table 3.1, the central mean point is {5.006, 3.418, 1.464, 0.244}. This data point may not be an actual observation. It will be a hypothetical data point with the most typical attribute values. 35 Correlation Correlation measures the statistical relationship between two attributes, particularly dependence of one attribute on another attribute. When two attributes are highly correlated with each other, they both vary at the same rate with each other either in the same or in opposite directions. For example, consider average temperature of the day and ice cream sales. Statistically, the two attributes that are correlated are dependent on each other and one may be used to predict the other. If there are sufficient data, future sales of ice cream can be predicted if the temperature forecast is known. However, correlation between two attributes does not imply causation, that is, one doesn’t necessarily cause the other. 36 Correlation Correlation between two attributes is commonly measured by the Pearson correlation coefficient (r), which measures the strength of linear dependence Correlation coefficients take a value from -1< r