Introduction to Data Science PDF

Dr. Amal Fouad Senior Data Scientist INTRODUCTION TO DATA SCIENCE Agenda Why should study Data Science? How Does Data Science Impact Organizations? Application and Competitive Advantage Importance of Data Science What We’ll Discuss DS vs AI vs ML vs DL? Internal Use - Confidential DS vs AI vs ML vs DL? Internal Use - Confidential Why Data Science? Internal Use - Confidential Why Data Science? Internal Use - Confidential Why Data Science? Internal Use - Confidential Data Definition Data is a collection of details or data remaining in the form of either figure's texts, symbols, descriptions, or more observations of entities. Data has various forms like letters, numbers, images, or characters. Computer data for instance is represented in the form of 0’s and 1’s – that can be interpreted to form a fact or value. Information is data collated to derive Internal Use - Confidential Databases Internal Use - Confidential Data Definition Databases are logical structures that are based on set theory that have relationships based on unique primary and foreign keys. These links allow processing of data between tables while holding the integrity of the data. Data Science is the study and analysis of data using methods, processes, and insights that corresponds to structured or unstructured data. The difference is that databases are actual objects and data science are methods, processes that corresponds to structured or unstructured data. Internal Use - Confidential Data Science Definition Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Internal Use - Confidential Data Science Definition a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science. Internal Use - Confidential Data Science Definition 13 Internal Use - Confidential Data Science Definition Data Science and others Statistics Big Data Analytics Business Analytics Business Intelligence Data(base) Management Visualization Machine Learning Data Mining Artificial Intelligence Predictive Modelling Internal Use - Confidential How YouTube uses AI: AI Tasks in YouTube: Automatically Removes Objectionable Content. Recommend videos Based On User Choice. Organizing The Content. Internal Use - Confidential How Facebook uses AI(cont.): Facebook researchers develop many tools to solve those problems: DeepText Machine Translation Automatic Image Recognition Chatbots Deepfake Targeted Ads Internal Use - Confidential How Google uses AI: Google uses data science in all its products: DeepText Gmail(smart replies-spam detection) Google Search (suggestion due to past searches) Google maps(ETA based on analysis of location ,day of week ,trip time,..) Google translate Speech Recognition Internal Use - Confidential Companies Using Data Science Big Data Science Tasks Facebooks Amazon Google Linkedln Netflix Youtube Microsoft Internal Use - Confidential Roles in Data Science Data Analyst Internal Use - Confidential Data Scientist Starter Pack Learning Data Science with Python - Libraries Internal Use - Confidential Data Scientist Starter Pack Learning Data Science with Python - Libraries Internal Use - Confidential Data Scientist Starter Pack Learning Data Science with Python - Tools Internal Use - Confidential Data Scientist - Applications Internal Use - Confidential Data Science - Importance 1. Data science helps brands to understand their customers in a much enhanced and empowered manner. 2. It allows brands to communicate their story in such a engaging and powerful manner. 3. Big Data is a new field that is constantly growing and evolving. Internal Use - Confidential Data Science - Importance 4. Its findings and results can be applied to almost any sector like travel, healthcare and education among others. 5. Data science is accessible to almost all sectors. Internal Use - Confidential Internal Use - Confidential THANK YOU Fundamentals of Data Science LECTURE 2: INTRODUCTION Course Book Course Software Data Science: Concepts and Practice Authors : Vijay Kotu & Bala Deshpande Publisher : Morgan Kaufmann Free Download ○3 What is Data Science? ❑ Data science is a collection of techniques used to extract value from data. ❑ It has become an essential tool for any organization that collects, stores, and processes data as part of its operations. ❑ Data science techniques rely on finding useful patterns, connections, and relationships within data. ❑ Data science is also commonly referred to as knowledge discovery, machine learning, predictive analytics, and data mining. 4 What is Data Science? ❑ Data science enables businesses to process huge amounts of data to detect patterns ❑ This in turn allows companies to increase efficiencies, manage costs, identify new market opportunities, and boost their market advantage. ❑ DS is the practice of mining large data sets of raw data, both structured and unstructured, to identify patterns and extract actionable insight from them. ❑ It’s an interdisciplinary field, and the foundations of data science include: statistics, computer science, predictive analytics, machine learning algorithm development, and new technologies to gain insights from big data 5 AI, MACHINE LEARNING, AND DATA SCIENCE Artificial intelligence, Machine learning, and data science are all related to each other ❑ Artificial intelligence is about giving machines the capability of mimicking human behavior, particularly cognitive functions. Examples would be: facial recognition, automated driving, sorting mail based on postal code. ❑ There are quite a range of techniques that fall under artificial intelligence: linguistics, natural language processing, decision science, bias, vision, robotics, planning, etc. AI, MACHINE LEARNING, AND DATA SCIENCE Artificial intelligence, Machine learning, and data science are all related to each other ❑ Machine learning can either be considered a sub-field or one of the tools of artificial intelligence, is providing machines with the capability of learning from experience. ❑ Experience for machines comes in the form of data. ❑ Data that is used to teach machines is called training data. ❑ Machine learning turns the traditional programing model upside down. ❑ A program, a set of instructions to a computer, transforms input signals into output signals using predetermined rules and relationships. Machine learning algorithms AI, MACHINE LEARNING, AND DATA SCIENCE Artificial intelligence, Machine learning, and data science are all related to each other ❑ Data science is the business application of machine learning, artificial intelligence, and other quantitative fields like statistics, visualization, and mathematics. ❑ It is an interdisciplinary field that extracts value from data What is implied by Data Science? Compare between AI, Ml and DS? What is implied by Data Science? WHAT IS DATA SCIENCE? ❑ Data science starts with data, which can range from a simple array of a few numeric observations to a complex matrix of millions of observations with thousands of variables. ❑ Data science utilizes certain specialized computational methods in order to discover meaningful and useful structures within a dataset. ❑ KEY FEATURES AND MOTIVATIONS OF DATA SCIENCE 1- Extracting Meaningful Patterns 2- Building Representative Models 3- Combination of Statistics, Machine Learning, and Computing 4- Learning Algorithms 5- Associated Fields What Can Data Science Be Used For? Used in almost worldwide fields including: healthcare, marketing, banking and finance, and policy work. Social Intelligence Political planning and economics Almost in all real life fields with huge amounts of accumulated data 13 Data Science vs Data Analytics ❑ Although the work of data scientists and data analysts are sometimes conflated, these fields are not the same. ❑ In summary, a data scientist is more likely to look ahead, predicting or forecasting as they look at data. ❑ A data analyst is more likely to focus on specific questions to answer digging into existing data sets that have already been processed for insights. 14 What are the Data science Tasks? 15 Types of Data Qualitative and quantitative Simple quantitative analysis Simple qualitative analysis Quantitative and qualitative Quantitative data – expressed as numbers Qualitative data – difficult to measure sensibly as numbers, e.g. count number of words to measure dissatisfaction Quantitative analysis – numerical methods to ascertain size, magnitude, amount(have a solid number=fact) Qualitative analysis – expresses the nature of elements and is represented as themes, patterns, stories Be careful how you manipulate data and numbers! Quantitative and Qualitative Quantitative and qualitative » Red. Blue, Yellow. Black. Qualitative Nominal » Very Unhappy. Unhappy. Qualitative Ordinal » 15. 123. 12. -10. Quantitative Discrete » 14.2. 11.09. 1000 Quantitative Continuous » Baby, Child, Teenager, Adult, Old Qualitative Ordinal » Alexandria, Cairo, Luxor Qualitative Nominal » East. West. North, South Qualitative Nominal » Student Age Quantitative Discrete » Product Weight Quantitative Continuous » Price Quantitative Discrete » Gender Qualitative Nominal » Phone Numbers Qualitative Nominal » Course Grades: A. B. C.E.E Qualitative Ordinal » User Feedback Paragraph Qualitative Nominal » Neutral, Happy, Very Happy Qualitative Ordinal Quantitative VS Qualitative » Objectives variables for data » Subjective parameters for data gathering. gathering » Values that can be counted » Things that can be described such as age, weight, volume, using the 5 sensory such as color, and scale. smell, taste, touch or feeling, typology, and shapes. » Researcher aim to increase » Researcher aim to get a variety of the sample size. The more data values to examine and understand. points, the more accurate. » It is costly to have large sample size. » Dynamic & Negotiable reality » Definite. Fixed & Measurable reality Simple quantitative analysis Averages – Mean: add up values and divide by number of data points – Median: middle value of data when ranked – Mode: figure that appears most often in the data Histogram Skewness Outliers Skewness Symmetric Negatively skewed Positively skewed Skewness Outliers Qualitative Ordinal It has a rank or order. It establishes a relative rank. It has no standardized interval scale. The Median and Mode can be analyzed. Can't have a Mean. Lie Factor Is a value to describe the relation between the size of effect shown in a graphic and the size of effect shown in the data. Where It is acceptable to be between 0.95 to 1.05 Lie Factor Lie Factor » Graphic = (1550 - 600) / 600 = 1.58 » Actual = (11750-10800) / 10800 = 0.088 » Lie Factor = 18 Data Cleaning Reasons for Data Cleansing 2 The data to be analyzed may be: ◉Incomplete; where the data is missing ◉Noisy; where data may contain errors or outlier values ◉Inconsistent; where data may contain discrepancies in the values How can Data be Cleaned? 3 Filling-in Missing Values Smoothing Noisy Data Identifying and Removing Outliers Resolving Inconsistency Typical Example 4 Sample Table Typical Example (Incomplete Data) 5 Sample Table Typical Example (Data Values Errors) 6 Sample Table Reasons: The Zip code consists of five digits and cannot contain any letters Income must be positive number Age must be positive number Typical Example (Outlier Values) 7 Sample Table Reasons: Outliers are data values that deviate from expected values of the rest of the data set The values 10000000 and -40000 look very divergent from the rest of values Typical Example (Ambiguity) 8 Sample Table Reasons: “S” in Marital Status could refer to “Single” or “Separated” So, there is a kind of ambiguity in the data Reasons for Incomplete Data 9 Relevant data may not be recorded because: A misunderstanding from the data entry persons Equipment failure Relevant data may not be available because it is unknown or providing it is optional Dealing with Incomplete Data 10 There are several ways to deal with missing data: Replace the missing value with some default value Replace the missing value with the field mean for the fields that take numerical values or the mode (if exists) for the fields that take categorical values Replace the missing values with a value generated at random from the field distribution observed Mean, Median, and Mode 11 The mean for a population of size n can be computed by: ◉ Consider the following list of 9 numbers: 13, 15, 12, 17, 22, 11, 13, 19, 12 Mean = (13 + 15 + 12 + 17 + 22 + 11 + 13 + 19 + 12)/9 = 14.88889 Mean, Median, and Mode 12 The median is the middle value of the ordered list of numbers. ◉Consider the following list of 9 numbers: 13, 15, 12, 17, 22, 11, 13, 19, 12 To compute the median, you need first to order the numbers: 11, 12, 12, 13, 13, 15, 17, 19, 22 Hence, the median is 13 Mean, Median, and Mode 13 The median is the middle value of the ordered list of numbers. ◉Consider the following list of 10 numbers: 13, 15, 12, 17, 22, 11, 13, 19, 12, 14 To compute the median, you need first to order the numbers: 11, 12, 12, 13, 13, 14, 15, 17, 19, 22 Hence, the median is (13 + 14)/2 = 13.5 Mean, Median, and Mode 14 The mode of a set of data is the value in the set that occurs most often. ◉Consider the following list of numbers: 13, 15, 12, 17, 22, 11, 13, 19, 13 Number Occurrence Number Occurrence 13 3 22 1 15 1 19. 1 12. 1 17 1 Mode is 13 Mean, Median, and Mode 15 The mode of a set of data is the value in the set that occurs most often. ◉Consider the following list of numbers: 13, 15, 12, 17, 22, 11, 13, 19, 12 Number Occurrence Number. Occurrence 13 2 22 1 15 1 11 1 12 2 19 1 17 1 Mode is 13 and12 (Bimodal) Mean, Median, and Mode 16 The mode of a set of data is the value in the set that occurs most often. ◉Consider the following list of numbers: 13, 15, 12, 17, 22, 11, 19 Number Occurrence Number Occurrence 13 1 22 1 15 1 11 1 12 1 19 1 17 1 There is no mode Handling Missing Values 17 A set of fields with missing values Handling Missing Values 18 A set of fields with missing values Handling Missing Values (Using Default Values) 19 Defaults Field1 Default: 0 Field2 Default: N Field3 Default: 240 Field4 Default: 50 Handling Missing Values (Using Means and Modes) 20 Use the mean for the numeric fields and the mode (if exists) for the categorical fields If mode doesn’t exist, you need to rely on either a default value or to use a random value Numeric Fields: Field1, Field3, and Field4 Field1 Mean = (21+24+22+12+11+16+16+17+18)/9 = 17.44 Field3 Mean = 334.44 Field4 mean = 81.78 If any field doesn’t accept decimal values, just approximate the mean value Handling Missing Values (Using Means and Modes) 21 Field2 is categorical, hence we need to compute the mode from the existing values Category Occurrence A 3 B 1 W 2 C 2 Hence, the mode is A Handling Missing Values (Using Means and Modes) 22 Assumptions Assume Field 1 and Field4 don’t accept decimal numbers, Hence we approximate the mean Field3 accepts decimal numbers, hence we don’t approximate the mean value Handling Missing Values (Using Random Values) 23 Handling Outliers 24 Sample Table Data Set Possible Outlier Values Outliers are data values that deviate from expected values of the rest of the data set Outliers are extreme values that lie near the limits of the data range or go against the trend of the remaining data. Normally, outliers need more investigation to make sure that they don’t occur due to mistakes during data entry Handling Outliers Using Inter-quartile Range 25 Q1 Q3 75% of data items Sorted Data Items 25% of data items 50% of data items Q2 Quartile is any of the three values which divide the sorted data set into four equal parts First quartile (Q1) cuts off lowest 25% of data Second quartile (Q2) cuts data set in half (it is the median of the data set) Third quartile (Q3) cuts off highest 25% of data, or lowest 75% Computing Q1, Q2, and Q3 26 to compute Q1, Q2, and Q3, use the following method: ◉Order the given data set in ascending order. ◉Use the median to divide the ordered data set into two halves. This median is second quartile (Q2). Exclude this median (if it is one of the data items) from any further computation. ◉The first quartile (Q1) value is the median of the lower half of the data. ◉The third quartile (Q3) value is the median of the upper half of the data. Example #1 of Computing Q1, Q2, and Q3 27 compute Q1, Q2, and Q3 for the following data set: 6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36 ◉ Order the given data set in ascending order: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49 ◉ Q2 = 40 (median of the data set) ◉ Q1 is the median of the lower half of the data (shown in red). ◉ Q1 = 15 ◉ Q3 is the median of the upper half of the data (shown in green). ◉ Q3 = 43 Example #2 of Computing Q1, Q2, and Q3 28 compute Q1, Q2, and Q3 for the following data set: 39, 36, 7, 40, 41, 17 ◉ Order the given data set in ascending order: 7, 17, 36, 39, 40, 41 ◉ Q2 = (36+39)/2 = 37.5 (median of the data set). Note, the number of data items is even so the median is the average of the middle two data items ◉ The median is not a data item, hence we need to use all items in the first half of the data items to compute Q1 and the rest of the items are used to compute Q3 ◉ Q1 = 17 ◉ Q3 is the median of the upper half of the data (shown in green). ◉ Q3 = 40 Detecting Outliers using Inter-quartile Range 29 Compute the Inter-Quartile Range (IQR) as follows: IQR = Q3 − Q1 A data value is an outlier if: ◉its value is = (Q3 + 1.5*IQR). Example of Detecting Outliers using Inter- quartile Range 30 Sample Table Data Set that might contains outliers Data Set 75000, -40000, 10000000, 50000, 99999 Example of Detecting Outliers using Inter- quartile Range 31 Data Set: 75000, -40000, 10000000, 50000, 99999 Ordered Data Set: -40000, 50000, 75000, 99999, 10000000 Q2 = 75000 Q1 = (–40000+50000)/2 = 5000 Q3 = (99999+10000000)/2 = 5049999.5 IRQ = Q3 – Q1 = 5049999.5 – 5000 = 5044999.5 Q1 – 1.5*IRQ = 5000 – 1.5*50449999.5 = – 7562499.5 Q3 + 1.5*IRQ = 5049999.5 + 1.5*5044999.5 = 12617498.75 All data in the data set are within range, hence there is no outliers in this example Example of Detecting Outliers using Inter- quartile Range 32 Data Set that might contains outliers Data Set 75000, 40000, 10000000, 50000, 99999, 75000 Example of Detecting Outliers using Inter- quartile Range 33 Data Set: 75000, 40000, 10000000, 50000, 99999, 75000 Ordered Data Set: 40000, 50000, 75000, 75000, 99999, 10000000 Q2 = (75000+ 75000)/2 = 75000 Q1 = 50000 Q3 = 99999 IRQ = Q3 – Q1 = 99999 – 50000 = 49999 Q1 – 1.5*IRQ = 50000 – 1.5*49999 = –24998.5 Q3 + 1.5*IRQ = 99999 + 1.5* 49999 = 174997.5 Hence data item 10000000 is an outlier and should be re-investigated for any data- entry errors Noisy Data 34 Noisy data are the kind of data that have incorrect values Some reasons for noisy data: ◉Data collection instruments may be faulty ◉Human or computer errors may occur during data entry ◉Transmission errors may occur ◉Technology limitations like buffer size, may occur during data-entry Noisy Data 35 Examples of Noisy Data Smoothing Noisy Data 36 By smoothing noisy data we can correct the errors Smoothing noisy data is performed by: ◉Validation and correction ◉ Standardization Validation and Correction of Noisy Data 37 This step examines the data for data-entry errors and tries to correct them automatically as far as possible according to the following guidelines: ◉Spell checking based on dictionary lookup is useful for identifying and correcting misspellings. Example: Kairo can be spell-checked and corrected into Cairo ◉Use dictionaries on geographic names and zip codes helps to correct address data. Example: Zip code 1243456 can be detected as an error since there is no Zip code matches this value Validation and Correction of Noisy Data (Cont.) 38 ◉Check validation rules and make sure field values follow the rules; for example: Age is not less than certain amount and age is a positive number. Example: if there is a rule governing your data says that age must be between 20 and 60, then ages of 18, 15, and 68 are detected as errors Each value of the categorical values belong to certain category. Example: if all the categories you have are A, B, C, and D, Then if categories W or N are found they will be declared as errors Validation and Correction of Noisy Data (Cont.) 39 ◉Check the fields that have ambiguous values and check for any possible data-entry errors Example: Using the same category value to refer to different meaning. “S” in “Marital Status” field could refer to “Single” or “Separated” Standardization to Smooth Noisy Data 40 Data values should be consistent and have uniform format. For example: ◉ Date and time entries should have a specific format Oct. 19, 2009 10/19/2009 19/10/2009 All dates must be written with the same format that have been agreed upon (e.g., Day/Month/Year) ◉ Names and other string data should be converted to either upper or lower case. MOHAMED AHMED instead of Mohamed Ahmed ◉ Removing prefixes and suffixes from names. Mohamed Ahmed instead of Mr. Mohamed Ahmed Mohamed Ahmed instead of Mohamed Ahmed, Ph.D. Standardization to Smooth Noisy Data (Cont.) 41 ◉Abbreviations and encoding schemes should consistently be resolved by consulting special dictionaries or applying predefined conversion rules. US is the standard abbreviation of United States Data Inconsistency 42 Data inconsistency means that different data items contain discrepancies in their values It can occur when different data items depend on other data items and their values don’t match; for example: ◉ Age and Birth-date; age can be computed from the birth-date, hence the value of Age must match the value computed from the birth-date ◉ City and Phone-area-code; each city has certain area-code ◉ Total-price and (unit-price and quantity); total-price can be computed from the unit-price and quantity These dependencies can be utilized to detect errors and substitute missing values or correct wrong values Data Inconsistency Example 43 Example of Inconsistent Data Data Inconsistency Marked in Red Incorrect Area-Code Total_Price doesn’t equal (Quantity*Unit_Price) Fundamentals of Data Science LECTURE 2: DATA SCIENCE PROCESS 2. Data Science Process Data Science Process The methodical discovery of useful relationships and patterns in data is enabled by a set of iterative activities collectively known as the data science process. The standard data science process 1. Prior Knowledge 1- Understanding the problem, 2- Preparing the data samples, 2. Preparation 3- Developing the model, 3. Modeling 4- Applying the model on a dataset to 4. Application see how the model may work in the real world, 5. Knowledge 5- Deploying and maintaining the models. What Motivated Data Mining? Why Is It Important? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Data mining can be viewed as a result of the natural evolution of information technology. Data science Process The fundamental objective of any process that involves data science is to address the analysis question. The learning algorithm used to solve the business question could be a decision tree, an artificial neural network, or a scatterplot. The software tool to develop and implement the data science algorithm used could be custom coding, RapidMiner, R, Weka, SAS, Oracle Data Miner, Python. Data Science Process Business Data Understanding Understanding 1. Prior Knowledge Prepare Data 2. Preparation Building Model using Training Data 3. Modeling Algorithms Test Data Applying Model and performance evaluation Deployment 4. Application Knowledge and Actions 5. Knowledge 1. Prior Knowledge Prior knowledge refers to information that is already known about a subject. The prior knowledge step in the data science process helps to define what problem is being solved, how it fits in the business context, and what data is needed in order to solve the problem. Gaining information on: 1. Objective of the problem 2. Subject area of the problem 3. Data 1. Prior Knowledge Gaining information on: 1. Objective of the problem - The data science process starts with a need for analysis, a question, or a business objective. - This is the most important step in the data science process - Without a well-defined statement of the problem, it is impossible to come up with the right dataset and pick the right data science algorithm. - As an iterative process, it is common to go back to previous data science process steps, revise the assumptions and approach. 1. Prior Knowledge Gaining information on: 2- Subject area of the problem – The process of data science uncovers hidden patterns in the dataset by exposing relationships between attributes. – But the problem is that it uncovers a lot of patterns. – The false or spurious (fake) signals are a major problem in the data science process. – Hence, it is essential to know the subject matter, the context, and the business process generating the data. 1. Prior Knowledge Gaining information on: 3- Data Similar to the prior knowledge in the subject area, prior knowledge in the data can also be gathered. Understanding how the data is collected, stored, transformed, reported, and used is essential to the data science process. This part of the step surveys all the data available to answer the business question and narrows down the new data that need to be sourced. There are quite a range of factors to consider: quality of the data, quantity of data, availability of data, gaps in data, does lack of data compels the practitioner to change the business question, etc. 1. Prior Knowledge 3- Data The terminology used in the data science process A dataset (example set) is a collection of data with a defined structure. Table 2.1, has a well-defined structure with 10 rows and 3 columns along with the column headers. This structure is also sometimes referred to as a “data frame”. A data point (record, object or example) is a single instance in the dataset. Each row in the table is a data point. 1. Prior Knowledge 3- Data the terminology used in the data science process Each instance contains the same structure as the dataset. An attribute (feature, input, dimension, variable, or predictor) is a single property of the dataset. Each column in the table is an attribute. Attributes can be numeric, categorical, date- time, text, or Boolean data types. 1. Prior Knowledge 3- Data the terminology used in the data science process A label (class label, output, prediction, target, or response) is the special attribute to be predicted based on all the input attributes. In Table 2.1, the interest rate is the output variable. Identifiers (PK) are special attributes that are used for locating or providing context to individual records. For example, common attributes like names, account numbers, and employee ID numbers are identifier attributes. In Table 2.1, the attribute ID is the identifier.. 2. Data Preparation Preparing the dataset to suit a data science task is the most time-consuming part of the process. It is extremely rare that datasets are available in the form required by the data science algorithms. Most of the data science algorithms would require data to be structured in a tabular format with records in the rows and attributes in the columns. If the data is in any other format, the data would need to be transformed by applying pivot, type conversion, join, or transpose functions, etc., to condition the data into the required structure. 2. Data Preparation 1. Data Exploration 2. Data quality 3. Handling missing values 4. Data type conversion 5. Transformation 6. Outliers 7. Feature selection 8. Sampling 2. Data Preparation 1. Data Exploration – Data exploration, also known as exploratory data analysis, provides a set of simple tools to achieve basic understanding of the data. – Data exploration approaches involve computing descriptive statistics and visualization of data. – These approaches can expose the structure of the data, the distribution of the values, the presence of extreme values, and highlight the inter-relationships within the dataset. 2. Data Preparation 1. Data Exploration Descriptive statistics like mean, median, mode, standard deviation, and range for each attribute provide an easily readable summary of the key characteristics of the distribution of data. Fig. 2.3 shows the scatterplot of credit score vs. loan interest rate and it can be observed that as credit score increases, interest rate decreases. 2. Data Preparation 2- Data quality Data quality is an ongoing concern wherever data is collected, processed, and stored. In the interest rate dataset (Table 2.1), how does one know if the credit score and interest rate data are accurate? What if a credit score has a recorded value of 900 (beyond the theoretical limit) or if there was a data entry error? Errors in data will impact the representativeness of the model. 2. Data Preparation 2- Data quality Organizations use data alerts, cleansing, and transformation techniques to improve and manage the quality of the data and store them in companywide repositories called data warehouses. Data sourced from well-maintained data warehouses have higher quality, as there are proper controls in place to ensure a level of data accuracy for new and existing data. The data cleansing practices include elimination of duplicate records, quarantining outlier records that exceed the bounds, standardization of attribute values, substitution of missing values, etc. 2. Data Preparation 3- Handling missing values One of the most common data quality issues is that some records have missing attribute values There are several different mitigation methods to deal with this problem, but each method has pros and cons. The first step of managing missing values is to understand the reason behind why the values are missing. Missing credit score values can be replaced with a credit score derived from the dataset (mean, minimum, or maximum value, depending on the characteristics of the attribute). This method is useful if the missing values occur randomly and the frequency of occurrence is quite rare. Alternatively, to build the representative model, all the data records with missing values or records with poor data quality can be ignored. This method reduces the size of the dataset 2. Data Preparation 4- Data type conversion The attributes in a dataset can be of different types, such as continuous numeric (interest rate), integer numeric (credit score), or categorical. For example, the credit score can be expressed as categorical values (poor, good, excellent) or numeric score. In case of linear regression models, the input attributes have to be numeric. If the available data are categorical, they must be converted to continuous numeric attribute. Numeric values can be converted to categorical data types by a technique called binning, where a range of values are specified for each category, for example, a score between 400 and 500 can be encoded as “low” and so on. 2. Data Preparation 5- Transformation In some data science algorithms like k-nearest neighbor (k-NN), the input attributes are expected to be numeric and normalized, because the algorithm compares the values of different attributes and calculates distance between the data points. Normalization prevents one attribute dominating the distance results because of large values. For example, consider income (expressed in USD, in thousands) and credit score (in hundreds). The distance calculation will always be dominated by slight variations in income. One solution is to convert the range of income and credit score to a more uniform scale from 0 to 1 by normalization. This way, a consistent comparison can be made between the two different attributes with different units. 2. Data Preparation 6- Outliers Outliers are anomalies – abnormal or upnormal - in a given dataset. Outliers may occur because of correct data capture (few people with income in tens of millions) or erroneous data capture (human height as 1.73 cm instead of 1.73 m). Regardless, the presence of outliers needs to be understood and will require special treatments. Detecting outliers may be the primary purpose of some data science applications, like fraud or intrusion detection. 2. Data Preparation 7- Feature selection The example dataset shown in Table 2.1 has one attribute or feature — the credit score — and one label — the interest rate — In practice, many data science problems involve a dataset with hundreds to thousands of attributes. A large number of attributes in the dataset significantly increases the complexity of a model and may degrade the performance of the model due to the curse of dimensionality Not all the attributes are equally important or useful in predicting the target. 2. Data Preparation 8- Sampling Sampling is a process of selecting a subset of records as a representation of the original dataset for use in data analysis or modeling. The sample data serve as a representative of the original dataset with similar properties, such as a similar mean. Sampling reduces the amount of data that need to be processed and speeds up the build process of the modeling. In most cases, to gain insights, extract the information, and to build representative predictive models it is sufficient to work with samples. Theoretically, the error introduced by sampling impacts the relevancy of the model, but their benefits far outweigh the risks. 3. Modeling A model is the abstract representation of the data and the relationships in a given dataset. A simple rule of thumb like “mortgage interest rate reduces with increase in credit score” is a model; although there is not enough quantitative information to use in a production scenario, it provides directional information by abstracting the relationship between credit score and interest rate. 3. Modeling Fig. 2.4 shows the steps in the modeling phase of predictive data science. Association analysis and clustering are descriptive data science techniques where there is no target variable to predict; hence, there is no test dataset. However, both predictive and descriptive models have an evaluation step. 3. Modeling Splitting training and test data sets The modeling step creates a representative model inferred from the data. The dataset used to create the model, with known attributes and target, is called the training dataset. The validity of the created model will also need to be checked with another known dataset called the test dataset or validation dataset. To facilitate this process, the overall known dataset can be split into a training dataset and a test dataset. A standard rule of thumb is two-thirds of the data are to be used as training and one-third as a test dataset 3. Modeling Splitting training and test data sets Training Data Test Data 4. Application In business applications, the results of the data science process have to be assimilated into the business process — usually in software applications —. Deployment is the stage at which the model becomes production ready The model deployment stage has to deal with: 1. Product readiness 2. Technical integration 3. Model response time 4. Remodeling 5. Assimilation 5. Knowledge The data science process provides a framework to extract nontrivial information from data To extract knowledge from these massive data assets, advanced approaches need to be employed, like data science algorithms. The data science process starts with prior knowledge and ends with posterior knowledge, which is the incremental insight gained. The data science process can bring up spurious irrelevant patterns from the dataset. 3. Data Exploration Objectives of Data Exploration 1. Data understanding 2. Data preparation 3. Data science tasks 4. Interpreting the results Objectives of Data Exploration 1. Data understanding: Data exploration provides a high-level overview of each attribute (also called variable) in the dataset and the interaction between the attributes. Data exploration helps answers the questions like what is the typical value of an attribute Objectives of Data Exploration 2. Data preparation: Before applying the data science algorithm, the dataset has to be prepared for handling any of the anomalies that may be present in the data. These anomalies include outliers, missing values, or highly correlated attributes. Objectives of Data Exploration 3. Data science tasks: Basic data exploration can sometimes substitute the entire data science process. For example, scatterplots can identify clusters in low-dimensional data or can help develop regression or classification models with simple visual rules. Objectives of Data Exploration 4. Interpreting the results: Data exploration is used in understanding the prediction, classification, and clustering of the results of the data science process. Histograms help to comprehend the distribution of the attribute and can also be useful for visualizing numeric prediction, error rate estimation, etc. Data Sets The most popular datasets used to learn data science is the Iris dataset, Iris is a flowering plant. The Iris dataset contains 150 observations of three different species (type), Iris setosa, Iris virginica, and Iris versicolor, with 50 observations each. Each observation consists of four attributes: sepal length, sepal width, petal length, and petal width. The fifth attribute, the label, is the name of the species observed, which takes the values I. setosa, I. virginica, and I. versicolor. Data Sets The dataset is available in all standard data science tools, such as RapidMiner, This dataset and other datasets used can be accessed from the book companion website: www.IntroDataScience.com. Descriptive Statistics Descriptive statistics refers to the study of the aggregate quantities of a dataset. These measures are some of the commonly used notations in everyday life. Some examples of descriptive statistics include average annual income, median home price in a neighborhood, range of credit scores of a population, etc. Descriptive Statistics Descriptive statistics can be broadly classified into Univariate Multivariate exploration depending on the number of attributes under analysis. Descriptive Statistics - Univariate Univariate Exploration Univariate data exploration denotes analysis of one attribute at a time. The example Iris dataset for one species, I. setosa, has 50 observations and 4 attributes, as shown in Table 3.1. Descriptive Statistics - Univariate Measure of Central Tendency The objective of finding the central location of an attribute is to quantify the dataset with one central or most common number. Mean: The mean is the arithmetic average of all observations in the dataset. It is calculated by summing all the data points and dividing by the number of data points. Median: The median is the value of the central point in the distribution. The median is calculated by sorting all the observations from small to large and selecting the mid-point observation in the sorted list. If the number of data points is even, then the average of the middle two data points is used as the median. Mode: The mode is the most frequently occurring observation. In the dataset, data points may be repetitive, and the most repetitive data point is the mode of the dataset. Criteria 1st 2nd 3rd 4th 5th Sub1 44 67 84 11 211 Sub2 98 79 84 10 211 Sub3 44 79 84 11 145 Sub4 123 154 153 120 198 Mean(µ) 77.25 94.75 101.25 38.00 191.25 Median 71 79 84 11 211 Mode 44 79 84 11 211 Variance Standard deviation Descriptive Statistics - Univariate Measure of Spread There are two common metrics to quantify spread. Range: The range is the difference between the maximum value and the minimum value of the attribute. it is severely impacted by the presence of outliers and fails to consider the distribution of all other data points in the attributes. Deviation: The variance and standard deviation measures the spread, by considering all the values of the attribute. Descriptive Statistics - Univariate Descriptive Statistics - Univariate This formula is valid only if the eight values with which we began form the complete population. If the values instead were a random sample drawn from some large parent population (for example, they were 8 students randomly and independently chosen from a class of 2 million), then one divides by 7 (which is n − 1) instead of 8 (which is n) in the denominator of the last formula, and the result is In that case, the result of the original formula would be called the sample standard deviation and denoted by S instead of Sigma Dividing by n-1 rather than n Criteria 1st 2nd 3rd 4th 5th Sub1 44 67 84 11 211 Sub2 98 79 84 10 211 Sub3 44 79 84 11 145 Sub4 123 154 153 120 198 Variance 1183.69 1194.19 892.69 2241.50 741.19 Standard deviation 394.21 391.24 282.57 782.70 193.74 Criteria 1st 2nd 3rd 4th 5th Sub1 44 67 84 11 211 Sub2 98 79 84 10 211 Sub3 44 79 84 11 145 Sub4 123 154 153 120 198 Mean(µ) 77.25 94.75 101.25 38.00 191.25 Median 71 79 84 11 211 Mode 44 79 84 11 211 Variance 1183.69 1194.19 892.69 2241.50 741.19 What is Covariance? Covariance is a measure of the relationship between two random variables. Positive covariance: Indicates that two variables tend to move in the same direction. Negative covariance: Reveals that two variables tend to move in inverse directions. Formula for Covariance Example 1 M Example 1 1 -1 -1 0 0 0 1 1 1 M 10 20 2/3 Correlation A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where x is the independent variable, and y is the dependent variable. The independent variable is the cause. Its value is independent of other variables in your study. The dependent variable is the effect. Its value depends on changes in the independent variable. A scatter plot can be used to determine whether a linear (straight line) correlation exists between two variables. The correlation coefficient The correlation coefficient is a measure of the strength and the direction of a linear relationship between two variables. The symbol r represents the sample correlation coefficient.It is the normalized version of covariance, making it dimensionless and easier to interpret.The formula for r is The range of the correlation coefficient is -1 to 1. Correlation Coefficient -1≤. r ≤+1 If x and y have a strong positive linear correlation, r is close to 1. If x and y have a strong negative linear correlation, r is close to -1. If there is no linear correlation or a weak linear correlation, r is close to 0. Example 1 Example 2 The following data represents the number of hours 12 different students watched television during the weekend and the scores of each student who took a test the following Monday. Display the scatter plot. Calculate the correlation coefficient r. the scatter plot. Example 2 Data Visualization Visualizing data is one of the most important techniques of data discovery and exploration. The visual representation of data provides easy comprehension of complex data with multiple attributes and their underlying relationships. Data Visualization The motivation for using data visualization includes: Comprehension of dense information: A simple visual chart can easily include thousands of data points. By using visuals, the user can see the big picture, as well as longer term trends that are extremely difficult to interpret purely by expressing data in numbers. Relationships: Visualizing data in Cartesian coordinates enables exploration of the relationships between the attributes. Although representing more than three attributes on the x, y, and z-axes is not feasible in Cartesian coordinates, there are a few creative solutions available by changing properties like the size, color, and shape of data markers or using flow maps, where more than two attributes are used in a two-dimensional medium. Data Visualization Univariate Visualization Visual exploration starts with investigating one attribute at a time using univariate charts. The techniques discussed in this section give an idea of how the attribute values are distributed and the shape of the distribution. Data Visualization Multivariate Visualization The multivariate visual exploration considers more than one attribute in the same visual. The techniques discussed in this section focus on the relationship of one attribute with another attribute. These visualizations examine two to four attributes simultaneously.

Introduction to Data Science PDF

Document Details

Tags

Related

Summary

Full Transcript