Fundamentals of Data Science DS302 PDF
Document Details
Uploaded by IdyllicTungsten4927
Dr. Nermeen Ghazy
Tags
Summary
This document provides an overview of the data science process, including steps like data understanding, preparation, modeling, and deploying models. It also covers different phases of the data science life cycle and discusses tools commonly used.
Full Transcript
Fundamentals of Data Science DS302 Dr. Nermeen Ghazy Reference Books Data Science :Concepts and Practice, Vijay Kotu and Bala Deshpande,2019. DATA SCIENCE: FOUNDATION & FUNDAMENTALS, B. S. V. Vatika, L. C. Dabra, Gwalior,2023....
Fundamentals of Data Science DS302 Dr. Nermeen Ghazy Reference Books Data Science :Concepts and Practice, Vijay Kotu and Bala Deshpande,2019. DATA SCIENCE: FOUNDATION & FUNDAMENTALS, B. S. V. Vatika, L. C. Dabra, Gwalior,2023. 2 Data Science Process The methodical discovery of useful relationships and patterns in data is enabled by a set of iterative activities collectively known as the data science process. The standard data science process 1- Understanding the problem, 2- Preparing the data samples, 3- Developing the model, 4- Applying the model on a dataset to see how the model may work in the real world, 5- Deploying and maintaining the models. 5 Which is: 1. Prior Knowledge 2. Preparation 3. Modeling 4. Application 5. Knowledge 6 Why Is It Important? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Data mining can be viewed as a result of the natural evolution of information technology. 7 Data science process frameworks One of the most popular data science process frameworks is Cross Industry Standard Process for Data Mining (CRISP-DM), which is an acronym for Cross Industry Standard Process for Data Mining. The CRISP-DM process is the most widely adopted framework for developing data science solutions. Other data science frameworks are SEMMA, an acronym for Sample, Explore, Modify, Model, and Assess, DMAIC, is an acronym for Define, Measure, Analyze, Improve, and Control, used in Six Sigma practice and the Selection, Preprocessing, Transformation, Data Mining, Interpretation, and Evaluation framework used in the knowledge discovery in databases process. 8 CRISP-DM process CRISP-DM is a process model with six phases that naturally describes the data science life cycle. It's like a set of guardrails to help you plan, organize, and implement your data science (or machine learning) project. The Business Understanding phase focuses on understanding the objectives and requirements of the project. Understanding of the customer’s needs. Data understanding focuses on identifying, collecting, analyze the data sets Data Understanding Deployment and 9 Data science Process The data science process is a generic set of steps that is problem, algorithm, and data science tool agnostic. The fundamental objective of any process that involves data science is to address the analysis question. The learning algorithm used to solve the business question could be a decision tree, an artificial neural network, or a scatterplot. The software tool to develop and implement the data science algorithm used could be custom coding, RapidMiner, R, Weka, SAS, Oracle Data Miner, Python. 10 Data science Process 11 Prior Knowledge Prior knowledge refers to information that is already known about a subject. The prior knowledge step in the data science process helps to problem is being solved define what, how it fits in the business context, and what data is needed in order to solve the problem. Gaining information on: 1. Objective of the problem 2. Subject area of the problem 3. Data 12 Prior Knowledge 1. Objective of the problem The data science process starts with a need for analysis, a question, or a business objective. This is the most important step in the data science process Without a well-defined statement of the problem, it is impossible to come up with the right dataset and pick the right data science algorithm. As an iterative process, it is common to go back to previous data science process steps, revise the assumptions, approach, and tactics. However, it is imperative to get the first step —the objective of the whole process— right. 13 Prior Knowledge 2- Subject area of the problem The process of data science uncovers hidden patterns in the dataset by exposing relationships between attributes. But the problem is that it uncovers a lot of patterns. The false or spurious (fake) signals are a major problem in the data science process. It is up to the practitioner to sift through the exposed patterns and accept the ones that are valid and relevant to the answer of the objective question. Hence, it is essential to know the subject matter, the context, and the business process generating the data. 14 Prior Knowledge 3- Data Similar to the prior knowledge in the subject area, prior knowledge in the data can also be gathered. Understanding how the data is collected, stored, transformed, reported, and used is essential to the data science process. This part of the step surveys all the data available to answer the business question and narrows down the new data that need to be sourced. 15 Prior Knowledge 3- Data There are quite a range of factors to consider: quality of the data, quantity of data, availability of data, gaps in data, does lack of data compel the practitioner to change the business question, etc. The objective of this step is to come up with a dataset to answer the business question through the data science process. 16 Prior Knowledge 3- Data The terminology used in the data science process A dataset (example set) is a collection of data with a defined structure. Table 2.1, has a well-defined structure with 10 rows and 3 columns along with the column headers. This structure is also sometimes referred to as a “data frame”. Adata point (record, object or example) is a single instance in the dataset. Each row in the table is a data point. 17 Prior Knowledge 3- Data the terminology used in the data science process A label (class label, output, prediction, target, or response) is the special attribute to be predicted based on all the input attributes. In Table 2.1, the interest rate is the output variable. Identifiers (PK) are special attributes that are used for locating or providing context to individual records. For example, common attributes like names, account numbers, and employee ID numbers are identifier attributes. In Table 2.1, the attribute ID is the identifier. 18 Data Preparation Preparing the dataset to suit a data science task is the most time-consuming part of the process. It is extremely rare that datasets are available in the form required by the data science algorithms. Most of the data science algorithms would require data to be structured in a tabular format with records in the rows and attributes in the columns. If the data is in any other format, the data would need to be transformed by applying pivot, type conversion, join, or transpose functions, etc., to condition the data into the required structure. 19 Data Preparation 1. Data Exploration 2. Data quality 3. Handling missing values 4. Data type conversion 5. Transformation 6. Outliers 7. Feature selection 8. Sampling 20 1-Data exploration Data exploration, also known as exploratory data analysis, provides a set of simple tools to achieve basic understanding of the data. – Data exploration approaches involve computing descriptive statistics and visualization of data. – These approaches can expose the structure of the data, the distribution of the values, the presence of extreme values, and highlight the inter-relationships within the dataset. 21 Data Exploration Descriptive statistics like mean, median, mode, standard deviation, and range for each attribute provide an easily readable summary of the key characteristics of the distribution of data. Fig. 2.3 shows the scatterplot of credit score vs. loan interest rate and it can be observed that as credit score increases, interest rate decreases. 22 2- Data Quality Data quality is an ongoing concern wherever data is collected, processed, and stored. In the interest rate dataset (Table 2.1), how does one know if the credit score and interest rate data are accurate? What if a credit score has a recorded value of 900 (beyond the theoretical limit) or if there was a data entry error? Errors in data will impact the representativeness of the model. 23 Data Quality Organizations use data alerts, cleansing, and transformation techniques to improve and manage the quality of the data and store them in companywide repositories called data warehouses. Data sourced from well-maintained data warehouses have higher quality, as there are proper controls in place to ensure a level of data accuracy for new and existing data. The data cleansing practices include elimination of duplicate records, quarantining outlier records that exceed the bounds, standardization of attribute values, substitution of missing values, etc. 24 3- Handling missing values One of the most common data quality issues is that some records have missing attribute values There are several different mitigation methods to deal with this problem, but each method has pros and cons. The first step of managing missing values is to understand the reason behind why the values are missing. Missing credit score values can be replaced with a credit score derived from the dataset (mean, minimum, or maximum value, depending on the characteristics of the attribute). This method is useful if the missing values occur randomly and the frequency of occurrence is quite rare. Alternatively, to build the representative model, all the data records with missing values or records with poor data quality can be ignored. This method reduces the size of the dataset 25 4- Data type conversion The attributes in a dataset can be of different types, such as continuous numeric (interest rate), integer numeric (credit score), or categorical. For example, the credit score can be expressed as categorical values (poor, good, excellent) or numeric score. In case of linear regression models, the input attributes have to be numeric. If the available data are categorical, they must be converted to continuous numeric attributes. Numeric values can be converted to categorical data types by a technique called binning, where a range of values are specified for each category, for example, a score between 400 and 500 can be encoded as “low” and so on. 26 5- Transformation In some data science algorithms like k-nearest neighbor (k-NN), the input attributes are expected to be numeric and normalized, because the algorithm compares the values of different attributes and calculates distance between the data points. Normalization prevents one attribute dominating the distance results because of large values. For example, consider income (expressed in USD, in thousands) and credit score (in hundreds). The distance calculation will always be dominated by slight variations in income. One solution is to convert the range of income and credit score to a more uniform scale from 0 to 1 by normalization. This way, a consistent comparison can be made between the two different attributes with different units. 27 6- Outliers Outliers are anomalies– abnormal or up normal- in a given dataset. Outliers may occur because of correct data capture (few people with income in tens of millions) or erroneous data capture (human height as 1.73 cm instead of 1.73 m). Regardless, the presence of outliers needs to be understood and will require special treatment. Detecting outliers may be the primary purpose of some data science applications, like fraud or intrusion detection. 28 7- Feature selection The example dataset shown in Table 2.1 has one attribute or feature — the credit score — and one label — the interest rate — In practice, many data science problems involve a dataset with hundreds to thousands of attributes. A large number of attributes in the dataset significantly increases the complexity of a model and may degrade the performance of the model due to the curse of dimensionality Not all the attributes are equally important or useful in predicting the target. 29 8- Sampling Sampling is a process of selecting a subset of records as a representation of the original dataset for use in data analysis or modeling. The sample data serve as a representative of the original dataset with similar properties, such as a similar mean Sampling reduces the amount of data that need to be processed and speeds up the build process of the modeling. In most cases, to gain insights, extract the information, and to build representative predictive models it is sufficient to work with samples. Theoretically, the error introduced by sampling impacts the relevancy of the model, but their benefits far outweigh the risks. 30 37