Data Analysis Introduction PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an introduction to data analysis and the different types of data, including structured and unstructured types. It also explains different applications of data science. The document is suitable for an introductory data analysis course, or a broader introductory business course.
Full Transcript
Unit 1:Introduction 1.Essential of data 2.what is Data Analysis? 3.Applications of Data Science Data Raw data Information/Processed data Various form of data Structured Data ⚫ tabular data(rectangular data) i.e. rows and columns from a database. ⚫ Structured data usually...
Unit 1:Introduction 1.Essential of data 2.what is Data Analysis? 3.Applications of Data Science Data Raw data Information/Processed data Various form of data Structured Data ⚫ tabular data(rectangular data) i.e. rows and columns from a database. ⚫ Structured data usually resides in relational databases (RDBMS) ⚫ Structured data is generated by both humans and machines Types of Structured Data 1) Numerical 2) Categorical 1) Numerical Data: Data that is expressed on a numerical scale i) Continuous:Data that can undertake any value in an interval For example, the speed of a car, heart rate, height,weight etc. ii)Discrete :Discrete data is information that can only take certain values, such as counts. For example, the number of heads in 20 flips of a coin. The shoe size of a person etc 2. Categorical Data ⚫ Data that can undertake only a specific set of values representing possible categories. ⚫ Binary — A special case of categorical data where the features are dichotomous i.e. can accept only 0/1 or True/False. ⚫ Ordinal — Categorical data that has an explicit ordering. For example, five-star rating of a restaurant(1,2,3,4,5) Unstructured Data ⚫ data with no pre-defined model/structure. ⚫ For example: ⚫ Images ⚫ textual data ⚫ Audio andVideo ⚫ Text files: Word processing, spreadsheets, presentations, emails, logs. ⚫ Email: Email has some internal structure thanks to its metadata, and we sometimes refer to it as semi-structured. However, its message field is unstructured and traditional analytics tools cannot parse it. ⚫ Social Media: Data from Facebook, Twitter, LinkedIn. ⚫ Website: YouTube, Instagram, photo sharing sites. ⚫ Mobile data: Text messages, locations. ⚫ Communications: Chat, IM, phone recordings, collaboration software. ⚫ Media: MP3, digital photos, audio and video files. ⚫ Business applications: MS Office documents, productivity applications. What is Data Analysis? Data analysis is the process of collecting, modeling, and analyzing data to extract insights that support decision-making. Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions. Data analysis is defined as a process of cleaning, transforming, and modeling data to discover useful information for business decision-making. The purpose of Data Analysis is to extract useful information from data and taking the decision based upon the data analysis. W hy Data Analysis is important? ⚫ Thinking-----take decision in day to day life ⚫ In short, analyzed data reveals insights that tell you what your customers need and where you need to focus your efforts. ⚫ By analyzing data , you can find either customers love or hate about your products and services? ⚫ From a management perspective, you can also benefit from analyzing your data as it helps you make business decisions based on facts and not simple intuition. For example, you can understand where to invest your capital, detect growth opportunities, predict your incomes, or tackle uncommon situations before they become problems ⚫ If your business is not growing, then you have to look back and acknowledge your mistakes and make a plan again without repeating those mistakes. Applications of Data Science ⚫ Fraud and Risk Detection ⚫ Healthcare ⚫ Internet Search ⚫ Targeted Advertising ⚫ Website Recommendations ⚫ Advanced Image Recognition ⚫ Speech Recognition ⚫ Airline Route Planning ⚫ Gaming ⚫ Augmented Reality Fraud and Risk Detection ⚫ Companies were fed up of bad debts and losses every year. ⚫ Loan sanctioning- Collection of data ⚫ Banking companies learned to divide and conquer data ⚫ customer profiling, past expenditures, and other essential variables are used to analyze the probabilities of risk and default. Healthcare ⚫ Medical Image Analysis:X-rays, MRIs, CT- Scans, ⚫ Drug Development ⚫ Virtual assistance for patients and customer support: our.MD and Ada. ⚫ Genetics & Genomics: Genomic Data Science applies the statistical techniques to genomic sequences, allowing the bioinformaticians and geneticists to understand the defects in genetic structures. It is also helpful in classifying diseases that are genetic in nature. Internet Search ⚫ Search Engines: Google, Yahoo, Bing, Ask, AOL etc. ⚫ Gives best result for our searched query in a fraction of seconds. ⚫ Google processes more than 20 petabytes of data every day. Targeted Advertising ⚫ digital ads have been able to get a lot higher CTR (Call-Through Rate) than traditional advertisements. ⚫ They can be targeted based on a user’s past behavior. ⚫ This is the reason why we might see ads of different things for which we previously searched. Website Recommendations ⚫ Website like Amazon not only help you find relevant products from billions of products available with them but also adds a lot to the user experience. ⚫ A lot of companies have used this engine to promote their products in accordance with user’s interest and relevance of information ⚫ Ex: Amazon, Twitter, Google Play, Netflix, Linkedin, imdb Advanced Image Recognition ⚫ Uploading our own image with friends on Facebook starts getting suggestions to tag our friends. ⚫ This automatic tag suggestion feature uses face recognition algorithm. ⚫ In their latest update, Facebook has outlined the additional progress they’ve made in this area, making specific note of their advances in image recognition accuracy and capacity. Speech Recognition ⚫ Example:Google Voice, Siri, Cortana etc. ⚫ Using speech-recognition feature, even if you aren’t in a position to type a message, your life wouldn’t stop. ⚫ Simply speak out the message and it will be converted to text. Airline Route Planning ⚫ Predict flight delay ⚫ Decide which class of airplanes to buy ⚫ Whether to directly land at the destination or take a halt in between (For example, A flight can have a direct route from New Delhi to New York. Alternatively, it can also choose to halt in any country.) ⚫ Effectively drive customer loyalty programs Gaming ⚫ Games are now designed using machine learning algorithms which improve/upgrade themselves as the player moves up to a higher level. ⚫ In motion gaming also, your opponent (computer) analyzes your previous moves and accordingly shapes up its game. ⚫ EA Sports, Zynga, Sony, Nintendo, Activision- Blizzard have led gaming experience to the next level using data science. Augmented Reality ⚫ Itis our reality augmented with digital science ⚫ VR headset incorporates computer expertise, algorithms, and data to provide you with the greatest viewing experience, Data Science and Virtual Reality have a connection. ⚫ The popular game Pokemon GO is a modest step in the right direction. References: ⚫ https://www.datacamp.com/ ⚫ https://towardsdatascience.com/ ⚫ https://www.kaggle.com/ ⚫ https://www.analyticsvidhya.com ⚫ https://www.guru99.com Questions???? Unit 2: Frameworks of Data Science Introduction Frameworks in data science CRISP-DM Methodology What is Framework? Most people probably have a basic understanding that programming involves writing lines of code. However, writing lines of code from scratch for each and every project is tedious. Frameworks and libraries shorten the creative process and allow programmers to take advantage of tried-and-true programmatic solutions to common problems. Frameworks and libraries are essentially starting blocks for creating code. These code blocks have been built, tested, and optimized by a community. It is a collection of individual software components available in code form and ready to run (what we call as libraries) that can be run independently or together to achieve a complicated task on any machine The important part is ready to run Why to use Framework?/Benefits Frameworks Create Better code Frameworks Are Pre-Tested and Pre-Optimized: saves time by using pre-tested and pre-optimized code rather than starting from scratch. Faster Implementation : Teams can spend less time designing and testing and more time analyzing and optimizing the models. Frameworks in Data Science TensorFlow Scikit-learn Keras Pandas Spark MLib PyTorch Matplotlib Numpy Seaborn TensorFlow TensorFlow is an “end-to-end open source machine learning platform” that helps data science develop and train machine learning (ML) models. It’s especially useful for efficiently building fast prototypes. Data scientists can write in any language already familiar with them to train and deploy models in the cloud. Scikit-learn Scikit-learn is an easy to learn, open-source Python library for machine learning built on NumPy, SciPy, and matplotlib. It can be used for data classification, regression, clustering, dimensionality reduction, model selection, and pre-processing. Pandas Pandas (short for “panel-data-s”) is a machine learning tool used for data exploring, cleaning, transforming, and visualization so it can be used in machine learning models and training. It’s an open-source Python library built on top of NumPy. Pandas can handle three types of data structures: series, DataFrame, and panel Data Analytics Framework / CRISP-DM Framework : (Cross-Industry Standard Process for Data Mining) Is a process model describing the life cycle of data science. It guides you through the entire phases of planning, organizing and implementing your data mining project Why CRISP DM is required ? Handling complex business problems Dealing with multiple data sources and its quality. Applying various data mining techniques. Measuring success of data mining projects. CRISP-DM Phases: 1. Business Understanding Understanding project objectives and requirements; Data mining problem definition 2. Data Understanding Initial data collection and familiarization; Identify data quality issues; Initial, obvious results. 3. Data Preparation Record and attribute selection; Data cleansing. 4. Modelling Run the data mining tools. 5. Evaluation Determine if results meet business objectives; Identify business issues that should have been addressed earlier. 6. Deployment Put the resulting models into practice; Set up for continuous mining of the data. 1. Business Understanding States goal in business terminology. Determine business objectives Describe problem in general terms. Business questions, Expected benefits. Define success criteria , resources , constraints , risk , cost , benefits. Understanding the project objectives and requirements from a business perspective. Converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. 2.Data Understanding Collect Data What are the data sources? Data Description Data Exploration Getting familiar with the data Identifying data quality problems Discovering first insights into the data 3. Data Preparation Integration of Data Select Data Cleaning , constructing , formatting , integrating data. Covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data Data preparation tasks are likely to be performed multiple times, and not in any prescribed order Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools 4. The Modelling Phase Select appropriate modelling technique Build model Assess model Various modeling techniques are selected and applied, and their parameters are calibrated to optimal values Typically, there are several techniques for the same data mining problem type Some techniques have specific requirements on the form of data, therefore, stepping back to the data preparation phase is often needed 5.The Evaluation Phase Validate Model Evaluate Results At this stage, a model (or models) that appears to have high quality, from a data analysis perspective, has been built Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives A key objective is to determine if there is some important business issue that has not been sufficiently considered At the end of this phase, a decision on the use of the data mining results should be reached 6.The Deployment Phase Plan deployment Maintenance Report Generation & review Creation of the model is generally not the end of the project Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it In many cases it will be the customer, not the data analyst, who will carry out the deployment steps References: https://www.jigsawacademy.com/blogs/data-science/data- science-framework/ Data science process: https://towardsdatascience.com/the-data- science-process-a19eb7ebc41b https://aptude.com/data-science/entry/31-data-science- programming-frameworks-and-interfaces Unit III: Essentials of Statistical Learning Basics of Statistics: mean, median, standard deviation, variance, correlation, covariance Introduction to Regression ⦁ Statistics means numerical data, and is field of math that generally deals with collection of data, tabulation, and interpretation of numerical data. ⦁ It is an area of applied mathematics concern with data collection analysis, interpretation, and presentation ⦁ Population – It is actually a collection of set of individuals or objects or events whose properties are to be analyzed. ⦁ Sample – It is the subset of a population ⦁ Descriptive statistics uses data that provides a description of the population either through numerical calculation or graph or table. ⦁ It provides a graphical summary of data. It is simply used for summarizing objects. ⦁ Measure of central tendency – Measure of central tendency is also known as summary statistics that is used to represents the center point or a particular value of a data set or sample set. ⦁ In statistics, there are three common measures of central tendency as shown below: Mean Median Mode Note: Mean mode median are also known as central parameters ⦁ The mean value is the average value. ⦁ To calculate the mean, find the sum of all values, and divide the sum by the number of values: In python to calculate mean: 1) mean() Syntax : mean([data-set]) Parameters : [data-set] : List or tuple of a set of numbers. Example 1: import statistics #list of positive integer numbers data1 = [1, 3, 4, 5, 7, 9, 2] x = statistics.mean(data1) # Printing the mean print("Mean is :", x) Example 2: from statistics import mean # tuple of positive integer numbers data1 = (11, 3, 4, 5, 7, 9, 2) print(mean(data1)) 2) numpy.mean(arr, axis = None) Parameters : arr : [array_like]input array. axis : [int or tuples of int]axis along which we want to calculate the arithmetic mean. Otherwise, it will consider arr to be flattened(works on all the axis). axis = 0 means along the column and axis = 1 means working along the row. #Python Program illustrating numpy.mean() method import numpy as np # 1D array arr = [20, 2, 7, 1, 34] print("arr : ", arr) print("mean of arr : ", np.mean(arr)) ⦁ The mode of a set of data values is the value that appears most often. ⦁ Or the value which appears maximum times in the given data. ⦁ Example: Given data-set is : [1, 2, 3, 4, 4, 4, 4, 5, 6, 7, 7, 7, 8] The mode of the given data-set is 4 mode([data set]): Parameters : [data-set] which is a tuple, list or a iterator of real valued numbers as well as Strings. import statistics # declaring a simple data-set consisting of integers. set1 =[1, 2, 3, 3, 4, 4, 4, 5, 5, 6] print(statistics.mode(set1)) ⦁ the median is the middle value of the given data set. ⦁ Example:77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103 (86 + 87) / 2 = 86.5 ⦁ In python to calculate median use: 1) median([dataset]): Syntax : median( [data-set] ) Parameters : [data-set] : List or tuple or an iterable with a set of numeric values import statistics # unsorted list of random integers data1 = [2, -2, 3, 6, 9, 4, 5, -1] # Printing median of the random data-set print("Median of data-set is : “, statistics.median(data1)) 2) Using numpy module: import numpy as np speed = [99,86,87,88,86,103,87,94,78,77,85,86] x = np.median(speed) print(x) ⦁ A measure of variability is a summary statistic that represents the amount of dispersion in a dataset. ⦁ How spread out are the values? While a measure of central tendency describes the typical value, measures of variability define how far away the data points tend to fall from the center. ⦁ The variability is in the context of a distribution of values. ⦁ A low dispersion indicates that the data points tend to be clustered tightly around the center. ⦁ High dispersion signifies that they tend to fall further away. 1) Range 2) Variance 3) Standard Deviation 4) Covariance 5) Correlation ⦁ Range: ✓ The range of the data is given as the difference between the maximum and the minimum values of the observations in the data. ✓ For example, let’s say we have data on the number of customers walking in the store in a week. 10, 14, 8, 10, 15, 4, 7 Minimum value in data = 7 Maximum Value in the data = 15 Range = Maximum Value in the data – Minimum value in the data = 15 – 7 =8 a=[1,2,3,4,5] print(“range=“,max(a)-min(a)) ⦁ Variance is calculated by taking the difference of each number in the dataset from the mean, summing all the differences, and finally dividing it by the number of values in the dataset. ⦁ A large variance indicates that the numbers in the dataset are far from the mean and far from each other. ⦁ A small variance, on the other hand, indicates that the numbers are close to the mean and to each other. ⦁ A variance of 0 indicates that all the numbers in the dataset are the identical. ⦁ Finally, the valid value of variance is always a positive number (0 or more). ⦁ it is useful to be able to visualize the distribution of data in a dataset. ⦁ In python variance() is used for this ⦁ Syntax : variance( [data], xbar ) Parameters : [data] : An iterable with real valued numbers. xbar (Optional) : Takes actual mean of data-set as value. Returnype : Returns the actual variance of the values passed as parameter. # Importing Statistics module import statistics # Creating a sample of data sample = [2.74, 1.23, 2.63, 2.22, 3, 1.98] # Prints variance of the sample set print("Variance of sample set is “, statistics.variance(sample)) We can also use numpy.var(): numpy.var(sample) ⦁ Calculate the range, variance of the following data. 1) -3 -3 -3 -3 0 3 3 3 3 2) -4 -2 0 -2 6 4 6 0 -6 4 ⦁ Standard deviation is square root of variance. ⦁ variance is the average of squared difference of values in a data set from the mean value. ⦁ Standard Deviation is a measure of spread in Statistics. ⦁ It is used to quantify the measure of spread, variation of a set of data values. ⦁ It is very much similar to variance, gives the measure of deviation whereas variance provides the squared value. ⦁ A low measure of Standard Deviation indicates that the data are less spread out. ⦁ whereas a high value of Standard Deviation shows that the data in a set are spread apart from their mean average values. ⦁ where x1, x2, x3.....xn are observed values in sample data, is the mean value of observations ⦁ and N is the number of sample observations. ⦁ Syntax : stdev( [data-set], xbar ) Parameters : [data] : An iterable with real valued numbers. xbar (Optional): Takes actual mean of data-set as value. Returnype : Returns the actual standard deviation of the values passed as parameter. import statistics # creating a simple data - set sample = [1, 2, 3, 4, 5] # Prints standard deviation # xbar is set to default value of 1 print("Standard Deviation of sample is” ,statistics.stdev(sample)) ⦁ Parameters : ⦁ a:array_likeCalculate the standard deviation of these values. ⦁ axis :None or int or tuple of ints, optional Example: import numpy as np a=[1,2,3,4,5,6] print(np.std(a)) ⦁ variance measures the spread of data within its mean value ⦁ covariance measures the relationship between two random variables. ⦁ In statistics, covariance is the measure of the directional relationship between two random variables. ⦁ how much will a variable change when another variable changes. ⦁ if COV(xi, xj) = 0 then variables are uncorrelated ⦁ If COV(xi, xj) > 0 then variables positively correlated ⦁ If COV(xi, xj) > < 0 then variables negatively correlated Covariance between 2 random variables is calculated by taking the product of the difference between the value of each random variable and its mean, summing all the products, and finally dividing it by the number of values in the dataset. ⦁ Syntax: numpy.cov(m) ⦁ m : [array_like] A 1D or 2D variables. variables are columns Example: # Python code to from numpy import array demonstrate the # use of numpy.cov from numpy import cov import numpy as np x = array([1,2,3,4,5,6,7,8,9]) x = np.array([[0, 3, print(x) 4], [1, 2, 4], [3, 4, 5]]) y = array([9,8,7,6,5,4,3,2,1]) print(y) print("Shape of array:\n", Sigma = cov(x,y) np.shape(x)) print(Sigma) print("Covariance matrix of x:\n", np.cov(x)) ⦁ Correlation is a statistical technique that helps to measure and analyze the degree of relationship between two variables ⦁ It refers to extent to which two variables are associated with each other The measure of correlation called the correlation coefficient The degree of relationship is expressed by coefficient which range from -1 to +1. The direction of change is indicated by a sign. The correlation analysis enable us to have an idea about the degree & direction of the relationship between the two variables under study ⦁ Correlation is a statistical technique used to determine the degree to which two variables are related. ⦁ If the change in one variable affects the change in the other variable , the variable are said correlated. The correlation may be positive, negative or zero. ⦁ Positive correlation: If the value of the attribute A increases with the increase in the value of the attribute B and vice-versa. Ex. Income & Expenses, Height & weight. ⦁ Negative correlation: If the value of the attribute A decreases with the increase in the value of the attribute B and vice-versa. Ex. Expenses and Saving ⦁ Zero correlation: When the values of attribute A varies at random with B and vice-versa. Positive relationships ⦁ water consumption and temperature. ⦁ study time and grades. Negative relationships: ⦁ alcohol consumption and driving ability. ⦁ Price & quantity demanded ⦁ numpy.corrcoef(x, y) : Parameters : x: array_likeA 1-D or 2-D array containing multiple variables and observations. y :array_like 1-D or 2-D array ⦁ Example: ⦁ import numpy as np exp=np.array([1,2,3,4,5]) sal=np.array([15000,20000,30000,40000,50000]) mat=np.corrcoef(exp,sal) Displays output as: print("correlation matrix=\n",mat) [[1. 0.99388373] [0.99388373 1.]] ⦁ Regression analysis is a way to find trends in data. ⦁ Regression analysis is a statistical technique for analysing and comprehending the connection between two or more variables of interest. ⦁ The methodology used to do regression analysis aids in understanding which elements are significant, which may be ignored, and how they interact with one another. ⦁ which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables. ⦁ It is mainly used for prediction, forecasting, time series modelling, and determining the causal-effect relationship between variables. ⦁ Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the dependent variable. It is also called target variable. ⦁ Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent variables are called independent variable, also called as a predictor. ⦁ Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values. An outlier may hamper the result, so it should be avoided. 1. Linear Regression : ⦁ The relationship between a dependent variable and a single independent variable is described using a basic linear regression methodology. ⦁ Linear regression is a statistical regression method which is used for predictive analysis. ⦁ A Simple Linear Regression model reveals a linear or slanted straight line relation, thus the name. ⦁ The simple linear model is expressed using the following equation: Y = a + bX + ϵ Where: Y – variable that is dependent X – Independent (explanatory) variable a – Intercept b – Slope ϵ – Residual (error) ⦁ Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis), hence called linear regression. ⦁ If there is only one input variable (x), then such linear regression is called simple linear regression. And if there is more than one input variable, then such linear regression is called multiple linear regression. ⦁ The relationship between variables in the linear regression model can be explained using following diagram. ⦁ Here we are predicting the salary of an employee on the basis of the year of experience. ⦁ Logistic regression is another algorithm which is used to solve the classification problems ⦁ Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or False, Spam or not spam, etc. ⦁ It is a predictive analysis algorithm which works on the concept of probability. ⦁ Logistic regression is a type of regression, but it is different from the linear regression algorithm in the term how they are used. ⦁ Logistic regression uses sigmoid function or logistic function which is a complex cost function. This sigmoid function is used to model the data in logistic regression. The function can be represented as: f(x)= Output between the 0 and 1 value. x= input to the function e= base of natural logarithm. When we provide the input values (data) to the function, it gives the S-curve as follows Data collection Why Pre-processing? Methods of pre-processing Data Cleaning Data Integration Data Reduction: Attribute subset selection Histograms Clustering and Sampling Data Transformation & Data Discretization: Normalization, Binning, HistogramAnalysis ⦁ Data is truly considered a resource in today’s world. ⦁ As per the World Economic Forum, by 2025 we will be generating about 463 exabytes of data globally per day! ⦁ But is all this data fit enough to be used by machine learning algorithms? ⦁ How do we decide that? ⦁ For this data pre-processing is needed — transforming the data such that it becomes machine-readable… ⦁ Data pre-processing is the process of transforming raw data into an understandable format. ⦁ we cannot work with raw data. ⦁ The quality of the data should be checked before applying machine learning algorithms. ⦁ In other words, the features of the data can now be easily interpreted by the algorithm. ⦁ Data gets transformed, or Encoded, to bring it to such a state that now the machine can easily understand it. ⦁ Data objects are described by a number of features, that capture the basic characteristics of an object ⦁ For ex: the mass of a physical object or the time at which an event occurred, etc ⦁ Features are often called as variables, characteristics, fields, attributes, or dimensions. ⦁ A feature is an individual measurable property or characteristic of a phenomenon being observed ⦁ For Example: color, mileage and power can be considered as features of a car. ⦁ There are different types of features that we can come across when we deal with data. ⦁ Categorical : Features whose values are taken from a defined set of values. ⦁ For Example: days in a week , the Boolean set ⦁ Numerical : Features whose values are continuous or integer- valued. ⦁ For Example: number of steps you walk in a day, or the speed at which you are driving your car at. Major Tasks in Data Pre-processing: 1. Data cleaning 2. Data integration 3. Data reduction 4. Data transformation ⦁ Data cleaning is the process to remove incorrect data, incomplete data and inaccurate data from the datasets ⦁ Data cleaning aims at filling missing values, smoothing out noise while determining outliers and rectifying inconsistencies in the data ⦁ What is data cleaning – Removing null records, dropping unnecessary columns, treating missing values, rectifying junk values or otherwise called outliers, restructuring the data to modify it to a more readable format, etc is known as data cleaning. ⦁ it also replaces the missing values. ⦁ There are some techniques in data cleaning 1) Removing Null/Duplicate Records/Ignore tuple ⦁ If in a particular row a significant amount of data is missing, then it would be better to drop that row as it would not be adding any value to our model. name score address height weight A 56 Goa 165 56 B 45 Mumbai 3 65 C 87 Delhi 170 58 D E 99 Mysore 167 60 ⦁ As we see that corresponding to student name “D”, most of the data is missing hence we drop that particular row. ⦁ example: import pandas as pd df=pd.read_csv(“student.csv”) newdf=df.dropna() o/p will be----- name score address height weight A 56 Goa 165 56 B 45 Mumbai3 65 C 87 Delhi 170 58 E 99 Mysore 167 60 Student name “D” is removed here 2) Dropping unnecessary Columns : ⦁ When we receive the data from stakeholders, generally it is huge. There can be a log of data that might not add any value to our model. ⦁ Such data is better removed as it would valuable resources like memory and processing time. ⦁ Example: while looking at students’performance over a test, students’weight or their height does not have anything to contribute to the model. df.drop(['height','weight'], axis = 1,inplace=True) #Drops Height column form the dataframe (Note: the parameter implace=True will make the changes in original dataframe) 3) Treating missing values : ⦁ Missing data is a deceptively tricky issue in machine learning. ⦁ We cannot just ignore or remove the missing observation. ⦁ They must be handled carefully as they can be an indication of something important. ⦁ We can also fill in the missing value with the help of attribute mean ⦁ In Python we have: DataFrame.fillna(value=None, method=None, axis=None, inplace=False , limit=None) fillna() is used to fill NaN values using the specified method. parameters: Value: any value for ex zero Method: bfill,ffill Axis :0 or ‘index’, 1 or ‘columns’ Inplace :bool, default False Limit : int, default None ⦁ Student_df['col_name'].fillna((Student_df['col_name'].mean()), inplace=True) ⦁ This syntax is used to fill specific column having NaN values from the dataset with mean value We can also use interpolate() method for missing values: The linear method ignores the index and treats missing values as equally spaced and finds the best point to fit the missing value after previous points. If the missing value is at first index then it will leave it as Nan. let’s apply it to our dataframe Syntax: df.interpolate(method ='linear‘) ⦁ Data Integration is a data pre-processing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and provide a unified view of the data. ⦁ “data integration is the process of combining data from different sources into a single, unified view ⦁ It delivers trusted data from various sources. ⦁ Focus is at identification and retrieval of datasets from internal sources and storage of these data sets into warehouses. ⦁ Integration of data from multiple sources to one or more targets. ⦁ Schema Integration & Object Matching :Different companies have different ways of designing the datasets. So it will become very difficult to integrate columns from them. ⦁ Redundancy : Unwanted columns/information ⦁ Detection & Reduction of Data values conflicts : ⦁ For ex: company A is having column price in rupees and Company B is having price in dollars($) then again it is very difficult to integrate them ⦁ Data Reduction techniques help to minimize size of dataset without affecting the result. ⦁ Data reduction techniques ensure the integrity of data while reducing the data. ⦁ Data reduction Techniques: 1) Dimensionality reduction 2) numerosity reduction and 3) data compression. 1) Dimensionality reduction : Dimensionality reduction eliminates the attributes from the data set under consideration thereby reducing the volume of original data For Ex: Wavelet Transform, Principal Component Analysis (PCA),Attribute subset collection 2) Numerosity reduction : The numerosity reduction reduces the volume of the original data and represents it in a much smaller form. For Ex: Regression, Histogram, Clustering, Sampling, Data cube aggregation 3) Data compression : is a technique where the data transformation technique is applied to the original data in order to obtain compressed data. If the compressed data can again be reconstructed to form the original data without losing any information then it is a ‘lossless’ data reduction. ⦁ In data transformation process data are transformed from one format to another format, that is more appropriate for data mining. ⦁ In this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand. ⦁ It involves some processes like: Smoothing, Aggregation, generalization ,Normalization 1. Smoothing: Removing the noise from data(Binning, regression ,clustering) 2. Aggregation: summary/aggregate functions are applied to construct data cube 3. Generalization: low level concepts are replaced with high level concepts. Example: street is converted into city/country 4. Normalization: Attribute values are normalized by scaling their values so that they fall in specified range(MinMaxScalar, z-score etc) ⦁ Equal-width (distance) partitioning: ⦁ It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B- A)/N. ⦁ The most straightforward but outliers may dominate presentation. ⦁ Skewed data is not handled well. ⦁ Equal-depth (frequency) partitioning: ⦁ It divides the range into N intervals, each containing approximately same number of samples. ⦁ Good data scaling Managing categorical attributes can be tricky. ⦁ Binning Methods is used for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equip-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 ⦁ a histogram is a graphical representation of the distribution of a dataset. ⦁ a histogram is a plot that lets you show the underlying frequency distribution or the probability distribution of a single continuous numerical variable. ⦁ histograms are two-dimensional plots with two axes; the vertical axis is a frequency axis whilst the horizontal axis is divided into a range of numeric values (intervals or bins) or time intervals. ⦁ The frequency of each bin is shown by the area of vertical rectangular bars. ⦁ Each bar covers a range of continuous numeric values of the variable under study. ⦁ The vertical axis shows frequency values derived from counts for each bin. ⦁ Frequency of different data points in the dataset. ⦁ Location of the center of data. ⦁ The spread of dataset. ⦁ Skewness/variance of dataset. ⦁ Presence of outliers in the dataset. Patient Blood Sugar AA 113 80-100 Normal BB 85 100-125 Pre-diabetic CC 90 125-above Diabetic DD 150 EE 149 FF 88 GG 93 Questions: 1) How many patients are normal? HH 115 2) How many of them are pre- II 135 diabetic? JJ 80 3) How many are diabetic? KK 77 LL 82 MM 129 import matplotlib.pyplot as plt blood_sugar= [85,90,150,149,88,93,115,135,80,77,82,129] plt.hist(blood_sugar) (by default it plots 10 bins. bin means really arranged) you can specify any value for this bin as plt.hist(blood_sugar,bins=3,rwidth=0.95) we can also specify range in bins as plt.hist(blood_sugar, bins=[80,100,125,150],rwidth=0.95,color=‘g’) plt.xlabel(“sugar range”) plt.ylabel(“total no of patients”) plt.title(“blood sugar analysis”) we can also add blood_sugar of women into it. and plot histogram Sampling is a method that allows us to get information about the population based on the statistics from a subset of the population (sample), without having to investigate every individual. ⦁ Sampling is done to draw conclusions about populations from samples, and it enables us to determine a population’s characteristics by directly observing only a portion (or sample) of the population. ⦁ Selecting a sample requires less time than selecting every item in a population ⦁ Sample selection is a cost-efficient method ⦁ Analysis of the sample is less cumbersome and more practical than an analysis of the entire population ⦁ https://www.analyticsvidhya.com/blog/2021/08/data-preprocessing-in-data- mining-a-hands-on-guide/ ⦁ https://www.geeksforgeeks.org/data-reduction-in-data-mining/ ⦁ https://www.javatpoint.com/data-preprocessing-machine-learning ⦁ https://towardsdatascience.com/introduction-to-data-preprocessing-in- machine-learning-a9fa83a5dc9d ⦁ https://www.analyticsvidhya.com/blog/2020/12/tutorial-to-data- preparation-for-training-machine-learning-model/ ⦁ https://youtu.be/EC_IeIBlGto