Unit 5 Data Wrangling PDF
Document Details
Uploaded by EnthusiasticPeninsula2972
Dipesh Joshi
Tags
Summary
This document is about data wrangling. It discusses the process of transforming and mapping data from one format to another for downstream purposes such as analytics. The data wrangling process includes cleaning and unifying messy data sets for easier access. It also highlights some Python libraries and functions for data wrangling.
Full Transcript
8/16/21 UNIT 5 Data Wrangling 1 DATA WRANGLING Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data f...
8/16/21 UNIT 5 Data Wrangling 1 DATA WRANGLING Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis. Data wrangling involves processing the data in various formats like - merging, grouping, concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data. Python has built-in features to apply wrangling methods to various data sets to achieve the analytical goal but we will also use the Scikit-learn package. 2 Dipesh Joshi 1 8/16/21 DATA WRANGLING Playing with Scikit-learn - Understanding classes in Scikit-learn Understanding how classes work is an important prerequisite for being able to use the Scikit-learn package appropriately. Scikit-learn is the package for machine learning and data science experimentation favored by most data scientists. It contains a wide range of well-established learning algorithms, error functions, and testing procedures. At its core, Scikit-learn features some base classes on which all the algo- rithms are built. 3 DATA WRANGLING Playing with Scikit-learn - Understanding classes in Scikit-learn Apart from BaseEstimator, the class from which all other classes inherit, there are four class types covering all the basic machine- learning functionalities: Ø Classifying Ø Regressing Ø Grouping by clusters Ø Transforming data 4 Dipesh Joshi 2 8/16/21 DATA WRANGLING Playing with Scikit-learn - Understanding classes in Scikit-learn Even though each base class has specific methods and attributes, the core functionalities for data processing and machine learning are guaranteed by one or more series of methods and attributes called interfaces. The inter- faces provide a uniform Application Programming Interface (API) to enforce similarity of methods and attributes between all the different algorithms present in the package. There are four Scikit-learn object-based interfaces: Ø Estimator: For fitting parameters, learning them from data, according to the algorithm Ø Predictor: For generating predictions from the fitted parameters Ø Transformer: For transforming data, implementing the fitted parameters Ø Model: For reporting goodness of fit or other score measures 5 DATA WRANGLING Playing with Scikit-learn - Understanding classes in Scikit-learn The package groups the algorithms built on base classes and one or more object interfaces into modules, each module displaying a specialization in a particular type of machine-learning solution. For example, the linear_ model module is for linear modeling, and metrics is for score and loss measure. In order to find a specific algorithm in Scikit-learn, we must first find the module containing the same kind of algorithm that interests by us, and then select it from the list of contents of the module. The algorithm is typically a class itself, whose methods and attributes are already known because they’re common to other algorithms in Scikit-learn. 6 Dipesh Joshi 3 8/16/21 DATA WRANGLING Playing with Scikit-learn - Defining applications for data science from sklearn.datasets import load_boston boston = load_boston() X, y = boston.data,boston.target print X.shape, y.shape Output : (506L, 13L) (506L,) 7 DATA WRANGLING Playing with Scikit-learn - Defining applications for data science from sklearn.linear_model import LinearRegression hypothesis = LinearRegression(normalize=True) hypothesis.fit(X,y) print hypothesis.coef_ 8 Dipesh Joshi 4 8/16/21 DATA WRANGLING Playing with Scikit-learn - Defining applications for data science import numpy as np new_observation = np.array([1,0,1,0,0.5,7,59,6,3,200,20,350,4], dtype=float) print hypothesis.predict(new_observation) Output : 25.8972783977 hypothesis.score(X,y) Output : 0.74060774286494291 9 DATA WRANGLING Playing with Scikit-learn - Defining applications for data science from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) scaler.fit(X) print scaler.transform(new_observation) Output : [ 0.01116872 0. 0.01979472 0. 0.23662551 0.65893849 0.57775489 0.44288845 0.08695652 0.02480916 0.78723404 0.88173887 0.06263797] 10 Dipesh Joshi 5 8/16/21 DATA WRANGLING Performing the Hashing Trick Scikit-learn provides us with most of the data structures and functionality, we need to complete our data science project. There are even classes for the trickiest and most advanced problems. For instance, when dealing with text, one of the most useful solutions provided by the Scikit-learn package is the hashing trick. 11 DATA WRANGLING Performing the Hashing Trick A more serious data science challenge is to analyze online-generated text flows, such as from social networks or large online text repositories. This scenario poses quite a challenge when trying to turn the text into a data matrix suitable for analysis. When working through such problems, knowing the hashing trick can give us quite a few advantages: Ø Handling large data matrices based on text on the fly Ø Fixing unexpected values or variables in our textual data Ø Building scalable algorithms for large collections of documents 12 Dipesh Joshi 6 8/16/21 DATA WRANGLING Performing the Hashing Trick - Using hash functions Hash functions can transform any input into an output whose characteristics are predictable. Usually they return a value where the output is bound at a specific interval - whose extremities range from negative to positive numbers or just span through positive numbers. We can imagine them as enforcing a standard on our data - no matter what values we provide, they always return a specific data product. 13 DATA WRANGLING Performing the Hashing Trick - Using hash functions Most useful hash function characteristic is that, given a certain input, they always provide the same numeric output value. Consequently, they’re called deterministic functions. For example, input a word like keyboard and the hashing function always returns the same number. In a certain sense, hash functions are like a secret code, transforming every- thing into numbers. Unlike secret codes, however, we can’t convert the hashed code to its original value. In some rare cases, different words generate the same hashed result (also called a hash collision). 14 Dipesh Joshi 7 8/16/21 DATA WRANGLING Performing the Hashing Trick - Demonstrating the hashing trick There are many hash functions, with MD5 (often used to check file integrity, because we can hash entire files) and SHA (used in cryptography) being the most popular. Python possesses a built-in hash function named hash that we can use to compare data objects before storing them in dictionaries. For instance, we can test how Python hashes its name: hash('Python’) -539294296 15 DATA WRANGLING Performing the Hashing Trick - Demonstrating the hashing trick A Scikit-learn hash function can also return an index in a specific positive range. We can obtain something similar using a built-in hash by employing standard division and its remainder. abs(hash('Python')) % 1000 296 When we ask for the remainder of the absolute number of the result from the hash function, we get a number that never exceeds the value we used for the division. 16 Dipesh Joshi 8 8/16/21 DATA WRANGLING Performing the Hashing Trick - Demonstrating the hashing trick How this works, Pretend that we want to transform a text string from the Internet into a numeric vector (a feature vector) so that we can use it for starting a machine-learning project. A good strategy for managing this data science task is to employ one-hot-encoding, which produces a bag of words. Here are the steps for one-hot-encoding a string (“Python for data science”) into a vector. vAssign a number to each word, for instance, Python=0 for=1 data=2 science=3. vInitialize the vector, counting the number of unique words that we assigned a code in Step 1. vUse the codes assigned in Step 1 as indexes for populating the vector with values, assigning a 1 where there is a coincidence with a word existing in the phrase. 17 DATA WRANGLING Performing the Hashing Trick - Demonstrating the hashing trick The resulting feature vector is expressed as the sequence [1,1,1,1] and made of exactly four elements.. We have started the machine-learning process, telling the program to expect sequences of four text features, when suddenly a new phrase arrives and we must vectorize the following text as well: “Python for machine learning”. Now we have two new words — “machine learning” — to work with. The following steps help we create the new vectors: 1. Assign these new codes: machine=4 learning=5. 2. Enlarge the previous vector to include the new words: [1,1,1,1,0,0]. 3. Compute the vector for the new string: [1,1,0,0,1,1]. 18 Dipesh Joshi 9 8/16/21 DATA WRANGLING Performing the Hashing Trick - Demonstrating the hashing trick One-hot-encoding is quite optimal because it creates efficient and ordered feature vectors. Unfortunately, one-hot-encoding fails and becomes difficult to handle when our project experiences a lot of variability with regard to its inputs. This is a common situation in data science projects working with text or other symbolic features where flow from the Internet or other online envi- ronments can suddenly create or add to our initial data. 19 DATA WRANGLING Performing the Hashing Trick - Demonstrating the hashing trick Using hash functions is a smarter way to handle unpredictability in our inputs: 1. Define a range for the hash function outputs. All our feature vectors will use that range. The example uses a range of values from 0 to 24. 2. Compute an index for each word in our string using the hash function. 3. Assign a unit value to vector’s positions according to word indexes. 20 Dipesh Joshi 10 8/16/21 DATA WRANGLING Performing the Hashing Trick - Demonstrating the hashing trick def hashing_trick(input_string, vector_size=20): feature_vector = * vector_size for word in input_string.split(' '): index = abs(hash(word)) % vector_size feature_vector[index] = 1 return feature_vector 21 DATA WRANGLING Performing the Hashing Trick - Working with deterministic selection Sparse matrices are the answer when dealing with data that has few values, that is, when most of the matrix values are zeroes. Sparse matrices store just the coordinates of the cells and their values, instead of storing the information for all the cells in the matrix. When an application requests data from an empty cell, the sparse matrix will return a zero value after looking for the coordinates and not finding them. from scipy.sparse import csc_matrix print csc_matrix([1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]) 22 Dipesh Joshi 11 8/16/21 DATA WRANGLING Performing the Hashing Trick - Working with deterministic selection The package SciPy offers a large variety of sparse matrix structures - each one storing the data in a different way and each one performing in a different way. Usually the csc_matrix is a good choice because most Scikit-learn algorithms accept it as input and it’s optimal for matrix operations. Scikit-learn offers HashingVectorizer, a class that rapidly transforms any collection of text into a sparse data matrix using the hashing trick. 23 DATA WRANGLING Performing the Hashing Trick - Working with deterministic selection sklearn_hashing_trick = txt.HashingVectorizer( n_features=20, binary=True, norm=None) text_vector = sklearn_hashing_trick.transform( ['Python for data science','Python for machine learning']) text_vector 24 Dipesh Joshi 12 8/16/21 DATA WRANGLING Considering Timing and Performance Profiling the time that operations require, measuring how much memory adding more data takes, or performing a transformation on our data can help us to spot the bottlenecks in our code and start looking for alternative solutions. IPython is the perfect environment for experimenting, tweaking, and improving our code. Working on blocks of code, recording the results and outputs, and writing additional notes and comments will help our data science solutions take shape in a controlled and reproducible way. 25 DATA WRANGLING Considering Timing and Performance - Benchmarking with timeit we compare two alternatives for encoding textual information into a data matrix that can address different needs: Ø CountVectorizer: Optimally encodes text into a data matrix but cannot address subsequent novelties in text. Ø HashingVectorizer: Provides flexibility in situations when it is likely that the application will receive new data, but is less optimal than techniques based on hashing functions. Although their advantages are quite clear in terms of how they handle the data, we may wonder what impact using one or the other has on our data processing in terms of speed and memory feasibility. 26 Dipesh Joshi 13 8/16/21 DATA WRANGLING Considering Timing and Performance - Benchmarking with timeit Concerning speed, IPython offers an easy, out-of-the-box solution, the line magic %timeit and the cell magic %%timeit: Ø %timeit: Calculates the best performance time for an instruction. Ø %%timeit: Calculates the best time performance for all the instructions in a cell, apart from the one placed on the same cell line as the cell magic (which could therefore be an initialization instruction). %timeit l = [k for k in range(10**6)] 10 loops, best of 3: 61.5 ms per loop 27 DATA WRANGLING Considering Timing and Performance - Working with the memory profiler when testing our application code for performance (speed) characteristics, we can obtain analogous information about memory usage. Keeping track of memory consumption could tell we about possible prob- lems in the way data is processed or transmitted to the learning algorithms. The memory_profiler package implements the required functionality. This package is not provided as a default Python or IPython package and it requires installation. pip install psutil pip install memory_profiler 28 Dipesh Joshi 14 8/16/21 DATA WRANGLING Considering Timing and Performance - Working with the memory profiler Use the following command for each IPython session we want to monitor: %load_ext memory_profiler After performing these tasks, we can easily track how much memory a com- mand consumes: hashing = sklearn_hashing_trick.transform(texts) %memit dense_hashing = hashing.toarray() peak memory: 68.79 MiB, increment: 0.14 MiB Obtaining a complete overview of memory consumption is possible by saving an IPython cell to disk and then profiling it using the line magic %mprun on an externally imported function. 29 DATA WRANGLING Running in Parallel Most computers today are multicore (two or more processors in a single package), some with multiple physical CPUs. One of the most important limitations of Python is that it uses a single core by default. Data science projects require quite a lot of computations. In particular, a part of the scientific aspect of data science relies on repeated tests and experi- ments on different data matrices. Using more CPU cores accelerates a computation by a factor that almost matches the number of cores. For example, having four cores would mean working at best four times faster. 30 Dipesh Joshi 15 8/16/21 DATA WRANGLING Running in Parallel We don’t receive a full fourfold increase because there is overhead when starting a parallel process - new running Python instances have to be set up with the right in-memory information and launched; consequently, the improvement will be less than potentially achievable but still significant. Knowing how to use more than one CPU is therefore an advanced but incredibly useful skill for increasing the number of analyses completed, and for speeding up our operations both when setting up and when using our data products. 31 DATA WRANGLING Running in Parallel - Performing multicore parallelism To perform multicore parallelism with Python, we integrate the Scikit-learn package with the joblib package for time-consuming operations, such as replicating models for validating results or for looking for the best hyper parameters. In particular, Scikit-learn allows multiprocessing when vCross-validating: Testing the results of a machine-learning hypothesis using different training and testing data vGrid-searching: Systematically changing the hyper-parameters of a machine-learning hypothesis and testing the consequent results 32 Dipesh Joshi 16 8/16/21 DATA WRANGLING Running in Parallel - Performing multicore parallelism vMultilabel prediction: Running an algorithm multiple times against mul- tiple targets when there are many different target outcomes to predict at the same time vEnsemble machine-learning methods: Modeling a large host of clas- sifiers, each one independent from the other, such as when using RandomForest-based modeling We don’t have to do anything special to take advantage of parallel computations — we can activate parallelism by setting the n_jobs parameter to a number of cores more than 1 or by setting the value to –1, which means we want to use all the available CPU instances. 33 DATA WRANGLING Running in Parallel - Demonstrating multiprocessing It’s a good idea to use IPython when we run a demonstration of how mul- tiprocessing can really save us time during data science projects. Using IPython provides the advantage of using the %timeit magic command for timing execution. We start by loading a multiclass dataset, a complex machine-learning algorithm (the Support Vector Classifier, or SVC), and a cross-validation procedure for estimating reliable resulting scores from all the procedures. 34 Dipesh Joshi 17 8/16/21 DATA WRANGLING Running in Parallel - Demonstrating multiprocessing from sklearn.datasets import load_digits digits = load_digits() X, y = digits.data,digits.target from sklearn.svm import SVC from sklearn.model_selection import cross_val_score %timeit single_core_learning = cross_val_score(SVC(), X,y, cv=20, n_jobs=1) %timeit multi_core_learning = cross_val_score(SVC(), X, y, cv=20, n_jobs=-1) 35 EXPLORING DATA ANALYSIS Data science relies on complex algorithms for building predictions and spotting important signals in data, and each algorithm presents dif- ferent strong and weak points. In short, we select a range of algorithms, we have them run on the data, we optimize their parameters as much as we can, and finally we decide which one will best help us to build our data product or generate insight into our problem. It sounds a little bit automatic and, partially, it is, thanks to powerful analytical software and scripting languages like Python. Learning algorithms are complex, and their sophisticated procedures naturally seem automatic and a bit opaque to us. GIGO stands for “Garbage In/Garbage Out.” 36 Dipesh Joshi 18 8/16/21 EXPLORING DATA ANALYSIS Exploratory Data Analysis (EDA) is a general approach to exploring datasets by means of simple summary statistics and graphic visualizations in order to gain a deeper understanding of data. EDA helps us become more effective in the subsequent data analysis and modeling. 37 EXPLORING DATA ANALYSIS The EDA Approach EDA was developed at Bell Labs by John Tukey, a mathematician and statistician who wanted to promote more questions and actions on data based on the data itself (the exploratory motif) in contrast to the dominant confirmatory approach of the time. A confirmatory approach relies on the use of a theory or procedure - the data is just there for testing and application. Tukey could already see that certain activities, such as testing and modeling, were easy to make automatic. 38 Dipesh Joshi 19 8/16/21 EXPLORING DATA ANALYSIS The EDA Approach Tukey said: This statement explains why, as a data scientist, our role and tools aren’t limited to automatic learning algorithms but also to manual and creative exploratory tasks. Computers are unbeatable at optimizing, but humans are strong at discovery by taking unexpected routes and trying unlikely but very effective solutions. 39 EXPLORING DATA ANALYSIS The EDA Approach EDA is a bit different because it checks beyond the basic assumptions about data workability, which actu- ally comprises the Initial Data Analysis (IDA). IDA v Complete observations or mark missing cases by appropriate features v Transform text or categorical variables v Create new features based on domain knowledge of the data problem v Have at hand a numeric dataset where rows are observations and columns are variables. 40 Dipesh Joshi 20 8/16/21 EXPLORING DATA ANALYSIS The EDA Approach EDA goes further than IDA. It’s moved by a different attitude: going beyond basic assumptions. With EDA, we v Describe of your data v Closely explore data distributions v Understand the relations between variables v Notice unusual or unexpected situations v Place the data into groups v Notice unexpected patterns within groups v Take note of group differences 41 EXPLORING DATA ANALYSIS Defining Descriptive Statistics for Numeric Data The first actions that we can take with the data are to produce some synthetic measures to help figure out what is going in it. We acquire knowledge of measures such as maximum and minimum values, and we define which intervals are the best place to start. from sklearn.datasets import load_iris iris = load_iris() 42 Dipesh Joshi 21 8/16/21 EXPLORING DATA ANALYSIS Defining Descriptive Statistics for Numeric Data import pandas as pd import numpy as np print 'Your pandas version is: %s' % pd.__version__ print 'Your NumPy version is %s' % np.__version__ iris_nparray = iris.data iris_dataframe = pd.DataFrame(iris.data, columns=iris.feature_names) iris_dataframe['group'] = pd.Series([iris.target_names[k] for k in iris.target], dtype="category") 43 EXPLORING DATA ANALYSIS Defining Descriptive Statistics for Numeric Data - Measuring central tendency Mean and median are the first measures to calculate for numeric variables when starting EDA. They can provide us with an estimate of EDA when the variables are centered and somehow symmetric. Using pandas, you can quickly compute both means and medians. print iris_dataframe.mean(numeric_only=True) print iris_dataframe.median(numeric_only=True) 44 Dipesh Joshi 22 8/16/21 EXPLORING DATA ANALYSIS Defining Descriptive Statistics for Numeric Data - Measuring central tendency When checking for central tendency measures, we should vVerify whether means are zero vCheck whether they are different from each other vNotice whether the median is different from the mean 45 EXPLORING DATA ANALYSIS Defining Descriptive Statistics for Numeric Data - Measuring variance and range We should check the variance by squaring the value of its standard deviation. The variance is a good indicator of whether a mean is a suitable indicator of the variable distribution. print iris_dataframe.std() print iris_dataframe.max(numeric_only=True)-iris_dataframe.min(numeric_only=True) 46 Dipesh Joshi 23 8/16/21 EXPLORING DATA ANALYSIS Defining Descriptive Statistics for Numeric Data - Working with percentiles Apart from the minimum and maximum, the position at 25 percent of your values (the lower quartile) and the position at 75 percent (the upper quartile) are useful for figuring how the data distribution works, and they are the basis of an illustrative graph called a boxplot. print iris_dataframe.quantile(np.array([0,.25,.50,.75,1])) The difference between the upper and lower percentile constitutes the inter- quartile range (IQR) which is a measure of the scale of variables that are of highest interest. 47 EXPLORING DATA ANALYSIS Defining Descriptive Statistics for Numeric Data - Defining measures of normality Skewness defines the asymmetry of data with respect to the mean. If the skew is negative, the left tail is too long and the mass of the observations are on the right side of the distribution. If it is positive, it is exactly the opposite. Kurtosis shows whether the data distribution, especially the peak and the tails, are of the right shape. If the kurtosis is above zero, the distri- bution has a marked peak. If it is below zero, the distribution is too flat instead. When performing the kurtosis and skewness tests, we determine whether the p-value is less than or equal 0.05. If so, we have to reject normality, which implies that we could obtain better results if we try to transform the variable into a normal one. 48 Dipesh Joshi 24 8/16/21 EXPLORING DATA ANALYSIS Defining Descriptive Statistics for Numeric Data - Defining measures of normality from scipy.stats import kurtosis, kurtosistest k = kurtosis(iris_dataframe['petal length (cm)']) zscore, pvalue = kurtosistest(iris_dataframe['petal length (cm)']) print 'Kurtosis %0.3f z-score %0.3f p-value %0.3f' % (k, zscore, pvalue) 49 EXPLORING DATA ANALYSIS Defining Descriptive Statistics for Numeric Data - Defining measures of normality from scipy.stats import skew, skewtest s = skew(iris_dataframe['petal length (cm)']) zscore, pvalue = skewtest(iris_dataframe['petal length (cm)']) print 'Skewness %0.3f z-score %0.3f p-value %0.3f' % (s, zscore, pvalue) 50 Dipesh Joshi 25 8/16/21 EXPLORING DATA ANALYSIS Counting for Categorical Data The Iris dataset is made of four metric variables and a qualitative target outcome. The dataset is made up of metric measurements (width and lengths in centimeters), we must render it qualitative by dividing it into bins according to specific intervals. The pandas package features two useful functions, cut and qcut, that can transform a metric variable into a qualitative one: Øcut expects a series of edge values used to cut the measurements or an integer number of groups used to cut the variables into equal-width bins. Øqcut expects a series of percentiles used to cut the variable. 51 EXPLORING DATA ANALYSIS Counting for Categorical Data We can obtain a new categorical DataFrame using the following command, which concatenates a binning for each variable: iris_binned = pd.concat([ pd.qcut(iris_dataframe.iloc[:,0], [0,.25,.5,.75, 1]), pd.qcut(iris_dataframe.iloc[:,1], [0,.25,.5,.75, 1]), pd.qcut(iris_dataframe.iloc[:,2], [0,.25,.5,.75, 1]), pd.qcut(iris_dataframe.iloc[:,3], [0,.25,.5,.75, 1]), ], join='outer', axis = 1) 52 Dipesh Joshi 26 8/16/21 EXPLORING DATA ANALYSIS Counting for Categorical Data - Understanding frequencies We can obtain a frequency for each categorical variable of the dataset, both for the predictive variable and for the outcome, by using the following code: print(iris_dataframe['group'].value_counts()) print(iris_binned['petal length (cm)'].value_counts()) print(iris_binned.describe()) 53 EXPLORING DATA ANALYSIS Counting for Categorical Data - Understanding frequencies Frequencies can signal a number of interesting characteristics of qualitative features: ØThe mode of the frequency distribution that is the most frequent category ØThe other most frequent categories, especially when they are comparable with the mode (bimodal distribution) or if there is a large difference between them ØThe distribution of frequencies among categories, if rapidly decreasing or equally distributed ØRare categories that gather together 54 Dipesh Joshi 27 8/16/21 EXPLORING DATA ANALYSIS Counting for Categorical Data - Creating contingency tables By matching different categorical frequency distributions, you can display the relationship between qualitative variables. The pandas.crosstab func- tion can match variables or groups of variables, helping to locate possible data structures or relationships. print pd.crosstab(iris_dataframe['group'], iris_binned['petal length (cm)']) 55 EXPLORING DATA ANALYSIS Creating Applied Visualization for EDA The data is rich in information because it offers a perspective that goes beyond the single variable, presenting more variables with their reciprocal variations. The way to use more of the data is to create a bivariate exploration. This is also the basis for complex data analysis based on a multivariate approach. If the univariate approach inspected a limited number of descriptive statis- tics, then matching different variables or groups of variables increases the number of possibilities. Using visualization is a rapid way to limit test and analysis to only interesting traces and hints. 56 Dipesh Joshi 28 8/16/21 EXPLORING DATA ANALYSIS Creating Applied Visualization for EDA - Inspecting boxplots Boxplots provide a way to represent distributions and their extreme ranges, signaling whether some observations are too far from the core of the data — a problematic situation for some learning algorithms. The following code shows how to create a basic boxplot using the iris dataset boxplots = iris_dataframe.boxplot(return_type='axes') 57 EXPLORING DATA ANALYSIS Creating Applied Visualization for EDA - Performing t-tests after boxplots After we have spotted a possible group difference relative to a variable, a t-test or a one-way Analysis Of Variance (ANOVA) can provide you with a statistical verification of the significance of the difference between the groups’ means. from scipy.stats import ttest_ind group0 = iris_dataframe['group'] == 'setosa' group1 = iris_dataframe['group'] == 'versicolor' group2 = iris_dataframe['group'] == 'virginica' print('var1 %0.3f var2 %03f' % (iris_dataframe['petal length (cm)'][group1].var(),iris_dataframe['petal length (cm)'][group2].var())) 58 Dipesh Joshi 29 8/16/21 EXPLORING DATA ANALYSIS Creating Applied Visualization for EDA - Performing t-tests after boxplots The t-test compares two groups at a time, and it requires that you define whether the groups have similar variance or not. t, pvalue = ttest_ind(iris_dataframe['sepal width (cm)'][group1], iris_dataframe['sepal width (cm)'][group2], axis=0, equal_var=False) print('t statistic %0.3f p-value %0.3f' % (t, pvalue)) 59 EXPLORING DATA ANALYSIS Creating Applied Visualization for EDA - Performing t-tests after boxplots We can simultaneously check more than two groups using the one-way ANOVA test. from scipy.stats import f_oneway f, pvalue = f_oneway(iris_dataframe['sepal width (cm)'][group0],iris_dataframe['sepal width (cm)'][group1],iris_dataframe['sepal width (cm)'][group2]) print("One-way ANOVA F-value %0.3f p-value %0.3f" % (f,pvalue)) 60 Dipesh Joshi 30 8/16/21 EXPLORING DATA ANALYSIS Creating Applied Visualization for EDA - Observing parallel coordinates Parallel coordinates can help spot which groups in the outcome variable you could easily separate from the other. It is a truly multivariate plot, because at a glance it represents all your data at the same time. from pandas.plotting import parallel_coordinates iris_dataframe['labels'] = [iris.target_names[k] for k in iris_dataframe['group']] pll = parallel_coordinates(iris_dataframe,'labels') 61 EXPLORING DATA ANALYSIS Creating Applied Visualization for EDA - Graphing distributions We usually render the information that boxplot and descriptive statistics provide into a curve or a histogram, which shows an overview of the complete distribution of values. densityplot = iris_dataframe[iris_dataframe.columns[:4]].plot(kind='density’) single_distribution = iris_dataframe['petal length (cm)'].plot(kind='hist') 62 Dipesh Joshi 31 8/16/21 EXPLORING DATA ANALYSIS Creating Applied Visualization for EDA - Plotting scatterplots In scatterplots, the two compared variables provide the coordinates for plotting the observations as points on a plane. colors_palette = {0: 'red', 1: 'yellow', 2:'blue’} colors = [colors_palette[c] for c in iris_dataframe['group’]] simple_scatterplot = iris_dataframe.plot(kind='scatter', x='petal length (cm)', y='petal width (cm)', c=colors) 63 EXPLORING DATA ANALYSIS Creating Applied Visualization for EDA - Plotting scatterplots from pandas.tools.plotting import scatter_matrix colors_palette = {0: "red", 1: "yellow", 2: "blue"} colors = [colors_palette[c] for c in iris_dataframe['group’]] matrix_of_scatterplots = scatter_matrix(iris_dataframe, figsize=(6, 6), color=colors, diagonal='kde') 64 Dipesh Joshi 32 8/16/21 EXPLORING DATA ANALYSIS Understanding Correlation Just as the relationship between variables is graphically representable, it is also measurable by a statistical estimate. When working with numeric variables, the estimate is a correlation, and the Pearson’s correlation is the most famous. The Pearson’s correlation is the foundation for complex linear estimation models. When we work with categorical variables, the estimate is an association, and the chi-square statistic is the most frequently used tool for measuring association between features. 65 EXPLORING DATA ANALYSIS Understanding Correlation - Using covariance and correlation Covariance is the first measure of the relationship of two variables. It deter- mines whether both variables have a coincident behavior with respect to their mean. If the single values of two variables are usually above or below their respective averages, the two variables have a positive association. It means that they tend to agree, and we can figure out the behavior of one of the two by looking at the other. In such a case, their covariance will be a positive number, and the higher the number, the higher the agreement. 66 Dipesh Joshi 33 8/16/21 EXPLORING DATA ANALYSIS Understanding Correlation - Using covariance and correlation If, instead, one variable is usually above and the other variable usually below their respective averages, the two variables are negatively associated. Even though the two disagree, it’s an interesting situation for making predictions, because by observing the state of one of them, you can figure out the likely state of the other. In this case, their covariance will be a negative number. 67 EXPLORING DATA ANALYSIS Understanding Correlation - Using covariance and correlation A third state is that the two variables don’t systematically agree or disagree with each other. In this case, the covariance will tend to be zero, a sign that the variables don’t share much and have independent behaviours. Ideally, when we have a numeric target variable, you want the target variable to have a high positive or negative covariance with the predictive variables. high positive or negative covariance among the predictive variables is a sign of information redundancy. Information redundancy signals that the variables point to the same data — that is, the variables are telling us the same thing in slightly different ways. 68 Dipesh Joshi 34 8/16/21 EXPLORING DATA ANALYSIS Understanding Correlation - Using covariance and correlation Computing a covariance matrix is straightforward using pandas. iris_dataframe.cov() iris_dataframe.corr() 69 EXPLORING DATA ANALYSIS Understanding Correlation - Using nonparametric correlation Correlations can work fine when our variables are numeric and their relationship is strictly linear. Sometimes, Our feature could be ordinal or we may suspect some nonlinearity due to non-normal distributions in our data. A possible solution is to test the doubtful correlations with a nonparametric correlation, such as a Spearman correlation. A Spearman correlation transforms our numeric values into rankings and then correlates the rankings, thus minimizing the influence of any nonlinear relationship between the two variables under scrutiny. 70 Dipesh Joshi 35 8/16/21 EXPLORING DATA ANALYSIS Understanding Correlation - Using nonparametric correlation from scipy.stats import spearmanr from scipy.stats.stats import pearsonr spearmanr_coef, spearmanr_p = spearmanr(iris_dataframe['sepal length (cm)'], iris_dataframe['sepal width (cm)']) pearsonr_coef, pearsonr_p = pearsonr(iris_dataframe['sepal length (cm)'], iris_dataframe['sepal width (cm)']) print 'Pearson correlation %0.3f | Spearman correlation %0.3f' % (pearsonr_coef, spearmanr_coef) 71 EXPLORING DATA ANALYSIS Understanding Correlation - Considering chi-square for tables We can apply another nonparametric test for relationship when working with cross-tables. This test is applicable to both categorical and numeric data. The chi-square statistic tell us when the table distribution of two variables is statistically comparable to a table in which the two variables are hypothesized as not related to each other. 72 Dipesh Joshi 36 8/16/21 EXPLORING DATA ANALYSIS 73 EXPLORING DATA ANALYSIS Understanding Correlation - Considering chi-square for tables from scipy.stats import chi2_contingency table = pd.crosstab(iris_dataframe['group'], iris_binned['petal length (cm)']) chi2, p, dof, expected = chi2_contingency(table.values) print('Chi-square %0.2f p-value %0.3f' % (chi2, p)) 74 Dipesh Joshi 37 8/16/21 EXPLORING DATA ANALYSIS Modifying Data Distributions As a by product of data exploration, in an EDA phase we can do the following: Ø Obtain new feature creation from the combination of different but related variables Ø Spot hidden groups or strange values lurking in our data Ø Try some useful modifications of our data distributions by binning 75 EXPLORING DATA ANALYSIS Modifying Data Distributions - Using the normal distribution The normal, or Gaussian, distribution is the most useful distribution in statistics due to its frequent recurrence and particular mathematical properties. During data science practice, you’ll meet with a wide range of different distributions - with some of them named by probabilistic theory 76 Dipesh Joshi 38 8/16/21 EXPLORING DATA ANALYSIS Modifying Data Distributions - Creating a Z-score standardization In our EDA process, we may have realized that our variables have different scales and are heterogeneous in their distributions. As a consequence of our analysis, we need to transform the variables in a way that makes them easily comparable. from sklearn.preprocessing import scale stand_sepal_width = scale(iris_dataframe['sepal width (cm)']) 77 EXPLORING DATA ANALYSIS Modifying Data Distributions - Transforming other notable distributions When we check variables with high skewness and kurtosis for their correlation, the results may disappoint us. Using a nonparametric measure of correlation, such as Spearman’s, may tell us more about two variables than Pearson’s r may tell us. 78 Dipesh Joshi 39