Exploratory Data Analysis PDF
Document Details
Uploaded by TruthfulBoltzmann
Tags
Summary
This document is about exploratory data analysis (EDA). It covers various aspects of EDA, including different types of analysis, how to do different techniques using python.
Full Transcript
Exploratory Data Analysis Objectives At the end of this chapter, students will be able to What is Exploratory Data Analysis (EDA)? Importance of EDA in data science. What are the benefits of EDA? Types of EDA. What is Exploratory Data Analysis (EDA)? EDA is the process of analyz...
Exploratory Data Analysis Objectives At the end of this chapter, students will be able to What is Exploratory Data Analysis (EDA)? Importance of EDA in data science. What are the benefits of EDA? Types of EDA. What is Exploratory Data Analysis (EDA)? EDA is the process of analyzing a dataset and summarizing its main features, often by data visualization methods. EDA performs preliminary investigations on data in order to uncover patterns, detect anomalies, test hypotheses and verify assumptions. It is primarily used by data scientists to understand the various aspects of data. It is used to see what data can reveal beyond formal modeling or hypothesis. Importance of Exploratory Data Analysis (EDA) Quality control: EDA helps to identify errors and inconsistencies in the data, ensuring that the results of further analysis are accurate and reliable. Hypothesis generation: EDA helps to generate hypotheses about the relationships between variables, which can be tested through further analysis. Data understanding: EDA helps analysts to understand the data, its distribution, and its characteristics, allowing for more informed decision-making. Data visualization: EDA helps to communicate the data and its insights to stakeholders, making it easier for them to understand and act on the results. Contd.. EDA actually reveals ground truth about the content without making any underlying assumptions. Key components of exploratory data analysis include: summarizing data, statistical analysis, and visualization of data. Python provides expert tools for exploratory analysis, with: pandas for Summarizing. scipy, along with others for statistical analysis and matplotlib and plotly for visualizations. Why is EDA important to data scientists? The main purpose of EDA is to look at data before making any assumptions. It can help the data scientists identify obvious errors, better understand patterns within the data, detect outliers and anomalies and find the relationship between different variables. In general, EDA focuses on understanding the characteristics of a dataset before deciding what we want to do with that dataset. EDA is used for: Generate questions about your data. Search for answers by visualizing, transforming and modeling your data. Use what we learn to refine our questions or generate new questions. What are the benefits of EDA? Spotting missing or incorrect data. Understanding the structure of data. Testing our hypothesis and assumptions. Identifying the most important variables. Determining error margins. Identifying the most appropriate statistical tools to analyze. Types of EDA Univariate Non-graphical Univariate graphical Multivariate Non-graphical Multivariate graphical Univariate Non-graphical It is the simplest form of data analysis used in practice. we use just one variable whose data (referred to as population) is compiled and studied. The main aim of univariate non-graphical EDA is to find out the details about the distribution of the population data and to know some specific parameters of statistics. Univariate Non-graphical(Contd..) The significant parameters which are estimated from a distribution point of view are as follows: Central Tendency: This term refers to values located at the data's central position or middle zone. The three generally estimated parameters of central tendency are mean, median, and mode. (Mean is the average of all values in data, while the mode is the value that occurs the maximum number of times. The Median is the middle value with equal observations to its left and right. ) Univariate Non-Graphical (Contd..) Range: The range is the difference between the maximum and minimum value in the data, thus indicating how much the data is away from the central value on the higher and lower side. Variance and Standard Deviation: Variance is a measure of dispersion that indicates the spread of all data points in a data set. It is the measure of dispersion mostly used and is the mean squared difference between each data point and mean. While Standard deviation is the square root value of it. Note: Non-graphical methods are quantitative and objective, they are not able to give the complete picture of the data. Example Python code import numpy as np # Calculate measures of import pandas as pd variability range = mydata.max() - # Generate some random data mydata.min() data = [10,20,30,30,40,40,40,60,100] std_dev = mydata.std() mydata = pd.Series(data) # Calculate measures of # Calculate measures of central skewness and kurtosis tendency mean = mydata.mean() skewness = mydata.skew() median = mydata.median() kurtosis = mydata.kurtosis() mode = mydata.mode() Note: This can be done on any single column of a data frame Univariate Graphical Involve a degree of subjective analysis. Some common types of univariate graphics are: Stem-and-leaf plots: This is a very simple but powerful EDA method used to display quantitative data but in a shortened format. It displays the values in the data set, keeping each observation intact but separating them as stem (the leading digits) and remaining or trailing digits as leaves. Example Python code import matplotlib.pyplot as plt # Create data data = [13, 25, 30, 14, 17, 22, 16, 35, 58, 82, 19] # separating the stem parts stems = [1, 2, 3, 1, 1, 2, 1, 3, 5, 8, 1] # Create stem and leaf plot fig, ax = plt.subplots() ax.stem(stems, data) ax.set_xlabel('Stems') ax.set_ylabel('Leaves') ax.set_title('Stem and Leaf Plot Example') # Show plot plt.show() Univariate Graphical (Contd..) Histograms (Bar Charts): These plots are used to display both grouped or ungrouped data. On the x-axis, values of variables are plotted, while on the y-axis are the number of observations or frequencies. Histograms are very simple to quickly understand your data, which tell about values of data like central tendency, dispersion, outliers, etc. Example Python code – Histogram import matplotlib.pyplot as plt # Create data data = [13, 25, 30, 14, 17, 22, 16, 35, 58, 82, 19] # Create histogram fig, ax = plt.subplots() ax.hist(data, bins=8) # Add x-axis and y-axis labels ax.set_xlabel('Data') ax.set_ylabel('Frequency') # Add a title ax.set_title('Histogram Example') # Show plot plt.show() Types of Histograms Simple Bar Charts: These are used to represent categorical variables with rectangular bars, where the different lengths correspond to the values of the variables. Multiple or Grouped charts: Grouped bar charts are bar charts representing multiple sets of data items for comparison where a single color is used to denote one specific series in the dataset. Example Python code – Simple Bar Chart import numpy as np import matplotlib.pyplot as plt # creating the dataset data = {'Database':20, 'C++':15, 'Java':30, 'Python':35} courses = list(data.keys()) values = list(data.values()) fig = plt.figure(figsize = (10, 5)) # creating the bar plot plt.bar(courses, values, color ='b', width = 0.4) plt.xlabel("Courses offered") plt.ylabel("No. of students enrolled") plt.title("Students enrolled in different courses") plt.show() Example Python code – Multiple Bar Chart import numpy as np import matplotlib.pyplot as plt barWidth = 0.25 fig = plt.subplots(figsize =(12, 8)) IT, ECE, CSE = [12, 30, 1, 8, 22], [28, 6, 16, 5, 10], [29, 3, 24, 25, 17] # Set position of bar on X axis br1 = np.arange(len(IT)) br2 = [x + barWidth for x in br1] br3 = [x + barWidth for x in br2] plt.bar(br1, IT, color ='r', width = barWidth, edgecolor ='grey', label ='IT') plt.bar(br2, ECE, color ='g', width = barWidth, edgecolor ='grey', label ='ECE') plt.bar(br3, CSE, color ='b', width = barWidth, edgecolor ='grey', label ='CSE') plt.xlabel('Branch', fontweight ='bold', fontsize = 15) plt.ylabel('Students passed', fontweight ='bold', fontsize = 15) plt.xticks([r + barWidth for r in range(len(IT))], ['2015', '2016', '2017', '2018', '2019']) plt.legend() plt.show() Types of Histograms(Cond..) Percentage Bar Charts: These are bar graphs that depict the data in the form of percentages for each observation. The following image shows a percentage bar chart with dummy values. Box Plots: These are used to display the distribution of quantitative value in the data. If the data set consists of categorical variables, the plots can show the comparison between them. Further, if outliers are present in the data, they can be easily identified. These graphs are very useful when comparisons are to be shown in percentages, like values in the 25 %, 50 %, and 75% range (quartiles). Example Python code – Percentage Bar Chart import pandas as pd import matplotlib.pyplot as plt # Load the Iris dataset iris = pd.read_csv('Iris.csv') # Calculate the percentage of each species in the dataset species_counts = iris['Species'].value_counts(normalize=True) * 100 # Create a bar chart of the percentage of each species species_counts.plot(kind='bar', color='blue') # Add labels and title to the chart plt.xlabel('Species') plt.ylabel('Percentage') plt.title('Percentage of Each Species in Iris Dataset') # Display the chart plt.show() Example Python code – Box Plot import pandas as pd import matplotlib.pyplot as plt # Load the Iris dataset iris = pd.read_csv('Iris.csv') # Create a box plot of the sepal length for each species iris.boxplot(column='SepalLengthCm', by='Species') # Display the chart plt.show() Univariate Graphical – Pie Chart import matplotlib.pyplot as plt data = [123, 123, 144, 147, 152, 162, 175, 175, 182, 175] plt.pie(data,explode = [0,0,0,0,0,0,0,0,0,0.2]) plt.show() Multivariate Non-Graphical The multivariate non-graphical exploratory data analysis technique is usually used to show the connection/relationship/association between two or more variables with the help of either cross-tabulation or statistics. For categorical data, an extension of tabulation called cross- tabulation is extremely useful. For 2 variables, cross-tabulation is preferred by making a two-way table with column headings that match the amount of one-variable and row headings that match the amount of the opposite two variables, then filling the counts with all subjects that share an equivalent pair of levels. Example Python code import pandas as pd # Load some sample data into a DataFrame data = pd.read_csv('datasets/Iris.csv') print(data.head(6)) # Calculate measures of central tendency for each variable means = data.mean() medians = data.median() modes = data.mode() # Calculate measures of variability for each variable ranges = data.max() - data.min() std_devs = data.std() # Calculate correlations between pairs of variables correlations = data.corr() Multivariate Non-Graphical(Contd..) For each categorical variable and one quantitative variable, we create statistics for quantitative variables separately for every level of the specific variable then compare the statistics across the amount of categorical variable. Comparing the means is an off-the-cuff version of ANOVA and comparing medians may be a robust version of one-way ANOVA. Multivariate Graphical Graphics are used in multivariate graphical data to show the relationships between two or more variables. Some common types of multivariate graphics include: Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that the scatterplot , sohas one variable on the x-axis and one on the y-axis and therefore the point for every case in your dataset. Run chart: It’s a line graph of data plotted over time. Heat map: It’s a graphical representation of data where values are depicted by color. Multivariate Graphical(Contd..) Multivariate chart: It’s a graphical representation of the relationships between factors and response. Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in two-dimensional plot. Descriptive statistics methods Descriptive statistics methods are used to summarize and describe the main features of a dataset. Measures of central tendency, Measures of variability, Frequency distribution, Percentiles/Quantiles, and Skewness and kurtosis are some common descriptive statistics methods used to describe data in exploratory data analysis (EDA). Correlation between variables is also one way to describe and understand data. It measures the strength and direction of the linear relationship between two variables. Correlation can be positive, negative, or zero. Correlation Coefficient If a scatter plot shows a possible linear relationship, then the correlation coefficient indicates how strong the relationship is between x and y. We use the letter r for the correlation coefficient. Usually, -1