Exploratory Data Analysis - Lecture Six - Al Turath University
Document Details
Uploaded by Deleted User
Al-Turath University
Doaa Mohammad
Tags
Summary
This document is a lecture on exploratory data analysis (EDA). It describes EDA as a method for analyzing data sets and summarizing their characteristics, often using graphical techniques. The lecture also discusses various techniques used for EDA, including histograms, boxplots, and how to effectively visualize data.
Full Transcript
Al Turath University Data Science By Doaa Mohammad Assistant Lecturer Lecture Six Exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization met...
Al Turath University Data Science By Doaa Mohammad Assistant Lecturer Lecture Six Exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to 1. maximize insight into a data set; 2. uncover underlying structure; 3. extract important variables; 4. detect outliers and anomalies; 5. test underlying assumptions; 6. develop parsimonious models; and determine optimal factor settings. EDA is not identical to statistical graphics although the two terms are used almost interchangeably. Statistical graphics is a collection of techniques--all graphically based and all focusing on one data characterization aspect, EDA encompasses a larger venue; EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. EDA is an important first step in any data analysis. Understanding where outliers occur and how variables are related can help one design statistical analyses that yield meaningful results. Exploratory data analysis During exploratory data analysis you take a deep dive into the data. Information becomes much easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain an understanding of your data and the interactions between variables. This phase is about exploring data, so keeping your mind open and your eyes peeled is essential during the exploratory data analysis phase. The goal isn’t to cleanse the data, but it’s common that you’ll still discover anomalies you missed before, forcing you to take a step back and fix them. The visualization techniques you use in this phase range from simple line graphs or histograms, as shown in figure 2, to more complex diagrams. Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight into the data. Mike Bostock has interactive examples of almost any type of graph. It’s worth spending time on his website, though most of his examples are more useful for data presentation than data exploration. 1.0 0.8 0.6 0.4 0.2 0.0 2011 Figure 2, From top to bottom, a bar chart, a line plot, and a distribution are some of the graphs used in exploratory analysis. These plots can be combined to provide even more insight, as shown in figure 3. Overlaying several plots is common practice. 1.0 1.0 0.8 0.8 0.6 0.6 Q28_1 0.4 Q28_3 0.4 0.2 0.2 0.0 0.0 –0.2 –0.2 –0.2 0.0 0.2 0.4 0.6 0.8 1.0 –0.2 0.0 0.2 0.4 0.6 0.8 1.0 Q28_2 Q28_2 0.25 0.20 0.20 0.15 0.15 0.10 Q28_4 0.10 Q28_5 0.05 0.05 0.00 0.00 – 0.05 0.0 – 0. Figure 3. Drawing multiple plots together can help you understand 2 the structure of your data over multiple variables. Figure 4. A Pareto diagram is a combination of the values and a cumulative distribution. It’s easy to see from this diagram that the first 50% of the countries contain slightly less than 80% of the total amount. If this graph represented customer buying power and we sell expensive products, we probably don’t need to spend our marketing budget in every country; we could start with the first 50%. In figure 4, we combine simple graphs into a Pareto diagram. Figure 5. Link and brush allows you to select observations in one plot and highlight the same observations in the other plots. Figure 5, shows another technique: brushing and linking. With brushing and linking you combine and link different graphs and tables (or views) so changes in one graph are automatically transferred to the other graphs. This interactive exploration of data facilitates the discovery of new insights. Figure 5, shows the average score per country for questions. Not only does this indicate a high correlation between the answers, but it’s easy to see that when you select several points on a subplot, the points will correspond to similar points on the other graphs. In this case the selected points on the left graph correspond to points on the right graph. Two other important graphs are the histogram shown in figure 6 and the boxplot shown in figure 7. In a histogram a variable is cut into discrete categories and the number of occur-rences in each category are summed up and shown in the graph. The boxplot, on the other hand, doesn’t show how many observations are present but does offer an impression of the distribution within categories. It can show the maximum, minimum, median, and other characterizing measures at the same time. 10 8 2 0 60 65 70 75 80 85 90 Figure 6. Example histogram: the number of people in the age- groups of 5-year intervals Figure 7. Example boxplot: each user category has a distribution of the appreciation each has for a certain picture on a photography website. The techniques we described in this phase are mainly visual, but in practice they’re certainly not limited to visualization techniques. Tabulation, clustering, and other modeling techniques can also be a part of exploratory analysis. Even building simple models can be a part of this step. Now that you’ve finished the data exploration phase and you’ve gained a good grasp of your data, it’s time to move on to the next phase: building models.