Exploratory Data Analysis Lecture Notes PDF

Document Details

BreathtakingMoonstone873

Uploaded by BreathtakingMoonstone873

Menoufia National University

Dr. Amira Abdelatey

Tags

exploratory data analysis data analysis visualization statistics

Summary

These lecture notes cover exploratory data analysis (EDA). The document details various types of data analysis, including descriptive analysis. It also includes graphical representations like histograms, bar graphs, and pie charts as examples.

Full Transcript

Exploratory Data Analysis Dr. Amira Abdelatey Exploratory data analysis EDA In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods....

Exploratory Data Analysis Dr. Amira Abdelatey Exploratory data analysis EDA In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. 2 Three Types of Analytics Prescriptive analytics transforms data-driven insights into actionable strategies, bridging the gap between knowledge and effective decision-making., By using optimization techniques or heuristic techniques Choose the right type is based on your business 3 DESCRIPTIVE ANALYSIS Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. Three types: Measures of Distribution Measures of dispersion Measures of central tendency 4 Measures of Distribution Grouping the data elements into categories and charting the frequency within these categories yields a graphical illustration of how the data is distributed throughout its range Frequency distribution: A frequency distribution shows the frequency of repeated items in a graphical form or tabular form. The following are the scores of Quiz No. of Marks Students 10 students in the G.K. quiz released by Mr. Chris 15, 17, 15 2 20, 15, 20, 17, 17, 14, 14, 20. 17 3 Find out the number of 20 3 students who got the same marks 14 2 5 Types of frequency distribution Ungrouped frequency distribution: It shows the frequency of an item in each separate data value rather than groups of data values. Grouped frequency distribution: In this type, the data is arranged and separated into groups called class intervals. Marks obtained Marks obtained No. of Students No. of Students in Test in Test (class (Frequency) intervals) 5 3 0–5 3 10 4 6 – 10 4 15 5 11 – 15 5 18 4 16 – 20 8 20 4 Total 20 Total 20 6 Frequency Distribution Graphs Histograms: A histogram is a graphical presentation of data using rectangular bars of different heights. In a histogram, there is no space between the rectangular bars. Bar Graphs: Bar graphs represent data using rectangular bars of uniform width along with equal spacing between the rectangular bars. Pie Chart: A pie chart is a type of graph that visually displays data in a circular chart. It records data in a circular manner and then it is further divided into sectors that show a particular part of data out of the whole part. Frequency Polygon: A frequency polygon is drawn by joining the mid- points of the bars in a histogram. 7 Histogram Number of Height Range Trees (ft) (Frequency) 60 - 65 3 66 - 70 3 71 - 75 8 76 - 80 10 81 - 85 5 86 - 90 1 8 Bar graphs A survey of 145 people asked them "Which is the nicest fruit?": Fruit: Apple Orange Banana Kiwifruit Blueberry Grapes People: 35 30 10 25 40 5 9 Pie chart Pie charts represent relative (percentage) frequencies by displaying how much of the whole pie each category represents. who would win in a battle of superheros: 10 Central Tendency The 3 most common measures of central tendency are the mode, median, and mean. Mode: the most frequent value. Median: the middle number in an ordered data set. Mean: the sum of all values divided by the total number of values. 11 Dispersion (Spread) Dispersion refers to the spread of the values around the central tendency. 1.Range 2.Variance 3.Standard Deviation 4.Skewness 5.Interquartile Range IQR 12 Normal Distribution We say the data is "normally distributed": The Normal Distribution has: mean = median = mode symmetry about the center 50% of values less than the mean and 50% greater than the mean 13 Dispersion Range: Range is the measure of the difference between the largest and smallest value of the data variability. The variance measures the average degree to which each point differs from the mean. (The average of the squared differences from the Mean.) Standard deviation is the spread of a group of numbers from the mean. It is the square root of the Variance To calculate the variance follow these steps: Work out the Mean Then for each number: subtract the Mean and square the result (the squared difference). Then work out the average of those squared differences 14 Dispersion You and your friends have just measured the heights of your dogs (in millimeters): The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm. 15 Dispersion The heights are: 600mm, 470mm, 170mm, 430mm and 300mm. The mean (average) height is 394 mm Now we calculate each dog's difference from the Mean: To calculate the Variance, take each difference, square it, and then average the result Variance is 21,704 STD 147 16 Dispersion Using the Standard Deviation, we have a "standard" way of knowing what is normal, and what is extra large or extra small. Rottweilers are tall dogs. And Dachshunds are a bit short, right? 17 Interquartile Range The interquartile range tells you the spread of the middle half of your distribution. The interquartile range is found by subtracting the Q1 value from the Q3 value: 18 Interquartile Range 19 Five number summary A boxplot, or a box-and-whisker plot, summarizes a data set visually using a five-number summary. Every distribution can be organized using these five numbers: 1. Lowest value (minimum) 2. Q1: 25th percentile 3. Median (Q2) 4. Q3: 75th percentile 5. Highest value (Maximum) 20 Inter quartile range The placement of the box tells you the direction of the skew. A box that’s much closer to the right side means you have a negatively skewed distribution, and a box closer to the left side tells you that you have a positively skewed distribution. 21 Outlier detection An outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement, or it may indicate experimental error; An outlier must be excluded from the dataset As it can cause serious problems in analyses. We can use the IQR method of identifying outliers to set up a “fence” outside of Q1 and Q3. Any values that fall outside of this fence are considered outliers. To build this fence we take 1.5 times the IQR and then subtract this value from Q1 and add this value to Q3. This gives us the minimum and maximum fence posts that we compare each observation to. Any observations that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers. 22 23 Notes The Standard Deviation and IQR are more accurate and detailed estimate of dispersion because an outlier can greatly exaggerate (‫)مبالغة‬ the range. The Standard Deviation is bigger when the differences are more spread out Standard Deviation , boxplot and IQR can be used to compare two variables from two datasets (how much data varies) for example classroom results. A high standard deviation shows that the data is widely spread (less reliable) and a low standard deviation shows that the data are clustered closely around the mean (more reliable). 24

Use Quizgecko on...
Browser
Browser