Data Analytics Lecture 4 2024 PDF
Document Details
Uploaded by NavigablePlatypus
Sheridan College
2024
Dr. Ameera Al-Karkhi
Tags
Summary
This document details a lecture on data visualization, specifically highlighting examples of analyzing data on house prices using tables and charts. Also, information on data visualization is introduced including what the types of visualization (e.g. histogram).
Full Transcript
1 Data Analytics Dr. Ameera Al-Karkhi Lecture 4 Subject Code: ENGI 55612 2024 2 Data visualization Data Visualization: is the presentation of data in a format that aims to meet a purpose of transmitting certain information to...
1 Data Analytics Dr. Ameera Al-Karkhi Lecture 4 Subject Code: ENGI 55612 2024 2 Data visualization Data Visualization: is the presentation of data in a format that aims to meet a purpose of transmitting certain information to a reader, such as a table or chart. Information is frequently difficult to understand when presented in text-only format for data. Example: Given this text-only, data on 2013 median house prices in southern California counties, finding the price for a particular county is inconvenient: Los Angeles $405,000; Orange $661,000; Riverside $306,000; San Bernardino $192,000; San Diego $473,000; Ventura $464,000. 3 Instead, displaying the data visually as a table better conveys the information. A table displays data using rows and columns. Southern California median house prices by county (2013) 4 As another example, the following data represents California median house prices from 2000-2010: 2000 $241,000; 2001 $262,000; 2002 $316,000; 2003 $372,000; 2004 $451,000; 2005 $523,000; 2006 $556,000; 2007 $560,000; 2008 $348,000; 2009 $275,000; 2010 $305,000. A table conveys the information better than text, but if the goal is to illustrate the housing price "bubble" that grew and then burst in 2008, a chart is even better. California median house prices, 2000-2010. 5 California median house prices, 2000-2010. 6 Q: Refer to the table above showing California house prices by county (2013). A company is considering moving offices to San Bernardino county. What is the median house price in that county? A: 192,000 A table is effective when the goal is to enable lookup of specific values, and understanding relative prices is not the goal (else, a chart would better show relative prices). 7 Q: Refer to the table above showing California house prices from 2000-2010. In what year did the price bubble burst? That is, in what year was the price drastically lower than the previous year? A: 2008 2008's median house price of $348,000 is drastically lower than 2007's price of $560,000. However, using a table required comparing pairs of years until the drop was detected 8 Q: Refer to the following figure, showing California house prices from 2000-2010. In what year was the peak of house prices? A: 2007 The highest point is at 2007. The peak is readily visible. The actual dollar values are not relevant. California median house prices, 2000-2010. Q: Referring to the figure above, what was the relative difference between the highest and lowest California house prices? Answer with: double, triple, or quadruple. A: double Visually, the highest point in 2007 is about double the lowest points of either 2000 or 2009. That information is available in the table too, but requires more scanning and mental calculation. 9 Uses of data visualization Expressing data as a table or chart allows the viewer to comprehend data more quickly than data presented as a list of numbers. A chart is particularly helpful in analyzing large datasets where a list, or even a table of the data would be incomprehensible. Visual representation is also more intuitively grasped than numbers. ▪ The pie chart below shows the per person availability of milk in the United States in 2013. The viewer is able to quickly understand that plain 2% milk has the greatest availability, and gains an intuitive sense of how much more 2% milk is available than any other category. Per person availability of milk in the United States in 2013 Moore, Karleigh, et al. "Data Presentation - Pie Charts." Brilliant.org. Retrieved 16 July 2018, brilliant.org/wiki/data-presentation-pie-charts/. 10 A chart can show (identify) trends in the data for the viewer. From 1971 until 2019, the price of gold is depicted in the graph below. With the exception of market crashes in 1981 and 2013 and brief downward movements during that time, the main trend is upward. As a result, someone considering investing in gold can come to the conclusion that, when done right, gold is generally a solid long-term investment. The price of gold from 1971 until 2019 USDA Economic Research Service. "Table. dymfg." United States Department of Agriculture. www.ers.usda.gov/data-products/food-availability-per-capita-data-system/. 11 Q: Charts are useful for small datasets, but become too crowded when used with large datasets. True False Q: Charts can be used to identify relationships between different variables. True False 12 Factors to take into account while displaying data: ▪ The size of the dataset and ▪ The cardinality of the dataset - Cardinality is the number of unique elements in a dataset. For example, the set of student IDs of students in a class has high cardinality, since each ID is unique, whereas the set of student ages will have lower cardinality, since many students will have the same ages - Certain chart types, such as pie charts or bar charts, are well-suited to data with low cardinality, but not well-suited for high-cardinality data, as illustrated by the pie charts below. 13 - The chart on the left displays high-cardinality data, and is difficult to read, while the chart on the right displays low- cardinality data. Note: too many categories can be confusing. Be careful of putting too much information in a pie chart. The pie chart on right, gives a clear idea of the representation of types relative to the whole sample. The second pie chart on left, is more difficult to interpret, with too many categories. 14 For high-cardinality data, other visualisations including scatter graphs, line charts, and histograms perform excellently. The histogram in the image below displays the distribution of numerous different data points by grouping the data into eight equal-sized bins. Two variables with a high cardinality are related in the scatter plot. 15 The kind of data being presented and the information to be communicated both influence the type of chart that is utilised. ▪ A pie chart, histogram, or box plot can be used to illustrate a dataset with only one variable or when only one variable needs to be presented. ▪ It may be ideal to use a scatter plot or line chart to visualise a dataset that has two or more linked variables. ▪ A bar graph, pie chart, or violin plot can all be used to visualise a dataset in which one of the variables is categorical. For instance, the bar graph below displays the quantity of exoplanets found by each of the ten techniques. Bar chart 16 Boxplots To show the centre, spread, and distribution of your data, boxplots use the 5-number summary (minimum and maximum values with the three quartiles). They provide a great numerical and visual description of the data when used in conjunction with histograms. A histogram and boxplot of a normal distribution. 17 histogram and boxplot of a skewed left distribution. 18 A histogram and boxplot of a skewed right distribution. 19 Which chart better conveys the most common number of discovered planets in a star system? 20 Which chart better conveys how gas mileage for cars has changed over time? 21 Which chart is more appropriate for showing data from three unrelated categories? 22 Visualizing A Quantitative Data using a Histogram A histogram shows a quantitative variable's whole distribution: ▪ Partition all possible values into bins (bars). ▪ Then count the number of cases that fall into each bin. ▪ Attached bars (bins) with interval scale (equal width). ▪ The bins, together with these counts give the distribution of the data. ▪ A gap in the histogram indicates that there is a bin with no cases in it. ▪ The range of bins is typically 5 to 30, unless very big data sets require more bins. ▪ Number of bins depend on the sample size. 23 Identifying the Shape of a Histogram You need to identify whether there is any peak in the graph or not? Does it have a single peak (one mode), or several peaks (modes)? Note: ▪ Histogram with one peak/mode is called “unimodal”. ▪ Histogram with two peaks is called “bimodal”. ▪ Histogram with three or more peaks is called “multimodal”. ▪ A histogram with no apparent mode in which all the bars are about the same height is a uniform distribution. Note: Value on the horizontal axis of the histogram is the mode. 24 Identifying the Shape of a Histogram You need to identify whether it is symmetric or not: ▪ Fold the histogram along a vertical line in the middle and see if the edges match closely. You need to identify whether it is skewed to the right or to the left: ▪ Skewedness occur when an observation (or a few observations) is deviated from the overall pattern of the data. ▪ See if the thinner ends of a distribution, tails, pull the distribution to their side. ▪ If one tail stretches out farther than the other, then the histogram is skewed toward the side of the longer tail. 25 Descriptive Univariate Analysis: Dispersion statistics There are two main groups of univariate statistics: ◼ Location statistics ◼ Dispersion statistics 26 ◼ Location statistics: ◼ Minimum: is the lowest value ◼ Maximum: is the largest value ◼ Mean: is the average value ◼ Mode: is the most frequent value ◼ The value that is larger than: st ◼ 25% of all values is the 1 quartile ◼ 50% of all values is the median or 2 nd quartile ◼ 75% of all values is the 3 rd quartile 27 Friend Max temp Weight Height Gender Company Andrew 25 77 175 M Good Bernhard 31 110 195 M Good Carolina 15 70 172 F Bad Dennis 20 85 180 M Good Eve 10 65 168 F Bad Fred 12 75 173 M Good Gwyneth 16 75 180 F Bad Hayden 26 63 165 F Bad Irene 15 55 158 F Bad James 21 66 163 M Good Kevin 30 95 190 M Bad Lea 13 72 172 F Good Marcus 8 83 185 F Bad Nigel 12 115 192 M Good 28 ◼ Let us use as example the attribute weight from our data set Location statistic Weight (kg) Min 55.00 Max 115.00 Mean or average 79.00 Mode 75.00 1st quartile 65.75 2nd quartile or mode 75.00 3rd quartile 87.50 29 ◼ Box-plots present the minimum, the 1st quartile, the median, the 3rd quartile and the maximum statistics, by this order, bottom-up or from left to right ◼ The attribute height Normal right skewed(+ve) left skewed(-ve) 30 Box-plots: a way to plot the summary of positions ▪ It utilizes the 5-number summary statistics (mean, median, min, max, mode) ▪ It reflects the shape of the data. ▪ It summarizes both the centre (median) & spread (Interquartile Range: IQR). ▪ The median is marked by a line drawn within the box but it is not necessarily in the middle of the box for all data sets. 31 ▪ The box of the box plot contains the middle 50% of the data, from Lower Quartile (1st Quartile, Q1=25th percentile to Upper Quartile (3rd Quartile, Q3 =75th percentile). ▪ The line extending from the box of the boxplot are called whiskers. These extend to minimum and the maximum data values, except for suspect outliers, which are plotted individually with a circle. ▪ Data values beyond 1.5(IQR) from either Q1 or Q3 are suspect outliers. ▪ Data values that are 3(IQR) away from either Q1 or Q3 are highly suspect outliers. 32 First quartile Q2 Third quartile Q1 Q3 33 ◼ Box-plots can also be used to describe the symmetry/ skewness of an attribute ◼ The median or the mode are more robust as a central tendency statistic than the mean in the presence of extreme values or strongly skewed distributions 34 Illustration of skewed and symmetric distributions Note: ▪ In a approx. symmetric distribution, the mean and the median will be close to each other. ▪ In a skewed distribution: - If Mean < Median, the data is left skewed. - If Mean > Median, the data is right skewed. 35 Common Shapes of Histograms 36 Normal Distribution An important class of smooth continuous curves are symmetric unimodal (bell-shaped) also known as normal curves. They describe normal distribution. 37 ▪ Normal distribution is characterized by its mean 𝜇 and its standard deviation 𝜎: 𝑥~N(𝜇,𝜎) ▪ The mean 𝜇 is located at the centre of the symmetric curve and is the same as the median (and the mode). ▪ The standard deviation 𝜎 controls the spread of a normal curve. ▪ Normal distributions are an important model of probability distribution, because they: 1. approximate well the real-world distributions of variables. 2. are most important distribution for statistical inference. 38 Plot Normal Distribution Using Boxplot For normal distribution the plot should be: ▪ Symmetric about median 39 Descriptive Univariate Analysis: Common Univariate Probability Distributions ◼ Different events of our life follow already studied distributions for example, the height of adult men, the value of a random number, or the number of cars passing in a given highway toll. We present two of these distributions: ◼ The Uniform distribution ◼ The Normal distribution, also known as the Gaussian ◼ Both are continuous distributions and have known probability density functions 40 Uniform Distribution Every potential value of x has the exact same probability of occurring when the continuous random variables are distributed uniformly. In common with all continuous random variables, the area under the function between all the possible values of X is equal to 1 and as a result it is possible to work out the probability density function of X, for all uniform distributions using simple formula: 41 Uniform distribution definition: given that a random variable X has possible values from such that all possible values are equally likely, it is said to be uniformly distributed that is : 𝑥~𝑼(𝑎, 𝑏) The probability distribution function of X is: 𝟏 𝒇(𝒙) = 𝒃−𝒂 h 𝑓(𝑥) = 0 elsewhere A = L*W L = L*h = (b-a)* =1 42 Properties of uniform distribution ◼ An attribute 𝑥 that follows the uniform distribution with parameters 𝑎 and 𝑏, has equal frequency of occurrence of values in any interval of a given size ◼ 𝑥~𝑼(𝑎, 𝑏) 𝒂+𝒃 ◼ 𝒎𝒆𝒂𝒏: 𝝁𝒙 = 𝟐 (𝒃−𝒂)𝟐 ◼ Variance: 𝝈𝟐𝒙 = 𝟏𝟐 0, 𝑖𝑓 𝑥0 < 𝑎 𝑥0 −𝑎 ▪ 𝑃 𝑥 < 𝑥0 = , 𝑖𝑓 𝑎 ≤ 𝑥0 ≤ 𝑏 The probability of x < 0.3 is given by the proportion of 𝑏−𝑎 the area taken by this term, 1, 𝑖𝑓𝑥0 > 𝑏 The probability density function, f (x) of x ∼ U(0, 1) 43 Uniform distribution application Example: What is the probability that a randomly selected 8-weeks-old baby would smile for two to eighteen seconds? Answer: Find P(2 left skewed (-ve) =median => right skewed(+ve) No skewed (Normal) ❖ mean will always be to the right of the median 72 73 ▪ To calculate the skewness of data in statistics manually, you can use the following formula Example in Python: import pandas as pd import numpy as np # Replace 'data' with the actual column name data_column = 'data’ # Calculate the mean, median, and standard deviation of the data column mean = np.mean(df[data_column]) median = np.median(df[data_column]) std_dev = np.std(df[data_column]) # Calculate the skewness using the formula skewness = (3 * (mean - median)) / std_dev # Print the skewness value print("Skewness:", skewness) 74 ▪ To calculate the kurtosis of data in statistics manually, you can use the following formula: Kurtosis = (1 / n) * Σ((xi - mean) ^ 4) / (std_dev ^ 4) Example in Python: import pandas as pd import numpy as np # Replace 'data' with the actual column name data_column = 'data' # Calculate the mean and standard deviation of the data column mean = np.mean(df[data_column]) std_dev = np.std(df[data_column]) # Calculate the kurtosis using the formula n = len(df[data_column]) kurtosis = (1 / n) * np.sum(((df[data_column] - mean) ** 4) / (std_dev ** 4)) # Print the kurtosis value print("Kurtosis:", kurtosis) 75 Class Activity: Variance Q; Why do we have variance squared? - to avoid getting –ve computation and because distance or dispersion can not be –v). - Also if we have +/- values will cancel some computation and shows wrong results. - Calculate the variance of the following data: Annual income 62000 64000 49000 324000 1264000 54330 64000 51000 55000 48000 53000 76 Class Activity : use the following data base and study the skewness of each set three data sets: col1 col2 col3 1 1 1 1 1 2 1 2 3 Hint: calculate the 1 2 3 mean, median and 2 3 4 2 3 4 mode then plot each set 2 3 4 to find the type of 2 4 5 skewness. 2 4 5 2 4 5 3 4 5 3 4 6 3 5 6 3 5 6 4 5 6 4 6 6 5 6 6 5 7 7 7 7 7 77 Exercise : find the mean, median and mode for the set of scores in the frequency distribution table below: x f 5 2 4 3 3 2 2 2 1 1 Exercise: The following data are representing verbal comprehension test scores of males and females. ▪ Female: 26 25 24 24 23 23 22 22 21 21 21 20 20 ▪ Male: 20 19 18 17 22 21 21 26 26 26 23 23 22 1. Calculate mean, mode, median, for both males and females separately. 2. What kind of distribution is this? 78