SAS Topic 2 Descriptive Statistics PDF
Document Details
Uploaded by PraisingBowenite9541
Temasek Polytechnic
Tags
Summary
This document provides a detailed outline of SAS Topic 2 on descriptive statistics. The learning outcomes cover concepts such as differentiating between population and sample statistics, discrete and continuous data, calculations, graph construction, and graphical representations. Software skills like utilizing calculator functions and software for statistical calculations are also touched upon.
Full Transcript
SAS Topic 2 Descriptive Statistics Learning Outcomes 1. Differentiate between population and sample statistics 2. Differentiate between discrete and continuous data 3. Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation,...
SAS Topic 2 Descriptive Statistics Learning Outcomes 1. Differentiate between population and sample statistics 2. Differentiate between discrete and continuous data 3. Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data 4. Construct graphs from datasets 5. Explain graphical representations of data 2 Calculator/software skills a) Compute basic descriptive stats (mean, standard deviation) from common calculator models b) Compute basic descriptive stats (mean, mode, median, standard deviation) from software c) Chart data using software 3 Statistics is all around us! 4 Source: https://www.talkwalker.com/blog/social-media-statistics-singapore Statistics is all around us! On average, TP The most common students spend way TP students about 2 hours every come to school is by day checking out public buses social media accounts I spend about $10 on 50% of students in the average every week diploma scored more on bubble tea than 65 marks in this quiz 5 On average, TP students spent about 2 hours every day checking out social media accounts Questions: How did one get to this average duration of 2 hrs? Can we assume that every student in TP also spend roughly 2 hours on the internet? 6 Imagine the process Student Time spent TP ID (hr) 1 2.5 2 0.5 7 3 1 Learning Outcome 1 : Difference between a population and a sample Sample: Population: A subset of TP students Every single student in TP 8 Learning Outcome 1 : Difference between a population and a sample A population is a complete collection of measurements, objects, or individuals under study A sample is a portion or subset taken from a population. Example: Population – all of TP students Sample – Year 2 TP students 9 Learning Outcome 1 : Difference between a population and a sample Sample statistic: Average social media hours = 2 hrs Student ID Time spent (hr) Population parameter: Actual true average 1 2.5 2 0.5 Average = 3 1 10 …. … Learning Outcome 1 : Difference between a population and a sample A parameter is a characteristic or measure obtained by using all the data values from a specific population A statistic is a characteristic or measure obtained by using the data values from a sample 11 Learning Outcome 1 : Difference between a population and a sample Example: Parameter : Mean (or average) social media hours of all TP students Statistic : Mean social media hours of Year 2 students To differentiate these 2 means, we use different symbols For population mean (parameter) (read as “mu”) 12 For sample mean (statistic) (read as “x-bar”) Learning Outcome 2 : Differentiate between discrete and continuous data Person Time spent Diploma Gender Variables ID (hr) course 1 2.5 PHS Male 2 0.5 ChE Male Observation s or data 3 1 MBT Female point 4 3.2 FNC Male........ 13 Learning Outcome 2 : Differentiate between discrete and continuous data A variable is a characteristic of interest. It can assume different values Quantitative: has numerical values E.g. time spent on the social media Weight of students Types of variablesHeight Monthly salary Categorical: takes on a specific category or label E.g. Gender Diploma courses 14 Occupation type Learning Outcome 2 : Differentiate between discrete and continuous data Diploma course Person Time spent Diploma Gender and ID (hr) course gender are 1 2.5 PHS Male categorical 2 0.5 ChE Male variables 3 1 MBT Female 4 3.2 FNC Male........ Time spent on social media is a 15 quantitative variable Learning Outcome 2 : Differentiate between discrete and continuous data Discrete Fixed values, “countable” E.g Number of household pets Quantitative: Number of sibling has numerical Number of mobile devices owned values Types of Continuous variables Categori Infinite collection of cal: possible values cannot takes on “pinpoint“ a particular a specific value E.g. Height of TP students category (162.1 cm, 159cm, 182.5 or label cm..) 16 Hours spent on social media Learning Outcome 2 : Differentiate between discrete and continuous data Person Time spent Diploma Gender ID (hr) course 1 2.5 PHS Male 2 0.5 ChE Male 3 1 MBT Female 4 3.2 FNC Male........ Time spent on social media is a continuous variable 17 Learning Outcome 2 : Differentiate between discrete and continuous data Another way of classifying data is to use four levels of measurement Nominal Ordinal Interval Ratio Nominal and ordinal data are more commonly associated with categorical variables 18 Learning Outcome 2 : Differentiate between discrete and continuous data Quantitative: Nominal Purely labels or names has numerical E.g. Female, Male values Bus, cab, train Red, Blue, Brown Types of variables Categorical: Ordinal takes on a specific Values that can be arranged in a particular order that make category or label sense. Difference between categories is not always constant/fixed E.g Grades A, B, C, D Strongly agree, agree, 19 disagree.. Learning Outcome 2 : Differentiate between discrete and continuous data Like ordinal data, interval data can also be arranged in order Unlike ordinal data, the differences between data values are always fixed and meaningful A zero data point or value is meaningless Example: Years 1000, 2000, 1776, and 1942 (do not have Year 0) Year 1776 and 1777 is 1 year apart, so is Year 20 1942 and 1943 Learning Outcome 2 : Differentiate between discrete and continuous data Ratio data can also be arranged in order Differences between values can be found and are meaningful Ratio between values are consistent and meaningful Example: Presence of a natural zero starting point, which is meaningful Time spent on social media (can be 0 hr) Weight of 7-week-old puppies (can be 0 theoretically!) 21 Learning Outcome 2 : Differentiate between discrete and continuous data Time spent on social media (can be 0 hr) Ratio of 3 hr : 1 hr = Ratio of 6 hr: 2 hr A 7-week-old puppy at 6 kg is twice as heavy as one at 3 kg Another puppy which weighs 7 kg is also twice as heavy as a 3.5 kg puppy 22 Learning Outcome : Differentiate between discrete and continuous data While ratio and interval data are commonly associated with numbers (quantitative variables), they can also be ranked and have ordinal character Ratio measurements may be thought of possessing all the “underlying” scales Ratio: Absolute Zero Interval: E.g Equal distance Weight of children in a primary school Ordinal: Order (the actual weight can be classified Nominal: as Underweight, Normal weight and Overweight) Categori es 23 Learning Outcome 2 : Differentiate between discrete and continuous data Discrete Ratio Quantitative: has numerical values Continu Interval Types of ous variables Categoric Nominal al: takes on a specific category or 24 label Ordina Learning Outcome 2 : Differentiate between discrete and continuous data Practice question: Determine the nature of the variables shown in the table: Variable Variable Ordin Interval Ratio Nomina type al l Hair Color Categorial ❌ ❌ ❌ ✔ (Grey, Black, White) Age (25, 13, 8.5, 55 etc) Quantitati ✔ ✔ ✔ ✔ ve Length of panda cubs in a zoo Types of social media platform (IG, TikTok etc) Monthly allowance ($210, $350…) 25 Sleeping time (9pm, 10pm, 11pm, Learning Outcome 2 : Differentiate between discrete and continuous data Solution: Variable Variable Ordin Interval Ratio Nomina type al l Hair Color Categorial ❌ ❌ ❌ ✔ (Grey, Black, White) Age (25, 13, 8.5, 55 etc) Quantitati ✔ ✔ ✔ ✔ ve Length of panda cubs in a Quantitati ✔ ✔ ✔ ✔ zoo ve Types of social media Categorial ❌ ❌ ❌ ✔ platform (IG, TikTok etc) Monthly allowance Quantitati ✔ ✔ ✔ ✔ ($210, $350…) ve 26 Sleeping time Quantitati ✔ ✔ ❌ ✔ Summary of Learning Outcomes 1 and 2 1. Differentiate between population and sample statistics 2. Differentiate between discrete and continuous data Sample is a sub-set of a population Variables can be numerical-based (quantitative) or categorical Observations or data points collected in each variable are measured in NOIR scale (Nominal, Ordinal, Interval or Ratio) 27 On average, TP students spent about 2 hours every day checking out social media accounts Questions: How did one get to this average duration of 2 hrs? Can we assume that every student in TP also spend roughly 2 hours on the internet? 28 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data The 1st question depends on the variable collected Time is a quantitative variable Make sense to describe the data in terms of mean (average) social media hours the usage duration that the highest number of the students clocked the usage duration that accounts for 50% of the students (or even 25%, 60% or 75%) Descriptive statistics is the process of describing and summarizing the data from the sample 29 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data The 2nd question depends on whether the data from the sample is very close to the population true mean Sampling method is important By how far can we generalize from our small experiment and predict the unknown mean social media hours of ALL TP students? Inferential statistics 30 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data We describe or summarize data to look for trends or patterns Where does the observations tend to “crowd at”? 3Ms : Mean, Median and Mode Collectively termed Measures of Central Tendency 31 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data The mean is the mathematical average of data values in the sample If there are n observations, x1, x2,…, xn, the sample mean is given by the formula 𝑛 ∑ 𝑥𝑖 = 𝑖= 1 𝑛 is read as “x-bar” and represents mean value of x of the sample in statistics lingo. is read “summation of all values of from i = 1 to nth value” in the sample, ie the total sum of the observations. 32 n is the sample size, ie how many data points there are. Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Example 1: A class of 15 children were asked to fold paper flowers. The number of flowers the children folded in an hour are as follow: 5 3 7 2 9 3 8 5 6 4 4 5 6 11 10 Calculate the mean number of flowers folded, leaving your answer in 3 significant figures. 33 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Example 1 solution: Let xi = no. of flowers a child ‘i’ folded in an hour where i = 1, 2, 3…15 𝑛 ∑ 𝑥𝑖 = ≈ 5.87 = 𝑖= 1 𝑛 The mean number of flowers folded in an hour is 5.87 34 Calculator/software skills While one may enter “5 + 3 + … =, then ÷ the ans by 15” …. It is possible to use the Statistics function in most calculators to obtain the mean The Statistics function inputs the data in a different way Skips the arithmetic and is a short-cut to get different descriptive statistics Software like Excel has an additional tab to handle statistical calculations (Analysis Toolpak) Specialized statistical analysis software (SPSS, Jamovi, R-) use menu interfaces, codes, programming 35 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data The mean is easily influenced by extreme values as it is dependent on every measurement. Back to the flower-folding classroom in 5 3 7 2 9 3 8 Example 1: 5 6 4 30 5 6 11 10 𝑛 ∑ 𝑥𝑖 = = 7.60 = 𝑖= 1 𝑛 36 The old mean was 5.87 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data The median is another measure of central tendency It is the middle value when all measurements are arranged from the largest to smallest value (or vice versa) Unlike the mean, the median is not as affected by extreme values 37 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Calculating the median: (1) Arrange the data (observations) from largest to smallest or vice verse (2) Locate the middle observation = the median (3) If the sample size n is odd number, then the median observation is at ()th position (4) If the sample size is even number, then the median is the average of the 2 observations located at the ()th and ()th position (5) Note: these formulae DO NOT give you the median, but the positions of the observations to get the median 38 Calculator/software skills The median cannot be obtained using calculator directly Manual sorting of data by you Software sorts the data and calculates median automatically 39 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Example 2: Using the data from Example 1, find the median number of flowers folded by the class. 5 3 7 2 9 3 8 5 6 4 4 5 6 11 10 Solution 15 children (n, odd number), rearranging the data from smallest to largest: 2 3 3 4 4 5 5 5 6 6 7 8 9 1 1 0 1 40 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Solution (continue): Median observation is at ()th position, ie the 8th observation: 2 3 3 4 4 5 5 5 6 6 7 8 9 1 1 0 1 The median number of flowers is 5. This means that half the children in the class fold 5 or more flowers, OR 5 or less flowers 41 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Example 3: The teacher realized there was a mistake in counting the number of flowers folded by one of the child. The corrected set of values are: 5 3 7 2 9 3 8 5 6 4 4 5 6 17 10 Solution 15 children (odd number) 42 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Solution (continue): Median observation is at ()th position, ie the 8th observation 2 3 3 4 4 5 5 5 6 6 7 8 9 1 1 0 7 Like the previous data set (Example 2), the median number of flowers remains unchanged, still 5 flowers. An extreme value has less impact on the median but would change the mean. 43 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Example 4: Let’s assume another child joins the class, find the new median number of flowers folded by the class. 5 3 7 2 9 3 8 7 5 6 4 4 5 6 11 10 Solution 16 children (n, even number), rearranging the data from smallest to largest: 2 3 3 4 4 5 5 5 6 6 7 7 8 9 1 1 0 1 44 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Solution (continue): Median observation is the average of the data points at ()th and ( + 1)th position, ie the 8th and 9th observation: 2 3 3 4 4 5 5 5 6 6 7 7 8 9 1 1 0 1 The median number of flowers is = 5.5 flowers 45 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data The median is also known as the 50th percentile 50% of the observations are larger than the median value, while the other half is smaller than the median value Other “position” values are sometimes used, e.g. 25th percentile, 75th percentile, 90th percentile 46 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data The mode is another measure of central tendency. It is the value that occurs most frequently in a set of data It is not necessarily a single value There can be more than one mode in a set of data (multi- modal) Similarly, if there is no data value that occur most frequently, the data has no mode Like the median, the mode cannot be obtained by calculator 47 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Example 5: Using data from Example 1, find the mode of the number of flowers folded by the children. 5 3 7 2 9 3 8 5 6 4 4 5 6 11 10 Solution (continue): No. of 2 3 4 5 6 7 8 9 10 11 flowers No. of 1 2 2 3 2 1 1 1 1 1 * Ensure children* the total number of children adds up to 15 Since the value 5 occurs most frequently, the 48 mode is 5 flowers. Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Example 6: the mode of the height (in cm) of 10 students Find from a swimming club, measured as shown: 161 163 157 167 164 165 176 179 165 164 Solution : Both 164 and 165 occur most frequently (twice each), while the other heights occur only once. Thus the modes are 164 cm and 165 cm. 49 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Do you notice the difference between finding the mode and median? Median Mode Must arrange the data in No need to arrange data ascending or descending order Calculate mean of the middle Mode values are reported observations separately (no mean values) 50 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Example 7: A six-sided fair dice has numbers 1, 2, 3, 4, 5 and 6 printed on each of the faces. What is the mode? Solution : Since each number appears the same number of times (once), there is no mode. 51 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Practice question: 12 students from a Year 1 diploma course in TP were interviewed to find out how they travel to school. The data is as follows: Student ID Transport type Student ID Transport type 1 Public bus 7 Public bus 2 Taxi 8 Public bus 3 MRT 9 MRT 4 MRT 10 Cycle 5 Taxi 11 Walk 6 Taxi 12 Cycle 52 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Practice question (continue): (a) What type of variable “Transport type”? (Quantitative or categorical)? (b) What level of measurement is “Transport type”? (Nominal, Ordinal, Interval, Ratio)? (c) What is the central tendency measure that you can report for this study? (Mean, median or mode?) 53 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Solution to practice question: (a) What type of variable “Transport type”? (Ans: Categorical) (b) What level of measurement is “Transport type”? (Ans: Nominal. It does not make sense to rank “bus”, “taxi” etc in any logical order!) 54 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Solution to practice question: (c) What is the central tendency measure that you can report for this study? (Ans: We can report the mode of the transport type, the type of transport that is favoured by the MOST number of students) (d) What is the value of the statistic reported in (c)? (Ans: The modal transport types are public bus, MRT and Taxi. These transport categories attracts 55 equal and highest student counts (3 students Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Another way of reporting the data is to assess how far apart individual observations are from the mean These are called measures of dispersion or spread They are necessary because using only measures of central tendency may be misleading 56 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data For example: The following data gives the overall exam results for two groups of 9 students: Group Scores Mean Gap between Dispersion/ each value Spread and mean A 78, 70, 73, 73, 76, 80, 70, 74 Closer to mean Small 74, 72 B 50, 87, 92, 78, 53, 66, 61, 74 Further from Large Though the mean 84, 95 values for both groups meanare the same, noticed Group A marks are closer to the mean value of 74 Group B individual student’s marks are further from the mean 57 They have different degrees of dispersion or spread Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Numerical measures of dispersion: Range Interquartile range (IQR) Standard deviation (s.d.) Variance (var) Starting with range: The difference between the maximum and the minimum value Range = Maximum value – minimum value Easy to calculate and understand However, it does not detect differences in variation and is highly susceptible to extreme values (called outliers) 58 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Example 8: Back to the overall exam results of 9 students from Group A and B. Calculate the respective ranges. Group Scores A 78, 70, 73, 73, 76, 80, 70, 74, 72 B 50, 87, 92, 78, 53, 66, 61, 84, 95 Solution : Range of Group A = 80 – 70 = 10 marks Range of Group B = 95 – 50 = 59 45 marks Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Quartiles and Interquartile Range Recall the median observation is also called the 50th percentile 50% of the data values is lower than or equals to the median value In general, the pth percentile of a sample is the data point such that p percent of the observations is equal or less than it. E.g. 80th percentile of students’ height in a class is 156 cm 60 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Quartiles and Interquartile Range It is more common to report the quartiles, ie the 25th and 75th percentiles 25th percentile Quartile 1 or Q1 75th percentile Quartile 3 or Q3 Next 25% of sample First 25% of sample Third 25% of sampleTop 25% of sample Smallest observation/value Largest observation/value Q1 Median Q3 61 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Quartiles and Interquartile Range The interquartile range (IQR) = 75th percentile data point - 25th percentile data point 62 Calculator/software skills Formulae for Q2 and Q3 are not as straight-forward as mean or median In practice, software like is used to compute percentiles of a data-set 63 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Standard deviation and variance Consider the monthly income of ten randomly chosen workers from City A and City B (in thousand US dollars, rounded to nearest thousand). City A 2 9 11 3 8 6 12 4 15 17 City B 2 15 16 2 17 2 3 16 3 Do a 5 tally of the number of workers in each salary value 64 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data 65 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Dot-plots of worker’s Both cities have salaries in City A identical ranges of salaries Range = 17000 – 2000 = 15000 USD The shape (distribution) of City B the graphs are very different! City B has salaries on both ends of the Horizontal axis shows the salary points. Each “dot” is one observation or worker. Workers with identical salary are “stacked” scale on top of each other. City A is more 66 “evenly” spread out Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data The range is not resistant to extreme values or outliers Because the range uses only the maximum and minimum values, it does not take every value into account Therefore it does not truly reflect the variation between all the observations 67 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Standard deviation and variance Both standard deviation and variance measure how much the observations in a sample are spread out The variance determines how far each observation deviates from the mean The standard deviation is the square 68 root of the variance Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Standard deviation and variance For a sample of n observations, x1, x2,…, xn, their sample variance, s2 is given by the formula: Sample variance, s2 = = Sample standard deviation, s = = 69 Sample standard deviation = Calculator/software skills In practice, instead of using the (complicated) formula (and risk making mistakes) Use calculator Statistics function to input the observation Obtain the standard deviation (SD) To get the variance, square the value of the SD 70 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Population standard deviation and variance The formula are very similar For a population of N observations, x1, x2,…, xN, the variance is given by the formula: Population variance, σ2 = Population standard deviation = σ = 71 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Standard deviation and variance Observations might be larger or smaller than the mean So the individual difference may be positive or negative However, variance is always positive (ie > 0) 72 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Statistical symbols for population parameters and sample statistics Populati Sample on Standard deviation s σ Variance s2 σ2 Mean 73 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Example 9: Back to the salary of workers in City A and B Calculate the standard deviation of the monthly income of ten randomly chosen workers from City A and City B (in thousand US dollars). Comment on your answers. 74 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Solution: Step (2) Step (3) Let xi denote the salary of the ith worker (i = 1, 2, 3, 4, 5, …10) For City A, 1) Calculate the sample mean 2) Calculate the difference between each salary point and the sample mean 3) Square the difference from Step Step (1) (2) Step (4) 4) Total up the squares from Step (3) 75 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Step (2) Step (3) Sample standard deviation, s = n–1=9 Standard deviation for City A = = 5.08 Repeat Step (1) to (4) for City B! Step (1) Step (4) 76 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Step Solution: (1) Another Samplemethod: standard deviation, s = This formula does not require the sample mean. For City A, Step 1) Find the square of each salary value (3) Step 2) Total up the squares from step (1) (2) 3) Total up the original salary values 77 The data for each step are shown in an Excel Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Step Sample standard deviation, s = (1) n–1=9 Standard deviation for City A = = 5.08 Step Both formulae gives the same SD (3) Step value. (2) 78 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Solution: The standard deviation for the monthly income of the 10 sampled workers from City A and City B are $5.08 and $6.87 respectively. Compared with City A, the standard deviation for City B is larger. Hence the data for City B has a larger spread. 79 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data Note: Manual calculation is tedious, so Example 9 is for illustration purpose “Short-cut” method in practice: use the calculator statistics function or software to obtain the standard deviation 80 Learning Outcome 3 : Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data A special type of standard deviation When we draw many many samples from a population we want to research on The sample standard deviation (s) can be calculated for each sample, so can the mean The spread between the means of the samples is calculated as or , where n is the sample size This spread is known as the standard error (more in Topic 3) 81 When communicating results to non- technical types there is nothing better than a clear visualization to make your point. John Tukey, statistician (1915 – 2000) 82 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Why do we need graphs? When working with large sample sizes Visual means to describe and summarize the data Quick way to examine data distribution Central tendency Spread or dispersion Though can be manually created, softwares can help with more accuracy and ease 83 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Common types of visualizations Dotplots Pie charts *Focus of current topic Time series graphs **Focus of Topic 6 Scatterplot** (Regression and Correlation) Bar graphs Histograms* Box-and-whisker plot* 84 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data What visualizations should I choose? The appropriate types of visualization depends on the variables studied (categorical or quantitative) and the questions you want to ask about the data. Typically, we’re interested in counts or frequencies of data values (how many/how much). OR to see the dispersion or spread 85 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Dotplots Using Example 9, the salary of workers from 2 cities, in ‘000 USD. Horizontal axis are data values. Each “dot” is one observation or worker. Workers with identical salary are “stacked” on top of each other. City A City B 86 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Dotplots The horizontal axis is a quantitative variable Shows the shape of distribution of data (noticed that City B dotplot shows extreme high and low salaries) Easy to create from the original data values 87 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Pie charts Only for categorical variables Size of the pie shows proportion of data values belonging to that category 88 Example: Pie chart of causes of air disasters (Pearson) Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Time series graphs Singapore’s GDP from 1988 to 2022, billions USD Quantitative data (https://www.statista.com/statistics/378648/gross-domestic- product-gdp-in-singapore/) collected over time (e.g. yearly, monthly or daily) Shows information or trends over time 89 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Scatterplots A plot of paired (x, y) quantitative data from each observation Has a horizontal x-axis and a vertical y-axis. Horizontal axis is used for the first variable (x) Vertical axis is used for the second variable (y) Shows relationship between the paired data More on scatterplots in Topic 6 (Regression and Correlation) 90 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Scatter Plot of Height/cm against Example of a Weight/kg scatterplot 7 adults in a medical examination, each had their height AND weight recorded Height (cm) In general, as height increases, the weight also increases. 91 Weight Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Bar graphs Bars of equal width to show frequencies of categories of categorical (or qualitative) data. Bars may or may not be separated by gaps Heights of the bars shows the distribution of data across the 92 different categories Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Bar graphs What is the modal grade category? 93 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Histogram Looks like a bar graph, but has no gaps in the horizontal axis Use for quantitative, continuous data Horizontal axis represents classes or categories of numerical values Vertical axis corresponds to frequency counts 94 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Histogram Very useful for descriptive statistics purposes Mean Median Mode Outliers Skewness (ie uneven or unsymmetrical distribution) 95 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Histogram Example 10: Pulse rate/min (beats per min) of 50 students 89 68 92 74 76 65 77 83 75 87 85 64 79 77 96 80 70 85 80 80 82 81 86 71 90 87 71 72 62 78 77 90 83 81 73 80 78 81 81 75 82 88 79 79 94 82 66 78 74 72 96 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Some issues to bear in mind: Hard to describe it just looking at the table Pulse rate is a quantitative variable - other chart types are not very useful Let’s group the pulse rates into classes 97 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Histogram – creating data classes “Park”or “bins” the raw value into a category / class No specific rule on the number of classes (usually between 5 to 20) The width of each class can be approximately calculated by 98 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data 89 68 92 74 76 65 77 83 75 87 85 64 79 77 96 80 70 85 80 80 82 81 86 71 90 87 71 72 62 78 77 90 83 81 73 80 78 81 81 75 82 88 79 79 94 82 66 78 74 72 Maximum = 96, minimum = 62 Let’s create 8 categories (up to you to decide, the more categories narrow bin width) = 4.25, round up to 5 width of each class Choose the closest whole number from the minimum value to start the first bin (in this example, start from 60) 99 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data The “binned” data on pulse rate Pulse-rate Counts/ Note that the bin range of (beats/min) frequency 60 to 64 includes the values of 60 and 64, so 60 – 64 2 the width is 5 units 65 - 69 3 (60, 61, 62, 63, 64). 70 – 74 8 A student with a pulse rate of 60/min or 64/min goes 75 - 79 12 into this bin 80 – 84 13 85 – 89 7 90 – 94 4 100 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Excel can plot for you The red curve connects the mid-point of the upper edge of Histogram of pulse rates each bar 14 13 12 12 counts 10 Frequency couts 8 8 7 Frequency 6 4 4 3 2 2 1 0 60 – 64 65 - 69 70 – 74 75 - 79 80 – 84 85 – 89 90 – 94 95 - 99 Pulse-rates (beats/min) 101 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Conclusions from the pulse- rate histogram Pulse-rate 80 – 84 beats/min is the modal class (most students possess this pulse-rate) Most likely the median or mid-value lies between 75 – 84 beats/min The mean pulse-rate is also likely to be between 75 – 84 beats/min 102 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Conclusions from the pulse- rate histogram The distribution looks “normal” Imagine a vertical line that divides the red curve into 2 equal halves at the middle of the graph Tells you that the data is evenly distributed to the left and right of the modal class Not many extreme values (or outliers) 103 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Some variables can be “unevenly” distributed, or skewed. The diagram here shows a positive (or right) skew. Values have a trailing tail at the +ve part of the curve positive skew 104 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data This distribution of a variable is skewed to the left (negative skewed) as it has a trailing left tail. Values have a trailing tail at the - ve part of the curve negative skew 105 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Box-and-whisker plot (or boxplot) A box-and-whisker plot (or just boxplot) is a graph that shows the location of the quartiles, maximum, minimum value and outliers The quartiles are: 25th percentile or Q1 50th percentile, Q2 or median 75th percentile or Q3 106 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Box-and-whisker plot (or box- plot) Centre, spread and overall range of the distribution are immediately visualized Usually used to describe a skewed distribution, or a distribution with outliers, when the mean and standard deviation may not give an accurate description of the data 107 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data A boxplot can be plotted horizontally or vertically 1) Determine the Q1, Q2, Q3, minimum and maximum values of the data points. This requires arranging the raw data values in ascending or descending order. 2) Scales are marked on the left hand side if the boxplot is plotted vertically 3) Draw a box, the lower edge is Q1, the upper edge is Q3 (if you plot a vertical boxplot). A line inside the box marks the median 108 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data To plot a boxplot 4) Determine if there are outliers The “1.5 × IQR” criterion: Q1 - (1.5 × IQR) Q3 + (1.5 × IQR) IQR = Q3 – Q1 109 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data To plot a boxplot 5) Observations more than 1.5 × IQR outside the central box are plotted individually as possible outliers. 6) Lines extend from the lower and upper edge of the box to the smallest and largest observations that are not suspected outliers. 110 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Data Schematic of a values vertical boxplot Max (without outliers) value Q3 + (1.5 x IQR) If the maximum or Q3 minimum values are Q2 or median IQR within the 1.5 × IQR Q1 criterion, there are no Q1 - (1.5 x IQR) outliers. Min value 111 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Schematic of a Data boxplot (with values Outlier outliers) 2nd max Q3 + (1.5 x IQR) If the maximum or value minimum values exceed Q3 the 1.5 × IQR criterion, Q2 or median IQR they are plotted Q1 individually. Q1 - (1.5 x IQR) 2nd min value The 2nd largest and smallest values that are Outlier non-outliers are shown 112 as lines extending from Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Schematic of a Data boxplot (with values Outlier outlier) 2nd max Q3 + (1.5 x IQR) value Only the maximum data value exceed the Q3 + Q3 1.5 × IQR criterion, so it Q2 or median IQR is shown on the boxplot. Q1 Q1 - (1.5 x IQR) The second largest min value value is shown as a line extending from the top edge 113 of the box. Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Example 11: Plotting the boxplot 50 students were tested for the number of push-ups they can do without rest. Here are the results, arranged in ascending order. The Q1 and Q3 for the data below are 16.75 and 47 respectively. 114 Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Solution : Min Q1 Median Q3 Max The “5-number” 5 16.75 28.5 47 95 summary IQR = Q3 – Q1 = 47 – 16.75 = 30.25 1.5 × IQR = 1.5 × 30.25 = 45.375 Check for possible outliers: Q1 – 1.5 x IQR = 16.75 – 45.375 = -28.625 Q3 + 1.5 x IQR = 47 + 45.375 = 92.375 Since 95 > (Q3 + 1.5 x IQR) = 92.375. 95 is 115 the only outlier. Learning Outcome 4: Construct graphs from datasets Learning outcome 5 : Explain graphical representations of data Some notes: 95 is drawn as a circle to indicate that it is an outlier 81 is the next largest value which is not an outlier as it is within 1.5 times of IQR from Q3 5 is the smallest value 16.75 116 that is not an outlier Calculator/software skills Additional points about the boxplot: Median can be obtained by hand- computation (see Learning Outcome 3) When drawing by hand, the data value axis should reflect approximately the scale and location of the 5 numbers For hand-plotting, values for Q1 and Q3 would be provided Excel can compute Q1 and Q3 by syntax (ie formulae), and also plots boxplot 117 The end: Recap of learning outcomes 1. Differentiate between population and sample statistics 2. Differentiate between discrete and continuous data 3. Calculate the sum, mean, mode, median, quartiles, range, variance, standard deviation, and standard error for a given set of data 4. Construct graphs from datasets 5. Explain graphical representations of data 118