Mathematics of Data Management Notes (Lower Canada College)
Document Details
Uploaded by Deleted User
Lower Canada College
Tags
Summary
These course notes, from Lower Canada College, cover the mathematics of data management focusing on descriptive statistics and probability. The document explains concepts like data types, sampling techniques, and graphing data. It also includes calculation methods for mean, median, mode, and more. It features examples and exercises to illustrate the different types of graphs such as bar graphs, Pareto charts, etc.
Full Transcript
Mathematics of Data Management Part 1 – Descriptive Statistics Course Notes Lower Canada College Name: ______________________________________________ 1 Unit 1...
Mathematics of Data Management Part 1 – Descriptive Statistics Course Notes Lower Canada College Name: ______________________________________________ 1 Unit 1 – Introduction to statistics A. Basics What is a survey? 1. Data: 2. Population: 3. Sample: Example: Identify the data, population, and sample in the following statement. To determine the means of transportation used by workers we interviewed the first 100 workers to arrive in the morning. How do participants respond to a survey (response selection)? A voluntary response sample: Example: A simple random sample: Example: 2 Generating A Simple Random Sample I want to give a quiz to a few students in my class to see if they read the chapter assigned the night before. I have 25 students total but would like a sample of 5. Statistical bias is any factor that favours certain outcomes or responses and consequentially skews the survey results. Sources of Bias Sampling Bias Non-Response Bias 3 Measurement Bias Response Bias B. Types of Data Where does the data come from? Data that describes some characteristic of a: a) b) Example: Identify the parameter(s) and statistic(s) in the following statement. In a Harris Poll, 2320 adults in the US were surveyed about body piercings, and 5% of the respondents said that they had a body piercing, but not on the face. Based on the latest available data of the time of this writing, there are 241,472,385 adults in the US. 4 What does the data represent? Data that describes: a) b) Example: Identify if the following data is quantitative or qualitative. a) The ages (in years) of survey respondents. b) The political party affiliations of survey respondents. c) The numbers 12, 74, 77, 76, 73, 78, 19, 9, 23 and 25 sewn onto jerseys for a baseball team. How is the data counted? Quantitative data values that: a) b) Example: Identify if the following data as discrete or continuous a) The number of eggs that hens lay in one week. b) The number of rolls of a die required to get an outcome of 2. c) During a year, a cow might yield an amount of milk that can be any value between 0 – 7000 liters. 5 Levels of measurement tell you how precisely variables are recorded. Example of response Type Description Example of question options Qualitative Data Quantitative Data Example: Identify the level of measurement of each of the following situations. a) The times of 50 minutes and 100 minutes for a statistics class. b) Survey responses of yes, no, and undecided. c) A college professor assigns grades of A, B, C, D or F. d) The years 1492 and 1776. 6 C. Collecting Data Sampling Technique Visual Systematic Sampling Pro Con Pro Stratified Sampling Con 7 Cluster Sampling Pro Con Convenience Sampling Pro Con 8 Example: Identify the type of sampling technique in each of the following situations. 1. A professor asks the first 5 students who arrive to class to participate in a research study about young adult sleep patterns 2. A large bakery mass produces cakes on an assembly line. Each shift, a quality control expert randomly selects one of the first ten finished cakes, and every tenth cake thereafter. Employees weigh those cakes and give the cakes a detailed visual check. 3. A student council surveys 100 students by taking random samples of 25 freshmen, 25 sophomores, 25 juniors, and 25 seniors. 4. A TV show host asks his viewers to visit his website and respond to an online pol. 5. One day, an airline company wants to survey its customers, so they randomly select 5 flights that day and survey every passenger on those flights. 6. A manage associated each employee’s name with a number on one ball in a container, then drew balls without looking to select a sample of 5 employees. 9 Unit 2 – Graphing and summarizing data A. Graphing data Why do we need to display data? Survey #1: How do you take your coffee/tea? Milk Non-Dairy Milk Black Survey #2: What is your sitting and standing heart rate? Sitting Heart Rate Standing Heart Rate 10 1. Bar Graph 2. Pareto Chart 11 Inputting & ordering data 3. Frequency Distribution Table 12 4. Grouped Frequency Distribution Table 5. Histogram 13 6. Frequency Polygon 7. Ogive (Cumulative Frequency Polygon) 14 8. Stem-and-leaf Plot 15 B. Summarizing data Measures of Central Tendency Mean Median Mode 16 Example: Below is a table listing the number of chocolate chips per cookie in five different packages of chocolate chip cookies. Calculate the mean, median and mode for each brand. Brand Mean Median Mode Chips Ahoy (regular) Chips Ahoy (chewy) Chips Ahoy (reduced fat) Keebler Hannaford 17 Calculating mean and median (raw data/list of values) Example: Below is a table listing the number of chocolate chips per cookie in five different packages of chocolate chip cookies. Calculate the mean, medial class, and modal class for each brand. Chips Ahoy (regular) Mean Medial Modal Class Frequency Mid Value Class Class 19-21 6 22-24 16 25-27 15 28-30 3 Hannaford Mean Medial Modal Class Frequency Mid Value Class Class 11-13 8 14-16 14 17-19 1 20-22 1 18 Calculating mean (grouped data) 19 Measures of Dispersion Range Standard Deviation Variance 20 Example: You and your friends have just measured the heights of your dogs (mm): The heights (at the shoulders) are: 600 mm, 470 mm, 170 mm, 430 mm and 300 mm. Step #1: Calculate the mean 21 Step #2: Subtract the mean from each value and square the difference Step #3: Average the square differences 22 Example: Consider two math classes overall results. One class has an average of 80% and a standard deviation of 5%, and the other class also has an average of 80% but with a standard deviation of 10%. In which class is an individual result of 90% more impressive? Example: Below is a table listing the number of chocolate chips per cookie in five different packages of chocolate chip cookies. Calculate the range, standard deviation, and variance. 23 Brand Standard Range Variance Deviation Chips Ahoy (regular) Chips Ahoy (chewy) Chips Ahoy (reduced fat) Keebler Hannaford Calculating standard deviation 24 Measures of Position Percentiles Finding the percentile of a data value (𝑥) Finding the data value that is at a percentile Example: Find the percentile of a cookie with Example: How many chocolate chips would a 29 chocolate chips. cookie have if it is in the 45th percentile? 25 Quartiles Example: Find the quartiles of the Keebler data. Calculating quartiles 26 Boxplots (visual representation of quartiles) Example: Draw a box plot for the Keebler data. Creating boxplots 27 Example: Draw a parallel boxplot for the Keebler and Hannaford data. 28 Modified Box Plot Example: Draw a modified box plot for the Hannaford data. Creating modified boxplots 29 Unit 3 – Probability A. Basics of Probability In considering probability, we deal with procedures that produce outcomes. Event: Simple Event: Sample Space: Example: In the following display, we use “b” to denote a baby boy and “g” to denote a baby girl. Procedure Example of an event Sample space Single birth 1 girl 3 births 2 boys and 1 girl Notation 𝑃: 𝐴, 𝐵 and 𝐶: 𝑃(𝐴): Notes: 0 ≤ 𝑃(𝐴) ≤ 1 The probability of an impossible event is 0. The probability of an event that is certain to occur is 1. 30 Calculating probability Conduct (or observe) a procedure and count the number of times that event 𝐴 occurs. 𝑃(𝐴) = Example: A recent survey of 1010 adults showed that 202 of them smoke. Find the probability that a randomly selected adult is a smoker. Example: If giving birth to a boy or a girl is equally likely, find the probability of getting three children of the same gender when three children are born. Example: In a study of U.S. paper currency, bills from 17 large cities were analyzed for the presence of cocaine. Here are the results: 23 bills were not tainted by cocaine and 211 were tainted by cocaine. If a bill is randomly selected, find the probability that it is tainted by cocaine. 31 Types of Events a) Complementary events The complement of event 𝐴 Rule of complementary events Example: Based on data from a poll, the probability of randomly selecting someone who holds religious beliefs is 0.60. If a person is randomly selected, find the probability of getting someone who does not hold religious beliefs. b) Compound events 𝑃(𝐴 𝑜𝑟 𝐵) = Example: If someone is randomly selected from the 1000 subjects given a drug test, find the probability of selecting a subject who had a positive test (A) result or uses drugs (B). Positive test result Negative test result (Drug use is indicated) (Drug use is not indicated) Subject states they use 44 6 drugs Subject states they do not 90 860 use drugs 𝑃(𝐴 𝑜𝑟 𝐵) = 32 c) Independent and dependent events Independent events (with replacement): Dependent events (without replacement): 𝑃(𝐴 𝑎𝑛𝑑 𝐵) = Example: If you have ten pens in your pencil case: 2 red, 3 green, 1 purple, and 4 blue, calculate the probability that you choose: a) A red then a green pen (with replacement). b) A red then another red pen (with replacement). c) A green then a purple pen (without replacement). 33 d) Conditional Probability 𝑃(𝐵|𝐴) = Example: Positive test result Negative test result (Drug use is indicated) (Drug use is not indicated) Subject states they use 44 6 drugs Subject states they do not 90 860 use drugs a) If 1 of the 1000 test subjects is randomly selected, find the probability that the subject has a positive test result (A), given that the subject uses drugs (B). b) If 1 of the 1000 test subjects is randomly selected, find the probability that the subject uses drugs (B) , given that he or she had a positive test result (A). 34 B. Counting Permutations: Combinations: Counting Rules 1. Fundamental Counting Rule: 35 2. Factorial Rule: Factorial 3. Permutations Rule (when all the items are different): Permutations 36 4. Combinations Rule: 5. Permutations Rule (when some items are identical to others): Combinations 37 Example: Computers are typically designed so that the most basic unit of information is a bit, which represents either 0 or a 1. Letters, digits, and punctuation symbols are represented as a byte, which is a sequence of eight bits in a particular order. For example, the ASCII coding system represents the number 7 as 00110111. How many different characters are possible if they are all to be represented as bytes? Example: A history pop quiz has one question in which students are asked to arrange the following prime ministers in chronological order: Martin, Harper, Trudeau, MacDonald, Chretien, Mulroney. If an unprepared student makes random guesses, what is the probability of selecting the correct chronological order? 38 Example: In horse racing, a bet on an exacta in a race is won by correctly selecting the horses that finish first and second, and you must select those two horses in the correct order. The 136th running of the Kentucky Derby had a field of 20 horses. If a bettor randomly selects two of those horses for an exacta bet, what is the probability of winning by selecting Super Saver to win and Ice Box to finish second (as they did)? Example: When designing a survey, to see if subjects thoughtlessly spew answers just to finish the survey, pollsters repeat a question with some rewording and check to see if the answers are consistent. For one survey with 10 questions, 2 of the questions are the same, and 3 other questions are also identical. For this survey, how many different arrangements are possible? 39 Example: In the Pennsylvania Match 6 Lotto, winning the jackpot requires that you select six different numbers from 1 to 49, and the same six numbers must be drawn in the lottery. The winning numbers can be drawn in any order, so order does not make a difference. Find the probability of winning the jackpot when one ticket is purchased. 40 C. Discrete Probability Distributions A random variable: A discrete random variable: A continuous random variable: A probability distribution Three requirements: 1. 2. 3. 41 Example: Consider two births with the following random variable, 𝑥 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑔𝑖𝑟𝑙𝑠 𝑖𝑛 𝑡𝑤𝑜 𝑏𝑖𝑟𝑡ℎ𝑠. Number of girls 𝑃(𝑥) 𝑥 0 0.25 1 0.50 2 0.25 a) Does the table describe a probability distribution? b) Graph the probability distribution using a probability histogram. 42 Parameters of a Probability Distribution Mean Variance Standard Deviation Example: Calculate the mean, variance, and standard deviation. Number of girls 𝑃(𝑥) 𝑥 0 0.25 1 0.50 2 0.25 Calculating parameters of a probability distribution 43 Range Rule of Thumb Example: If a couple has two children, is two girls an unusually high number of girls? 44 Binomial Probability Distribution Four requirements: 1. 2. 3. 4. Binomial Distribution Formula 𝑛: 𝑝: 𝑞: 𝑥: 45 Example: Given that there is a 0.85 probability that a randomly selected adult knows what Twitter is, use the binomial probability formula to find the probability of getting exactly three adults who know what Twitter is when five adults are randomly selected. Calculating probability of binomial distribution 46 Example: Based on a poll, 60% of adults hold religious beliefs. If we randomly select six adults, calculate: a) The probability that exactly two of the six adults hold religious beliefs. b) The probability that the number of adults who hold religious beliefs is at least two. Calculating probability of binomial distribution (at least) 47 Parameters of a Binomial Distribution Mean Variance Standard Deviation Example: The brand name of McDonald’s has 95% recognition rate. A special focus group consists of 12 randomly selected adults to be used for extensive market testing. For such random groups of 12 people, find the mean and standard deviation for the number of people who recognizing the brand name of McDonald’s. 48 Unit 4 – Normal Distribution A. Normal Distribution and Standard Deviations 49 Example: Assume that the heights of students at LCC are normally distributed and that the 𝜇 = 1.4 𝑚 and 𝜎 = 0.15 𝑚. Indicate the critical values on the bell curve below and then calculate how tall the middle a) 68% of the kids are? b) 95% of the kids are? c) 99.7% of the kids are? The Central Limit Theorem: 50 Example: Giselle is 168 cm. In her high school, boys’ heights are normally distributed with a mean of 174 cm and a standard deviation of 6 cm. a) What is the probability that the first boy Giselle meets at school tomorrow will be taller than she is? b) What is the probability that the first boy Giselle meets at school tomorrow will be shorter than she is? c) What percentage of the boys are between 168 cm and 186 cm? d) What percentage of the boys are less than 186 cm? e) What percentage of the boys are between 156 cm and 162 cm? 51 B. Skewness Left (Negatively) Skewed Right (Positively) Skewed Pearson’s Index/Coefficient of Skewness (𝑆! ) 52 Example: Is this data skewed? Heights (inches) Number of Persons 58 10 59 18 60 30 61 42 62 35 63 28 64 16 65 8 53 C. Standard Normal Distribution and z-scores Example: Who did better? o In Tommy’s data management class, he got 88% on his test. The class average is 75% and the standard deviation is 6.9%. o In Clive’s data management class, he got 88% on his test and the class average was also 75%. However, the standard deviation is 11.52%. 54 Example: Which of the following two data values are more extreme? o A chocolate chip cookie with 30 chocolate chips in a bag with a mean of 24 chocolate chips and a standard deviation of 2.6 chocolate chips. o A can of soda with a weight of 0.8295 lbs with a mean weight of 0.81682 lbs and a standard deviation of 0.00751 lbs. Z-scores to identify “unusual values” 55 D. Percents and z-scores (Standard Normal Distribution) Standard Normal Distribution Table 56 Example: a) Find the percent of values below -2.83. b) Find the percent of values below 1.11. c) Find the percent of values below -1.11. d) Find the percent of values above -0.66. e) Find the percent of values above 0.56. f) Find the percent of values between 2.10 and 3.25. g) Find the percent of values between -2.99 and -1.62. h) Find the z-score for a percent of 13.6%. i) Find the z-score for a percent of 92.65%. 57 Example: A bone mineral density test can be helpful identifying the presence of likelihood of osteoporosis. The result of a bone density test is commonly measured as a z score. The population of z scores is normally distributed with a mean of 0 and standard deviation of 1. If a randomly selected adult undergoes a bone density test: Question 1: Find the probability that the result is a reading less than 1.27. Question 2: Find the probability that the result is a reading above -1.00. Question 3: Find the probability that the result is a reading between -1.00 and -2.50. Question 4: Find the bone density score of the lower 93.7% of the data. Question 5: Find the bone density score that defines the upper 40%. Question 6: Between which two values do the middle 40% lie? 58 E. Percents and values (Normal Distribution) Example: The social organization Tall Clubs International has a requirement that women must be at least 70 in. tall. Given that women have normally distributed heights with a mean of 63.8 in. and a standard deviation of 2.6 in. Identifying percent given value a) Find the probability of a woman being chosen that would be allowed in the Tall Clubs International. 59 b) Find the probability of a woman being chosen that is below 66 in. c) Find the probability of a woman being chosen that is in between 55 and 62 in. 60 Identifying value given percent d) Using same Tall Clubs International example, find the height of a woman at the lower 93.7% of the data. 61 e) Determine the value at the upper 70%. f) Between which two values do the middle 50% lie? 62 F. Proving Normalcy Example: A toy tricycle comes with the label: “Easy-To-Assemble. An adult can complete this assembly in 20 mins or less.” Thirty-six adults were asked to complete the assembly of the tricycle and record their times. Here are the results: 16 10 20 22 19 14 30 22 12 13 18 19 17 21 29 22 16 28 24 20 8 17 21 32 18 25 22 28 15 11 26 17 23 24 21 20 Are the assembly lines normally distributed? 1. Visual Creating a histogram/ boxplot 63 2. Mean vs. median 3. Pearson’s Index of Skewness 64