STAT401 Lecture 02 PDF
Document Details

Uploaded by MarvelousBinomial8832
Rutgers University
Tags
Summary
This document is lecture notes on basic statistics, covering various topics including descriptive and inferential statistics, organizing data, and different types of distributions. It also describes measures of center and variation of datasets, alongside the five-number summary and boxplots.
Full Transcript
Lecture 2 Basics of Statistics II (Chapters 1 – 3) Review of Lecture 1 Basic notions of Statistics Descriptive and Inferential Statistics Population, Sample, Variable, Data Statistical model, parameter, statistic Organizing Data (table, chart, diagram) Frequency table, Pie char...
Lecture 2 Basics of Statistics II (Chapters 1 – 3) Review of Lecture 1 Basic notions of Statistics Descriptive and Inferential Statistics Population, Sample, Variable, Data Statistical model, parameter, statistic Organizing Data (table, chart, diagram) Frequency table, Pie chart, Bar chart, histogram, dotplot, stem-and-leaf diagram Shapes of distributions: modality, symmetry, skewness, (approximate) normal distributions Descriptive measures of a dataset Measures of center: mean, median, mode Recall: Population; Sample; Variable: height 𝒙 Students of our University Students of our class {178} {167} {190} {180} {176} 𝟏 𝟏𝟎𝟎 {156} ഥ= 𝒙 𝒙𝒊 = 𝟏𝟕𝟐. 𝟐 𝟏𝟎𝟎 𝒊=𝟏 Statistical model 𝒙~𝓝(𝝁, 𝝈𝟐 ) 𝒔𝟐 = ⋯ = 𝟏𝟐. 𝟒 Parameters: 𝝁, 𝝈𝟐 ഥ, 𝒔𝟐 Statistics: 𝒙 Confidence Interval: We are 98% confident that 167.5 < 𝜇 < 190.4 Hypothesis testing: At significance level 5%, reject that 𝜎 2 = 1.01 Recall: Qualitative (categorical) and quantitative variable / data (data) (Categorical) Qualitative variable: takes non-numerical names or labels Gender, Birth month, Favorite movie, Major, Campus… Discrete variable: takes finite or countable different numbers Age, Household size, Number of siblings, Shoe size… Continuous variable: takes a range / interval of numbers Height, Weight, Distance, Temperature, Foot length… Recall: Use table, chart, diagram to organize data Categorical data: non-numerical names or labels (Relative) frequency table, Pie chart, Bar chart Recall: Use table, chart, diagram to organize data Discrete data: finite or countable different numbers Dotplot, stem-and-leaf diagram (stemplot) Dotplots and stemplots recover DATA: “PULSE” the dataset with integer values Recall: Use table, chart, diagram to organize data Continuous data: a range / interval of numbers (Relative) frequency table, Histogram: group the observations into classes (categories or bins) treat the classes as the distinct values of qualitative data. Recall: density curve Shape of a distribution: modality, symmetry and skewness Normal (or approximately normal) distribution: if it has (approximately) bell-shaped density curve Descriptive Measures DESCRIBING, EXPLORING, AND COMPARING DATA Descriptive measures Measures of center Measures of variation Three standard deviation rule The five-number summary Quartiles, IQR, Upper and lower limits Boxplot (box-and-whisker plot) Descriptive measures of populations Population mean Population standard deviation (mean, median, mode) (Quantitative data) (Quantitative data) (Quantitative and Qualitive data) (mean, median, mode) (mean, median, mode) Mean, median and mode of a data set are often different. Mean is sensitive to extreme observations; median is not. Median is a resistant measure of center, which is preferred for data sets with extreme observations. Elon Musk net worth - Google Search Where is the mode? (sample mean statistic) each 𝒙𝒊 is an observation (value) of the variable in the sample Check: ഥ ∑ 𝒙𝒊 − 𝒙 𝟐 = ∑ 𝒙𝟐𝒊 − 𝟐𝒙𝒊 𝒙 ഥ𝟐 = ∑𝒙𝟐𝒊 − ∑𝟐ഥ ഥ+𝒙 𝒙𝟐 𝒙𝒙𝒊 + ∑ഥ = ∑𝒙𝟐𝒊 − 𝟐ഥ 𝒙𝟐 = ∑𝒙𝟐𝒊 − 𝟐𝒏ഥ 𝒙∑𝒙𝒊 + 𝒏ഥ 𝒙𝟐 = ∑𝒙𝟐𝒊 − ∑𝒙𝒊 𝟐 /𝒏 𝒙𝟐 + 𝒏ഥ (Midrange) Measures of Center: measure the central location of a dataset, i.e., where most of the values of a dataset is located. DATA: “PULSE” (Range, standard deviation, variance) is the sample variance (compute sample mean and variance) ഥ Since ∑ 𝒙𝒊 − 𝒙 𝟐 = ∑𝒙𝟐𝒊 − ∑𝒙𝒊 𝟐 /𝒏, we have 𝒏 𝟏 𝒔𝟐 = ഥ 𝒙𝒊 − 𝒙 𝟐 𝒏−𝟏 𝒊=𝟏 𝟐 𝒏 𝒏 𝟏 𝟐 𝟏 = 𝒙𝒊 − 𝒙𝒊 𝒏−𝟏 𝒏 𝒊=𝟏 𝒊=𝟏 𝒏 𝟏 Also recall ഥ = 𝒙𝒊 𝒙 𝒏 𝒊=𝟏 Example: Heights of basketball players Find the sample median, mode, mean, variance, midrange, range Mean Median Mode Midrange Range Variance Team I Team II Example: Heights of basketball players Find the sample median, mode, mean, variance, midrange, range Data Set I: ഥ 𝒙 = 𝟓𝟎, 𝒔 = 𝟕. 𝟒 Data Set II: ഥ 𝒙 = 𝟓𝟎, 𝒔 = 𝟏𝟒. 𝟐 Empirical Rule: for data sets with approximately bell-shaped distribution About 68% of all values fall within 1 standard deviation of the mean About 95% of all values fall within 2 standard deviations of the mean About 99.7% of all values fall within 3 standard deviations of the mean Q1 = 23, Q2 = 30.5, and Q3 = 36.5. Observations that lie below the lower limit or above the upper limit are potential outliers. Adjacent values are the most extreme observations that still lie within the lower and upper limits; they are the most extreme observations that are not potential outliers. If a data set has no potential outliers, the adjacent values are just the minimum and maximum observations. (see previous slide for an example) Q1 = 23, Q2 = 30.5, Q3 = 36.5; Potential outlier: 66; Adjacent values: 5 and 43. Q1 = 23, Q2 = 30.5, Q3 = 36.5; Potential outlier: 66; Adjacent values: 5 and 43. Q1 = 23, Q2 = 30.5, Q3 = 36.5; Potential outlier: 66; Adjacent values: 5 and 43. (Application: compare datasets) (Application: detect potential outliers) Outliers are observations that fall well outside the overall pattern of the data. An outlier may be the result of a measurement or recording error, an observation from a different population. An extreme observation need not be an outlier; it may instead be an indication of skewness. Boxplots for right-skewed, symmetric, and left-skewed distributions For a particular variable on a particular population There is only one population mean μ or population std. deviation σ There are many sample means and sample standard deviations Sample mean / variance are estimates of population mean / variance. The z-score of an observation tells us the number of standard deviations that the observation is away from the mean. Sample z-Score When population 𝜇 or 𝜎 are not available, ഥ 𝒙𝒊 − 𝒙 𝒛𝒊 = 𝒔 defines the sample z-score 𝒛𝒊 for an observation 𝒙𝒊. Recall the three- standard-deviation rule, the sample z-score is almost always in (-3, 3). Preview: Probability theory for inferential statistics (SRS) {178} {167} {190} {180} {176} 𝟏 𝟏𝟎𝟎 {156} ഥ= 𝒙 𝒙𝒊 = 𝟏𝟕𝟐. 𝟐 𝟏𝟎𝟎 𝒊=𝟏 Statistical model 𝒙~𝓝(𝝁, 𝝈𝟐 ) 𝒔𝟐 = ⋯ = 𝟏𝟐. 𝟒 Why do we need to study Probability Theory for inferential Statistics? Preview: Probability theory for inferential statistics To understand the data generating models on populations, e.g., 𝒙~𝓝 𝝁, 𝝈𝟐 𝒙~𝑩𝒆𝒓𝒏 𝒑 𝒙~𝑴𝒖𝒍𝒕𝒊 𝒑𝟏 , 𝒑𝟐 , … , 𝒑𝒌 , 𝒌 𝒚 ∣ 𝒙~𝓝 𝜷𝟎 + 𝜷𝟏 𝒙, 𝝈𝟐 Preview: Probability theory for inferential statistics To understand the sampling distributions of the statistics, e.g., ഥ~𝓝 𝝁, 𝝈𝟐 /𝒏 𝒙 ഥ−𝝁 𝒙 ~𝒕𝒏−𝟏 𝒔/ 𝒏 (𝒏 − 𝟏)𝒔𝟐 𝟐 ~𝝌𝒏−𝟏 𝝈𝟐 Preview: Probability theory for inferential statistics To formulate and understand the inferential statements Data Set 31: Commute Times Reported daily commute times (minutes) to work from 7494 workers (first five rows shown here) of age 16 and older in different cities. Data are from the U.S. Census Bureau’s 2017 American Community Survey. Based on the data, we are 98% confident that the average commute time in Boston is between 26 and 48 minutes. Based on the data, we are 95% confident that, on average, the commute time in Boston is shorter than in New York city. At 2% significance level, the data provide sufficient evidence that, on average, the commute time in Chicago is longer than 25 minutes.