Inferential Statistics - Statistics for Data Science PDF
Document Details
Uploaded by Deleted User
Apex Institute of Technology
Tags
Summary
This document provides an overview of inferential statistics for the Statistics for Data Science course taught at Apex Institute of Technology. It details course objectives and outcomes, covering topics like summary statistics, frequency distributions, and graphical representations. The document also lists suggested readings for the course.
Full Transcript
APEX INSTITUTE OF TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Statistics for Data Science(23CSH-233) Faculty: Prof. (Dr.) Madan Lal Saini(E13485) Inferential Statistics DISCOVER. LEARN. EMPOWER 1 Statisti...
APEX INSTITUTE OF TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Statistics for Data Science(23CSH-233) Faculty: Prof. (Dr.) Madan Lal Saini(E13485) Inferential Statistics DISCOVER. LEARN. EMPOWER 1 Statistics for Data Science : Course Objectives COURSE OBJECTIVES The Course aims to: 1. To equip students with the skills to summarize and interpret data using descriptive statistics and visualization techniques. 2. To develop a foundational understanding of probability and its applications in data science. 3. To enable students to perform hypothesis testing and construct confidence intervals for statistical inference. 4. To teach students how to build and assess linear and logistic regression models for predictive analysis. 5. To provide hands-on experience with statistical software for data manipulation, analysis, and visualization. 2 COURSE OUTCOMES On completion of this course, the students shall be able to:- Summarize and describe the main features of a dataset using measures such as mean, CO1 median, mode, variance, and standard deviation, as well as graphical representations like histograms, box plots, and scatter plots. Understand of probability theory, including concepts such as random variables, CO2 probability distributions, and the law of large numbers, enabling them to model and reason about uncertainty in data. Apply/perform statistical inference, including hypothesis testing, confidence interval CO3 estimation, and p-value computation, to draw valid conclusions from sample data about larger populations. Apply linear and logistic regression techniques to identify relationships between CO4 variables, make predictions, and evaluate model performance. Utilize statistical software tools to perform data analysis, including data cleaning, CO5 transformation, visualization, and implementing various statistical methods. 3 Unit-3 Syllabus Unit-3 Inferential Statistics Inferential Statistical Inference Terminology, Statistics & Hypothesis Testing, Hypothesis Parametric Tests, Testing Non-parametric Tests Industry Hypothesis Testing using Excel Application Industry Practices & Applications of Statistics 4 SUGGESTIVE READINGS TEXT BOOKS: T1. Hastie, Trevor, et al., The elements of statistical learning. Vol. 2. No. 1. New York: Publisher: Springer, Edition: Second Edition (2009), ISBN: 978-0387848570 T2. Montgomery, Douglas C., and George C. Runger. Applied statistics and probability for engineers. John Wiley & Sons, 2010. T3. Probability and Statistics The Science of Uncertainty Second Ed., Michael J. Evans and Jeffrey S. Rosenthal. REFERENCE BOOKS: R1. Practical Statistics for Data Scientists: 50 Essential Concepts, Authors: Peter Bruce, et al, Publisher: O'Reilly Media, Edition: Second Edition (2020), ISBN: 978-1492072942 R2. An Introduction to Statistical Learning: with Applications in R, Authors: Gareth James, et al, Publisher: Springer, Edition: Second Edition (2021), ISBN: 978-1071614174 R3. Think Stats: Exploratory Data Analysis in Python, Author: Allen B. Downey, Publisher: O'Reilly Media, Publication Year: 2014 (2nd Edition), ISBN: 978-1491907337 5 What is a Statistic???? Sample Sample Sample Population Sample Parameter: value that describes a population Statistic: a value that describes a sample PSYCH always using samples!!! Descriptive & Inferential Statistics Descriptive Statistics Inferential Statistics Organize Generalize from samples to pops Summarize Hypothesis testing Simplify Relationships Presentation of among variables data Describing data Make predictions Descriptive Statistics 3 Types 1. Frequency Distributions 3. Summary Stats # of Ss that fall Describe data in just one in a particular category number 2. Graphical Representations Graphs & Tables 1. Frequency Distributions # of Ss that fall in a particular category How many males and how many females are in our class? total Frequency ? ? (%) ?/tot x 100 ?/tot x 100 scale of measurement? -----% ------% nominal 1. Frequency Distributions # of Ss that fall in a particular category Categorize on the basis of more that one variable at same time CROSS-TABULATION total Democrats 24 1 25 Republican 19 6 25 Total 43 7 50 1. Frequency Distributions How many brothers & sisters do you have? # of bros & sis Frequency 7 ? 6 ? 5 ? 4 ? 3 ? 2 ? 1 ? 0 ? 2. Graphical Representations Graphs & Tables Bar graph (ratio data - quantitative) 2. Graphical Representations Histogram of the categorical variables 2. Graphical Representations Polygon - Line Graph 2. Graphical Representations Graphs & Tables How many brothers & sisters do you have? Lets plot class data: HISTOGRAM # of bros & sis Frequency 7 ? 6 ? 5 ? 4 ? 3 ? 2 ? 1 ? 0 ? jagged Altman, D. G et al. BMJ 1995;310:298 smooth Central Limit Theorem: the larger the sample size, the closer a distribution will approximate the normal distribution or A distribution of scores taken at random from any distribution will tend to form a normal curve Normal Distribution: halfTwo Tail above the scores 68% mean…half below (symmetrical) 2.5% 95% 2.5% 13.5% 13.5% IQ body temperature, shoe sizes, diameters of trees, 5% region of rejection of null hypothesis Wt, height etc… Non directional Summary Statistics describe data in just 2 numbers Measures of variability typical average variation Measures of central tendency typical average score Measures of Central Tendency Quantitative data: Mode – the most frequently occurring observation Median – the middle value in the data (50 50 ) Mean – arithmetic average Qualitative data: Mode – always appropriate Mean – never appropriate Mean Notation The most common and most useful average Sample vs population Mean = sum of all observations Sample mean = X number of all observations Population mean =m Observations can be added in any order. Summation sign = Sample size = n Population size = N Special Property of the Mean Balance Point The sum of all observations expressed as positive and negative deviations from the mean always equals zero!!!! The mean is the single point of equilibrium (balance) in a data set The mean is affected by all values in the data set If you change a single value, the mean changes. The mean is the single point of equilibrium (balance) in a data set SEE FOR YOURSELF!!! Lets do the Math Summary Statistics describe data in just 2 numbers Measures of variability Measures of central tendency typical average variation typical average score 1. range: distance from the lowest to the highest (use 2 data points) 2. Variance: (use all data points) 3. Standard Deviation 4. Standard Error of the Mean Descriptive & Inferential Statistics Descriptive Statistics Inferential Statistics Organize Generalize from samples to pops Summarize Hypothesis testing Simplify Relationships Presentation of among variables data Describing data Make predictions Measures of Variability 2. Variance: (use all data points): average of the distance that each score is from the mean (Squared deviation from the mean) Notation for variance s2 3. Standard Deviation= SD= s2 4. Standard Error of the mean = SEM = SD/ n Inferential Statistics Sample Sample Population Sample Sample Draw inferences about the larger group Sampling Error: variability among samples due to chance vs population Or true differences? Are just due to sampling error? Probability….. Error…misleading…not a mistake Probability Numerical indication of how likely it is that a given event will occur (General Definition)“hum…what’s the probability it will rain?” Statistical probability: the odds that what we observed in the sample did not occur because of error (random and/or systematic)“hum…what’s the probability that my results are not just due to chance” In other words, the probability associated with a statistic is the level of confidence we have that the sample group that we measured actually represents the total population data Are our inferences valid?…Best we can do is to calculate probability about inferences Inferential Statistics: uses sample data to evaluate the credibility of a hypothesis about a population NULL Hypothesis: NULL (nullus - latin): “not any” no differences between means H0 : m1 = m2 Always testing the null hypothesis “H- Naught” Inferential statistics: uses sample data to evaluate the credibility of a hypothesis about a population Hypothesis: Scientific or alternative hypothesis Predicts that there are differences between the groups H1 : m1 = m2 Hypothesis A statement about what findings are expected null hypothesis "the two groups will not differ“ alternative hypothesis "group A will do better than group B" "group A and B will not perform the same" Inferential Statistics When making comparisons btw 2 sample means there are 2 possibilities Null hypothesis is false Null hypothesis is true Reject the Null hypothesis Not reject the Null Hypothesis Possible Outcomes in Hypothesis Testing (Decision) Null is True Null is False Correct Accept Error Decision Type II Error Correct Reject Error Decision Type I Error Type I Error: Rejecting a True Hypothesis Type II Error: Accepting a False Hypothesis Hypothesis Testing - Decision Decision Right or Wrong? But we can know the probability of being right or wrong Can specify and control the probability of making TYPE I of TYPE II Error Try to keep it small… ALPHA the probability of making a type I error depends on the criterion you use to accept or reject the null hypothesis = significance level (smaller you make alpha, the less likely you are to commit error) 0.05 (5 chances in 100 that the difference observed was really due to sampling error – 5% of the time a type I error will occur) Possible Outcomes in Hypothesis Testing Null is True Null is False Alpha (a) Accept Correct Decision Error Type II Error Correct Difference observed is really Reject Error Decision just sampling error Type I Error The prob. of type one error When we do statistical analysis… if alpha (p value- significance level) greater than 0.05 WE ACCEPT THE NULL HYPOTHESIS is equal to or less that 0.05 we REJECT THE NULL (difference btw means) Two Tail 2.5% 2.5% 5% region of rejection of null hypothesis Non directional One Tail 5% 5% region of rejection of null hypothesis Directional BETA Probability of making type II error occurs when we fail to reject the Null when we should have Possible Outcomes in Hypothesis Testing Null is True Null is False Beta (b) Accept Correct Decision Error Type II Error Correct Difference observed is real Reject Error Decision Failed to reject the Null Type I Error POWER: ability to reduce type II error POWER: ability to reduce type II error (1-Beta) – Power Analysis The power to find an effect if an effect is present 1. Increase our n 2. Decrease variability 3. More precise measurements Effect Size: measure of the size of the difference between means attributed to the treatment Inferential statistics Significance testing: Practical vs statistical significance Inferential statistics Used for Testing for Mean Differences T-test: when experiments include only 2 groups a. Independent b. Correlated i. Within-subjects ii. Matched Based on the t statistic (critical values) based on df & alpha level Inferential statistics Used for Testing for Mean Differences Analysis of Variance (ANOVA): used when comparing more than 2 groups 1. Between Subjects 2. Within Subjects – repeated measures Based on the f statistic (critical values) based on df & alpha level More than one IV = factorial (iv=factors) Only one IV=one-way anova Inferential statistics Meta-Analysis: Allows for statistical averaging of results From independent studies of the same phenomenon References Books: Hastie, Trevor, et al., The elements of statistical learning. Vol. 2. No. 1. New York: Publisher: Springer, Edition: Second Edition (2009), ISBN: 978-0387848570 Practical Statistics for Data Scientists: 50 Essential Concepts, Authors: Peter Bruce, et al, Publisher: O'Reilly Media, Edition: Second Edition (2020), ISBN: 978-1492072942 Research Papers: Garg, Ram and Goyal, Ruchi, Inferential Statistics As a Measure of Judging the Short-Term Solvency An Empirical Study of Three Steel Companies in India (February 5, 2019). International Journal of Advanced Studies of Scientific Research, Vol. 4, No. 1, 2019, Available at SSRN: https://ssrn.com/abstract=3329388. Alacaci, C. (2004). Inferential Statistics: Understanding Expert Knowledge and its Implications for Statistics Education. Journal of Statistics Education, 12(2). https://doi.org/10.1080/10691898.2004.11910737 Websites: https://www.simplilearn.com/inferential-statistics-article/ https://builtin.com/data-science/inferential- statistics#:~:text=Inferential%20statistics%20is%20the%20practice,sample%20data%20sample%20or%20popul ation./ Videos: https://www.youtube.com/watch?v=cjTgyRUaD1s&list=PLbRMhDVUMngeD_vOeveVE-3b7wu_AZph9 https://www.youtube.com/watch?v=ZmCBF5JXOPM&list=PLFW6lRTa1g80s2MWqXNg2o0haq1k14v2I 47 THANK YOU For queries Email: [email protected] APEX INSTITUTE OF TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Statistics for Data Science(23CSH-233) Faculty: Prof. (Dr.) Madan Lal Saini(E13485) Inferential Statistics DISCOVER. LEARN. EMPOWER 1 Statistics for Data Science : Course Objectives COURSE OBJECTIVES The Course aims to: 1. To equip students with the skills to summarize and interpret data using descriptive statistics and visualization techniques. 2. To develop a foundational understanding of probability and its applications in data science. 3. To enable students to perform hypothesis testing and construct confidence intervals for statistical inference. 4. To teach students how to build and assess linear and logistic regression models for predictive analysis. 5. To provide hands-on experience with statistical software for data manipulation, analysis, and visualization. 2 COURSE OUTCOMES On completion of this course, the students shall be able to:- Summarize and describe the main features of a dataset using measures such as mean, CO1 median, mode, variance, and standard deviation, as well as graphical representations like histograms, box plots, and scatter plots. Understand of probability theory, including concepts such as random variables, CO2 probability distributions, and the law of large numbers, enabling them to model and reason about uncertainty in data. Apply/perform statistical inference, including hypothesis testing, confidence interval CO3 estimation, and p-value computation, to draw valid conclusions from sample data about larger populations. Apply linear and logistic regression techniques to identify relationships between CO4 variables, make predictions, and evaluate model performance. Utilize statistical software tools to perform data analysis, including data cleaning, CO5 transformation, visualization, and implementing various statistical methods. 3 Unit-3 Syllabus Unit-3 Inferential Statistics Inferential Statistical Inference Terminology, Statistics & Hypothesis Testing, Hypothesis Parametric Tests, Testing Non-parametric Tests Industry Hypothesis Testing using Excel Application Industry Practices & Applications of Statistics 4 SUGGESTIVE READINGS TEXT BOOKS: T1. Hastie, Trevor, et al., The elements of statistical learning. Vol. 2. No. 1. New York: Publisher: Springer, Edition: Second Edition (2009), ISBN: 978-0387848570 T2. Montgomery, Douglas C., and George C. Runger. Applied statistics and probability for engineers. John Wiley & Sons, 2010. T3. Probability and Statistics The Science of Uncertainty Second Ed., Michael J. Evans and Jeffrey S. Rosenthal. REFERENCE BOOKS: R1. Practical Statistics for Data Scientists: 50 Essential Concepts, Authors: Peter Bruce, et al, Publisher: O'Reilly Media, Edition: Second Edition (2020), ISBN: 978-1492072942 R2. An Introduction to Statistical Learning: with Applications in R, Authors: Gareth James, et al, Publisher: Springer, Edition: Second Edition (2021), ISBN: 978-1071614174 R3. Think Stats: Exploratory Data Analysis in Python, Author: Allen B. Downey, Publisher: O'Reilly Media, Publication Year: 2014 (2nd Edition), ISBN: 978-1491907337 5 Hypothesis Testing: Excel Instructions Activate the PHSTAT Plug-in (instructions shown in separate video) 1) Open the file Z_Jeans_Dutch_Disposable_Income_Fashion_1000.xlsx 2) Locate the PHSTAT 4 folder (unzipped) on your computer 3) Click the PHSTAT.XLAM file 4) Click ‘Enable Macros’ when prompted PHSTAT Menu will appear on your Excel Spreadsheet 5 ) Click PHSTAT > One SampleTests > t-test for the mean sigma unknown 6) Enter the appropriate values in the dialog box Hypothesized population mean ! Significance Level 𝛂 Select the sample from the appropriate range in the worksheet Select the appropriate test 7)A new Worksheet ‘Hypothesis’ will contain the following output Z-Jeans Data Hypothesized population mean ! Null Hypothesis = 1500 Level of Significance 0.05 Sample Size 1000 Significance Level 𝛂 Sample Mean 1584.87 Sample Standard Deviation 756.5145619 Intermediate Calculations Standard Error of the Mean 23.9231 Degrees of Freedom 999 t Test Statistic 3.5475 𝑍𝑡𝑒𝑠𝑡 Two-Tail Test Lower Critical Value -1.9623 Upper Critical Value 1.9623 p -Value 0.0004 P-value Reject the null hypothesis Statistical Decision Make sure to translate the statistical decision to a business decision !!! References Books: Hastie, Trevor, et al., The elements of statistical learning. Vol. 2. No. 1. New York: Publisher: Springer, Edition: Second Edition (2009), ISBN: 978-0387848570 Practical Statistics for Data Scientists: 50 Essential Concepts, Authors: Peter Bruce, et al, Publisher: O'Reilly Media, Edition: Second Edition (2020), ISBN: 978-1492072942 Research Papers: Garg, Ram and Goyal, Ruchi, Inferential Statistics As a Measure of Judging the Short-Term Solvency An Empirical Study of Three Steel Companies in India (February 5, 2019). International Journal of Advanced Studies of Scientific Research, Vol. 4, No. 1, 2019, Available at SSRN: https://ssrn.com/abstract=3329388. Alacaci, C. (2004). Inferential Statistics: Understanding Expert Knowledge and its Implications for Statistics Education. Journal of Statistics Education, 12(2). https://doi.org/10.1080/10691898.2004.11910737 Websites: https://www.simplilearn.com/inferential-statistics-article/ https://builtin.com/data-science/inferential- statistics#:~:text=Inferential%20statistics%20is%20the%20practice,sample%20data%20sample%20or%20popul ation./ Videos: https://www.youtube.com/watch?v=cjTgyRUaD1s&list=PLbRMhDVUMngeD_vOeveVE-3b7wu_AZph9 https://www.youtube.com/watch?v=ZmCBF5JXOPM&list=PLFW6lRTa1g80s2MWqXNg2o0haq1k14v2I 11 THANK YOU For queries Email: [email protected] APEX INSTITUTE OF TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Statistics for Data Science(23CSH-233) Faculty: Prof. (Dr.) Madan Lal Saini(E13485) Inferential Statistics DISCOVER. LEARN. EMPOWER 1 Statistics for Data Science : Course Objectives COURSE OBJECTIVES The Course aims to: 1. To equip students with the skills to summarize and interpret data using descriptive statistics and visualization techniques. 2. To develop a foundational understanding of probability and its applications in data science. 3. To enable students to perform hypothesis testing and construct confidence intervals for statistical inference. 4. To teach students how to build and assess linear and logistic regression models for predictive analysis. 5. To provide hands-on experience with statistical software for data manipulation, analysis, and visualization. 2 COURSE OUTCOMES On completion of this course, the students shall be able to:- Summarize and describe the main features of a dataset using measures such as mean, CO1 median, mode, variance, and standard deviation, as well as graphical representations like histograms, box plots, and scatter plots. Understand of probability theory, including concepts such as random variables, CO2 probability distributions, and the law of large numbers, enabling them to model and reason about uncertainty in data. Apply/perform statistical inference, including hypothesis testing, confidence interval CO3 estimation, and p-value computation, to draw valid conclusions from sample data about larger populations. Apply linear and logistic regression techniques to identify relationships between CO4 variables, make predictions, and evaluate model performance. Utilize statistical software tools to perform data analysis, including data cleaning, CO5 transformation, visualization, and implementing various statistical methods. 3 Unit-3 Syllabus Unit-3 Inferential Statistics Inferential Statistical Inference Terminology, Statistics & Hypothesis Testing, Hypothesis Parametric Tests, Testing Non-parametric Tests Industry Hypothesis Testing using Excel Application Industry Practices & Applications of Statistics 4 SUGGESTIVE READINGS TEXT BOOKS: T1. Hastie, Trevor, et al., The elements of statistical learning. Vol. 2. No. 1. New York: Publisher: Springer, Edition: Second Edition (2009), ISBN: 978-0387848570 T2. Montgomery, Douglas C., and George C. Runger. Applied statistics and probability for engineers. John Wiley & Sons, 2010. T3. Probability and Statistics The Science of Uncertainty Second Ed., Michael J. Evans and Jeffrey S. Rosenthal. REFERENCE BOOKS: R1. Practical Statistics for Data Scientists: 50 Essential Concepts, Authors: Peter Bruce, et al, Publisher: O'Reilly Media, Edition: Second Edition (2020), ISBN: 978-1492072942 R2. An Introduction to Statistical Learning: with Applications in R, Authors: Gareth James, et al, Publisher: Springer, Edition: Second Edition (2021), ISBN: 978-1071614174 R3. Think Stats: Exploratory Data Analysis in Python, Author: Allen B. Downey, Publisher: O'Reilly Media, Publication Year: 2014 (2nd Edition), ISBN: 978-1491907337 5 Statistics in Our Life Finance Crimes and legal system Medical Quality Etc. https://www.youtube.com/watch?v=jbkSRLYSojo Statistical Lies http://www.physics.csbsju.edu/stats/display.ht ml Statistical Lies Statistical Lies Definitions Following slides are copied from different resources Variables A variable is a characteristic or condition that can change or take on different values. Most research begins with a general question about the relationship between two variables for a specific group of individuals. 11 Experiments The goal of an experiment is to demonstrate a cause- and-effect relationship between two variables; that is, to show that changing the value of one variable causes changes to occur in a second variable. 12 Types of Variables Variables can be classified as discrete or continuous. Discrete variables (such as class size) consist of indivisible categories, and continuous variables (such as time or weight) are infinitely divisible into whatever units a researcher may choose. For example, time can be measured to the nearest minute, second, half-second, etc. 13 Population The entire group of individuals is called the population. For example, a researcher may be interested in the relation between class size (variable 1) and academic performance (variable 2) for the population of third-grade children. 14 Sample Usually populations are so large that a researcher cannot examine the entire group. Therefore, a sample is selected to represent the population in a research study. The goal is to use the results obtained from the sample to help answer questions about the population. 15 Simple random sample A simple random sample (SRS) of size n is a sample chosen by a method in which each collection of n population items is equally likely to comprise the sample, just as in the lottery. Independent Items The items in a sample are independent if knowing the values of some of the items does not help to predict the values of the others. Items in a simple random sample may be treated as independent in most cases encountered in practice. The exception occurs when the population is finite and the sample comprises a substantial fraction (more than 5%) of the population. Descriptive Statistics Descriptive statistics are methods for organizing and summarizing data. For example, tables or graphs are used to organize data, and descriptive values such as the average score are used to summarize data. A descriptive value for a population is called a parameter and a descriptive value for a sample is called a statistic. 19 Inferential Statistics Inferential statistics are methods for using sample data to make general conclusions (inferences) about populations. Because a sample is typically only a part of the whole population, sample data provide only limited information about the population. As a result, sample statistics are generally imperfect representatives of the corresponding population parameters. 20 Sampling Error The discrepancy between a sample statistic and its population parameter is called sampling error. Defining and measuring sampling error is a large part of inferential statistics. 21 Types of Data Numerical or quantitative if a numerical quantity is assigned to each item in the sample. Continuous: Height Weight Age Discrete Number of students in class Number of equipment in a project Types of Data Categorical or qualitative if the sample items are placed into categories (always discrete) Nominal (if there is no natural order between the categories) Gender Hair color Zip code Ordinal (if an ordering exists) Customer satisfaction surveys. Students grades. Summary Statistics Sample Mean: 1 n X Xi n i 1 Sample Variance: 2 Xi X n n 1 1 2 s 2 X i nX 2 n 1 i 1 n 1 i 1 Sample standard deviation is the square root of the sample variance. Definition of a Median The median is another measure of center, like the mean. To find it: If n is odd, the sample median is the number in position n 1. 2 If n is even, the sample median is the average n n of the numbers in positions and 1. 2 2 Summary Importance of statistics. Population versus sample. Mean, median and standard deviation. References Books: Hastie, Trevor, et al., The elements of statistical learning. Vol. 2. No. 1. New York: Publisher: Springer, Edition: Second Edition (2009), ISBN: 978-0387848570 Practical Statistics for Data Scientists: 50 Essential Concepts, Authors: Peter Bruce, et al, Publisher: O'Reilly Media, Edition: Second Edition (2020), ISBN: 978-1492072942 Research Papers: Garg, Ram and Goyal, Ruchi, Inferential Statistics As a Measure of Judging the Short-Term Solvency An Empirical Study of Three Steel Companies in India (February 5, 2019). International Journal of Advanced Studies of Scientific Research, Vol. 4, No. 1, 2019, Available at SSRN: https://ssrn.com/abstract=3329388. Alacaci, C. (2004). Inferential Statistics: Understanding Expert Knowledge and its Implications for Statistics Education. Journal of Statistics Education, 12(2). https://doi.org/10.1080/10691898.2004.11910737 Websites: https://www.simplilearn.com/inferential-statistics-article/ https://builtin.com/data-science/inferential- statistics#:~:text=Inferential%20statistics%20is%20the%20practice,sample%20data%20sample%20or%20popul ation./ Videos: https://www.youtube.com/watch?v=cjTgyRUaD1s&list=PLbRMhDVUMngeD_vOeveVE-3b7wu_AZph9 https://www.youtube.com/watch?v=ZmCBF5JXOPM&list=PLFW6lRTa1g80s2MWqXNg2o0haq1k14v2I 27 THANK YOU For queries Email: [email protected] Statistical Inference After we have selected a sample, we know the responses of the individuals in the sample. However, the reason for taking the sample is to infer from that data some conclusion about the wider population represented by the sample. Statistical inference provides methods for drawing conclusions about a population from sample data. Population Collect data from a Sample representative sample... Make an inference about the population. 1 Confidence Interval A level C confidence interval for a parameter has two parts: An interval calculated from the data, which has the form estimate ± margin of error A confidence level C, where C is the probability that the interval will capture the true parameter value in repeated samples. In other words, the confidence level is the success rate for the method. We usually choose a confidence level of 90% or higher because we want to be quite sure of our conclusions. The most common confidence level is 95%. 2 Statistical Estimation Note: Assume we know the stdev σ of the population, σ = 100. 3 We know that sample mean 𝑥ҧ is an unbiased estimator for the (unknown) population mean µ. So we can take 𝑥ҧ = 495 as a good estimate. But how reliable is this estimate? If we take repeated samples, the sample means will vary. Note: Numbers in these figures are different from the previous example. 4 5 Statistical Confidence Because of the Central Limit Theorem, sample mean 𝑥ҧ is normally distributed. From the 68-95-99.7 rule, we know 95% of the values are 𝝈 between +/- 2 standard deviations 𝑥ҧ ± 𝟐 ∙. 𝒏 𝜎 100 And 2 ∙ = 2 ∙ = 2 ∙ 4.5 = 9. 𝑛 500 So we say that the true population mean µ lies somewhere in the interval 495 ± 9 = [486, 504] with 95% confidence. This is the 95% confidence interval for the population mean. 6 Confidence Interval for a Population Mean To calculate a confidence interval for µ, we use the formula: estimate ± (critical value) (standard deviation of statistic) Z* 80% 1.282 85% 1.440 90% 1.645 95% 1.960 99% 2.576 99.5% 2.807 Choose an SRS of size n from a population having unknown mean µ and known standard deviation σ. A level C confidence interval for µ is 𝜎 𝑥ҧ ± 𝑧 ∗ 𝑛 The critical value z* is found from the standard Normal distribution. 7 The Margin of Error The confidence level C determines the value of z* (in Table D). The margin of error also depends on z*. m z * n Higher confidence C implies a larger margin of error m (thus less precision in our estimates). A lower confidence level C produces a C smaller margin of error m (thus better precision in our estimates). m m −z* z* 8 9 10 11 12 Choosing the Sample Size You may need a certain margin of error (e.g., in drug trials or manufacturing specs). In most cases, we have no control over the population variability (), but we can choose the number of measurements (n). The confidence interval for a population mean will have a specified margin of error m when the sample size is z * 2 m z* n n m Remember, though, that sample size is not always stretchable at will. There are typically costs and constraints associated with large samples. The best approach is to use the smallest sample size that can give you useful results. 13 Sample Size Example How many undergraduates should we survey? Suppose we are planning a survey about college savings programs. We want the margin of error of the amount contributed to be $30 with 95% confidence. Let us assume the population standard deviation, σ, equals $1483. How many measurements should you take? For a 95% confidence interval, z* = 1.96. æ z *s ö æ 1.96 *1483 ö 2 2 n =ç ÷ Þ n =ç ÷ = 9387.54. è m ø è 30 ø Using only 9387 measurements will not be enough to ensure that m is no more than $30. Therefore, we need at least 9388 measurements. 14 6.2 Tests of Significance The reasoning of tests of significance Stating hypotheses Test statistics P-values Statistical significance Tests for a population mean Two-sided significance tests and confidence intervals 15 Statistical Inference 2 The second common type of Statistical inference, called tests of significance, is to assess evidence in the data about some claim concerning a population. A test of significance is a formal procedure for comparing observed data with a claim (also called a hypothesis) whose truth we want to assess. The claim is a statement about a parameter such as the population proportion p or the population mean µ. We express the results of a significance test in terms of a probability, called the P-value, which measures how well the data and the claim agree. 16 Four Steps of Tests of Significance Tests of Significance: Four Steps 1. State the null and alternative hypotheses. 2. Calculate the value of the test statistic. 3. Find the P-value for the observed data. 4. State a conclusion. We will learn the details of many tests of significance in the following chapters. The proper test statistic is determined by the hypotheses and the data collection design. 17 1. Stating Hypotheses A significance test starts with a careful statement of the claims we want to compare. The claim tested by a statistical test is called the null hypothesis (H0). The test is designed to assess the strength of the evidence against the null hypothesis. Often, the null hypothesis is a statement of “no effect” or “no difference in the true means.” The claim about the population for which we’re trying to find evidence is the alternative hypothesis (Ha). 18 19 2. Test Statistic A test of significance is based on a statistic that estimates the parameter that appears in the hypotheses. When H0 is true, we expect the estimate to be near the parameter value specified in H0. Values of the estimate far from the parameter value specified by H0 give evidence against H0. A test statistic calculated from the sample data measures how far the data diverge from what we would expect if the null hypothesis H0 were true. estimate − hypothesized value 𝑧= standard deviation of the estimate Large values of the statistic show that the data are not consistent with H0. 20 21 3. P-Value The probability, computed assuming H0 is true, that the statistic would take a value as or more extreme than the one actually observed is called the P-value of the test. The smaller the P-value, the stronger the evidence against H0. 22 23 4. Conclusion We make one of two decisions based on the strength of the evidence against the null hypothesis ―reject H0 or fail to reject H0. P-value small → reject H0 → conclude Ha (in context), P-value large → fail to reject H0 → cannot conclude Ha (in context). If the P-value is smaller than , we say that the data are statistically significant at level . The quantity is called the significance level or the level of significance. When we use a fixed level of significance to draw a conclusion in a significance test, P-value < → reject H0 → conclude Ha (in context) P-value ≥ → fail to reject H0 → cannot conclude Ha (in context) 24 25 Tests for a Population Mean One-sided, upper-tail test One-sided, lower-tail test Two-sided test – count both sides 26 27 28 Two-Sided Significance Tests and Confidence Intervals Because a two-sided test is symmetrical, we can also use a 1 – confidence interval to test a two-sided hypothesis at level . Confidence level C and for a two-sided test are related as follows: C=1– /2 /2 29 30 31 More About P-Values 32 6.3 Use and Abuse of Tests Choosing a significance level What statistical significance does not mean Do not ignore lack of significance Beware of searching for significance 33 Cautions About Significance Tests 1 Choosing the significance level Factors often considered: What are the consequences of rejecting the null hypothesis when it is actually true? What might happen if we concluded that global warming was real when it really wasn’t? Suppose an innocent person was convicted of a crime. Are you conducting a preliminary study? If so, you may want a larger so that you will be less likely to miss an interesting result. 34 Choosing Significance Some conventions: Level Typically, the standards of our field of work are used. There are no sharp cutoffs for P-values: for example, there is no practical difference between 4.9% and 5.1%. It is the order of magnitude of the P-value that matters: “somewhat significant,” “significant,” or “very significant.” 35 Cautions About Significance Tests 2 Do not ignore lack of significance Consider this provocative title from the British Medical Journal: “Absence of evidence is not evidence of absence.” Having no proof that a particular suspect committed a murder does not imply that the suspect did not commit the murder. Indeed, failing to find statistical significance in results means that “the null hypothesis is not rejected.” This is very different from actually accepting the null hypothesis. The sample size, for instance, could be too small to overcome large variability in the population. 36 Cautions About Significance Tests 3 Statistical inference not valid for all sets of data 37 6.4 Power and Inference as a Decision Power Increasing the power The common practice of testing hypotheses 38 Power of Test 39 40 41 42 43 Type When I and Type II Errors we draw a conclusion from a significance test, we hope our conclusion will be correct. But sometimes it will be wrong. There are two types of mistakes we can make. If we reject H0 when H0 is true, we have committed a Type I error. If we fail to reject H0 when H0 is false, we have committed a Type II error. Truth about the population H0 false H0 true (Ha true) Conclusion Correct based on Reject H0 Type I error conclusion sample Fail to reject Correct Type II error H0 conclusion 44 45 Increasing the Power Suppose we have performed a power calculation and found that the power is too small. Four ways to increase power are 1. Increase the significance level α. It is more difficult to reject a null hypothesis with a larger α level. 2. Consider a particular alternate value for μ that is farther from the null value. Values of μ that are farther from the hypothesized value are easier to detect. 3. Increase the sample size. More data will provide better information about the sample average, so we have a better chance of distinguishing values of μ. 4. Decrease σ. Improving the measuring process and restricting attention to a subpopulation are possible ways to decrease σ. 46 Measurement Measurement: We often want to measure properties of data or models. For the data: Basic properties: Min, max, mean, std. deviation of a dataset. Relationships: between fields (columns) in a tabular dataset, via scatter plots, regression, correlation etc. And for models: Accuracy: How well does our model match the data (e.g. predict hidden values)? Performance: How fast is a ML system on a dataset? How much memory does it use? How does it scale as the dataset size grows? Measurement on Samples Many datasets are samples from an infinite population. We are most interested in measures on the population, but we have access only to a sample of it. A sample measurement is called a “statistic”. Examples: Sample min, max, mean, std. deviation Measurement on Samples Many datasets are samples from an infinite population. We are most interested in measures on the population, but we have access only to a sample of it. That makes measurement hard: Sample measurements are “noisy,” i.e. vary from one sample to the next Sample measurements may be biased, i.e. systematically be different from the measurement on the population. Measurement on Samples Many datasets are samples from an infinite population. We are most interested in measures on the population, but we have access only to a sample of it. That makes measurement hard: Sample measurements have variance: variation between samples Sample measurements have bias, systematic variation from the population value. Examples of Statistics Unbiased: σ𝑛 𝑖=1 𝑥𝑖 Sample mean (sample of n values) 𝑥ҧ = ൗ𝑛 Sample median (kth largest in 2k-1 values) Biased: Min Max 2 σ𝑛 2 𝑖=1 𝑥−𝑥ҧ ൗ Sample variance 𝜎 = 𝑛 (but this does correctly give population variance in the limit as 𝑛 → ∞) For biased estimators, the bias is usually worse on small samples. Statistical Notation We’ll use upper case symbols “𝑋” to represent random variables, which you can think of as draws from the entire population. Lower case symbols “𝑥” represent particular samples of the population, and subscripted lower case symbols to represent instances of a sample: 𝑥𝑖 Normal Distributions, Mean, Variance The mean of a set of values is just the average of the values. Variance a measure of the width of a distribution. Specifically, the variance is the mean squared deviation of points from the mean: 𝑛 1 𝑉𝑎𝑟 𝑋 = 𝑋𝑖 − 𝑋ത 2 𝑛 𝑖=1 The standard deviation is the square root of variance. The normal distribution is completed characterized by mean and variance. mean Standard deviation Central Limit Theorem The distribution of the sum (or mean) of a set of n identically-distributed random variables Xi approaches a normal distribution as n . The common parametric statistical tests, like t-test and ANOVA assume normally-distributed data, but depend on sample mean and variance measures of the data. They typically work reasonably well for data that are not normally distributed as long as the samples are not too small. Correcting distributions Many statistical tools, including mean and variance, t-test, ANOVA etc. assume data are normally distributed. Very often this is not true. The box-and-whisker plot is a good clue Whenever its asymmetric, the data cannot be normal. The histogram gives even more information Correcting distributions In many cases these distribution can be corrected before any other processing. Examples: X satisfies a log-normal distribution, Y=log(X) has a normal dist. X poisson with mean k and sdev. sqrt(k). Then sqrt(X) is approximately normally distributed with sdev 1. Histogram Normalization Its not difficult to turn histogram normalization into an algorithm: 0.04 0.10 Draw a normal distribution, and compute its histogram into k bins. Normalize (scale) the areas of the bars to add up to 1. If the left bar has area 0.04, assign the top 0.04-largest values to it, and reassign them a value “60”. If the next bar has area 0.10, assign the next 0.10-largest values to it, and reassign them a value “65” etc. Distributions Some other important distributions: Poisson: the distribution of counts that occur at a certain “rate”. Observed frequency of a given term in a corpus. Number of visits to a web site in a fixed time interval. Number of web site clicks in an hour. Exponential: the interval between two such events. Zipf/Pareto/Yule distributions: govern the frequencies of different terms in a document, or web site visits. Binomial/Multinomial: The number of counts of events (e.g. die tosses = 6) out of n trials. You should understand the distribution of your data before applying any model. Measurement Statistics Measurement Hypothesis Testing Featurization Feature selection Feature Hashing Visualizing Accuracy Autonomy Corp Rhine Paradox* Joseph Rhine was a parapsychologist in the 1950’s (founder of the Journal of Parapsychology and the Parapsychological Society, an affiliate of the AAAS). He ran an experiment where subjects had to guess whether 10 hidden cards were red or blue. He found that about 1 person in 1000 had ESP, i.e. they could guess the color of all 10 cards. Q: what’s wrong with his conclusion? * Example from Jeff Ullman/Anand Rajaraman Autonomy Corp Rhine Paradox He called back the “psychic” subjects and had them do the same test again. They all failed. He concluded that the act of telling psychics that they have psychic abilities causes them to lose it…(!) Hypothesis Testing We want to prove a hypothesis HA but its hard so we try to disprove a null hypothesis H0 A test statistic is some measurement we can make on the data which is likely to be big under HA but small under H0. Hypothesis Testing Example: We suspect that a particular coin isn’t fair. We toss it 10 times, it comes up heads every time… We conclude it’s not fair, why? How sure are we? Now we toss a coin 4 times, and it comes up heads every time. What do we conclude? Hypothesis Testing We want to prove a hypothesis HA (the coin is biased), but its hard so we try to disprove a null hypothesis H0 (the coin is fair). A test statistic is some measurement we can make on the data which is likely to be big under HA but small under H0. the number of heads after k coin tosses – one sided the difference between number of heads and k/2 – two-sided Note: tests can be either one-tailed or two-tailed. Here a two- tailed test is convenient because it checks either very large or very small counts of heads. Hypothesis Testing Another example: Two samples a and b, normally distributed, from A and B. H0 hypothesis that mean(A) = mean(B) test statistic is: s = mean(a) – mean(b). s has mean zero and is normally distributed* under H0. But its “large” if the two means are different. * - We need to use the fact that the sum of two independent, normally-distributed variables is also normally distributed. Hypothesis Testing – contd. s = mean(a) – mean(b) is our test statistic, H0 the null hypothesis that mean(A)=mean(B) We reject if Pr(x > s | H0 ) < p, i.e. the probability of a statistic value at least as large as s, should be small. p is a suitable “small” probability, say 0.05. This threshold probability is called a p-value. P directly controls the false positive rate (rate at which we expect to observe large s even if is H0 true). As we make p smaller, the false negative rate increase – situations where mean(A), mean(B) differ but the test fails. Common values 0.05, 0.02, 0.01, 0.005, 0.001 Two-tailed Significance From G.J. Primavera, “Statistics for the Behavioral Sciences” When the p value is less than 5% (p <.05), we reject the null hypothesis Hypothesis Testing From G.J. Primavera, “Statistics for the Behavioral Sciences” Three important tests T-test: compare two groups, or two interventions on one group. CHI-squared and Fisher’s test. Compare the counts in a “contingency table”. ANOVA: compare outcomes under several discrete interventions. T-test Single-sample: Compute the test statistic: 𝑥ҧ t= 𝜎ത where 𝑥ҧ is the sample mean and 𝜎ത is the sample standard deviation, which is the square root of the sample variance Var(x). If X is normally distributed, t is almost normally distributed, but not quite because of the presence of 𝜎. ത It has a t-distribution. You use the single-sample test for one group of individuals in two conditions. Just subtract the two measurements for each person, and use the difference for the single sample t-test. This is called a within-subjects design. T-statistic and T-distribution We use the t-statistic from the last slide to test whether the mean of our sample could be zero. If the underlying population has mean zero, the t-distribution should be distributed like this: The area of the tail beyond our measurement tells us how likely it is under the null hypothesis. If that probability is low (say < 0.05) we reject the null hypothesis. Two sample T-test In this test, there are two samples 𝑥1 and 𝑥2 of sizes 𝑛1 and 𝑛2. A t-statistic is constructed from their sample means and sample standard deviations: 𝑥ҧ1 − 𝑥ҧ2 𝑡= 𝜎𝑥ҧ1 −𝑥ҧ2 𝜎1 2 𝜎2 2 where: 𝜎𝑥ҧ1 −𝑥ҧ2 = + and 𝜎1 and 𝜎2 are sample sdevs, 𝑛1 𝑛2 You should try to understand the formula, but you shouldn’t need to use it. most stats. software exposes a function that takes the samples 𝑥1 and 𝑥2 as inputs directly. This design is called a between-subjects test. Chi-squared test Often you will be faced with discrete (count) data. Given a table like this: Prob(X) Count(X) X=0 0.3 10 X=1 0.7 50 Where Prob(X) is part of a null hypothesis about the data (e.g. that a coin is fair). The CHI-squared statistic lets you test whether an observation is consistent with the data: Oi is an observed count, and Ei is the expected value of that count. It has a chi-squared distribution, whose p-values you compute to do the test. Fisher’s exact test In case we only have counts under different conditions Count1(X) Count2(X) X=0 a b X=1 c d We can use Fisher’s exact test (n = a+b+c+d): Which gives the probability directly (its not a statistic). Non-Parametric Tests All the tests so far are parametric tests that assume the data are normally distributed, and that the samples are independent of each other and all have the same distribution (IID). They may be arbitrarily inaccurate is those assumptions are not met. Always make sure your data satisfies the assumptions of the test you’re using. e.g. watch out for: Outliers – will corrupt many tests that use variance estimates. Correlated values as samples, e.g. if you repeated measurements on the same subject. Skewed distributions – give invalid results. Non-parametric tests These tests make no assumption about the distribution of the input data, and can be used on very general datasets: K-S test Permutation tests Bootstrap confidence intervals K-S test The K-S (Kolmogorov-Smirnov) test is a very useful test for checking whether two (continuous or discrete) distributions are the same. In the one-sided test, an observed distribution (e.g. some observed values or a histogram) is compared against a reference distribution. In the two-sided test, two observed distributions are compared. The K-S statistic is just the max distance between the CDFs of the two distributions. While the statistic is simple, its distribution is not! But it is available in most stat packages. K-S test The K-S test can be used to test whether a data sample has a normal distribution or not. Thus it can be used as a sanity check for any common parametric test (which assumes normally-distributed data). It can also be used to compare distributions of data values in a large data pipeline: Most errors will distort the distribution of a data parameter and a K-S test can detect this. Bootstrap samples Often you have only one sample of the data, but you would like to know how some measurement would vary across similar samples (i.e. the variance or histogram of a statistics). You can get a good approximation to related samples by “resampling your sample”. This is called bootstrap sampling (by analogy to lifting yourself up by your bootstraps). For a sample S of N values, a bootstrap sample is a set SB of N values drawn with replacement from S. Idealized Sampling idealized original population (through an oracle) take samples apply test statistic (e.g. mean) histogram of statistic values compare test statistic on the given data, compute p Bootstrap Sampling Original pop. Given data (sample) bootstrap samples, drawn with replacement apply test statistic (e.g. mean) histogram of statistic values The region containing 95% of the samples is a 95% confidence interval (CI) Bootstrap Confidence Interval tests Then a test statistic outside the 95% Confidence Interval (CI) would be considered significant at 0.05, and probably not drawn from the same population. e.g. Suppose the data are differences in running times between two algorithms. If the 95% bootstrap CI does not contain zero, then original distribution probably has a mean other than zero, i.e. the running times are different. We can also test for values other than zero. If the 95% CI contains only values greater than 2, we conclude that the difference in running times is significantly larger than 2. Bootstrap Test for Regression Suppose we have a single sample of points, to which we fit a regression line? How do we know whether this line is “significant”? And what do we mean by that? Bootstrap Test for Regression ANS: Take bootstrap samples, and fit a line to each sample. The possible regression lines are shown below: What we really want to know is “how likely is a line with zero or negative slope”. Bootstrap Test for Regression ANS: Take bootstrap samples, and fit a line to each sample. The possible regression lines are shown below: What we really want to know is “how likely is a line with zero or negative slope”. Negative slope Positive slope histogram of slope values The region containing 95% of the samples is a 95% confidence interval (CI) Updates We’re in 110/120 Jacobs again on Weds. Project work only this Weds. Check Course page for project suggestions / team formation help. Train-Test-Validation Sets When making measurements on a ML algorithm, we have additional challenges. With a sample of data, any model fit to it models both: 1. Structure in the entire population 2. Structure in the specific sample not true of the population 1. is good because it will generalize to other samples. 2. is bad because it wont. Example: a 25-year old man and a 30-year old woman. Age predicts gender perfectly. (age < 27 => man else woman) Gender predicts age perfectly. (gender == man => 25 else 30) Neither result generalizes. This is called over-fitting. Train-Test-Validation Sets Train/Test split: By (randomly) partitioning our data into train and test sets, we can avoid biased measurements of performance. The model now fits a different sample from the measurement. ML models are trained only on the training set, and then measured on the test set. Example: Build a model of age/gender based on the man/woman above. Now select a test set of 40 random people (men + women). The model will fail to make reliable predictions on this test set. Validation Sets Statistical models often include “tunable” parameters that can be adjusted to improve accuracy. You need a test-train split in order to measure performance for each set of parameters. But now you’ve used the test set in model-building which means the model might over-fit the test set. For that reason, its common to use a third set called the validation set which is used for parameter tuning. A common dataset split is 60-20-20 training/validation/test Model Tuning Tune Parameters Training Data ML model Validation Data Build Evaluate Model Test Data Final Model Scores A Brief History of Machine Learning Before 2012*: Cleverly- Input Data Designed ML model Features Most of the “heavy lifting” in here. Final performance only as good as the feature set. * Before publication of Krizhevsky et al.’s ImageNet CNN paper. A Brief History of Machine Learning After 2012: Deep Learning Input Data Features model Features and model learned together, mutually reinforcing A Brief History of Machine Learning But this (pre-2012) picture is still typical of many pipelines. We’ll focus on one aspect of feature design: feature selection, i.e. choosing which features from a list of candidates to use for a ML problem. Cleverly- Input Data Designed ML model Features Method 1: Ablation Train a model on features (𝑓1 , … , 𝑓𝑛 ), measure performance 𝑄0 Now remove a feature 𝑓𝑘 and train on (𝑓1 , … , 𝑓𝑘−1 , 𝑓𝑘+1 , … , 𝑓𝑛 ), producing performance 𝑄1. If performance 𝑄1 is significantly worse than 𝑄0 , keep 𝑓𝑘 otherwise discard it. Q: How do we check if “𝑄1 is significantly worse than 𝑄0 ” If Method 1: Ablation Train a model on features (𝑓1 , … , 𝑓𝑛 ), measure performance 𝑄0 Now remove a feature 𝑓𝑘 and train on (𝑓1 , … , 𝑓𝑘−1 , 𝑓𝑘+1 , … , 𝑓𝑛 ), producing performance 𝑄1. If performance 𝑄1 is significantly worse than 𝑄0 , keep 𝑓𝑘 otherwise discard it. Q: How do we check if “𝑄1 is significantly worse than 𝑄0 ” If we know 𝑄0 , 𝑄1 are normally-distributed with variance 𝜎 we can do a t-test. Method 1: Ablation Train a model on features (𝑓1 , … , 𝑓𝑛 ), measure performance 𝑄0 Now remove a feature 𝑓𝑘 and train on (𝑓1 , … , 𝑓𝑘−1 , 𝑓𝑘+1 , … , 𝑓𝑛 ), producing performance 𝑄1. If performance 𝑄1 is significantly worse than 𝑄0 , keep 𝑓𝑘 otherwise discard it. Q: How do we check if “𝑄1 is significantly worse than 𝑄0 ” Do bootstrap sampling on the training dataset, and compute 𝑄0 , 𝑄1 on each sample. Then use an appropriate statistical test (e.g. a CI) on vectors of 𝑄0 𝑄1 values generated by bootstrap samples. Method 1: Ablation Question: Why do you think ablation starts with all the features and removes one-at-a-time rather than starting with no features, and adding one-at-a-time? Method 2: Mutual Information Mutual information measures the extent to which knowledge of one feature influences the distribution of another (the classifier output). Where U is a random variable which is 1 if term et is in a given document, 0 otherwise. C is 1 if the document is in the class c, 0 otherwise. These are called indicator random variables. Mutual information can be used to rank features, the highest will be kept for the classifier and the rest ignored. Method 3: CHI-Squared CHI-squared is an important statistic to know for comparing count data. Here it is used to measure dependence between word counts in documents and in classes. Similar to mutual information, terms that show dependence are good candidates for feature selection. CHI-squared can be visualized as a test on contingency tables like this one: Right-Handed Left-Handed Total Males 43 9 52 Females 44 4 48 Total 87 13 100 Example of Feature Count vs. Accuracy Feature Hashing Challenge: many prediction problems involve very, very rare features, e.g. URLs or user cookies. There are billions to trillions of these, too many to represent explicitly in a model (or to run feature selection on!) Most of these features are not useful, i.e. don’t help predict the target class. A small fraction of these features are very important for predicting the target class (e.g. user clicks on a BMW dealer site has some interest in BMWs). Feature Hashing Word Hash Function Feature Count The 1 2 Quick Feature table 2 2 much smaller Brown 3 3 than feature Fox set. 4 1 Jumps 5 0 Over 6 1 the Lazy We train a classifier on Dog these features Feature Hashing Feature 3 receives “Brown”, “Lazy” and “Dog”. The first two of these are not very salient to the category of the sentence, but “Dog” is. Classifiers trained on hashed features often perform surprisingly well – although it depends on the application. They work well e.g. for add targeting, because the false positive cost (target dog ads to non-dog-lovers) is low compared to the false negative cost (miss an opportunity to target a dog-lover). Feature Hashing and Interactions One very important application of feature hashing is to interaction features. Interaction features (or just interactions) are tuples (usually pairs) of features which are treated as single features. E.g. the sentence “the quick brown fox…” has interaction features including: “quick-brown”, “brown-fox”, “quick-fox” etc. Interaction features are often worth “more than the sum of their parts” e.g. “BMW-tires,” “ipad-charger,” “school-bags” There are N2 interactions among N features, but very few are meaningful. Hashing them produces many collisions but most don’t matter. Why not to use “accuracy” directly The simplest measure of performance would be the fraction of items that are correctly classified, or the “accuracy” which is: tp + tn tp + tn + fp + fn (tp = true positive, fn = false negative etc.). But this measure is dominated by the larger set (of positives or negatives) and favors trivial classifiers. e.g. if 5% of items are truly positive, then a classifier that always says “negative” is 95% accurate. ROC plots ROC is Receiver-Operating Characteristic. ROC plots Y-axis: true positive rate = tp/(tp + fn), same as recall X-axis: false positive rate = fp/(fp + tn) = 1 - specificity Score increasing ROC AUC ROC AUC is the “Area Under the Curve” – a single number that captures the overall quality of the classifier. It should be between 0.5 (random classifier) and 1.0 (perfect). Random ordering area = 0.5 Lift Plot A variation of the ROC plot is the lift plot, which compares the performance of the actual classifier/search engine against random ordering, or sometimes against another classifier. Lift is the ratio of these lengths Lift Plot Lift plots emphasize initial precision (typically what you care about), and performance in a problem-independent way. Note: The lift plot points should be computed at regular spacing, e.g. 1/00 or 1/1000. Otherwise the initial lift value can be excessively high, and unstable. 1 - specificity Contents: Null and Alternative Hypotheses Test Statistic P-Value Significance Level One-Sample z Test Power and Sample Size Terms Introduce in Prior Chapter Population all possible values Sample a portion of the population Statistical inference generalizing from a sample to a population with calculated degree of certainty Two forms of statistical inference Hypothesis testing Estimation Parameter a characteristic of population, e.g., population mean µ Statistic calculated from data in the sample, e.g., sample mean ( ) x Distinctions Between Parameters and Statistics (Chapter 8 review) Parameters Statistics Source Population Sample Notation Greek (e.g., μ) Roman (e.g., xbar) Vary No Yes Calculated No Yes Sampling Distributions of a Mean (Introduced in Ch 8) The sampling distributions of a mean (SDM) describes the behavior of a sampling mean x ~ N , SE x where SE x n Hypothesis Testing Is also called significance testing Tests a claim about a parameter using evidence (data in a sample The technique is introduced by considering a one- sample z test The procedure is broken into four steps Each element of the procedure must be understood Hypothesis Testing Steps A. Null and alternative hypotheses B. Test statistic C. P-value and interpretation D. Significance level (optional) §9.1 Null and Alternative Hypotheses Convert the research question to null and alternative hypotheses The null hypothesis (H0) is a claim of “no difference in the population” The alternative hypothesis (Ha) claims “H0 is false” Collect data and seek evidence against H0 as a way of bolstering Ha (deduction) Illustrative Example: “Body Weight” The problem: In the 1970s, 20–29 year old men in the U.S. had a mean μ body weight of 170 pounds. Standard deviation σ was 40 pounds. We test whether mean body weight in the population now differs. Null hypothesis H0: μ = 170 (“no difference”) The alternative hypothesis can be either Ha: μ > 170 (one-sided test) or Ha: μ ≠ 170 (two-sided test) §9.2 Test Statistic This is an example of a one-sample test of a mean when σ is known. Use this statistic to test the problem: x 0 z stat SE x where 0 population mean assuming H 0 is true and SE x n Illustrative Example: z statistic For the illustrative example, μ0 = 170 We know σ = 40 Take an SRS of n = 64. Therefore 40 SEx 5 If we found a sample mean n of64173, then x 0 173 170 zstat 0.60 SEx 5 Illustrative Example: z statistic If we found a sample mean of 185, then x 0 185 170 zstat 3.00 SEx 5 Reasoning Behinµzstat x ~ N 170 ,5 Sampling distribution of xbar under H0: µ = 170 for n = 64 §9.3 P-value The P-value answer the question: What is the probability of the observed test statistic or one more extreme when H0 is true? This corresponds to the AUC in the tail of the Standard Normal distribution beyond the zstat. Convert z statistics to P-value : For Ha: μ > μ0 P = Pr(Z > zstat) = right-tail beyond zstat For Ha: μ < μ0 P = Pr(Z < zstat) = left tail beyond zstat For Ha: μ μ0 P = 2 × one-tailed P-value Use Table B or software to find these probabilities (next two slides). One-sided P-value for zstat of 0.6 One-sided P-value for zstat of 3.0 Two-Sided P-Value One-sided Ha AUC in tail beyond zstat Two-sided Ha consider potential deviations in both directions double the one-sided P-value Examples: If one-sided P = 0.0010, then two-sided P = 2 × 0.0010 = 0.0020. If one-sided P = 0.2743, then two-sided P = 2 × 0.2743 = 0.5486. Interpretation P-value answer the question: What is the probability of the observed test statistic … when H0 is true? Thus, smaller and smaller P-values provide stronger and stronger evidence against H0 Small P-value strong evidence Interpretation Conventions* P > 0.10 non-significant evidence against H0 0.05 < P 0.10 marginally significant evidence 0.01 < P 0.05 significant evidence against H0 P 0.01 highly significant evidence against H0 Examples P =.27 non-significant evidence against H0 P =.01 highly significant evidence against H0 * It is unwise to draw firm borders for “significance” α-Level (Used in some situations) Let α ≡ probability of erroneously rejecting H0 Set α threshold (e.g., let α =.10,.05, or whatever) Reject H0 when P ≤ α Retain H0 when P > α Example: Set α =.10. Find P = 0.27 retain H0 Example: Set α =.01. Find P =.001 reject H0 (Summary) One-Sample z Test A. Hypothesis statements H0: µ = µ0 vs. Ha: µ ≠ µ0 (two-sided) or Ha: µ < µ0 (left-sided) or Ha: µ > µ0 (right-sided) B. Test statistic x 0 z stat where SEx SEx n C. P-value: convert zstat to P value D. Significance statement (usually not necessary) §9.5 Conditions for z test σ known (not from data) Population approximately Normal or large sample (central limit theorem) SRS (or facsimile) Data valid The Lake Wobegon Example “where all the children are above average” Let X represent Weschler Adult Intelligence scores (WAIS) Typically, X ~ N(100, 15) Take SRS of n = 9 from Lake Wobegon population Data {116, 128, 125, 119, 89, 99, 105, 116, 118} Calculate: x-bar = 112.8 Does sample mean provide strong evidence that population mean μ > 100? Example: “Lake Wobegon” A. Hypotheses: H0: µ = 100 versus Ha: µ > 100 (one-sided) Ha: µ ≠ 100 (two-sided) B. Test statistic: 15 SEx 5 n 9 x 0 112.8 100 zstat 2.56 SEx 5 C. P-value: P = Pr(Z ≥ 2.56) = 0.0052 P =.0052 it is unlikely the sample came from this null distribution strong evidence against H0 Two-Sided P-value: Lake Wobegon Ha: µ ≠100 Considers random deviations “up” and “down” from μ0 tails above and below ±zstat Thus, two-sided P = 2 × 0.0052 = 0.0104 §9.6 Power and Sample Size Two types of decision errors: Type I error = erroneous rejection of true H0 Type II error = erroneous retention of false H0 Truth Decision H0 true H0 false Retain H0 Correct retention Type II error Reject H0 Type I error Correct rejection α ≡ probability of a Type I error β ≡ Probability of a Type II error Power β ≡ probability of a Type II error β = Pr(retain H0 | H0 false) (the “|” is read as “given”) 1 – β “Power” ≡ probability of avoiding a Type II error 1– β = Pr(reject H0 | H0 false) Power of a z test | 0 a | n 1 z1 2 where Φ(z) represent the cumulative probability of Standard Normal Z μ0 represent the population mean under the null hypothesis μa represents the population mean under the alternative hypothesis Calculating Power: Example A study of n = 16 retains H0: μ = 170 at α = 0.05 (two-sided); σ is 40. What was the power of test’s conditions to identify a population mean of 190? | | n 1 z1 0 a 2 | 170 190 | 16 1.96 40 0.04 0.5160 Reasoning Behind Power Competing sampling distributions Top curve (next page) assumes H0 is true Bottom curve assumes Ha is true α is set to 0.05 (two-sided) We will reject H0 when a sample mean exceeds 189.6 (right tail, top curve) The probability of getting a value greater than 189.6 on the bottom curve is 0.5160, corresponding to the power of the test Sample Size Requirements Sample size for one-sample z test: n 2 z1 z1 2 2 where 2 1 – β ≡ desired power α ≡ desired significance level (two-sided) σ ≡ population standard deviation Δ = μ0 – μa ≡ the difference worth detecting Example: Sample Size Requirement How large a sample is needed for a one-sample z test with 90% power and α = 0.05 (two-tailed) when σ = 40? Let H0: μ = 170 and Ha: μ = 190 (thus, Δ = μ0 − μa = 170 – 190 = −20) n 2 z1 z1 2 2 40 (1.28 1.96 ) 2 2 41.99 2 20 2 Round up to 42 to ensure adequate power. Illustration: conditions for 90% power.