ADDA15.lec12_Bayes (1).pptx
Document Details
Uploaded by JudiciousNephrite2042
University of Melbourne
2023
Tags
Related
- Statistics - Probability and Descriptive Statistics PDF
- STATS 331 Introduction to Bayesian Statistics PDF
- 1588490822-walpole-probability-statistics-for-engineers-scientists-9th-edition.pdf
- Bayesian Machine Learning Lecture PDF
- Bayesian Statistics Instructional Materials PDF
- Statistics II Exam Study Guide PDF
Full Transcript
PSYC40005 - 2023 ADVANCED DESIGN AND DATA ANALYSIS Lecture 12: Bayesian Data Analysis Cross Validation Adam Osth Psychological Sciences University of Melbourne Melbourne Connect [email protected] Lecture 12 BAYESIAN ANALY...
PSYC40005 - 2023 ADVANCED DESIGN AND DATA ANALYSIS Lecture 12: Bayesian Data Analysis Cross Validation Adam Osth Psychological Sciences University of Melbourne Melbourne Connect [email protected] Lecture 12 BAYESIAN ANALYSIS Null Hypothesis Significance Testing Null hypothesis significance testing (NHST) – The traditional analyses and alternative to Bayesian analyses – Conduct an analysis (regression, ANOVA), assess whether there is a significant result – But what is “significance”? The p value is the probability of obtaining the data if the null hypothesis is true If that value is small, we reject the null hypothesis Null Hypothesis Significance Testing What can a lack of statistical significance tell you? (p >.05) – The probability of the data under the null hypothesis is higher than our significance threshold – DOES NOT mean the null hypothesis is true! – The other possibility is that you may not have had enough data to reject the null hypothesis – You cannot distinguish between these two possibilities with NHST analyses P values are frequently misinterpreted! Null Hypothesis Significance Testing What’s missing from NHST! – …the likelihood of the data under the alternative hypothesis! The case of Sally Clark – Two of her children died at a very early age – Sally Clark was put on trial for murder The defense claimed this was due to SIDS (Sudden Infant Death Syndrome) The probability of two children dying under SIDS was 1 in 73 million Sally Clark was found guilty Sally Clark, cont’d – The 1 in 73 million can be considered the p value Very unlikely that the deaths of her children were due to chance alone – …but what is the likelihood of two deaths under the alternative hypothesis, that she murdered her two children? Murders of two children by a parent on two separate occasions is also extremely unlikely We should be taking a ratio of these two probabilities to make a judgment – This is a Bayes Factor Some Basics of Probability Theory Probability Conditional Probability Some Basics of Probability Theory Probability Conditional Probability Probability of “Probability that I will B given A go running under the circumstance that it is raining” Probability that I will Probability that it’s go running if it’s raining if I’m running raining Calculating the probability of both events together involves multiplication 𝑃 ( 𝐴∧ 𝐵) 𝑃 ( 𝐴| 𝐵 )= 𝑃 ( 𝐵) 𝑃 ( 𝐴 ) 𝑃 ( 𝐵 ∨ 𝐴) 𝑃 ( 𝐴| 𝐵 )= This is Bayes’s Rule! 𝑃 (𝐵) Bayes Rule In other words: Bayes rule provides a formal means for reversing a conditional probability, using the given conditional probability (p(B|A)) and the probabilities of each of the events (p(A) and p(B)) 𝑝 ( 𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) 𝑝 ( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) 𝑝 ( 𝑝 𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠|𝑑𝑎𝑡𝑎 ¿= 𝑝 (𝑑𝑎𝑡𝑎) Parameters = slope, intercept, etc. Maximum likelihood estimation maximizes this probability Posterior probabili ty 𝑝 ( 𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) 𝑝 ( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) 𝑝 ( 𝑝 𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠|𝑑𝑎𝑡𝑎 ¿= 𝑝 (𝑑𝑎𝑡𝑎) Bayesian methods expand on maximum likelihood estimation by incorporating p(parameters) – the prior probability, and p(data) – marginal likelihood An example of Bayesian posterior probability calculation A researcher comes up with a test for malaria. The test comes up positive 99% of the time when people have malaria and returns a positive result 2% of the time when people do not have the disease. The prior probability of having the disease is.1%. If I take the test and get a positive result, what is the probability that I actually have malaria? An example of Bayesian posterior probability calculation 𝑝 ( 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 ) 𝑝 (𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒∨𝑑𝑖𝑠𝑒𝑎𝑠𝑒) 𝑝 ( 𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ) = 𝑝 (𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) We have to expand the denominator – marginalize over all cases that involve a positive result 𝑝 ( 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 ) 𝑝(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒∨𝑑𝑖𝑠𝑒𝑎𝑠𝑒) 𝑝 ( 𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) = 𝑝 ( 𝑑𝑖𝑠𝑒𝑎𝑠𝑒) 𝑝 ( 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒|𝑑𝑖𝑠𝑒𝑎𝑠𝑒) +𝑝 ( 𝑛𝑜𝑑𝑖𝑠𝑒𝑎𝑠𝑒 ) 𝑝(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒∨𝑛𝑜𝑑𝑖𝑠𝑒𝑎𝑠𝑒).001 ∗.99 𝑖𝑠𝑒𝑎𝑠𝑒|𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ) = =.047 (.001 ∗.99 ) +(.999 ∗.02) Bayes Rule Expanded for Hypothesis Testing 𝑝 ( 𝑀 1 ) 𝑃 (𝐷𝑎𝑡𝑎∨𝑀 1) 𝑝 ( 𝑀 1| 𝐷𝑎𝑡𝑎 ) = 𝑝 ( 𝑀 1 ) 𝑝( 𝐷𝑎𝑡𝑎∨𝑀 1 )+ 𝑃 ( 𝑀 2 ) 𝑃 (𝐷𝑎𝑡𝑎∨𝑀 2 ) The Bayes Factor – ratio of evidence between two models (M1 and M2), although it can be generalized to any number of models. In practice, we can specify M1 and M2 as the null and alternative hypothesis and test between them. Another Way to Calculate Bayes Factors: The Savage Dickey Method 𝑝 ( 𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) 𝑝 ( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) 𝑝 ( 𝑝 𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠|𝑑𝑎𝑡𝑎 ¿= 𝑝 (𝑑𝑎𝑡𝑎) Estimate the posterior distribution for the effect size δ Use a prior distribution for δ that is centered on zero Compare the posterior density of δ at zero to the prior density of δ at zero – In other words: “What is our evidence that the effect size is zero after we have seen the data (posterior estimate) compared to before we have seen the data (prior estimate)?” Savage Dickey Bayes Factors Posterior distribution of effect size (how we’ve updated our beliefs about the effect size after having seen the data) Prior distribution of effect size Savage Dickey Bayes Factors Density of data under null hypothesis hypothesis (posterior Data are six density of Density of data under times more effect size = alternative hypothesis likely under 0) (prior density of effect the null size= 0) hypothesis We are more confident that the null hypothesis is true after having seen Advantages of Bayesian Hypothesis Testing Being able to find evidence for the null hypothesis – In stats classes, we have repeatedly told you that you can only reject the null hypothesis, you cannot accept it – With Bayesian hypothesis testing, you can find evidence for the null or the alternative hypothesis This ratio of evidence is quantified by the Bayes Factor Bayes Factors Suggested interpretations of Bayes Factors. Like all guidelines in statistics, these are merely suggestions! They are not meant to taken as law If a BF01 is 100, it means the null hypothesis is 100 times more likely than the alternative hypothesis Bayes Factors What happens if you don’t have enough data? Does this mean the null hypothesis is supported? – No, not at all! Often under these circumstances you get BF around 1.0, meaning the data are equally likely under the null and alternative hypotheses! As more data is collected, the BF will move toward the more supported hypothesis BFs also become more extreme as more data is collected Bayes Factors “But wait Adam, you’re not supposed to inspect data as they’re being collected! That’s a big no-no!” – Actually, this is less problematic for Bayesian analyses! – “Yesterday’s posterior is tomorrow’s prior” You can use the outcome of a prior Bayesian hypothesis test as the prior probability for the next test This naturally allows you to update your results as data are being collected. You can also use these as priors for your next analysis in a sequence of experiments Prior probabilities What can you place prior probabilities on? – Model parameters Effect size – can specify a prior distribution on the direction and magnitude of a particular effect Coefficients of a regression model – you may already have a sense of both the direction and magnitude of a particular predictor – For example, we know that there is a relationship between smoking and heart disease (positive relationship), meaning positive slopes should be a priori more likely – A literature review may be able to uncover what the values of those slopes are Why is the prior probability important? When conducting Bayesian data analyses, we can use prior probabilities to capture our sense of which results or hypotheses are more or less likely. – This can be based on intuition, or based on a review of the literature If there is a high probability of a positive effect in a particular experimental paradigm, this can be reflected as high prior probabilities for effect sizes greater than zero. – Likewise, if we are confident that there is no effect, we can also use a prior distribution with high probability density of the effect size being zero. Bem’s (2011) Result In 2011, Daryl Bem published a paper in the top social psychology journal with evidence for precognition (ESP). It is one of the most controversial papers published in the history of psychology. Bem’s (2011) Result Bem (2011) paper has been criticized on several grounds – Conducted several experiments, not all of them found evidence for ESP – Conditions and groups were analyzed separately, often without any form of statistical correction Response from Wagenmakers et al. (2011) Response from Wagenmakers et al. (2011) A passage – “…precognition- if it exists – is an anomalous phenomenon, because it conflicts with what we know to be true about the world (e.g., weather forecasting agencies do not employ clairvoyants, casinos make profits.)” In the paper, Wagenmakers et al. re-analyze the Bem data with Bayesian methods – They adopt a prior probability that heavily favors the null hypothesis (>10,000 to 1 against the alternative hypothesis) This captures the idea that “extraordinary claims require extraordinary evidence 8 of the 9 experiments’ data showed strong evidence for the null hypothesis More on Bayesian Methods Bayesian methods are easier than ever to implement JASP software is easy to use, free, and intuitive – http://jasp-stats.org – Can conduct Bayesian equivalents of ANOVAs, t-tests, regressions I also strongly recommend the following paper, which is long but easy to read and illustrates the foundations: – Etz, A., & Vandekerckhove, J. (2018). Introduction to Bayesian inference for psychology. Psychonomic Bulletin & Review, 25, 5-34. Lecture 12 CROSS VALIDATION Overfitting Sometimes a fit to the data that is too good is a bad thing! – If you’re fitting the data perfectly, you’re not just fitting the data, you’re also fitting the noise Pitt and Myung (2002) A good fit can result from a model that is overly complex – this model may not necessarily generalize to new datasets! Overfitting You can always improve the fit to data by adding complexity to the model – This complexity can take the form of: Additional predictors in a regression model Additional interaction terms More complex functional relationships between the model and the data – E.g., instead of linear relationships between model parameters and data, using quadratic or polynomial relationships This complexity comes at a cost – Poorer generalization to new data Overfitting Why is poorer generalization to new data bad? – Unable to generalize to new samples, paradigms, or to the population at large! Overfitting We have actually talked during the course about some of these issues! – E.g. loglinear models: assessing whether interaction terms should be included in the model (added complexity) – In that instance, we evaluated whether the added interactions were necessary by comparing different models If removing interactions did not result in a significant difference, interactions were removed We had to justify the inclusion of the added complexity – they had to add something to the goodness of fit! Overfitting Blue line fits the data better perfectly! but will it generalize to new data? Cross Validation Instead of fitting *all* of your data… – You fit a subset of your data (training data), and evaluate the performance of the model on the remaining subset that it was not trained on (validation data) – The fit performance on the validation data is referred to as out-of-sample prediction Extremely effective way of comparing models – The model that performs better on the validation data should be preferred This model exhibited better prediction/better generalization to data it was not trained on! Simple linear Complex polynomial Simple model: one model: ten parameters linear MSE = parameter mean model: poorer fit squared to error training (badness data, but of fit) better fit to test data (better general- ization) We should prefer the linear What kinds of models would you compare? In your research, you may have choices in your data analysis about: – How many predictors go in your regression model E.g., a two predictor model, a three predictor model, etc. – Whether interaction terms should be included It can be tempting to try out a number of different models and evaluate how well they perform – Techniques like cross validation are a *safeguard* against overfitting Cross Validation Several methods of performing this – Leave one out cross validation (LOOCV) is the most common Validation data is a single subject (or even a single data point), the rest are training data Repeat the process, where each subject/data point is left out (if you have N subjects/data points, you repeat the process N times) Evaluate performance of each model on the predicted data If you are performing parameter estimation (e.g.; estimating regression weights), you can average over the parameters from all N fits Cross Validation Downsides of the method – Time intensive Have to fit a large number of models to the data – Not easy to perform in SPSS There are specialized packages in R, MATLAB, or Python Talk to your advisers about this! Not just my recommendation! Regularization Regularization is a related technique for reducing the complexity of a regression model Most common technique is lasso regression: Include an additional term in the error term that is the sum of all of the values of the coefficients – Error term in OLS (ordinary least squares) regression is just the sum of squared error (deviations between model predictions and data) – In other words, having high values on all of the coefficients makes the model perform worse This pushes estimates on small or weak predictors to zero Regularization Lasso regression requires specifying the regularization term (e.g., penalty term for high sums of the coefficients) – Major downside of the method: this can be difficult to specify in some cases Regularization About five coefficients still have positive Regularization values A number of the Parameter coefficients were pushed down to zero Regularization Regularization Parameter When the regularization is too strong, all of the coefficients are pushed to zero! Regularization Why would you want to use it? – Predictors almost inevitably get non-zero estimates of coefficients even if they’re not doing anything – Lasso regression naturally produces a simpler model where less predictors have significant coefficients How can you use it? – It IS available in SPSS: https://www.ibm.com/docs/en/spss-statistics/SaaS?to pic=catreg-categorical-regression-regularization Analyze -> Regression -> Optimal Scaling (CATREG) – Also available in R: https://www.pluralsight.com/guides/linear-lasso-and- ridge-regression-with-r Combining Techniques Can you combine these techniques? – Yes! Cross validation, Bayesian methods, and Lasso regression are not mutually exclusive – The prior distribution in Bayesian analyses can behave like regularization If a prior distribution for a parameter is centered on zero, this “pulls” the parameter toward zero, similar to the lasso Lecture 12 SOME NOTES IN CLOSING WHERE TO FROM HERE? Some Notes in Closing This course was just a survey of different methods – you will want to learn more if you are using these techniques in your research Don’t be afraid to Google! – You are reaching the point where your classroom education only takes you so far – you need to learn the rest! – It’s important to know how to find the information or answers you need There are tons of YouTube videos available on how to perform statistical analyses Some Notes in Closing This material is often challenging – Don’t feel pressured to understand it right away! – It takes repetition, concentration, and a lot of thinking to really understand this content Use visualizations when you can! – Even the brightest statisticians use visualizations pretty extensively Use simulations (if you can – this is easier in R)! – Are you in doubt as to whether you’ve broken the assumptions or can use a technique? Simulate a bunch of fake datasets (~1,000+) with some given assumptions (maybe the null hypothesis is true) Perform the technique on each of them What is the Type 1 error rate? If it’s much higher than it should be, don’t use the technique!