Inferential Statistics PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an overview of inferential statistics, focusing on sampling error, hypothesis testing, and sampling distributions. The document explains the concepts and processes using examples and illustrations.
Full Transcript
○ Sampling error = variability due to chance Indicates that the numerical value of a sample statistic probably will be in error (it will deviate from the parameter existing) Aka - when I have a population with the mean of 50 and I take random ten people and calculate...
○ Sampling error = variability due to chance Indicates that the numerical value of a sample statistic probably will be in error (it will deviate from the parameter existing) Aka - when I have a population with the mean of 50 and I take random ten people and calculate their mean, it will be 49…other ten people would have the mean of 51, etc. ○ Hypothesis testing - we need to know when the ambiguity in our data is worth our attention Hypothesis = a statement about parameters in the population Aka is it possible the variability in our data is due to chance or not? We use sampling distributions - tell us what degree of sample-to-sample variability we can expect by chance as a function of sampling error It is a distribution of MEANS, not specific scores Sample distribution - the distribution of values within a sample The distribution of values obtained for a statistic over repeated sampling A distribution of a statistic derived from all possible samples of a specific size from a population Is the variability in our data due to the independent variable and would we obtain the same differences if e were to take the samples from populations with the same mean? We calculate what is the probability of having the same sampling difference if the population means were equal - if the probability is low, then the changes are due to the independent variable The whole process: Using the null hypothesis (that generally contradicts of what we hope to show) We can never prove something to be true but we can prove something to be false Provides us with a starting point - we state nothing is going on in the population What happens when we need to interpret a nonrejection? We probably do not have enough data We should accept the null hypothesis as true only for the time being, until we get better data The p-value = the probability of finding a test statistic or a more extreme one if the null-hypothesis is true How special is my sample? How different is it from the "nothing is going on"? Lower p-value -> more evidence against the H0 More unlikely to find this kind of sample if the H0 is true -> H0 is probably not true Decision making - we must decide whether a certain probability is sufficiently unlikely/likely to cause us to reject H0 Using conventions Rejecting if less than or equal to 0.05/0.01 Aka using the rejection levels/significance levels (saying that a difference is statistically significant at a.05 level - we mean that a difference that large would occur less than 5% ot the time if the null were true) Critical value - the value that describes the boundary of the rejection region Rejecting hypothesis Type I error - rejecting H0 when it is actually true Has a conditional probability designated as alpha (the cut off level) Type II error - failing to reject H0 when it is false and H1 is true Has a probability designated as beta But problem is that we never know HOW wrong H0 is Reducing Type I error increases Type II error Power - probability of rejecting H0 when it is actually false Using a one-tailed/directional test - rejecting H0 for only the lowest/highest mean differences Two-tailed/nondirectional test - rejecting H0 for highest and lowest scores More common Investigators usually have no idea which way the data will lean Investigators are kinda sure data will come out one way but wanna be safe One tailed tests are not really definable What does it actually mean to reject the null hypothesis? We usually confuse the probability of hypothesis given the data and the probability of the data given the hypothesis = conditional probabilities When we have a probability of 0.045, it shows the probability of the data given that H0 is true..but it does not mean the probability of H0 being true given the data Alternative approach Start with rejecting a null hypothesis in the first place, we sometimes just dint have the data to allow us to draw conclusions about which mean is larger Instead of saying the null hypothesis is supported/rejected, we just say idk man Do not specify which tail we are testing, we should just collect data and determine whether the result is extreme in either tail We cannot make an error rejecting the null because it is not there - the only error we can make is the reversal of the actual relationship ○ Standard error - the standard deviation of a distribution of differences between sample means Reflects the variability that we would expect to find in the values of that statistic over repeated trials ○ Sample statistics - describing characteristics of samples ○ Test statistics - associated with specific statistical procedures and have their own sampling distributions ○ Analyzing measurement data - either focus on differences between groups or the relationship between variables ○ Sampling distribution of the mean The central limit theorem: We use it to guess what the sampling distribution looks like based on our sample alone Simple random sample should be a subset of a population, each individual should be picked at random and should have an equal probability to be picked The standard deviation of the sampling distribution of the mean is always SMALLER than the standard deviation of the population distribution (we are taking the MEANS, so there is less variability, we are averaging everything) E.g. we have a uniform (rectangular) distribution - every value between 0 and 100 will be equally likely We have 50 0000 observations Mean = equal to one half of the range (50), sd = equal the range divided by the square root of 12 (28.67), variance = 833.33 Draw 5000 samples of size 5 -> plotting the means will result in cca normal distribution with values close to the ones mentioned above More samples with more observations will result in more normal looking distribution ○ Testing hypotheses about means Knowing the standard deviation Using the central limit theorem to obtain the sampling distribution when the H0 is true Calculating the z score: using the sample mean, normal population mean and the standard error (sd/square root of n) Assuming the that we are simling from a normal population - good ven when the testing population is not normal, if we provide a sample size that is big enough If the H0 is true - we expect the z score to be close to zero Not knowing the standard deviation -> using t Tests Using the sample standard deviations Sampling distributions of s^2 - unbiased estimate of variance Shape is positively skewed for small samples -> cannot use the same table as for z scores Using s^2 will lead to a sampling distribution called the Student's t distribution ○ Bootstrapping - sample with replacement from the obtained data -> effect of sampling from and infinite population of exactly the same shape as my sample ○ Confidence interval on the population mean Point estimate - when we have one specific esrimate of a parameter We want to find confidence limits of the mean that will be the borders of the confidecne intervala How big/small could the value of the mean be without causing us to reject H0 if we ran a t test on the obtained sample mean? Dosadit kladnou nebo zapornou t value based on the critical two tailed alpha -> solve for the mean to get the upper and lower confidence limit ○ Extreme cases with small sample sizes (that we cannot treat as a population) - using a different formula - looking for t values for all specific scores ○ One sample t-test Hypothesis Sampling distribution Test statistic P-value Statistical conclusion Substantive conclusion ○ Paired/matched samples t-test Dependent observations Related samples = Per person, measured at different times (e.g. scores before and after treatment) Matched samples = Per pair of individuals, paired on specific properties (e.g. treatment of people matched on gender and age) We test the difference between the two dependent measurements Difference score is determined: di = xi1 - xi2 N = number of pairs of measurements (the number of difference scores, e.g. we test two things on 12 people - so n is 12sd) Usually H0: delta = 0, so we leave it out Confidence intervals - solve the same equasion for the mean using the +-t value based on the degrees of freedom Cohen's d - effect size estimate (how many standard deviations relative to the first measure was the new measure) Population Mean after - population mean before/standard deviation of population before ○ Independent sample t-test When we test two groups that are independent, their means differ -> question: is the difference sufficiently large to justify the conclusion that the two samples were drawn from different ppulations? We don't know the individual scores ○ ○ ○ The variance sum law - the variance of a sum or difference of two independent variables is equal to the sum of their variances ○ Confidence intervals: Effect size ○ Cohen's d again, but using the pooled sd in the denominator sp ○ But - the variances in both groups are unequal -> the sampling distribution of the difference in means does not follow an ordinary t-distribution Using the Levene's test for equality of variances - it tests the difference between two variances, H0 says there is no difference We calculate the weighted average (aka the pooled variance estimate), which we then use in the denominator instead of the individual sd's: The degrees of freedom are then = n1 + n2 - 2 But not good for too much or too little of data -> Welch's t-test - the sampling distribution approximates a t-distributionn but with different degrees of freedom In R: Manually: Min = znamená nejmenší hodnota, co mají společnou a mínus jedna (17 a 15 mají společnou 15 a 15 - 1 = 14 ○ Combined t-test - when we have to deal with missing data Using a matched sample test for cases when I have both variables + independent t-test for the cases when I have only one variable -> combine them and use special tables ○ Probability Analytic view - what is the probability you will pull a blue ball out of a sack of balls? Frequentist view - drawing balls over and over again and noticing the colors Sample with replacement - each ball is replaces before the next one is drawn Probability = limit of the relative frequency of occurrence of the desired event that we approach as the number of draws increases Subjective probability = individual's subjective belief in the likelihood of the occurrence of an event Terminology and rules The basic bit of data for a theorist - event (the occurrence of pulling a blue ball) Independent events - when two events' concurrence or nonoccurrence has no effect on one another Mutually exclusive - if occurrence of one event precludes the occurrence of the other Exhaustive - if it includes all possible outcomes Laws Additive rule Given the set of mutually exclusive events, the probability of the occurrence of one event or another is equal to the sum of their separate prob. Mulplicative rule The prob. Of the joint occurrence of two or more independent events is the product of their individual probabilities If I want to draw a blue ball and a green ball after one another Joint probabilities The prob. Of co-occurrence of two or more events -> independent events Two random variables are independent if: The conditional probablities are equal for all conditions The conditional probabilities are equal to the marginal probabilities (just general probability of one of the things, like the probability of being dutch and the probability of having a partner would have to be equal - which they are not) Conditional probabilities The prob. That one event will occur given that some other event has occurred What we have been doing with H0 What is the probability of obtaining a score more extreme that this given the H0 is true? The general law = The chance that both events happen at the same time is the product of the probability of one event, times the probability of the other event given the first event. ○ Categorical data consists of frequencies of observations that fall into each of two or more categories, and the data consists of counts of observations in each category ○ A single even can have two or more possible outcomes, or two independent variables to test null hypotheses concerning their independence Using the chi-square test It is a one tailed distribution We use it to test, if two categorical variables are independent or not We také a sample from one population, the variables are categorical (nominal or ordinal) Two meanings: Used to refer to a particular mathematical distribution that exists in and of itself without any necessary referent in the outside world Statistical test that has a resulting statistic that is sidtributed in cca the same way as a x^2 distribution The denominator is called a gamma function (aka factorial) Has only one parameter (k - we will use it as our degrees of freedom) Mean = k Variance = 2k The goodness-of-fit test - asking whether the deviations from what would be expected by chance are large enough to lead us to conclude that responses weren't random Comparing the observed and expected frequencies Observed frequencies - what we actually observe Expected frequencies - what we would expect if the H0 were true ○ ○ We obtain the value of x^2 and refer it to the x^2 distribution to determine the probability of a value of X^2 at least this extreme if the null hypothesis of a chance distribution were true Using the tabled distribution of x^2 and degrees of freedom again (we obtain the degrees of freedom based on how many categories we have in an experiment. Game of throwing rock, paper, scissors has three -> 2 df) E.g. ○ What if we are interested in asking whether two or more variables are independent of one another? Aka whether the distribution of one variable is contingent on/conditional on a second variable Using the contingency table Expected frequencies - what we would expect if the two variables forming the table were independent Obtained by multiplying together the totals of the row and column (aka marginal totals) in which the cell is located and dividing by the total sample size ○ Eij - expected frequency for the cell in row I and column j, Ri and Cj - row and column totals ○ ○ Then we calculate the x^2 like before, using the E we just calculated ○ Degrees of freedom are different Df = (R - 1)(C-1), R and C = number of rows and columns ○ Assuming independence if the order of observations doesn't matter or doesn't affect the outcome ○ Nonocurrences - negative responses Can significantly affect us rejecting the H0 ○ Measurement of agreement kappa Not based on chi-square but uses contingency tables Measures interjudge agreement, used to examine the reliability of ratings Calculate the expected frequencies for each of the diagonal (agreement) cells, assuming the judgements are independent Fo - observed diagonal frequencies, Fe- expected diagonal frequencies ○ Problems with the p values Even when the difference between H0 and the true population is tiny, it will be significant if the sample is big enough A significant result doesn't mean there's a big difference/effect A low p value means we have found SOMETHING -> more publishing of results A high p value does not prove the H0, it just means an absence of evidence against it Any hypothesis about the population is not random, they always depend on prior knowledge A p value never tells us how large the effect is, even if the test seems to be significant ○ Categorical variables and effect sizes Not all statistical significant results are practically significant D-family measure Based on one or more measures of the differences between groups or levels of the indeependent variable E.g. the probability of receiving a death sentence for a crime is higher if you are a POC than when you are white The concepts or risks and odds The risk estimates - how likelu is smn to experience a certain condition Establishing a risk difference - the proportion of people with the condition in one group minus the proportion of the people with the condition in the other group However - dependant on the overall level of risk (so if it's more of a special event/condition, the risk magnitude will not be different) Risk ratio/relative risk - ratio of the two proportions Odds ratio - similar like the risk one, but Risk = number of having a condition/total number of all people Odds = number of having a condition/not having a condition Better for retrospective studies - risks are future oriented R-family measure Represents a correlation coefficient between the two independent vaariables Also called the measures of association Effect sizes for chi squared tests Phi - the correlation between tw variables, each of which is a dichotomy Dichotomy = variable that takes on one of the twoo distinct values) Different rules of thumb: 0.1 - small, 0.3 - medium, 0.5 - large Cramer's V - for larger tables than just for 2 x 2 K - the smaller of R and C Prospective/cohort studies - treatments are applied and tghm future outcome was determined Also called the randomized clinical trial Retrospective study/case-control design Selecting people who had a condition and looking in their past to see if they all have something in common ○ Effect sizes for t-tests Better than just being satisfied with a t test or confidence interval D-family of measures Cohen's d Independent test - sd of one (which is better for the experiment, do we have a contol group? Then use the sd of that since it's the "normal" variation we expect in "normal" data) or both the sample population (if both - pooled sd or square root of mean variance) Paired sample test - sd of one, both or the difference scores -> difference scores are better for power analysis For one sample tests - the difference between the sample mean and mu For independent tests - the difference between the sample means For any t-test -> eta squared Gives the proportion of variance accounted for Values between 0 and 1 ○ ○ Absolute effect sizes - when it is easy to interpret what does a higher/lower score mean What I actually found ○ Relative effect sizes - when we have weird scales and practically speaking it doesn't mean anything Like what does cofidence three points higher mean? -> we use how many sd.d. away it is Like all the standartized measures - cohen's d, etc. ○ Confidence intervals We can always be kinda sure about 95% (alpha is 0.05) of our findings are around the number we have found More precise when there is less variation, when sample is larger and when coverage is lower (smaller % means higher alpha and therefore smaller critical t value) ○ Power What is the probability of succesfull replication? The percentage of outcomes exceeding the critical value The function of alpha (The probability of Type I error), the true alternative hypothesis, the sample size and the particular test to be employed Power depends on the degree of oberlap between the sampling distirbutions under H0 and Ha Using effect size as a measure of the degree to which muA and mu1 differ in terms of the standard deviation of the parent population Estimating the effect size (d) Prior research - using previous studies to make guesses at the values we might expect for the differences in means and for std. d. Personal assessment of how large a difference is important - knowing the difference in means, so we only need to estimate the std.d. from other data Special conventions - sets of scores proposed by Cohen -> rule of thumb Combining the effect sizes with sample sizes Using delta statistics A noncentrality parameter - aka how wrong the null hypothesis is Larger than normal values of t are to be exepcted because H0 is false, we will occasionally obratin small values by chance -> beta, the probability of Type II error Power = 1 - beta E.g. delta for a one sample t test = dn^-2 Increasing power changing the alpha coefficient to a higher one Increasing sample size - estimating a required sample size Saying what power we want to have, looking up the d in the tables I need for it and solve for n A power of 80% means still a possibility of 20% of conduting a Type II error Influencing the effect size d The larger the effect, the larger the power Nonparametric tests/distribution free tests ○ Do not rely or parameter estimation - we usually assume normal distribution/independent observations For us to use the central limit theorem, we need a large sample (the larger it is, the more it looks like a normal distribution) Are we really taking a random sample? Or do they have smt in common Means are very impacted by extreme scores -> but we an use medians ○ We can either transform our data or use a different test that doesn't require an assumption Transformation using log All the extremes become closer to the actual mean Abline and H0 is now log of the mean, not the actual mean Using sqrt H0: mean = sqrt (mean) Using nonparametric tests Wilcoxon's Rank sum test (instead of the independent t-test) Does not assume the data has some specific shape The measurement level of my data is numerical but the n is not large Or when the measurement level of is ordinal Replace all scores with ranks from 1 (low) to n (high) Sum the ranks in the separate groups, so we have two numbers for two groups (but we only need one sum) Compare the sum (W)with the expected value of mu W under H0 Calculate the standard deviation of W Do a z test for the difference between W and mu W Under the assumption that the two distribtuions have the same shape Effect size - correlation r r= absolute value of z/sqrt of n Ranges from 0 to 1 Wilcoxon's signed rank test (instead of the paired t-test) One group of people, data before and after treatment Again, under the assumption that the two distributions have the same shape, we can interpret a significant result as a difference between medians Using the ranks of the difference scores, otherwise the same Using permutation tests Two groups of data, calculating the median for both of them and their difference Randomizing the order, so now some people from group B are with group A etc. (if there is no difference between these groups, swapping the people in the groups should not affect the statistic) Calculating the difference in medians again, doing it again for all the possible combinations of the data Noticing how many of all the differences are as extreme as the one I found first (BOTH POSITIVE ABD NEGATIVE) Divide the number of scores as extreme as the first one by the number of all the possible differences in medians -> that’s our p value We do this with all other statistics (like the mean as well) and get the sampling population ○ Distribution free -> more valuable to us ○ Resampling tests Base their conclusions on the results of drawing a large number of samples from some population under the assumption that the null hypothesis is true and then compare the obtained result with the rsampled result Bootstrapping and randomization tests ○ They are good because they do not replu on any assumptions concersning the shape of the sampled population (like that it's normal) ○ More sensitive to medians than to means ○ However - lower power relative to the corresponding parametric test Require more observations to have the same amount of power ○ Bootstrapping When we are interested in medians whose sampling distrubution and standard error cannot be derived analytically Most often used for parameter estimation, esspecially variation Deals with resampling data with replacement from some population We have to assume that the population is distributed exactly as our sample + draw as many observations as we need with replacement from that sample (it is like if we were sampling from an infinitely large population) ○ Wilcoxon rank-sum test For two indeoendent samples H0: the two samples were drawn at random from identical populations Sensitive to population differences in central tendency -> rejecting H) when we interpret that the two distributions had different central tendencies