Statistical Inference PDF
Document Details
Uploaded by MightyThallium
Universität Wien
Georgios Halkias
Tags
Related
- SoSe2022_Kapitel5_01062022 (3) PDF - Statistik - HSE Esslingen
- Inferential Statistics PDF
- Module 2 - Inference and Hypothesis Testing PDF
- Business Statistics for Contemporary Decision Making (6th Edition) PDF
- Introduction to Mathematical Statistics PDF
- STAT 144: Introductory Statistical Theory I - Chapter 5 - Introduction to Statistical Inference PDF
Summary
This document provides an overview of statistical inference, explaining how to analyze sample data to make inferences about a population. It covers concepts like probability distributions, confidence levels, and sampling error. The text also includes examples and calculations to illustrate the concepts.
Full Transcript
Statistical inference …we analyze sample data to make inferences about the population. derive estimates test hypotheses Spec...
Statistical inference …we analyze sample data to make inferences about the population. derive estimates test hypotheses Specific population characteristics (parameters) Contrasts & comparisons Associations & relationships Fit the model (statistical testing) Statistical model ► Statistically model the hypothesis using a certain test statistic ► Get a random/representative sample ► Summarize sample data with your test statistic ► Use probability distribution of test statistics to make inferences about the population Priv.-Doz. Dr. Georgios Halkias © 2 2/26 Probability (frequency) distribution …a function that describes how likely different values of a random variable are. The possible values of this variable are based on the underlying probability distribution. No mask Mask 0.95 * Refers to discrete probabilities – 0.59 continuous probability distributions follow the same reasoning 0.41 0.05 Number of times got sick since COVID19 outbreak (n = 100)* Priv.-Doz. Dr. Georgios Halkias © 4 4/26 Probability distribution Normal and standard normal distribution Every normal distribution (regardless of what the variable represents…) has these properties! 68–95–99 empirical rule A normal distribution can be standardized: population sample mean = 0 mean = median = mode SD = 1 symmetric Priv.-Doz. Dr. Georgios Halkias © 5 5/26 Example… How likely is it that students score less than 58? 58 points Mean = 55pts SD = 5pts n = 60 40 45 50 55 60 65 70 points 50.00% (half below the mean) + 22.57% 72.57% ► ~ a 73% probability that any observed grade is less than 58pts (or z < 0.6). ► ~ 73% of the grades will be below 58pts. ► ~ 27% of grades is 58pts or higher. 0.6 Priv.-Doz. Dr. Georgios Halkias © 7 7/26 Statistical inference Wait a minute! I operate on a sample, not the population. How confident can I be that what I find applies to the whole population??? Sample statistic = known (based on the empirical sample) Population parameter = unknown Using a sample always implies sampling error. Given a level of confidence, this error can be calculated and allow us to draw inferences. Priv.-Doz. Dr. Georgios Halkias © 8 8/26 Parameter estimation We collect data to get a sample statistic (e.g., mean, proportion, etc.) with which we estimate the corresponding population parameter. Population (N = 10): = 21.20 (nobody knows this) Standard deviation of each sample: SA = 2.309 19 SB = 1.528 SC = 1.528 18 20 19 22 18 21 18 Sample size: n = 3 22 19 Priv.-Doz. Dr. Georgios Halkias © 1010/26 Sampling Error (Margin of Error) Population parameter = Sample statistic ± Sampling error (b) known unknown (a) ► Population parameter fixed, but variation of sample statistic(s) ► Sampling error may overestimate/underestimate parameters Range of values due to sampling error can be theoretically estimated using: (a) the variability of the sample statistic (e.g., mean) in the population, i.e., the Standard Error (SE) (b) the critical value in the probability distribution that corresponds to our confidence level/error rate Priv.-Doz. Dr. Georgios Halkias © 11 11/26 Parameter estimation – Sampling distribution & Standard Error Standard deviation (S): variability of observations from the sample mean. Standard error (σx̅): variability of means across samples drawn from the same population = standard deviation of the sampling distribution. If the population is normally distributed, the sampling distribution of the mean is also normally distributed with a mean equal to the population mean (e.g., μ =3) For large enough sample sizes (>30; see, CLT), regardless of the population distribution, the sampling distribution of the statistic will be approximately normally distributed. So, the standard deviation of the sampling distribution (Std. Error) can also be approximated: 𝑠 𝜎𝑋ሜ = 𝑛 Priv.-Doz. Dr. Georgios Halkias © 12 12/26 Parameter estimation – Confidence Level =likelihood that our estimation is wrong Confidence level: How many times are estimations expected to capture the true parameter? → frequency (%) of all possible sample estimations that are expected to include the true population parameter. The significance level is the probability of rejecting the null hypothesis when it is actually true. It represents the threshold for making a Type I error. Specifying a confidence level, also determines how much “risk” (α=alpha) you are willing to take (“likelihood that your estimation is wrong”) Convidence levels, =95% , 99%, 99,9% → Confidence level = 1-α = significance level Typical “risk” levels, α = 5%, 1%, 0.1% A 95% confidence level corresponds to a 5% significance level, meaning that we’re 95% confident in our results, with a 5% risk of error The risk level α or Type I error or error rate or significance level is the opposite of confidence level. It's the probability of incorrectly rejecting a true null hypothesis If we want to estimate with a 95% confidence level, we also allow 5% of “wrong estimations.” = significance level / risk Confidence and Significance level (α=alpha) indicate how “strict” we are and specify critical (cut-off) values on the probability distribution Priv.-Doz. Dr. Georgios Halkias © 15 15/26 Parameter estimation Used when the Used when the Sig. level = risks = alpha direction of the effect direction of the is not predicted effect is predicted Critical values and Conf./Sig. level 2.5% 95% 2.5% When ALPHA is low, we need more evidence to reject the null hypothesis. This means setting our critical values further away from the mean to reduce the One-tailed tests probability of making a Type I error. Two-tailed tests check for deviations in both check for deviations in only one direction directions (higher or (either higher or lower) lower) No expectations of directionality → 2-tails Z-score = -1.96 Z-score = +1.96 Priv.-Doz. Dr. Georgios Halkias © 16 16/26 Parameter estimation – Confidence Interval Range of values due to sampling error can be theoretically estimated using: (a) the variability of our sample statistic (e.g., mean), that is, the Standard Error (SE) (b) the critical value in the prob. distribution that corresponds to our confidence level Critical value for 95% confidence (α/2 = 5% ) Standard error of statistic We end up with a lower and upper limit for our statistic → the Confidence Interval 95% confidence interval: If you get repeated samples, for 95% of them the confidence intervals will contain the true value of the population mean → we can be 95% confident that this range of values contains the true population parameter Priv.-Doz. Dr. Georgios Halkias © 18 18/26 Parameter estimation – Confidence Interval True population parameter Priv.-Doz. Dr. Georgios Halkias © 19 19/26 Parameter estimation Critical values and Confidence Intervals Sample A μ=20.67 ± 1.96×(2.309/1.732) The CIs around the sample mean account for 18.06 < μ < 23.28 the “uncertainty” of our estimation Sample B μ=19.33 ± 1.96×(1.528/1.732) Q1. Does the confidence interval include 17.60 < μ < 21.06 the population value (i.e., 21.20)? Sample C μ=19.66 ± 1.96×(1.528/1.732) I’m using the 2-tail critical value (Za/2) because the true value may be “higher or lower” (two-tailed test) 17.93 < μ < 21.39 Priv.-Doz. Dr. Georgios Halkias © 20 20/26 What is a “hypothesis”? Hypothesis is a prediction about the state of the world. It is a scientific statement that must be able to be empirically disproved, i.e., be falsifiable –testable and able to be disconfirmed based on evidence. …translated into relationships between variables that can be empirically measured (in a valid and reliable manner). Hypothesis: Being in a bad mood makes people spend more money. Independent variable (predictor variable) → mood (good/bad) Dependent variable (outcome/criterion variable) → money spending Which of the following statements represents a hypothesis and which one doesn’t? ► Small and large companies evoke different levels of consumer trust. ► Psychotherapy leads to improved well-being. ► Most people who commit suicide they regret doing so. ► If one had studied medicine, they would make more money. ► Dreaming duration for males is longer than that for females. Priv.-Doz. Dr. Georgios Halkias © 23/26 Types of hypotheses Directional hypotheses relates to 1-tail testing The researcher indicates a priori the direction (either positive or negative) of the expected relationship. e.g., Global brands evoke higher perception of quality than local brands. Advertising creativity increases consumer attitudes. Non-directional (exploratory) hypotheses relates to 2-tail testing The researcher expects an effect but has no a priori expectation about the direction of the effect. e.g., Global and local brands evoke different perception of quality. Example: Advertising creativity influences consumer attitudes. > One-Tailed Test: Let's say a pharmaceutical company wants to test whether a new drug increases the recovery rate of patients more than the current drug. Here, the hypothesis is directional (an increase): Right-tailed test: H1: Recovery rate > current rate Perceived ≠ Perceived > Two-Tailed Test: If the company wants to test whether the new drug has a different effect on the recovery rate (it could be either better or worse), the quality of global quality of local brands < brands hypothesis is non-directional: Two-tailed test: > H1: Recovery rate =/= current rate Perceive H1 (+) H2 (+) Quality Willingness brand perception to pay globalness Priv.-Doz. Dr. Georgios Halkias © 24/26 Types of hypotheses (…they come in pairs) Alternative hypothesis (H1): Our predictions/expectations of how things in the real world are. Usually, that there is an effect (e.g., a difference or a relationship) in the population. …each alternative hypothesis has a corresponding Null hypothesis (H0) which is the opposite of (it “nullifies”) H1 and usually states that no effect exists. The null (H0) together with the alternative hypothesis (H1) account for all potential outcomes regarding the relationship being studied. Example: H1: Heavy metal fans have above average IQ. H0: Heavy metal fans do not have above average IQ. Priv.-Doz. Dr. Georgios Halkias © 25/26 Types of hypotheses (…they come in pairs, but why?) We NEVER prove the alternative hypothesis using statistical testing, we ONLY collect evidence against the H0! Rejecting H0, doesn’t prove H1 (it merely maintains it). Failing to reject H0, doesn’t prove H0 (it merely maintains it). NHST considers the chances of observing our sample data (results), assuming that the null hypothesis is true. ► How likely is it to find 20 (out of 100) metalheads with IQ above average, if…? ► How likely is it to find 95 (out of 100) metalheads with IQ above average, if…? Hypothesis testing is based on the probability (p-value) of obtaining such sample data or more extreme if, hypothetically speaking, the null is true. Priv.-Doz. Dr. Georgios Halkias © 26 26/26 The process behind NHST ► Formulate the alternative (H1: there is some effect) and the null (Ho: there is no effect) hypothesis. ► Model H (in the form of a test-statistic) to later see how well it fits the data. ► Specify acceptable risk/error rate or significance level (α = 1 – confidence level). ► To determine how well the hypothesized model fits the data, calculate the test statistic and use its probability distribution to find the probability (p-value) of getting this “model” (test-statistic) or more extreme, assuming that the null hypothesis were true. ► If the probability associated with the test statistic (p-value) is less than the significance level (α), reject Ho in favor of H1 (effect is statistically significant). 7 Test statistic A numerical summary of the dataset that “models” the expected effect (hypothesis). Defined by a formula/equation that depends on the statistical test applied. z-test (one-sample) t-test (two-sample) the probability distribution ANOVA of test statistics can become known. Thus, we can calculate how frequently Chi-square test different results occur Priv.-Doz. Dr. Georgios Halkias © 9 9/29 Type I and II error No statistical test is certain. There is always chance of drawing incorrect conclusions. Reality in the population H0 true H0 false (there is no effect) (there is an effect) (based on sample) H0 fail to reject Correct Type II error Decision (no effect is found) (1-α) (β) H0 reject Type I error Correct (effect is found) (α) (1-β) α → the likelihood of making Type I error (false positive – find effect that does not exist) β → the likelihood of making Type II error (false negative – don’t find an effect that exists) Priv.-Doz. Dr. Georgios Halkias © 1010/29 Type I and II error Falsely reject the H0 Falsely accept* the H0 * fail to reject Priv.-Doz. Dr. Georgios Halkias © 11 11/29 Significance level (α, alpha) Would you put an innocent person in jail or fail to sentence Reality in the population a guilty one? H0 (Innocent) H0 (Guilty) no effect effect (based on sample) Type II error H0 fail to reject Correct (β) Decision (find innocent – no effect) (1-α) False Negative Type I error H0 reject Correct (α) (find guilty - effect) (1-β) False Positive ▪ The maximum risk we are willing to take to reject a true null hypothesis (Type I error) is known as Significance Level (α, alpha). ▪ The probability of our results (“model” or test statistic) is contrasted against the significance level (α-level –max. acceptable likelihood of Type I) to determine statistical significance. ▪ Usual (yet, arbitrary) levels: α =.05 (5%, minimum), =.01 (1%, strong) and =.001 (0,1%, stronger) Priv.-Doz. Dr. Georgios Halkias © 1212/29 Test statistic, critical value, and p-value (…the probability distribution of test statistics under the null hypothesis is known. Therefore, we can calculate how frequently different results occur). The probability associated with obtaining a test statistic (or bigger) is called p-value and shows how likely is it to get a test statistic at least as big as the one observed, if the null hypothesis is true (there is no effect). ► Assuming no effect, it is very unlikely (p ≤ α) that I would get these results (test statistic). I reject the null → there must be an effect: statistically significant results ► Assuming no effect, it is likely (p > α) that I would get these results (test statistic). I cannot reject the null → there seems to be no effect: statistically non-significant results Statistical significance is determined: p-value vs. α-level test statistic vs. critical value (of the test statistic) Priv.-Doz. Dr. Georgios Halkias © 1414/29 Regions of rejection !!!!! One-tail (directional) ഥ < μ or negative relationship H1: 𝒙 ഥ > μ or positive relationship H1: 𝒙 ഥ ≠ μ or some relationship H1: 𝒙 Two-tails (non-directional) If |statistic| > |critical| then p < α Accept H0 Typical α levels:.05 (5%),.01 (1%),.001 (99,9%) Critical values are determined by your Confidence/α level. Priv.-Doz. Dr. Georgios Halkias © 1616/29 Example… I believe that on average customers spend more than €18 in restaurants. H1: Spending is higher than €18. H0: Spending is not higher than €18. Standard deviation of each sample: 19 18 20 19 22 18 21 18 Sample size: n = 3 22 19 Priv.-Doz. Dr. Georgios Halkias © 1818/29 Model H1 (using the appropriate test) Compare against a fixed value A z-Test for a sample mean indicates by how many standard errors 𝐱ത greater than 18 → z-Test 1 tail test does the sample mean and the hypothesized (population) mean differ H1: 𝐱ത >18 H0: 𝐱ത ≤18 μ-2σ μ-σ μ μ+σ μ+2σ Note: if 𝑥ҧ greater than 18, then z > 0 if 𝑥ҧ is equal to 18, then z = 0 if 𝑥ҧ smaller than 18, then z < 0 -2 -1 0 1 2 Priv.-Doz. Dr. Georgios Halkias © 1919/29 Set α (alpha)-level for the z-test For 5% α (and 95% confidence) the critical value of a z-test is zcritical = 1.645 (1.96 for 2-tail) Priv.-Doz. Dr. Georgios Halkias © 2020/29 Set α (alpha)-level for the z-test Compare against a fixed value 𝐱ത greater than 18 → z-Test pos. relationship 1-tail (𝑥ҧ > μ): 95% of the z values lie below (left to) zcritical = 1.645 (spend more than 18) 2-tail (𝑥ҧ ≠ μ): 95% of z values lie within zcritical = +/- 1.96 (do not spend 18) zcritical = -1.96 (lower-tail) zcritical = +1.96 (upper-tail) We have a directional hypothesis, so 1-tail testing is appropriate. Using the 2-tail critical value would make us |zcritical| = 1.645 (one-tail) more strict (as if α = 2,5%) Priv.-Doz. Dr. Georgios Halkias © 2121/29 Test statistic (p-value) vs. critical value (α-level) Calculate test statistic and find its p-value software makes the comparison Compare with critical value for α-level automatically and gives you p for sig. DECISION Sample A z = 2.003 > 1.645 = zcritical p ≤.05 (α) (reject H0 →) H1 accepted Sample C z = 1.882 > 1.645 = zcritical p ≤.05 (α) (reject H0 →) H1 accepted Priv.-Doz. Dr. Georgios Halkias © 2222/29 Significance level and statistical significance → same test (z-Test), different research setting Sample C mean = 19.66 Z = 1.882 Sample A mean = 20.67 Z = 2.003 What If I want to be more confident (e.g., 97.5%)? What if I have a non-directional hypothesis (2-tail)? Zcritical = 1.96 Priv.-Doz. Dr. Georgios Halkias © 2323/29 Statistical significance and Confidence Intervals (CIs) Calculate the confidence intervals for the sample you drew (see previous session) and see if they include the H0. We'll use the critical value for 2-tail test (Za/2) and a 95% Confidence Level to be a bit more conservative. Note that this is similar to applying a 1-tail test with 97.5% Confidence Level. ▪ If they overlap with what H0 says then reject H1 (=fail to reject the Null) H1: 𝐱ത > 18 H0: 𝐱ത ≤ 18 Sample A DECISION μ=20.67 ± 1.96×(2.309/1.732) 95% CI (α/2=0.05) does not contain 18 18.06 < μ < 23.28 H1 accepted Sample C 95% CI (α/2=0.05) contains 18 μ=19.66 ± 1.96×(1.528/1.732) H1 rejected 17.93 < μ < 21.39 Priv.-Doz. Dr. Georgios Halkias © 2424/29 Statistical significance and Confidence Intervals (CIs): A note… CI with one-sample CI with two-samples α α 𝒍𝒆𝒏𝒈𝒕𝒉𝒂 𝒍𝒆𝒏𝒈𝒕𝒉𝒃 + 𝟐 𝟐 Moderate overlap ≈ 𝟐 (≤ 50%) Priv.-Doz. Dr. Georgios Halkias © 25 25/29 Inferential rules p-value statistic* Confidence Interval Decision (sig.) The (1-α)% Confidence Interval does Reject H0 |test| ≥ |testcritical| p≤α NOT include the H0 value. Accept H1 The (1-α)% Confidence Interval DOES Accept H0 |test| < |testcritical| p>α include the H0 value. Reject H1 * The test-statistic (and critical) values depend on the test applied (e.g., z, t, χ2, F….) Priv.-Doz. Dr. Georgios Halkias © 2626/29 Statistical and substantive significance Statistical significance is not the same thing as actual importance or substantive significance. …small and unimportant effects can turn out to be statistically significant just because of huge samples, while large and important effects can be missed simply because of small samples (and, thus, a lot of sampling error)… The problem with the p-value is that it gives virtually no information about whether the results really matter… Hypothesis testing and statistical significance does not tell us anything about the importance or magnitude of an effect, the so-called Effect Size. Priv.-Doz. Dr. Georgios Halkias © 28/29 Effect size …assesses the magnitude of an observed effect. An effect size is a standardized measure of the size of an effect. → we can compare effect sizes across different studies that have measured different variables or have used different scales of measurement. There are several effect size measures, such as: Cohen’s d Pearson’s r ► r =.1, d =.2 (small effect) ► r =.3, d =.5 (medium effect) ► r =.5, d =.8 (large effect) Beware: the size of an effect should be placed within the research context! Priv.-Doz. Dr. Georgios Halkias © 29/29 Statistical power Power is the ability of a test to detect an effect of a particular size → statistical power is the probability that a test will find an effect, assuming that one exists in the population. This is the opposite of the probability that a given test will not find an effect, assuming that one exists in the population, i.e., Type II error (β-level). Therefore, power =1 − β Statistical power of (min.).80 (β =.20) is desirable ► Size of sample ► Effect size ► α-level ► Methods Priv.-Doz. Dr. Georgios Halkias © 31/29 Sample size and statistical significance CIs from samples with same means and SDs but different size… 50% off 1+1 50% off 1+1 …because of sampling error… Priv.-Doz. Dr. Georgios Halkias © 32/29 Type I (α) and II (β) errors, p-values, power & effect size Imagine that you find a statistically significant effect… However, there is always a chance you are making a mistake. How likely is it that this effect will occur again in the future? Future research might not be able to re-produce and corroborate results obtained from underpowered studies. * Note: z-test (1-tail, α = 5%). Graphs are on the same scale! Priv.-Doz. Dr. Georgios Halkias © 33/29 Beyond statistical significance (p-value)… …Would you make a decision based on something that is not likely to happen again? …What if your blood tests tell you that are OK and you do not need treatment (p <.05, statistical power of 50%)? Given the α and the effect size, as the sample size increases, the sampling distributions under the H0 and H1 change (get narrower) and statistical power increases! * Note: z-test (1-tail, α = 5%). Graphs are on the same scale! Priv.-Doz. Dr. Georgios Halkias © 34/29 Key terms Units of analysis ► “who” is being studied (a.k.a. “cases”, “observations”, “subjects” or “respondents”) Variables ► “what“ is being studied (i.e., the characteristics of interest) Values ► the link between the who and the what (i.e., the “responses” or “scores” of the units of analysis on the variables) Priv.-Doz. Dr. Georgios Halkias © 2 2/10 Data matrix Respondents Variables Case# Age Sex Family Income 1 19 1 2 10000.00 2 21 2 1 25357.70 Values 3 20 1 1 5210.50 4 22 1 3 34567.00 Value: 1 = female Priv.-Doz. Dr. Georgios Halkias © 3 3/10 Data format Please indicate your gender. female male Which of the following brands do you know? Apple Sony HTC Samsung Huawei LG Other How likely is it that you recommend your smartphone brand to your family and friends? 0-100 %. How would you describe the smartphone brand you own? modern ⃝ ⃝ ⃝ ⃝ ⃝ old-fashioned Tell me the first word that comes to mind when you think of your smartphone brand............................................................................................................... Can you make the Data Matrix for the questions above? Priv.-Doz. Dr. Georgios Halkias © 4 4/10 Levels of measurement Level Description Examples Categorical Simplest type of measurement (binary). Numbers sex, nationality, preferred Nominal are used as labels only, they do not have any newspaper,… mathematical properties. Numbers are used to indicate whether a respondent Ordinal has more or less of a given characteristic. Hence, education, rank, order,… numbers have a natural rank order. Continuous (metric, “scale” in SPSS) scales from 1-5, 1-7, Possesses all characteristics of an ordinal scale and, Interval (typical questionnaire in addition, is characterized by equality of intervals. items) Possess all characteristics of an interval scale and, in age, income, weight, size, Ratio addition, an absolute zero point. share, sales, savings,... Priv.-Doz. Dr. Georgios Halkias © 5 5/10 Variables and constructs Objectivity and complexity Age Sex Education Income Ethicality Innovativeness Brand prestige A construct is a notion about the real world used in research to capture more complex phenomena: ethical behavior, consumer innovativeness, brand prestige, ethnocentrism, consumer-brand identification… Conceptual definition, operationalization, and measurement (usually combining multiple items/questions) Definition of consumer innovativeness: “a predisposition to buy new and different products and brands rather than remain with previous choices and consumer patterns” Priv.-Doz. Dr. Georgios Halkias © 6 6/10 Form new variables – constructs Usually complex constructs (see slide 7) are measured with multiple items. E.g., Perceived Brand Localness ▪ PBL1: “I associate Brand X with things that are Austrian” (1-7) ▪ PBL2: “For me Brand X represents the true Austria” (1-7) ▪ PBL3: “For me Brand X is a very good symbol of Austria” (1-7) We can incorporate all the information of these three items in one summated scale variable (composite variable) for further use in our analysis. Compute variable: PBL=(PBL1+PBL2+PBL3)/3 Note 1: The items should be measured in the same scale in order to develop a composite. Note 2: Make sure all items have the same directionality. Note 3: Statistically assess the reliability of the new composite variable. Priv.-Doz. Dr. Georgios Halkias © 7/10 Measurement error There is no perfect measurement Observed score (O) = True score (T) + Error (E) Systematic error (S) and/or Unsystematic error (R) (random) Validity: The extent to which a measure captures what it is supposed to capture (E=0) Reliability: The extent to which a measure is free of random error (R=0) Priv.-Doz. Dr. Georgios Halkias © 8 8/10 The dartboard analogy (relatively and overall) Priv.-Doz. Dr. Georgios Halkias © 9 9/10 The mode This is the most frequently occurring value for a variable (most frequent score). Advantages: particularly suitable for nominal and ordinal variables, no extreme values Disadvantages: no information about absolute frequency of (or difference between) values, bimodal/multimodal distributions. Dataset: Income 2500 = 4, Mode: 2, Interpretation: Most respondents’ income falls in the 800-1500 category Example question where mode is relevant: What colour should our new product package have? What is the flavour mostly preferred? Priv.-Doz. Dr. Georgios Halkias © 3 3/42 The median This is the middle score when scores are ordered or the value above and below which 50% of the values fall. Applicable for ordinal and scale variables. Advantages: gives more information than mode, is not affected by extreme values (outliers), splits the sample in half for further analyses Disadvantages: hardly captures variability in datasets. Two distributions with completely different variability might have the same median. Sensitive to additional values. Dataset: Age Median: 32, Interpretation: 50% are below 32, 50% are above 32 years old Example question where median is relevant: What is the median salary? Priv.-Doz. Dr. Georgios Halkias © 4 4/42 The median & the percentiles Percentiles (p) give you the value on your measurement scale, below which p% of observations fall and above which (1-p)% of observations lie. Important percentiles: 25th percentile (1. quartile): value below which 25% of observations fall 50th percentile (2. quartile): = median 75th percentile (3. quartile): value below which 75% of observations fall (= 1.5× IQR) Outlier (75 yrs) – errors when fixed scales are used Maximum value 75th percentile, x~42 Middle 50% IQR Median 25th percentile, x~25 Minimum value Priv.-Doz. Dr. Georgios Halkias © 5 5/42 The mean This is the arithmetic average of all the values, defined by the sum of scores divided by the number of scores. Applicable for continuous (metric, scale) variables. Advantages: Gives more nuanced information about the central location, is more stable across samples than mode and median. Less sensitive to additional values. Disadvantages: Sensitive to extreme values (outliers) which may distort the results. Dataset: Age Mean: 35.26, Interpretation: on average respondents in our sample are 35.26 years old Measures of central tendency (mean, median, mode) are fundamental statistics for testing! Priv.-Doz. Dr. Georgios Halkias © 6 6/42 Range This is the difference between the smallest and the largest value. Applicable for metric (scale) variables. Advantages: gives information about the spread / dispersion of the data distribution. Disadvantages: sensitive to extreme values and unstable across different samples. Dataset: Age Min: 18 Max: 75 Range: 75-18=57 Isn’t there a better way to assess dispersion? Priv.-Doz. Dr. Georgios Halkias © 8 8/42 Deviance Got it! We can see how values spread by looking at how far each values is from the center of a distribution. For example, for each observation we can calculate the distance from the mean! Hmm… The “center of gravity” tells me the total deviance is always zero Product A Product B 5 5 +2 +2 +2 4 4 +1 +1 0 0 mean 3 3 2 2 -1 -1 -2 -2 -2 1 1 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Priv.-Doz. Dr. Georgios Halkias © 9 9/42 Variance Sum of Squared Errors (SS): If I square each difference, I can overcome this problem and come up with a (positive) measure of total deviance (variation) from the mean. Well, yeah but this is not comparable across samples/group sizes because it gets bigger as the number of observations (sample size) increases! Take into consideration sample size by looking into the average total deviance or average squared distance from the mean, the notorious… “Degrees of freedom” The variance (a.k.a. Mean Squared Error) reflects variability (or error) and is a very important concept in many analytical techniques! Priv.-Doz. Dr. Georgios Halkias © 1010/42 Standard deviation The variance gives us whatever it is that we measure in units squared. We can bring it back to its original unit of measurement by taking the square root of the variance, that is the standard deviation (SD). …the “average distance from the mean.” The Sum of Squares (within a specific context), Variance, and Standard Deviation reflect the same thing: ► The variability in the data ► How well the mean represents the observed data ► Error ► Homogeneity (similarity) / heterogeneity (dissimilarity) of ratings Priv.-Doz. Dr. Georgios Halkias © 11 11/42 Priv.-Doz. Dr. Georgios Halkias © 13/42 Data frequencies Applicable for: nominal, ordinal, metric variables Give a picture of the frequency distribution of one variable across all respondents Example: Socio-demographic sample characteristics N = 273 Gender: 50.8% females Income: 2500 (17.2%) Priv.-Doz. Dr. Georgios Halkias © 15/42 Frequency tables Priv.-Doz. Dr. Georgios Halkias © 1616/42 Bar charts: categorical vs. continuous Verteilung nach Geschlecht Kontinuierlich verteilt Priv.-Doz. Dr. Georgios Halkias © 1717/42 Grouped frequency distribution Priv.-Doz. Dr. Georgios Halkias © 1818/42 Histograms Priv.-Doz. Dr. Georgios Halkias © 1919/42 Scatterplots Inspect bivariate relationships → distribution of values for pairs of variables. Priv.-Doz. Dr. Georgios Halkias © 2020/42 Assumptions Statistical requirements which ensure that the analytical methods deliver accurate results. Depending on the test applied other assumptions might be necessary. Parametric tests make assumptions about the population parameters and the distributions from which data are drawn: ▪ Additivity and linearity ▪ Independence ▪ Normality ▪ Homogeneity of variance Parametric tests → Assumptions met Non-parametric tests or “distribution free” / Transformations → Assumptions violated Priv.-Doz. Dr. Georgios Halkias © 2222/42 Assumptions Linearity and additivity ► Relationships among variables described by a linear function (General-ized Linear Model, GLM). ► The combined effect of several predictors is best described by adding their effects together. Independence (i.i.d.) ► Observations are independent (rows in the matrix independent) and come from the same probability distribution. ► Errors (residuals) in model are independent / not correlated. ► Systematic under/over-estimation. Priv.-Doz. Dr. Georgios Halkias © 2323/42 Normality Strictly speaking “data should be normally distributed” is not quite the case… ▪ The estimates in the population (i.e., the sampling distribution) need to be normally distributed. ▪ The residuals/errors need to be normally distributed to get optimal estimates. We don’t have access to the sampling distribution of estimates (in the population), so we usually test the observed data. If our data themselves are normal, then the above tend to be true! Note: this assumption refers to every (sub-) sample for which we derive a parameter estimate. Priv.-Doz. Dr. Georgios Halkias © 2424/42 Normal shape Mode=Median=Mean Deviations in the symmetry (skewness) and the “heaviness” (kurtosis) indicate non-normality. A perfect normal distribution has zero skewness and kurtosis Priv.-Doz. Dr. Georgios Halkias © 2525/42 Skewness links- und rechtsschiefe Skewness > 0 Skewness < 0 Frequent scores are Frequent scores are clustered at the low end clustered at the high end Mode Mode Median Median Mean Mean Mode < Median < Mean Mean < Median < Mode Priv.-Doz. Dr. Georgios Halkias © 26/42 Kurtosis Leptokurtic Kurtosis > 0 Mesokurtic Kurtosis = 0 Platykurtic Kurtosis 30, the sampling distribution of the estimate typically tends to converge to normal anyway, so normality can be assumed Homogeneity of variance / Homoscedasticity When testing several groups, the variance of the outcome variable across these groups should be similar (similar spread of scores) → Categorical X – Continuous Y (“comparisons”) Levene’s Tests Money H0: variances are the same spending H1: variances are not the same sig. → unequal variances: heterogeneous n.s. → equal variances: homogeneous Customer loyalty Hartley’s Fmax (a.k.a. variance ratio) Low = Group1 / High = Group 2 VR = Largest variance/Smallest variance If VR < 2-3 (for n ≥ 30), homogeneity can be assumed. Priv.-Doz. Dr. Georgios Halkias © 34/42 Homogeneity of variance / Homoscedasticity When testing relationships, the variance of the outcome variable (y) should be constant across the levels of the predictor variable (x) → Continuous X – Continuous Y (“relationships”) ▪ Plots should display random residuals that are uncorrelated and uniform. Patterns might yield untrustworthy results. ▪ Violation of homoscedasticity (i.e., heteroscedasticity) produces a distinctive funnel or cone shape in residual plots (as predicted values increase, the variance of the residuals also increases). By saving the residuals of Y as a variable and creating a scatterplot (see, Regression Analysis) Priv.-Doz. Dr. Georgios Halkias © 35/42 Comparing independent samples/groups Nominal 2 test Nature of Ordinal Mann-Whitney U test 2 dependent variable Interval/ Ratio Independent t-test How many groups? Nominal 2 test Nature of Kruskal-Wallis One- >2 Ordinal Way ANOVA dependent variable Interval/ One-Way ANOVA Ratio Methods in bold are discussed in class Note: nominal variable → compare frequencies; ordinal data → compare ranks; scale/metric data → compare means Priv.-Doz. Dr. Georgios Halkias © 4/32 Determine type (measurement level) of DV/IV ► Do more liberals, conservatives or moderates vote for Democrats vs. Republicans? ► Is unemployment rate higher in Vienna or Copenhagen? ► Coke or Pepsi employees earn (on average) more? ► Is choice between luxury and mainstream clothes the same for university professors and fund managers? ► Does using red as opposed to blue background in print ads make more consumers remember (or not) the advertised brand. Priv.-Doz. Dr. Georgios Halkias © 5/32 Comparing independent samples, χ2 (chi-square) test Nominal 2 test Nature of Ordinal Mann-Whitney U test 2 dependent variable Interval/ Ratio Independent t-test How many groups? Nominal 2 test Nature of Kruskal-Wallis One- >2 Ordinal Way ANOVA dependent variable Interval/ One-Way ANOVA Ratio Priv.-Doz. Dr. Georgios Halkias © 6/32 Example research questions for 2 tests ▪ Are B2B or B2C companies more likely to engage in digital promotion? ▪ Are heavy, moderate, or light COKE drinkers more likely to switch to PEPSI? ▪ Does gender affect the choice between Apple and Samsung smartphones? ▪ Are Russians, Greeks or Norwegians more likely to choose Asia as their summer destination? ▪ Does choosing a career in Academia vs. Consulting differ depending on the University you have studied (WU vs. UNIVIE)? Priv.-Doz. Dr. Georgios Halkias © 7 7/32 2 (chi square) test Association/dependence between categorical variables with two (or more) levels (by cross-tabulating and comparing their frequencies). Assumptions 1. Independent samples (each person allocated to ONE group only) 2. Expected count in each cell should be > 5 (in large tables max. 20% be α) then support for H0: μ1 = μ2 t is significant (t > tcritical, p < α) then support for H1: μ1 ≠μ2 s p2 (n1 − 1)s12 + (n2 − 1)s22 Pooled variance when group sizes not equal (weighted) → = n1 + n 2 − 2 Priv.-Doz. Dr. Georgios Halkias © 18 18/32 Output of an independent sample t-test Mean difference: 4.3096 – 4.3190= -.0094 If the test is not significant the 95% interval will contain 0, and vice versa Female or male customers are more satisfied? Levene’s test checks whether the variances of the two independent groups are Equality of means tests whether the means between the two groups are different or equal. not (is their difference zero or not). H0: The variances of the two groups are equal. H0: Men are equally satisfied to women from the service. H1: The variances of the two groups are not equal. H1: Men are not equally satisfied to women from the service. (SPSS automatically gives two-tailed p-values – if you have a directional hypothesis divide by 2 to If sig. >.05, H0 is accepted. The variances are equal and consequently we look obtain the 1-tailed p-value) the first line of the table. Otherwise, the opposite. If sig. >.05, H0 is accepted. Men and women are equally satisfied from the service. If sig. <.05 there would be a significant difference in the direction shown by the mean difference. Priv.-Doz. Dr. Georgios Halkias © 19 19/32 How to calculate the effect size (Pearson’s r, Cohen’s d, Hedges’ g): ▪ r… effect size ▪ t… t-statistic (take from output table) ▪ df… degrees of freedom (take from output table) t2 r= 2 Pooled SD t + df We found a (non) significant difference in the intention to talk positively for the brand ( t(df) = value, p 2 Ordinal ANOVA dependent variable Interval/ Repeated-measures Ratio ANOVA Compare the same respondents across different measures (or points of time) Priv.-Doz. Dr. Georgios Halkias © 22/32 Example research questions for paired-samples t-tests ▪ Do companies in the UK spend more money in employee bonuses or employee training? ▪ Which characteristic (e.g., friendly staff, speed of service) customers consider more important in services? ▪ Do consumers spend more money on cosmetics as they grow older? ▪ Has company Z increased its market share between 2007 and 2017? ▪ Has your understanding of statistics been improved after taking the course? Priv.-Doz. Dr. Georgios Halkias © 23 23/32 Paired-samples t-test Compare the means between two different variables of the same sample. Assumptions 1. Related (paired) observations (i.e., within-group or repeated-measures design) 2. Normal distribution (of the difference!) → K-S test, Skewness/Kurtosis. Process Step 1: Check assumptions (using new variable) Step 2: Calculate paired samples t-test → significant or not? Step 3: Calculate effect size (manually/online) Priv.-Doz. Dr. Georgios Halkias © 24 24/32 Paired-samples t-test H0: Related measures do not differ (v1μ = v2μ) H1: Related measures differ (v1μ ≠ v2μ) observed difference expected difference between variable − between variable means means (if null hypothesis is true) t = estimate of the standard error of the difference between two means D − D t is not significant (t < tcritical, p > α) then support for H0: v1μ = v2μ t= sD N t is significant (t > tcritical, p < α) then support for H1: v1μ ≠ v2μ D …mean difference in sample D...expected difference in population sD...standard deviation of difference N...sample size Priv.-Doz. Dr. Georgios Halkias © 25 25/32 Output of a paired-samples t-test Training prog. A Training prog. B If the test is significant the 95% interval will NOT contain 0, and vice versa Training prog. A & Training prog. B Training prog. A - Training prog. B Mean difference: 4,9333 – 6,0667=1,1333 Paired differences tests whether the difference between the means of two variables is zero. H0: Training A is perceived as equally good as Training B. H1: Training A is NOT perceived as equally good as Training B. (two-tailed) If sig. >.05, H0 is accepted, the two training programs are seen as equally effective. If sig. <.05 there is a significant difference on how effective the two programs are. In particular, the mean difference shows that A is considered worse than B. Priv.-Doz. Dr. Georgios Halkias © 26 26/32 How to calculate the effect size: r… effect size t… t-statistic (take from output table) df… degrees of freedom (take from output table) t2 r= 2 t + df A significant difference between speed of service and friendliness of staff (t(df) = value, p Fcritical(dfm,r), p < α) → if group means differ ► It does not tell us “how” ► Additional, follow-up tests are necessary Priv.-Doz. Dr. Georgios Halkias © 15 15/36 Post-hoc comparisons & Planned contrasts Post-Hoc Tests Planned Contrasts two-tailed tests* one-tailed tests Pairwise comparisons between all group means ▪ Hypothesis driven comparisons that are planned a priori (not post-hoc). Commonly used tests ▪ They break down the variance further ▪ Assumptions met: e.g. REGWQ or Tukey HSD according to specific hypotheses about how groups differ. ▪ Suggested option: Bonferroni, Tukey’s ▪ Consider directionality by looking into one tail ▪ Unequal sample sizes: Gabriel’s (small n), of the distribution. Hochberg’s GT2 (large n) ▪ Unequal variances: Games-Howell see Field section 12.4 (12.4.4) and 12.5 * You can divide the two-tailed p-value of the post-hoc by 2 to get the one-tailed p-value: p2-tailed/2 = p1-tailed Priv.-Doz. Dr. Georgios Halkias © 16 16/36 Example output of one-way ANOVA There was a significant difference in WtB across promotional strategies, F(DFM,DFR) = value, p <.05, η2 = value. Post-hoc comparisons showed that the “50% off” option is significantly better than the current strategy (Mcontrol,SD vs. M50%,SD, p <.05)… Priv.-Doz. Dr. Georgios Halkias © 17 17/36 ANCOVA: Covariates in the ANOVA model ► To test for differences between group means when we know that an extraneous (third) variable (i.e., covariate) affects the outcome variable (DV). Total variance ► Used to “control” (actually adjust) for the effect of extraneous, confounding variables. Variance explained by Variance explained by Unexplained variance model covariate ► Reduces error variance (SSR) by explaining some of the unexplained variance. ► We see the effect of the factor “above and beyond” the effect of the covariate. Priv.-Doz. Dr. Georgios Halkias © 19/36 ANCOVA: adjusted and unadjusted means Willingness to buy 6 Unadjusted means 5.15 Adjusted means 4.88 4.85 5 4.71 Price sensitivity 4 3.44 3.22 3.13 3 2.92 2.00 2 Pairwise comparisons are now based on the adjusted means 1 0 Control 1+1 50% off “What would the mean scores be if groups had the same level of price sensitivity?” Priv.-Doz. Dr. Georgios Halkias © 20/36 Example output of ANCOVA Tables with descriptives (M, SD) and adjusted M (and SE) Priv.-Doz. Dr. Georgios Halkias © 21 21/36 Factorial ANOVA Factorial ANOVA involves two (or more) independent variables (factors). We can look at interactions between factors/IVs: whether one factor moderates the effects of the other or whether the effects of one factor (IV) depend on the levels of another factor (IV). Assumptions 1. Independent measures (groups) 2. Independent and normally distributed errors/residuals 3. Homogeneity of variances (across samples)* Process Step 1: Do groups differ? Based on which factor? Is there an interaction? → F-ratio Step 2: Which groups differ from each other? → Post-hoc tests * if group sizes are equal, ANOVA is said to be fairly robust against violation of homogeneity (as long as normality can reasonably be assumed!) Priv.-Doz. Dr. Georgios Halkias © 23 23/36 Factorial ANOVA: main & interaction effects Two-way ANOVA (model with two factors) Money spent Mood (DV) Gender Happy Sad marginal Main effect: The individual effect of each factor on the dependent variable. Female Gender Interaction Effect: An interaction is present when the main effect of an Male independent variable is different depending on the levels of another variable. Mood marginal Gender: sig. Gender: ns Gender: ns Gender: ns Mood: ns Mood: ns Mood: sig. Mood: sig. Gender*Mood: ns Gender*Mood: sig. Gender*Mood: ns Gender*Mood: sig. Priv.-Doz. Dr. Georgios Halkias © 24 24/36 Factorial ANOVA: main & interaction effects No interaction Interaction Two-way ANOVA (2x2) Preference Preference Parallel Non-parallel ZARA H&M ZARA H&M Two-way ANOVA (3x3) Priv.-Doz. Dr. Georgios Halkias © 25 25/36 Theory of factorial ANOVA Partitions the total variability among the components of the model and compares the variability explained by the factors to the error variability (not explained by the factors) Priv.-Doz. Dr. Georgios Halkias © 26 26/36 Two-way ANOVA applied The influence of music on taste and the moderating role of gender Restaurant manager intends to add music in his store because he read on Twitter that music can positively influence the perception of taste. To find the right type of music, he designs an experiment where he also investigates if gender can influence (moderate) the potential effects of music. Classical Pop Rock Gender Female Male Female Male Female Male 65 50 70 45 55 30 50 55 65 60 65 30 70 80 60 85 70 30 45 65 70 65 55 55 Standardized measure of tastefulness (0-100) 55 70 65 70 55 35 30 75 60 70 60 20 70 75 60 80 50 45 55 65 50 60 50 40 Total 485 535 500 535 460 285 Mean 60.625 66.875 62.50 66.875 57.50 35.625 Variance 24.55 106.70 42.86 156.70 50.00 117.41 Priv.-Doz. Dr. Georgios Halkias © 27/36 Two-way ANOVA applied SST = ( xi − xgrand )2 2 S grand (N − 1) df = n -1 = 47 SS R = (x − x ) Group _ 1 i i 2 + (x − x ) Group _ 2 i i 2 +... + (x − x ) Group _ i i 2