BIO1109 6. Comparison of Population Proportions PDF
Document Details
Uploaded by PureOmaha
Far Eastern University
Frederick Gella
Tags
Summary
This document presents lecture notes on comparison of population proportions in biostatistics. It covers basic concepts, different methodologies, and examples. The source is from Le and Eberly's Introductory Biostatistics (2nd ed).
Full Transcript
6. Comparison of Population Proportions Presented by Frederick Gella Le and Eberly (2016). Introductory Biostatistics, 2nd ed. Le and Eberly (2016). Introductory Biostati Presented...
6. Comparison of Population Proportions Presented by Frederick Gella Le and Eberly (2016). Introductory Biostatistics, 2nd ed. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 1 / 28 Content 6.1 One-Sample Problem with Binary Data 6.2 Analysis of Pair-Matched Data 6.3 Comparison of Two Proportions 6.4 Mantel-Haenzel Method 6.5 Inferences for General Two-Way Tables 6.6 Fisher’s Exact Test 6.7 Ordered 2 × k Contingency Tables Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 2 / 28 Basic Concepts Let X1 and X2 denote two categorical variables, X1 having I levels and X2 having J levels, thus IJ combinations of classifications. We display the data in a rectangular table having I rows for the categories of X1 and J columns for the categories of X2 ; the IJ cells represent the IJ combinations of outcomes. When the cells contain frequencies of outcomes, the table is called a contingency table or cross-classified table, also referred to as an I by J or I × J table. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 3 / 28 6.1 One-Sample Problem with Binary Data In this type of problem, we have a sample of binary data (n, x ) with n being an adequately large sample size and x the number of positive outcomes among the n observations, and we consider the null hypothesis H0 : π = π 0 where π0 is a fixed and known number between 0 and 1: for example, H0 : π = 0.25 π0 is often a standardized or referenced figure, for example, the effect of a standardized drug or therapy or the national smoking rate (where the national sample is often large enough so as to produce negligible sampling error in π0 ). Or we could be concerned with a research question such as: Does the side effect (of a certain drug) exceed a regulated limit π0 ? Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 4 / 28 6.1 One-Sample Problem with Binary Data In a typical situation, the null hypothesis of a statistical test is concerned with a parameter π while a statistic is a sample proportion p; the corresponding sampling distribution is obtained easily by invoking the central limit theorem. With a large sample size and assuming that the null hypothesis H0 is true, it is the normal distribution with mean and variance given by µp = π0 (1) π0 (1 − π0 ) σp2 = (2) n respectively. From this sampling distribution, the observed value of the sample proportion can be standardized and converted to a standard unit: the number of standard errors away from the hypothesized value of π0. In other words, to perform a test of significance for H0 , we proceed with the following steps: Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 5 / 28 6.1 One-Sample Problem with Binary Data 1 Decide whether a one- or a two-sided test is appropriate. 2 Choose a level of significance α, a common choice being 0.05. p − π0 3 Calculate the z score: z = r π0 (1 − π0 ) n 4 From the table for the standard normal distribution and the choice of α (e.g., α = 0.05), the rejection region is determined by: For a one-sided test: z ≤ −1.65 for HA : π < π0 z ≥ 1.65 for HA : π > π0 For a two-sided test or HA : π ̸= π0 zA ≤ −1.96 or zA ≥ 1.96 Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 6 / 28 6.2 Analysis of Pair-Matched Data The method applies to cases where each subject or member of a group is observed twice for the presence or absence of a certain characteristic (e.g., at admission to and discharge from a hospital), or matched pairs are observed for the presence or absence of the same characteristic. Example. A popular application is an epidemiological design called a pair-matched case–control study. In case–control studies, cases of a specific disease are ascertained as they arise from population-based registers or lists of hospital admissions, and controls are sampled either as disease-free persons from the population at risk or as hospitalized patients having a diagnosis other than the one under investigation. As a technique to control confounding factors, individual cases are matched, often one-to-one, to controls chosen to have similar values for confounding variables such as age, gender, and race. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 7 / 28 6.2 Analysis of Pair-Matched Data For pair-matched data with a single binary exposure (e.g., smoking vs nonsmoking), data can be represented by a 2 × 2 table (Table 6.1) where (+, -) denotes the (exposed, nonexposed) outcome. In this 2 × 2 table, a denotes the number of pairs with two exposed members, b denotes the number of pairs where the case is exposed but the matched control is unexposed, c denotes the number of pairs where the case is unexposed but the matched control is exposed, and d denotes the number of pairs with two unexposed members. Goal. Compare the incidence of exposure among the cases versus the controls; the parts of the data showing no difference, the number a of pairs with two exposed members, and the number d of pairs with two unexposed members, would contribute nothing as evidence in such a comparison. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 8 / 28 6.2 Analysis of Pair-Matched Data In other words, the analysis of pair-matched data with a single binary exposure can be seen as a special case of the one-sample problem with binary data of Section 6.1 with n = b + c, x = b, and π0 = 0.5. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 9 / 28 6.2 Analysis of Pair-Matched Data Recall the form of the test statistic of Section 6.1; we have p − π0 z= r π0 (1 − π0 ) n p − π0 =r π0 (1 − π0 ) b+c [b/(b + c )] − 1/2 =p (1/2)[1 − (1/2)]/(b + c ) b−c =√ b+c Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 10 / 28 6.2 Analysis of Pair-Matched Data The decision is based on the standardized z score and referring to the percentiles of the standard normal distribution or, in the two-sided form, the square of the statistic above, denoted by (b − c )2 χ2 = b+c and the test is known as McNemar’s chi-square. If the test is one-sided, z is used and the null hypothesis is rejected at the 0.05 level when z ≥ 1.65 If the test is two-sided, χ2 is used and the null hypothesis is rejected at the 0.05 level when χ2 ≥ 1.962 = 3.84. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 11 / 28 6.3 Comparison of Two Proportions In this type of problem we have two independent samples of binary data (n1 , x1 ) and (n2 , x2 ) where the ns are adequately large sample sizes that may or may not be equal. The xs are the numbers of “positive” outcomes in the two samples, and we consider the null hypothesis H0 : π 1 = π 2 expressing the equality of the two population proportions. To perform a test of significance for H0 , we proceed with the following steps: 1 Decide whether a one-sided test, say, HA : π2 > π1 or HA : π1 > π2 or a two-sided test, HA : π 1 ̸ = π 2 is appropriate. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 12 / 28 6.3 Comparison of Two Proportions 2 Choose a significance level α, a common choice being 0.05. 3 Calculate the z score based on p2 − p1 : p2 − p1 z= p p (1 − p )(1/n1 + 1/n2 ) where p is the pooled proportion, defined by x1 + x2 n1 + n2 which is an estimate of the common proportion under H0. 4 Refer to the table for standard normal distribution for selecting a cut point. If α is 0.05, the rejection region is determined by: For the one-sided alternative HA : π2 > π1 , z ≥ 1.65. For the one-sided alternative HA : π2 < π1 , z ≤ −1.65. For the two-sided alternative HA : π1 ̸= π2 , z ≤ −1.96 or z ≥ 1.96. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 13 / 28 6.3 Comparison of Two Proportions In the two-sided form, the square of the z score, denoted χ2 , is more often used. The test is referred to as the chi-square test. The test statistic can also be obtained using the shortcut formula (n1 + n2 )[x1 (n2 − x2 ) − x2 (n1 − x1 )]2 χ2 = n1 n2 (x1 + x2 )(n1 + n2 − x1 − x2 ) and the null hypothesis is rejected at the 0.05 level when χ2 ≥ 3.84. With data in a 2 × 2 table (Table 6.4), the chi-square statistic above is simply (a + b + c + d )(ad − bc )2 χ2 =. (a + c )(b + d )(a + b )(c + d ) its denominator being the product of the four marginal totals. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 14 / 28 6.4 Mantel-Haenszel Method We are often interested only in investigating the relationship between two binary variables (e.g., a disease and an exposure); however, we have to control for confounders. A confounding variable is a variable that may be associated with either the disease or exposure or both. Example. In Example 1.2, a case–control study was undertaken to investigate the relationship between lung cancer and employment in shipyards during World War II among male residents of coastal Georgia. In this case, smoking is a confounder; it has been found to be associated with lung cancer and it may be associated with employment because construction workers are likely to be smokers. Specifically, we could investigate: 1 Among smokers, whether or not shipbuilding and lung cancer are related; 2 Among nonsmokers, whether or not shipbuilding and lung cancer are related. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 15 / 28 6.4 Mantel-Haenszel Method Are shipbuilding and lung cancer independent, conditional on smoking? However, we do not want to reach separate conclusions, one at each level of smoking. Assuming that the confounder, smoking, is not an effect modifier (i.e., smoking does not alter the relationship between lung cancer and shipbuilding), we want to pool data for a combined decision. When both the disease and the exposure are binary, a popular method to achieve this task is the Mantel–Haenszel method. The process can be summarized as follows: 1 We form 2 × 2 tables, one at each level of the confounder. 2 At a level of the confounder, we have the frequencies as shown in Table 6.8. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 16 / 28 6.4 Mantel-Haenszel Method Under the null hypothesis and fixed marginal totals, cell (1, 1) frequency a is distributed with mean and variance: r1 c1 E0 ( a ) = n r1 r2 c1 c2 Var0 (a) = 2. n (n − 1) Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 17 / 28 6.4 Mantel-Haenszel Method and the Mantel–Haenszel test is based on the z statistic: z= q ∑[a − (r1 c1 /n)]. ∑ (r1 r2 c1 c2 )/[n2 (n − 1)] where the summation (∑) is across levels of the confounder. Of course, one can use the square of the z score, a chi-square test at one degree of freedom, for two-sided alternatives. When the test above is statistically significant, the association between the disease and the exposure is real. Since we assume that the confounder is not an effect modifier, the odds ratio is constant across its levels. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 18 / 28 6.4 Mantel-Haenszel Method The odds ratio at each level is estimated by (ad /bc ); while the Mantel–Haenszel procedure pools data across levels of the confounder to obtain a combined estimate: ORMH = ∑(ad /n). ∑(bc/n) Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 19 / 28 6.5 Inferences for General Two-Way Tables Consider the general case of an I × J table: say, resulting from a survey of size n. Let X1 and X2 denote two categorical variables, X1 having I levels and X2 having J levels; there are IJ combinations of classifications. The IJ cells represent the IJ combinations of classifications; their probabilities are {πij }, where πij denotes the probability that the outcome (X1 , X2 ) falls in the cell in row i and column j. When two categorical variables forming the two-way table are independent, all πij = πi + π+j. Here πi + and π+j are the two marginal or univariate probabilities. The estimate of πij under this condition is π̂ij = π̂i + π̂+j = pi + p+j x x = i+ · i+ n n xi + x+j =. n2 Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 20 / 28 6.5 Inferences for General Two-Way Tables where the xs are the observed frequencies. Under the assumption of independence, we would have in cell (i, j ): eij = n π̂ij xi + x+j = n (row total)(column total) =. sample size The eij are called estimated expected frequencies, the frequencies we expect to have under the null hypothesis of independence. They have the same marginal totals as those of the data observed. Goal. We want to see if the two factors or variables X1 and X2 are related; the task we perform is a test for independence. We achieve that by comparing the observed frequencies, the xs, versus those expected under the null hypothesis of independence, the expected frequencies es. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 21 / 28 6.5 Inferences for General Two-Way Tables This needed comparison is done through Pearson’s chi-square statistic: (xij − eij )2 χ =∑ 2. i,j eij For large samples (all eij ≥ 5), χ2 has approximately a chi-square distribution with degrees of freedom under the null hypothesis of independence, df = (I − 1)(J = 1) with greater values leading to a rejection of H0. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 22 / 28 6.6 Fisher’s Exact Test Even with a continuity correction, the goodness-of-fit test statistic such as Pearson’s χ2 is not suitable when the sample is small. Generally, statisticians suggest using them only if no expected frequency in the table is less than 5. For studies with small samples, we introduce a method known as Fisher’s exact test. For tables in which use of the chi-square test χ2 is appropriate, the two tests give very similar results. Goal. We want to find the exact significance level associated with an observed table. The central idea is to enumerate all possible outcomes consistent with the observed marginal totals and add up the probabilities of those tables that are more extreme than the one observed. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 23 / 28 6.6 Fisher’s Exact Test Conditional on the margins, a 2 × 2 table is a one-dimensional random variable having a known distribution, so the exact test is relatively easy to implement. The probability of observing a table with cells a, b, c, and d (with total n) is (a + b ) ! (c + d ) ! (a + c ) ! (b + d ) ! Pr(a, b, c, d ) = n!a!b!c!d ! The process for doing hand calculations would be as follows: 1 Rearrange the rows and columns of the table observed so the smaller row total is in the first row and the smaller column total is in the first column. 2 Start with the table having 0 in the (1, 1) cell (top left cell). The other cells in this table can be calculated from the fixed row and column margins. 3 Construct the next table by increasing the (1, 1) cell from 0 to 1 and decreasing all other cells accordingly. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 24 / 28 6.6 Fisher’s Exact Test 4 Continue to increase the (1, 1) cell by 1 until one of the other cells becomes 0. At that point we have enumerated all possible tables. 5 Calculate and add up the probabilities of those tables with cell (1, 1) having values from 0 to the observed frequency (left side for a one-sided test); double the smaller side for a two-sided test. Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 25 / 28 6.7 2 × K Contingency Tables In general, consider an ordered 2 × k table with the frequencies shown in Table 6.18. The number of concordances is calculated by C = a1 (b2 +... + bk ) + a2 (b3 +... + bk ) +... + ak −1 bk. The number of discordances is D = b1 (a2 +... + ak ) + b2 (a3 +... + ak ) +... + bk −1 ak. To perform the test, we calculate the statistic S = C −D then standardize it to obtain S − µS z= σD where µS = 0 is the mean of S under the null hypothesis and Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 26 / 28 6.7 2 × K Contingency Tables s AB σS = (N 3 − n13 − n23 −... − nk3 ) 3N (N − 1) The standardized z score is distributed as standard normal if the null hypothesis is true. For a one-sided alternative, which is a natural choice for this type of test, the null hypothesis is rejected at the 5% level if z > 1.65 (or z < −1.65 if the terms concordance and discordance are switched). Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 27 / 28 References 1 Le and Eberly. (2016). Introductory Biostatistics (2nd ed., Wiley and Sons) Le and Eberly (2016). Introductory Biostati Presented by Frederick Gella 6. Comparison of Population Proportions 28 / 28