Statistics in Health Sciences A2-G01 2023 PDF

Statistics in Health Sciences A2-G01 Andrea Gonzalez (1603921) & Alejandro Donaire (1600697) 3rd December 2023 Contents 1 Introduction 2 Comparing incidence rates 2.1 Adding extra columns to Table 1 . . . . . . . . . . 2.2 Informative sentences for age group 45-54 . . . . . 2.3 Informative sentences for age group 75-84 . . . . . 2.4 Graphical visualization of the Incidence Rate Ratio 2.5 Questions and anwsers related to Figure 1 . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 2 2 3 Modeling incidence rates with a Generalized Linear Model 3.1 Remarks regarding the shape of the curve in Figure 1 . . . . . . . 3.2 Deriving the form of the expected IRR according to the GLM . . . 3.3 Expected sign for δ1j . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Fitting the proposed GLM . . . . . . . . . . . . . . . . . . . . . . . 3.5 Extracting the estimated parameters of the fitted GLM . . . . . . 3.6 Estimated IRRs using the fitted GLM and Formula 1 . . . . . . . . 3.7 Solving again questions 2 and 3 in Section 2 using the GLM results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 4 5 5 5 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Discussion 7 5 References 8 A R code A.1 Loading the data . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Generating the initial table . . . . . . . . . . . . . . . . . . . . . A.3 Adding extra columns to the initial table . . . . . . . . . . . . . A.4 Generating output for sentences of age group 45-54 . . . . . . . . A.5 Generating output for sentences of age group 75-84 . . . . . . . . A.6 Graphical visualiation of the IRR . . . . . . . . . . . . . . . . . . A.7 Logic to partially generate the anwsers related to Figure 1 . . . . A.8 Predictions for age groups 45-54 and 75-84 using the fitted GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 9 10 10 10 11 11 1 Introduction There is nothing to do in this section. However, we will take advantage of this space to load the data and replicate the first table, so it can be referenced and expanded later on. Here, only the table will be shown. The code corresponding to loading and creating the table can be found in the Appendix, in sections A.1 and A.2, respectively. Table 1: Data on coronary death rates, corresponding to the breslow dataset [2], originally from Doll and Hill (1961) [1]. Person-years Age Nonsmokers Smokers Nonsmokers Smokers 18790 10673 5710 2585 1462 52407 43248 28612 12663 5317 2 12 28 28 31 32 104 206 186 102 35-44 45-54 55-64 65-74 75-84 2 2.1 Coronary deaths Comparing incidence rates Adding extra columns to Table 1 Next, Table 1 is expanded to include the sample coronary deaths per 1000 person-years for smokers and non-smokers, for the different age groups, as well as the resulting incidence rate ratio. The code to create the table can be found in Section A.3 of the Appendix. Table 2: Data on coronary death rates.The incidence rates correspond to the sample coronary death rates per 1000 personyears.The Last column,’IRR’, corresponds to the Incidence Rate Ratio. Data from Doll and Hill (1961) [1]. Person-years Age 35-44 45-54 55-64 65-74 75-84 2.2 Coronary deaths Incidence Rate Non-smokers Smokers Non-smokers Smokers Nonsmokers Smokers 18790 10673 5710 2585 1462 52407 43248 28612 12663 5317 2 12 28 28 31 32 104 206 186 102 0.106 1.124 4.904 10.832 21.204 0.611 2.405 7.200 14.688 19.184 IRR 5.737 2.139 1.468 1.356 0.905 Informative sentences for age group 45-54 The actual information in the following sentences is automatically generated given the data. The code that preprocesses and generates the information can be found in Section A.4 of the Appendix. (a) The incidence rate among smokers was 2.405 per 1000 person-years. (b) The incidence rate among non-smokers was 1.124 per 1000 person-years. (c) The incidence rate ratio was 2.139. (d) The p-value of the Wald test to decide if the incidence rate among smokers is the same as among non-smokers was 0.011. (e) For age group 45-54, the incidence rate of coronary deaths among smokers was 2.405 per 1000 personyears, while 1.124 per 1000 person-years among non-smokers. This resulted in an incidence rate ratio of 2.139, which can be interpreted as smokers being 2.139 times more likely to experience coronary death than non-smokers. Using the Wald test to statistically analyze these results, a p-value of 0.011 was obtained. This indicates that the observed increase in coronary deaths among smokers in the 45-54 age group is unlikely due to chance, underscoring a potential causal relationship. 1 2.3 Informative sentences for age group 75-84 Just like in Section 2.2, the actual information of the following sentences is automatically generated given the data. The code that preprocesses and generates the information can be found in Section A.5 of the Appendix. (a) The incidence rate among smokers was 19.184 per 1000 person-years. (b) The incidence rate among non-smokers was 21.204 per 1000 person-years. (c) The incidence rate ratio was 0.905. (d) The p-value of the Wald test to decide if the incidence rate among smokers is the same as among non-smokers was 0.625. (e) For age group 75-84, the incidence rate of coronary deaths among smokers was 19.184 per 1000 person-years, while 21.204 per 1000 person-years for non-smokers. This resulted in an incidence rate ratio of 0.905, which is interpreted as non-smokers being about 11% more likely to experience coronary death than smokers. Using the Wald test to statistically analyze these results, however, a p-value of 0.625 was obtained. This suggests that the observed difference in coronary deaths between smokers and non-smokers in the 75-84 age group is potentially due to chance. 2.4 Graphical visualization of the Incidence Rate Ratio The following is a graphical visualization for the column ’IRR’ of Table 2. The code to generate this visualization is in Section A.6 of the Appendix. IRR 1 1 2 2 3 IRR 4 3 4 5 5 6 Incidence Rate Ratio (IRR) of Coronary Deaths in Smokers vs. Non−smokers by Age Group 35−44 45−54 55−64 65−74 75−84 35−44 Age Group 45−54 55−64 65−74 75−84 Age Group Figure 1: The incidence rate ratios are calculated as the ratio of the incidence rate of coronary deaths in smokers to the incidence rate of coronary deaths in non-smokers. The left plot shows the incidence rate ratios in natural scale, while the right plot shows the incidence rate ratios in logarithmic scale. Data from Doll and Hill (1961) [1]. 2.5 Questions and anwsers related to Figure 1 Some remarks can be made regarding Figure 1. Some of the answers to these questions are automated and depend on the data. The code that manages the logic behind the answers can be found in Section A.7 of the Appendix. (a) Is the rate ratio greater than 1 for all age groups? What does it mean? No, the rate ratio is not greater than 1 for all age groups. For group 75-84 the rate ratio is 0.905. This means that for this group the incidence of coronary deaths is actually greater in the group of non-smokers than in the group of smokers, i.e., that smoking has a protective effect, contrarily to common sense. This might be due to other factors not taken into account here, or simply by chance, as mentioned in part (e) of Section 2.3 2 (b) Is the rate ratio constant over age groups? What does it mean? No, the rate ratio does not appear to be constant over the age groups. In particular, the older the age group, the lower the rate ratio. The specific shape of the curve the points form is approximately exponential, or approximately linear, when the IRR is plotted on a logarithmic scale, as can be seen in Figure 1. This might be due to the fact that, as people age, the probability of dying due to coronary heart disease increases irrespective of whether they smoke or not. This would inevitably diminish the importance of smoking, resulting in a lower IRR. In contrast, if the rate ratio were constant, it would suggest that the probability of suffering a coronary death caused by other factors other than smoking would be the same across different age groups. 3 Modeling incidence rates with a Generalized Linear Model 3.1 Remarks regarding the shape of the curve in Figure 1 In order to connect the exponential variation of the incidence ratio with Figure 1, which displays the IRR, it is useful to consider the definition of IRR for a certain age group j: IRRj = Ir1j Ir0j In this definition, it can be seen that the IRRj is, in particular, a ratio of incidence ratios. If we are assuming that the incidence ratios follow an exponential function, then we have a ratio of exponential, which, in turn, is exponential itself. This means that we should expect the IRRj , this is, Figure 1, to follow an exponential shape. There are some indicators that indeed suggest Figure 1 to vary exponentially. Firstly, a rapid decrease, followed by a slow decrease, can be observed. Most importantly, when the incidence rate ratio is represented on a logarithmic scale, the variation of the values becomes approximately linear. Additionally, when represented on the natural scale, the initial value is approximately divided by the same scalar over the age groups. In summary, even with limited data, an exponential trend can be identified. Thus, the assumption that the incidence ratio varies exponentially with age is indeed consistent with Figure 1. 3.2 Deriving the form of the expected IRR according to the GLM This demonstration will be done in two parts. Considering that j is the age group, the first part will treat the case j = 0, and the second one the case j = k, for k = 1, 2, 3, 4. In both cases, the same starting point is going to be used. Specifically, it consists of taking the definition of IRR and substituting in the expected values for the incidence ratios given by the proposed model. IRRj = Ir1j E(Ir1j ) = Ir0j E(Ir0j ) We will continue developing this expression for the particular cases described above. For j = 0 E(Ir10 ) exp(L(E1 , A0 )) exp(α + β1 ) = = = exp(β1 ) E(Ir00 ) exp(L(E0 , A0 )) exp(α) Where we have used the properties of the exponential and substituted the expanded form of the linear predictor. 3 For j = k E(Ir1k ) exp(L(E1 , Ak )) exp(α + β1 + γk + δ1k ) = = = exp(β1 + δ1k ) E(Ir0k ) exp(L(E0 , Ak )) exp(α + γk ) Similarly to the case j = 0, we have also used the properties of the exponential and the expanded form of the linear predictor. Also, in both cases, we have used the definition of 1x=a to determine which parameters remain. As it can be seen, this is equivalent to what we wanted to prove, namely: ( exp(β1 ), if j = 0 IRRj = exp(β1 + δ1j ), if j ̸= 0 3.3 (1) Expected sign for δ1j The strategy that will be followed to determine the sign we expect for δ1j is to consider all three cases (zero, positive or negative), see what happens theoretically, and then compare that with Figure 1. For δ1j = 0 In this case, for all j we obtain the following expression: IRRj = exp(β1 ) As it can be seen, this is simply a constant (over age groups), so we should expect a horizontal line in Figure 1. For δ1j > 0 When j = 0, we simply have exp(β1 ), which is constant over the age groups. To simplify the notation, we can call this constant c. On the other hand, when j is greater than 0, we have exp(β1 + δ1j ), which can be expanded to exp(β1 ) · exp(δ1j ) or, equivalently, c · exp(δ1j ). Now we note that since the exponential of a value greater than zero is greater than one, it follows that exp(δ1j ) is greater than one. Therefore, c · exp(δ1j ) must be greater than c. This result implies that we should see, for all age groups except the first one, values of IRR greater than the one corresponding to the first age group in Figure 1. For δ1j < 0 Contrarily to the previous case, δ1j is less than zero, so exp(δ1j ) is less than one. This results in · exp(δij ) being less than c. Accordingly, we should expect, for all age groups except the first one, values of IRR less than the one corresponding to the first age group in Figure 1. Of all three cases, only the last one matches what we actually see. Therefore, the sign that we expect for δ1j is negative. 4 3.4 Fitting the proposed GLM In this section, the specified GLM is fitted. The estimated coefficients of the model are extracted using the function coefficients(), and then they are saved to a variable so they can be used in subsequent sections. > > + + > > > # Fitting the specified GLM mod <- glm(deaths ~ smoker * age + offset(log(personYears)), family = poisson, data = breslow) # Saving the coeficients in a variable for later use coefs <- coefficients(mod) 3.5 Extracting the estimated parameters of the fitted GLM α̂ = -9.1479329 β̂1 = 1.7468734 γ̂1 = 2.3573671 γ̂2 = 3.8301631 γ̂3 = 4.6226566 γ̂4 = 5.2943595 δ̂1 = -0.9866229 δ̂2 = -1.3628089 δ̂3 = -1.44229 δ̂4 = -1.8469916 3.6 Estimated IRRs using the fitted GLM and Formula 1 [ 35−44 = 5.7366382 IRR [ 45−54 = 2.1388118 IRR [ 55−64 = 1.4682401 IRR [ 65−74 = 1.3560598 IRR [ 75−84 = 0.9047304 IRR 3.7 Solving again questions 2 and 3 in Section 2 using the GLM results For age group 45-54: Using the model, we can predict the incidence rate for a certain smoking status i and age group j. In particular, the incidence rate among smokers of the age group 45-54 corresponds to i = 1 and j = 1. In this case, the prediction provided by the GLM is: exp(L(E1 , A1 )) = exp(α + β1 + γ1 + δ11 ) Similarly, the incidence rate among non-smokers in the age group 45-54 corresponds to i = 0 and j = 1, which translates to the following prediction: exp(L(E0 , A1 )) = exp(α + γ1 ) 5 Also, the Incidence Rate Ratio can be calculated using the expression proved in Section 3.2. For the age group 45-54 in particular, the following is obtained: IRRj = exp(β1 + δ11 ) The actual predictions can be calculated using the code in Section A.8 of the Appendix. The following is a summary of the results: (a) The incidence rate among smokers was 2.405 per 1000 person-years. (b) The incidence rate among non-smokers was 1.124 per 1000 person-years. (c) The incidence rate ratio was 2.139. For age group 75-84: In the case of the age group 75-84, the incidence rate among smokers can be determined by setting i = 1 and j = 4. The corresponding prediction made by the GLM is: exp(L(E1 , A4 )) = exp(α + β1 + γ2 + δ14 ) Similarly, for non-smokers in the age group 75-84, setting i = 0 and j = 4, yields the following prediction: exp(L(E0 , A4 )) = exp(α + γ4 ) The Incidence Rate Ratio (IRR) for the age group 75-84 can be calculated using the formula derived in Section 3.2, i.e., Formula 1: IRRj = exp(β1 + δ14 ) The actual predictions can be obtained using the code provided in Section A.8 of the Appendix. A summary of the results is presented below: (a) The incidence rate among smokers was 19.184 per 1000 person-years. (b) The incidence rate among non-smokers was 21.204 per 1000 person-years. (c) The incidence rate ratio was 0.905. 6 4 Discussion In this analysis, we have examined whether smoking is a risk factor for coronary artery disease through a data analysis followed by a GLM modeling using a Poisson distribution and a logarithmic transformation. We have considered confounders, particularly age, through its interaction with smoking. We found that smoking is positively associated with coronary artery disease in all age groups except those aged 75-84, for whom the association was negative. This might be explained by the fact that as people age, the probability of dying due to coronary heart disease increases, irrespective of whether they smoke or not. The most dramatic difference was observed in the youngest age group (35-44), where the coronary death rate among smokers was almost 6 times as high as the rate among non-smokers. Finally, this study has some limitations, including the exclusive focus on male subjects and the dated nature of the data, which may not accurately reflect current tobacco characteristics or population dynamics, therefore weakening the conclusiveness of the results. We propose re-doing this study with updated and more inclusive data. 7 5 References [1] Doll R, Hill A.B. Mortality of British doctors in relation to smoking: Observations on coronary thrombosis. National Cancer Institute Monograph. 1966;19:205-268. [2] Breslow N.E. Cohort Analysis in Epidemiology. In A Celebration of Statistics A.C. Atkinson and S.E. Fienberg (editors). 1985;109-143. Springer-Verlag. 8 A A.1 > > > > > + + + + # Assign the data to a variable data <- breslow + + + + + Generating the initial table # Extracting the columns in the desired format Age <- unique(data$age) NS_PY <- data[data$smoker == "no", ][’personYears’] S_PY <- data[data$smoker == "yes", ][’personYears’] NS_CD <- data[data$smoker == "no", ][’deaths’] S_CD <- data[data$smoker == "yes", ][’deaths’] # Creating a data frame to store and sort the columns data_processed <- data.frame(Age, NS_PY, S_PY, NS_CD, S_CD) # Converting the data frame into a table kable(data_processed, col.names = c("Age", "Nonsmokers", "Smokers", "Nonsmokers", "Smokers"), caption = "Data on coronary death rates, corresponding to the \\texttt{breslow} dataset \\cite{breslow}, originally from Doll and Hill (1961) \\cite{doll_hill}.", booktabs=TRUE, label = "initial_table") %>% add_header_above(c(" "=1, "Person-years"=2, "Coronary deaths"=2)) %>% kable_styling(latex_options = "hold_position") A.3 > > > > > > > > > > > + > > > + + + Loading the data # Read the data load("breslow.RData") A.2 > > > > > > > > > + > > > + + R code Adding extra columns to the initial table # Extracting sample coronary death rates per 1000 person-years (smokers) NS_IR <- (NS_CD / NS_PY) * 1000 # Extracting sample coronary death rates per 1000 person-years (non-mokers) S_IR <- (S_CD / S_PY) * 1000 # Extracting incidence rate ratio for smokers vs. non-smokers IRR <- S_IR / NS_IR # Merging the new columns with the dataframe created previously data_processed <- cbind(data_processed, NS_IR, S_IR, IRR) # Create the table using kable kable(data_processed, col.names = c("Age", "Non-smokers", "Smokers", "Non-smokers", "Smokers", "Nonsmokers", "Smokers", " "), caption = "Data on coronary death rates.The incidence rates correspond to the sample coronary death rates per 1000 person-years.The Last column,’IRR’, corresponds to the Incidence Rate Ratio. Data from Doll and Hill (1961) \\cite{doll_hill}.", booktabs = TRUE, digits = 3, label = "extra_cols_table") %>% add_header_above(c(" "=1, "Person-years"=2, "Coronary deaths"=2, "Incidence Rate"=2, "IRR"=1)) %>% kable_styling(latex_options = "hold_position") 9 A.4 > > > > > > > > > > > + > > + > > > > > > > + + + + + + + + # Renaming columns so it is easier to retrieve information colnames(data_processed) = c("Age", "PS_NS", "PS_S", "CD_NS", "CD_S", "IR_NS", "IR_S", "IRR") # Determining the age group age_group <- ’45-54’ # Selecting data from specific age group selected_data <- data_processed[data_processed$Age == age_group, ] # Preparing the data for the Wald test disease_counts <- c(Unexposed = as.numeric(selected_data[’CD_NS’]), Exposed = as.numeric(selected_data[’CD_S’])) person_years <- c(Unexposed = as.numeric(selected_data[’PS_NS’]), Exposed = as.numeric(selected_data[’PS_S’])) # Wald test to decide if the incidence rate among smokers is the same than among non-smokers wald_test <- rateratio.wald(disease_counts, person_years) # Creating a custom function to handle the conclusion. It is different depending on the # final p-value obtained when performing the Wald test. conclusion <- function(p_value, age_group) { if (selected_data[’IRR’] > 1) {population <- "smokers"} else {population <- "non-smokers"} if (p_value < 0.05) { return(paste("This indicates that the observed increase in coronary deaths among", population, "in the", age_group, "age group is unlikely due to chance, underscoring a potential causal relationship.")) } else { return(paste("This suggests that the observed difference in coronary deaths between smokers and non-smokers in the", age_group, "age group is potentially due to chance.")) } } A.5 > > > > > > > > + > > + > > > Generating output for sentences of age group 75-84 # Determining the age group age_group <- ’75-84’ # Selecting data from specific age group selected_data <- data_processed[data_processed$Age == age_group, ] # Preparing the data for the Wald test disease_counts <- c(Unexposed = as.numeric(selected_data[’CD_NS’]), Exposed = as.numeric(selected_data[’CD_S’])) person_years <- c(Unexposed = as.numeric(selected_data[’PS_NS’]), Exposed = as.numeric(selected_data[’PS_S’])) # Wald test to decide if the incidence rate among smokers is the same than among non-smokers wald_test <- rateratio.wald(disease_counts, person_years) A.6 > > > > > > > + > Generating output for sentences of age group 45-54 Graphical visualiation of the IRR # Defining the data that is going to be used age_groups <- as.vector(data_processed$Age) IRR <- as.vector(data_processed$IRR) par(mfrow = c(1, 2), mar=c(5,5,4,2)) # Plot 1: Normal scale plot(IRR, type = "b", xaxt = "n", ylim = range(IRR), xlab = "Age Group", ylab = "IRR", cex.main = 2, cex.lab = 2, cex.axis = 1.7, cex = 2) axis(1, at = 1:length(age_groups), labels = age_groups, cex.axis = 1.7) 10 > > + > > > > + # Plot 2: Logarithmic scale plot(IRR, type = "b", xaxt = "n", log = "y", ylim = range(IRR), xlab = "Age Group", ylab = "IRR", cex.main = 2, cex.lab = 2, cex.axis = 1.7, cex = 2) axis(1, at = 1:length(age_groups), labels = age_groups, cex.axis = 1.7) # Adding a common title to both plots title("Incidence Rate Ratio (IRR) of Coronary Deaths in Smokers vs. Non-smokers by Age Group", outer = TRUE, line = -2, cex.main = 2.2) A.7 Logic to partially generate the anwsers related to Figure 1 > # First question > first_answer <- function(age_groups, IRR){ + # All greater than 1 + if(all(IRR > 1)){ + return(paste("Yes, the rate ratio is greater than 1 for all age groups. This, a priori, indicates that, for all age groups, there seems to be a greater incidence of coronary deaths for smokers compared to non-smokers")) + # Some are less than 1 + } else { + IRRs_selected <- IRR[IRR<1] + age_groups_selected <- age_groups[IRR<1] + # Deciding between only 1 less than 1, or more than 1 less than 1 + if (length(age_groups_selected) == 1) { + group_word <- "group" + this_word <- "this" + group_values <- age_groups_selected + group_IRRs <- paste0(sprintf("%.3f", IRRs_selected), ".") + } else { + group_word <- "groups" + this_word <- "these" + group_values <- paste(paste(head(age_groups_selected, n - 1), collapse = ", "), "and", + tail(paste(paste(head(age_groups, n - 1), collapse = ", "), "and", + tail(age_groups, 1)), 1)) + group_IRRs <- paste(paste(head(IRRs_selected, n - 1), collapse = ", "), "and", + tail(IRRs_selected, 1), "respectively.") + } + return(paste("No, the rate ratio is not greater than 1 for all age groups. For ", group_word, + group_values, "the rate ratio is", group_IRRs, "This means that for", this_word, + group_word, "the incidence of coronary deaths is actually greater in the group of non-smokers than in the group of smokers, i.e., that smoking has a protective effect, contrarily to common sense. This might be due to other factors not taken into account here, or simply by chance, as mentioned in part (e) of Section \\ref{sec:sentences_75_84}")) + } + } A.8 > > > + > > > > > > > > > + > Predictions for age groups 45-54 and 75-84 using the fitted GLM # For age group 45-54 S_IR_45_54 = 1000*exp(coefs["(Intercept)"] + coefs[’smokeryes’] + coefs["age45-54"] + coefs["smokeryes:age45-54"]) NS_IR_45_54 = 1000*exp(coefs["(Intercept)"] + coefs["age45-54"]) IRR_45_54 = exp(coefs["smokeryes"] + coefs["smokeryes:age45-54"]) # For age group 75-84 S_IR_75_84 = 1000*exp(coefs["(Intercept)"] + coefs[’smokeryes’] + coefs["age75-84"] + coefs["smokeryes:age75-84"]) 11 > NS_IR_75_84 = 1000*exp(coefs["(Intercept)"] + coefs["age75-84"]) > > IRR_75_84 = exp(coefs["smokeryes"] + coefs["smokeryes:age75-84"]) 12

Statistics in Health Sciences A2-G01 2023 PDF

Document Details

Tags

Related

Summary

Full Transcript