🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Document Details

PraiseworthyHammeredDulcimer

Uploaded by PraiseworthyHammeredDulcimer

Tags

statistics health sciences data analysis

Full Transcript

Statistics in Health Sciences A3-G01 Andrea Gonzalez (1603921) & Alejandro Donaire (1600697) December 27th 2023 Contents 1 Introduction 2 Head circumference growth 2.1 Data . . . . . . . . . . . . 2.2 Data visualization . . . . 2.2.1 Creation . . . . . . 2.2.2 Interpretation . . . 2.3 Modeling ....

Statistics in Health Sciences A3-G01 Andrea Gonzalez (1603921) & Alejandro Donaire (1600697) December 27th 2023 Contents 1 Introduction 2 Head circumference growth 2.1 Data . . . . . . . . . . . . 2.2 Data visualization . . . . 2.2.1 Creation . . . . . . 2.2.2 Interpretation . . . 2.3 Modeling . . . . . . . . . 2.3.1 Model fitting . . . 2.4 Results . . . . . . . . . . . 1 in children . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 1 1 2 2 3 3 Girth and volume in trees 3.1 Expected value for β . . . . . . . . . . . . . . . . . 3.2 Transformation to get a linear regression model . . 3.3 Estimation of β . . . . . . . . . . . . . . . . . . . . 3.4 Goodness of fit . . . . . . . . . . . . . . . . . . . . 3.5 Effect of a 10% increase in the girth on the volume 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 4 5 5 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A R code A.1 Visualization of the head circumference data . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Models for the relationship between head circumference and age in girls and boys . . . . . A.3 Visualization of the models relating head circumference and age for each gender . . . . . . A.4 Information extraction from the head circumference growth models . . . . . . . . . . . . . A.5 Beta (β) estimation in the linear regression model for the trees data . . . . . . . . . . . A.6 Information extraction and pretty printing the p-values from the linear regression model (tree data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7 Visualization of the residuals from the linear regression model (tree data) . . . . . . . . . A.8 Calculation of the effect of a 10% increase in the girth on the volume . . . . . . . . . . . . A.9 Summary for the relationship of girth and volume of a tree according to the linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 6 6 7 7 8 8 8 8 1 Introduction Nothing to do in this section. 2 Head circumference growth in children 2.1 Data In this section, the files of interest (i.e. hcfa-girls-0-5-zscores.xlsx and hcfa-boys-0-5-zscores.xlsx) will be imported and prepared for use in later sections. 1. The files are loaded into a data.frame: > data_boys <- read_xlsx("hcfa-boys-0-5-zscores.xlsx") > data_girls <- read_xlsx("hcfa-girls-0-5-zscores.xlsx") 2. They are then merged into a single data.frame, retaining only the columns for age and head circumference, as well ass adding an identifier for sex: > > > > > > > > > # Including an identifier for sex in each dataset data_boys$sex = ’M’ data_girls$sex = ’F’ # Merging the two dataframes into one data <- rbind(data_boys, data_girls) # Only keeping age, head circumference and sex data <- data[c("Month", "M", "sex")] 3. Subsequently, data at birth are excluded: > # Data at birth is data at month 0 > data <- data[data$Month != 0, ] 4. Finally, the column names are relabeled: > colnames(data) <- c("age", "hc", "sex") 2.2 Data visualization Within this section, the relationship between head circumference and age is explored visually and then interpreted. 2.2.1 Creation Figure 1 serves a graphical representation of the data. The code to generate this visualization can be found in Section A.1 of the Appendix. 2.2.2 Interpretation The first observation that can be made is that the growth of the head circumference slows down with age. In particular, the shape of the curve in Figure 1 suggests that each time age doubles, the head circumference increases by a fixed amount of centimeters. This can be more technically described as a logarithmic variation of head circumference with respect to age. An additional remark is that there seems to be a difference in head circumference between males and females. More specifically, the head circumference of males consistently surpasses that of females, and the difference remains relatively constant over time, at least over the range of ages covered by the dataset. 1 Head Circumference by Age and Sex 38 Head Circumference (cm) 40 42 44 46 48 50 Males Females 36 36 38 Head Circumference (cm) 40 42 44 46 48 50 Males Females 0 10 20 30 40 Age (months) 50 60 1 2 4 8 16 Age (months) 32 64 Figure 1: Standards for head circumference in girls and boys from birth to 5 years of age. The head circumference is plotted for various months since birth, excluding data at birth. The left plot displays months in a natural scale, while the right plot represents months in a logarithmic scale. The data is sourced from the World Health Organization. Available at https://www.who.int/tools/child-growth-standards/standards/head-circumference-for-age. 2.3 2.3.1 Modeling Model fitting The expressions for both models in R, for girls and boys, can be found in Section A.2 of the Appendix. Next, an explanation for the configuration of each argument in the tlm() function, is provided: ˆ y: The name of the response variable. In this case, is head circumference (hc). ˆ x: The name of the explanatory variable. In this case, age, which has been previously logtransformed and added to the dataset as a new column (logage). ˆ family: The link function. In this case the model is a linear model, so it should be “gaussian”. ˆ data: The data.frame containing the variables y and x. In this case, this data.frame corresponds to the variable data (see Section 2.1). However, for each sex, only a subset of the data has been used, filtering for males or females when needed. ˆ xpow: Indicates whether the variable x has been transformed. For this case, a 0 should be specified to denote a logarithmic transformation. ˆ ypow: Similar to xpow. This variable has not been transformed, so a 1 should be specified. Figure 2 shows the fitted model for each gender in the original variable space, along the real observations. The R code that generates this plot can be found in Section A.3 of the Appendix. Analyzing Figure 2, it can be seen that the fit of the models is rather good, being slightly better in females than in males. This makes even more evident the rapid growth in head size during early years, later leveling off as age increases. Despite minor deviations between the observed data and the fitted model, the assumption that head circumference grows logarithmically with age appears to be appropriate. 2 Fitted models for each sex with log−transformed independent variable 50 Females Males Model Observations 36 38 38 Head Circumference (cm) 40 42 44 46 48 Head Circumference (cm) 40 42 44 46 48 50 Model Observations 0 Figure 2: 2.4 10 20 30 40 Age (months) 50 60 0 10 20 30 40 Age (months) 50 60 The plot shows two different fitted models (solid red lines), one for each gender (left for girls and right for boys). These models describe the relationship between standards for head circumference and age (from birth to 5 years, excluding data at birth). The same model has been used for both genders, namely, a linear regression with log-transformed independent variable. The real observations have been included for comparison as small, dark circles on the plot. The data is sourced from the World Health Organization. Available at https://www.who.int/tools/child-growth-standards/standards/head-circumference-for-age Results Under the fitted models, with each 2-fold increase in age, the mean head circumference increases by 2.31 centimeters in males and 2.35 centimeters in females. In multiple repetitions of this experiment, we would expect to find the true value between 2.26 and 2.36 centimeters in 95% of the cases (95% CI) for males, and between 2.32 and 2.38 centimeters for females.1 The code that has been used to extract these results from the fitted models is available in Section A.4 of the Appendix. 1 More information regarding confidence intervals can be found at: https://en.wikipedia.org/wiki/Confidence interval 3 3 3.1 Girth and volume in trees Expected value for β Making the simplied assumption that trees are cylinders, it is then easy to establish a relation between volume and girth. Let’s consider a circular right cylinder like the one in Figure 3, with height h and diameter d = 2r. Then its volume is V = πr2 h and its girth is G = 2πr, which corresponds to the circumference of its circular cross-section. By solving for r in the girth equation and substituting the result into the volume equation, we derive r = G/2π, leading to V = π(G/2π)2 h. This can be simplified to: V = 1 2 G h 4π Therefore, we see that the girth, G, is raised to the power of 2 in the equation obtained for the volume of a simplified Figure 3: A circular right cylinder of height h and diameter d = 2r (See Wikipedia). tree. This indicates that, before doing any data analysis, we should expect a value relatively close to 2 for β. 3.2 Transformation to get a linear regression model The proposed power law model, as stated by the assignment instructions, is expressed as: V = γGβ δ, δ ∼ Lognormal(0, σ 2 ) To transform this model into a linear regression model we can begin by taking logarithms as follows: log(V ) = log(γGβ δ) Utilizing the properties logb (mn) = logb (m) + logb (n), and that logb (mp ) = p logb (m), we obtain: log(V ) = log(γ) + β log(G) + log(δ) Finally, renaming for convenience, and noting that the logarithm of a lognormal random variable is normally distributed, we achieve the desired linear regression form: log(V ) = α + β log(G) + ϵ, ϵ ∼ Normal(0, σ 2 ) Where, α := log(γ) and ϵ := log(δ). 3.3 Estimation of β Using the code in Section A.5 of the Appendix, the estimation for β obtained through the tlm() function is 2.20, with a corresponding 95% confidence interval ranging from 2.02 to 2.38. Indeed, the value of β is close to 2 as hypothesized in Section 3.1 by making the simplified assumption that trees can be approximated by cylinders. 4 3.4 Goodness of fit To evaluate the goodness of the fit for the fitted model several things can be examined. The first thing that can be considered is the significance of the coefficients. In this case, p-values of 4.18 · 10−11 and 6.36 · 10−21 are obtained for the intercept and the β coefficient, respectively, indicating robust statistical significance. On the other hand, the R-squared for the model fit is 0.95, which can be interpreted as 95% of the variance in the dependent variable explained by the independent variable. This is a rather high value and thus indicates a good fit.2 Additionally, the residuals can be analysed. Specifically, they can be plotted, and their normality can be assessed. Figure 4 illustrates these checks. Sample Quantiles 0.0 0.1 −0.1 −0.2 −0.2 −0.1 Model residuals 0.0 0.1 0.2 Normal Q−Q Plot for the Residuals 0.2 Residuals 0 5 10 15 Index 20 25 30 −2 −1 0 1 Theoretical Quantiles 2 Figure 4: The left side displays the residuals of the model, while the right side features the Normal Q-Q Plot of the residuals. These visualizations help decide whether the fitted model is adequate or not. The fitted model is a power law model that establishes a relationship between the girth and volume of trees, utilizing data from the trees dataset available in R. As shown in Figure 4, the residuals are approximately random and approximately normally distributed, supporting a positive evaluation of the fitted model. Finally, the estimate for the β coefficient, as mentioned previously, agrees with the common-sense prediction outlined in Section 3.1, adding another layer of support for the model’s adequacy. The R code related to this Section can be found in the Appendix, in Sections A.6 and A.7. 3.5 Effect of a 10% increase in the girth on the volume According to the fitted model, increasing the girth of the tree 10% results in a 23% increase in the median or geometric mean of the tree’s volume. The associated 95% confidence interval spans from 21% to 26%. The R code associated is available in Section A.8 of the Appendix. 3.6 Summary Under the fitted model, a 2-fold increase in the girth of the tree is associated to a 359% increase in the median or geometric mean of the volume of the tree. The associated 95% confidence interval ranges from 305% to 422% (see the R code in Section A.9 of the Appendix). 2 To improve the printing of the p-values, a function by Jose Barrera Gómez has been used. Check Section A.6 to see the corresponding R code. 5 A A.1 > > > > > + + > > > + + + > > > + + > > > > + + + > > > + # Plot 1 (left): Normal scale plot(NULL, xlim = range(data$age), ylim = range(data$hc), xlab = "Age (months)", ylab = "Head Circumference (cm)", cex.axis = 1.7, cex.lab = 1.7) lines(data$age[data$sex == "M"], data$hc[data$sex == "M"], col = "blue", lty="dashed", lwd=3) lines(data$age[data$sex == "F"], data$hc[data$sex == "F"], col = "red", lty="dotted", lwd=4) legend("topleft", legend = c("Males", "Females"), col = c("blue", "red"), lty = c("dashed", "dotted"), lwd = c(3, 4), cex=1.5) # Plot 2 (right): Logarithmic scale (x-axis) plot(NULL, xlim = range(data$age), ylim = range(data$hc), log = "x", xlab = "Age (months)", ylab = "Head Circumference (cm)", xaxt = ’n’, cex.axis = 1.7, cex.lab = 1.7) axis(1, at = 2^c(0:ceiling(log(max(data$age), base = 2))), cex.axis = 1.7) lines(data$age[data$sex == "M"], data$hc[data$sex == "M"], col = "blue", lty="dashed", lwd=3) lines(data$age[data$sex == "F"], data$hc[data$sex == "F"], col = "red", lty="dotted", lwd=4) legend("topleft", legend = c("Males", "Females"), col = c("blue", "red"), lty = c("dashed", "dotted"), lwd = c(3, 4), cex=1.5) # Adding a common title to both plots title("Head Circumference by Age and Sex", outer = TRUE, line = -2, cex.main = 2.2) Models for the relationship between head circumference and age in girls and boys # Adding the log-transformed age column to the dataset data$logage <- log(data$age) # Fitting the model for MALES model_males <- tlm(y = hc, x = logage, family = gaussian, data = data[data$sex=="M", ], xpow = 0, ypow = 1) # Fitting the model for FEMALES model_females <- tlm(y = hc, x = logage, family = gaussian, data = data[data$sex=="F", ], xpow = 0, ypow = 1) A.3 > > > > > > > > > + + Visualization of the head circumference data # Double plot par(mfrow = c(1, 2), mar=c(5,5,4,2)) A.2 > > > > > + > > > + R code Visualization of the models relating head circumference and age for each gender # Obtaining the predictions for each model pred_females <- predict(model_females$model) pred_males <- predict(model_males$model) # Double plot, with adjusted inner and outer margins par(mfrow=c(1,2), mar=c(5,5,4,2), oma=c(0,0,1.5,0)) # Plot for FEMALES plot(pred_females, type = "l", col="red", lwd=2, xlab = "Age (months)", ylab = "Head Circumference (cm)", main = "Females", cex.axis = 1.7, cex.lab = 1.7, cex.main = 1.9) 6 > > + + + + + > > > + + > > + + + + + > > > + points(x = data[data$sex=="F", ]$age, y = data[data$sex=="F", ]$hc) legend("topleft", legend = c("Model", "Observations"), col = c("red", "black"), lty = c(1, NA), lwd = c(2, 1), pch = c(NA, 1), cex = 1.5) # Plot for MALES plot(pred_males, type = "l", col="red", lwd=2, xlab = "Age (months)", ylab = "Head Circumference (cm)", main = "Males", cex.axis = 1.7, cex.lab = 1.7, cex.main = 1.9) points(x = data[data$sex=="M", ]$age, y = data[data$sex=="M", ]$hc) legend("topleft", legend = c("Model", "Observations"), col = c("red", "black"), lty = c(1, NA), lwd = c(2, 1), pch = c(NA, 1), cex = 1.5) # Adding a common title to both plots title("Fitted models for each sex with log-transformed independent variable", outer = TRUE, line = 0, cex.main = 2.2) A.4 > > + > > > > > > > > > > > > > # Ratio of quartile 3 to quartile 1 exact_q <- as.numeric(quantile(data[data$sex=="F", ]$age)[’75%’]/ quantile(data[data$sex=="F", ]$age)[’25%’]) # More understandable choice of q chosen_q <- floor(exact_q) # Extracting info for MALES effect_males <- effect(model_males, q = chosen_q, verbose = FALSE) estimate_males <- effect_males$effect$Estimate CI_males <- c(lower = effect_males$effect$‘lower95%‘, upper = effect_males$effect$‘upper95%‘) # Extracting info for FEMALES effect_females <- effect(model_females, q = chosen_q, verbose = FALSE) estimate_females <- effect_females$effect$Estimate CI_females <- c(lower = effect_females$effect$‘lower95%‘, upper = effect_females$effect$‘upper95%‘) A.5 > > > > > > > > > + > > > > > > > > > Information extraction from the head circumference growth models Beta (β) estimation in the linear regression model for the trees data # Loading the data data(trees) # Adding the log-transformed volume and girth columns to the dataset trees$Log_Volume <- log(trees$Volume) trees$Log_Girth <- log(trees$Girth) # Fitting the model model_trees <- tlm(y = Log_Volume, x = Log_Girth, family = gaussian, data = trees, xpow = 0, ypow = 0) # Extracting the relevant information from the fitted model model_info <- effectInfo(model_trees) estimate <- model_info$beta[’Log_Girth’, ’Estimate’] std_err <- model_info$beta[’Log_Girth’, ’Std. Error’] # Calculating a 95% confidence interval CI_values <- as.numeric(confint(model_trees$model, parm="Log_Girth")) CI_estimate <- c(lower = CI_values[1], upper = CI_values[2]) 7 A.6 > > + + + + + + + > > > > > > # Custom function to pretty print scientific notation scinotLaTeX <- function(x, dig = 2) { x1 <- sprintf(paste0("%.", dig, "e"), x) x2 <- unlist(strsplit(x1, "e")) a <- x2[1] b <- as.character(as.numeric(x2[2])) res <- paste0("$", a, "\\cdot 10^{", b, "}$") return(res) } # Extracting relevant information from the fitted model model_sum <- summary(model_trees) r_squared <- model_sum$summary$r.squared pvalue_intercept <- model_sum$summary$coefficients["(Intercept)", "Pr(>|t|)"] pvalue_beta <- model_sum$summary$coefficients["Log_Girth", "Pr(>|t|)"] A.7 > > > > > > > > + + > > > + + > # Double plot par(mfrow=c(1,2)) # Plot of the residuals plot(model_residuals, col = "black", pch = 21, main = "Residuals", ylab = "Model residuals", lwd = 2, cex = 2, cex.axis = 1.7, cex.lab = 1.7, cex.main = 1.9) # Quantile-Quantile plot for the residuals qqnorm(model_residuals, col = "black", pch = 21, main = "Normal Q-Q Plot for the Residuals", lwd = 2, cex = 2,cex.axis = 1.7, cex.lab = 1.7, cex.main = 1.9) qqline(model_residuals, col = "red", lwd=2) Calculation of the effect of a 10% increase in the girth on the volume # Extracting the relevant information for q = 1.1 (r = 10%) effect_girth <- effect(model_trees, q = 1.1, verbose = FALSE) estimate_effect <- effect_girth$effect$Estimate CI_effect <- c(lower = effect_girth$effect$‘lower95%‘, upper = effect_girth$effect$‘upper95%‘) A.9 > > + > > > > > > > > Visualization of the residuals from the linear regression model (tree data) # Extracting the residuals model_residuals <- model_sum$summary$residuals A.8 > > > > Information extraction and pretty printing the p-values from the linear regression model (tree data) Summary for the relationship of girth and volume of a tree according to the linear regression model # Ratio of quartile 3 to quartile 1 exact_q <- as.numeric(quantile(trees$Girth)[’75%’]/ quantile(trees$Girth)[’25%’]) # More understandable choice of q chosen_q <- ceiling(exact_q) # Extracting the relevant information effect_girth <- effect(model_trees, q = chosen_q, verbose = FALSE) estimate_effect <- effect_girth$effect$Estimate CI_effect <- c(lower = effect_girth$effect$‘lower95%‘, upper = effect_girth$effect$‘upper95%‘) 8

Use Quizgecko on...
Browser
Browser