Summary Exam R PDF
Document Details
![ImpressedTrumpet](https://quizgecko.com/images/avatars/avatar-12.webp)
Uploaded by ImpressedTrumpet
Universiteit Utrecht
Tags
Summary
This document provides an overview of R programming, data analysis, and statistical modeling techniques. It discusses basic concepts like the interactive environment, interpreted vs. compiled languages, packages, and project management.
Full Transcript
Summary Slides Lecture 1 1. R Basics R is a statistical (and interpreted) programming language designed primarily for data analysis, visualization, and statistical modeling. Unlike point-and-click data analysis software, R requires you to type in commands, which gives you more control and flexibili...
Summary Slides Lecture 1 1. R Basics R is a statistical (and interpreted) programming language designed primarily for data analysis, visualization, and statistical modeling. Unlike point-and-click data analysis software, R requires you to type in commands, which gives you more control and flexibility over your analyses. Interactive Environment When you type commands into the R Console (in R or via RStudio), these commands are executed immediately. This interactive feature means you can instantly see the results and experiment with different techniques without having to compile or build a complete “project” each time. Interpreted vs. Compiled Languages R is an interpreted language. Each command you submit is read and executed on the spot, which contrasts with compiled languages (such as C++), where you first compile your code into an executable and then run it. RStudio RStudio is known as an Integrated Development Environment (IDE) for R. An IDE provides a more user-friendly interface to develop code. Its features include: 1. Multiple panes (e.g., Source, Console, Environment/History, Files/Plots/Help) that help organize your workflow. 2. Tools for debugging, version control (Git integration), and easy package management. 3. Syntax highlighting, auto-completion, and other helpful coding features. Although RStudio adds “window dressing” and convenience features around R, it remains open- source and is maintained independently of R. 2. Packages In R, functionality beyond the base installation is often accessed through packages. A package typically includes R functions, compiled code, and documentation that cater to specific tasks or areas of interest (e.g., data wrangling, machine learning, visualization). Installing a Package install.packages("NAME_OF_PACKAGE") This downloads the package from an online repository (usually CRAN: The Comprehensive R Archive Network) and makes it available on your system. Loading a Package library(NAME_OF_PACKAGE) or require(NAME_OF_PACKAGE) Loading is necessary each time you start a new R session if you plan to use that package. The require() function works similarly to library(), but it returns a logical value indicating whether the package was successfully loaded (TRUE/FALSE). 3. Project Management Organizing your files and directories efficiently will dramatically simplify your data analysis workflow. Three key concepts are crucial: 1. Working Directories Your working directory in R is the folder where R reads and saves files by default. You can check your current working directory with: getwd() You can set it with: setwd("/path/to/your/directory") 2. Directory Structures and File Paths Understanding relative vs. absolute paths will help you avoid headaches when you move or share projects. o Absolute path: Starts at the root of your file system (e.g., C:/Users/You/Documents/...) o Relative path: Starts from your current working directory (e.g., data/my_file.csv if your working directory is the folder that contains the data folder). 3. RStudio Projects RStudio Projects help you keep everything self-contained. When you open a Project, RStudio automatically sets the working directory to the project folder, which ensures all file paths remain consistent. It also helps maintain a separate environment for each project, avoiding conflicts with other analyses. 4. Delimited Data Types A common first step in data analysis is importing data. R provides functions for reading in different types of delimited text files: read.csv() Assumes that values in your file are separated by commas. This is the typical CSV format in many parts of the world, especially North America. read.csv2() Assumes that values are separated by semicolons (;). This function is often used for European (EU) style CSV files, where semicolons are the default delimiter and commas sometimes indicate decimal points. Make sure you know the delimiter used in your dataset so you can choose the correct function. 5. R Functions In R, anything in the form word(...) is typically a function call. Functions are the building blocks of R; they take inputs (arguments), perform calculations or operations, and then return a result. Infix Operators Operators like 10 is a common flag for severe collinearity. Ensure N > P so that parameters are estimable. 3.3 IID Binomial (No Residual Clustering) If the data are clustered (e.g., multiple observations per person, location, or group), the straightforward assumption that each observation is independent breaks down. You can detect leftover clustering by: o Checking intraclass correlation (ICC) on the deviance residuals. o Adding random effects or robust standard errors if needed. 4. Computational Considerations Logistic regression relies on iterative methods (e.g., Newton-Raphson) rather than a direct formula. As a result: 1. Sample Size Requirements o Typically larger than in linear regression for stable convergence. o Common rules of thumb: ▪ 10 cases per predictor (especially for the rarer outcome). ▪ N = 10P / π0 (where π0 proportion of the minority class). ▪ N = 100 + 50 in more complex cases. 2. Class Imbalance o If one class (e.g., positives) is rare, this can hurt estimates. o Remedies: oversample the minority class, undersample the majority class, or use class weights in the model. 3. No Perfect Separation o If a predictor perfectly separates 0 vs. 1, standard logistic regression cannot estimate a finite coefficient for that predictor. o Regularization (e.g., ridge or LASSO) can help in these cases. 5. Influential Observations As with linear models, you can have overly influential data points: Check Cook’s Distance using the linear predictor (η) rather than the raw outcome. Influential observations may drastically change 𝜋̂𝑛 by shifting the slope and intercept. 6. Evaluating Classification Performance 6.1 Confusion Matrix Once a threshold (often 0.5) is chosen for 𝜋̂𝑛 , each case is classified as 0 or 1. The confusion matrix tallies: Predicted 0 Predicted 1 TP: True Positive Actual 0 TN FP TN: True Negative Actual 1 FN TP FP: False Positive (Type I error) FN: False Negative (Type II error) From this matrix, we derive: 𝑇𝑃+𝑇𝑁 Accuracy: 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 Error Rate: 1 − 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑇𝑃 Sensitivity (Recall):𝑇𝑃+𝐹𝑁 𝑇𝑁 Specificity: 𝑇𝑁+𝐹𝑃 𝐹𝑃 False Positive Rate: 1 − 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁+𝐹𝑃 𝑇𝑃 Positive Predictive Value (Precision): 𝑇𝑃+𝐹𝑃 𝑇𝑁 Negative Predictive Value: 𝑇𝑁+𝐹𝑁 6.2 ROC Curve & AUC The ROC (Receiver Operating Characteristic) curve plots sensitivity against specificity} for all possible classification thresholds (0 to 1). The AUC (Area Under the Curve) summarizes performance: AUC Interpretation: o 0.7–0.8: Acceptable o 0.8–0.9: Excellent o 0.9: Outstanding AUC is threshold-independent, letting you judge classifier quality regardless of a chosen cutoff. 7. Alternative Performance Measures 7.1 Cross-Entropy Error (CEE) When you care about how confident a model’s predictions are (not just correct/incorrect classification), the Cross-Entropy Error is a useful metric: 𝑁 −1 𝐶𝐸𝐸 = −𝑁 ∑ 𝑌𝑛 ln(𝜋̂𝑛 ) + (1 − 𝑌𝑛 )ln (1 − 𝜋̂𝑛 ) 𝑛=1 Penalizes overconfidence when 𝜋̂𝑛 is far from the truth. Helps distinguish two models that might yield the same misclassification rate but differ in how confidently they make correct/incorrect predictions. 7.2 Advantages Over Misclassification Rate Two models can yield the same confusion matrix (and thus the same misclassification rate) but differ in the probability estimates they produce. CEE rewards well-calibrated probability estimates and penalizes extreme, incorrect predictions more heavily. Important formulas and exam prep questions 1. Working with the Linear Regression Equation Linear Regression Model: Yi = β0 + β1* Xi + εi, where: Yi is the response (or outcome) variable for the i-th observation, Xi is the predictor (or explanatory) variable for the i-th observation, β0 is the intercept term, β1 is the slope (coefficient) for the predictor X, εi is the error term (residual) for the i-th observation. 1a. Calculate Predicted Values Given Certain Inputs When we have estimates for β0 and β1, we can compute the predicted value of Y (denoted Y^i) for any input Xi by ignoring the error term and plugging in Xi: Y ^ i = β ^ 0 + β ^ 1 * Xi Example: Suppose the estimated regression line is: Y ^ i = 2 + 3 * Xi If Xi = 4, Y^i = 2 + 3 × 4 = 2 + 12 = 14 So, the predicted Y value is 14 when X = 4. 1b. Interpret Parameter Estimates Intercept (β0): The expected value of Y when X = 0. Slope (β1): The expected change in Y for a one-unit increase in X. 1c. Evaluate Hypotheses/Research Questions In simple linear regression, a common null hypothesis is: H0: β 1 = 0 (no relationship between X and Y) versus the alternative: Ha: β1 ≠ 0 (there is some relationship) We often use t-tests or confidence intervals to test or estimate β1. If the p-value is below a chosen significance level (e.g., 0.05), we reject H0 and conclude there is evidence of a relationship between X and Y. 2. Differences Between Full Regression Model and Best-Fit Line 1. Full Regression Model (the “population” or “theoretical” model): Yi = β0 + β1 * Xi + εi o Here, β0, β1 are the true (unknown) population parameters. o εi is the true error (random noise) for observation i. 2. Equation for the Best-Fit (Estimated) Line: Y ^ I = β ^ 0 + β ^ 1 * Xi o β^0,β^1 are the estimated parameters from the sample. o Y^i is the predicted (or fitted) value of Y for Xi. o We no longer include εi because for predictions we ignore the random error term. 3. Definition of a Residual The residual ε^i is the difference between the observed value Yi and the predicted or fitted value Y^i: ε^i= Yi − Y^i A positive residual (ε^i > 0) means the observed Yi was higher than predicted. A negative residual (ε^i < 0) means the observed Yi was lower than predicted. 4. Relationship Between Probabilities and Odds (for Binary Outcomes) When dealing with binary outcomes (e.g., success/failure, yes/no), we often talk about a probability p = P(“success”). The odds of success are defined as: 𝑝 𝑜𝑑𝑑𝑠 = 1−𝑝 If p = 0.5, the odds are 0.5/0.5=1. We often say “1 to 1” or “even odds.” If p = 0.75, the odds are 0.75/0.25=3. We say “3 to 1 odds.” Understanding odds is crucial for logistic regression, which models the log of the odds (i.e., the logit). 5. Definition of the Logit Function 𝑝 \𝑙𝑜𝑔𝑖𝑡(𝑝) = 𝑙𝑛 1−𝑝 The logit function transforms a probability p (which ranges from 0 to 1) into a value ranging from −∞ to +∞. When p = 0.5, \logit (0.5) = ln(1) = 0 𝑝 When p → 1, → ∞, so \logit(p) → +∞ 1−𝑝 𝑝 When p → 0, 1−𝑝 →0, so \logit(p) → −∞ The logit function forms the foundation of logistic regression because it provides a linear scale for probabilities. 6. The Logistic Function and Its Role in Logistic Regression Logistic Regression models the logit of the probability pi (of the “success” for observation i) as a linear function of Xi: 𝑝𝑖 ln ( ) = 𝛽0 + 𝛽1 ∗ 𝑋𝑖 1 − 𝑝𝑖 This can be rewritten to express pi explicitly: where ηi = β0 + β1*Xi is the linear predictor. The logistic function (sometimes called the sigmoid function) is the inverse of the logit. It maps any real number (from −∞ to +∞) to a probability between 0 and 1. Interpretation in Logistic Regression β1 in logistic regression can be interpreted as the log-odds change associated with a one-unit increase in X. Exponentiating β1(i.e., exp(β1)) gives the odds ratio for a one-unit increase in X. Practice Question 1: Linear Regression Prediction Question: Given the estimated regression equation Y^=−5+2.5X: 1. What is the predicted value of Y when X=4? 2. Interpret the slope and intercept in context (assume X is “hours studied” and Y is “exam score”). Answer: 1. Predicted Value: Y^=−5+2.5×4=−5+10=5 So, if a student studies 4 hours, the model predicts an exam score of 5 (out of 100, presumably). 2. Interpretation: o Intercept (β^0=−5): When X=0 (i.e., 0 hours studied), the model predicts an exam score of -5. Of course, a negative score may not be realistic, indicating the model might not be valid at very low study times. o Slope (β^1=2.5): Each additional hour of study is associated with a 2.5-point increase in the exam score, on average. Practice Question 2: Residual Calculation Question: Suppose we have an observation where Yi=12, but our model predicts Y^i=10. What is the residual, and how do you interpret it? Answer: ε^I = Yi − Y^I = 12 − 10 = 2. Interpretation: The model under-predicted the actual value by 2. The observed value was 2 points higher than expected by the model. Practice Question 3: Probability vs. Odds Question: If a certain event has probability p=0.20, what are the odds for that event? Conversely, if the odds are 4 (i.e., “4 to 1”), what is p? Answer: 1. Odds when p=0.20 0.20 0.20 𝑜𝑑𝑑𝑠 = = = 0.25 1 − 0.20 0.80 We say “0.25 to 1,” or more simply, “1 to 4 against.” 2. Probability when odds = 4: 𝑜𝑑𝑑𝑠 4 4 𝑝= = = = 0.80 1 + 𝑜𝑑𝑑𝑠 1+4 5 Practice Question 4: Logit Function Question: Compute the logit of p=0.75. Then invert that logit to confirm you get back p=0.75. Answer: Logit: 0.75 0.75 \𝑙𝑜𝑔𝑖𝑡(0.75) = ln ( ) = ln ( ) = ln (3) = 1.0986 1 − 0.75 0.25 Inverse (the logistic function): exp (1.0986) 3 3 𝑝= = = = 0.75 1 + exp (1.0986) 1+3 4 Practice Question 5: Logistic Regression Interpretation Question: Suppose a logistic regression model is: 𝑝𝑖 ln ( ) = −2 + 0.8𝑋𝑖 1 − 𝑝𝑖 1. What is the odds ratio for a one-unit increase in X? 2. What is the predicted probability pi when Xi=3? Answer: 1. Odds Ratio = exp(0.8)≈2.2255 This means each 1-unit increase in X multiplies the odds of success by about 2.23. 2. Predicted Probability when X=3: o Compute the linear predictor η=−2+0.8×3=−2+2.4=0.4 o Convert to probability: exp (0.4) 1.4918 𝑝𝑖 = = = 0.60 1 + exp(0.4) 1 + 1.4918 So the model predicts a 60% chance of “success” when X=3.