🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

STAT360 Applied Regression, Fall 2024 PDF

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Summary

These notes cover STAT360 Applied Regression, focusing on linear models and various data analysis techniques. The content includes topics like introduction, linear models, matrix representation, and goodness of fit for Fall 2024.

Full Transcript

STAT360: Applied Regression Unit 1: Linear Model United Arab Emirates University Department of Statistics and Business Analytics College of Business and Economics Ibrahim Alfaki, PhD Fall 2024...

STAT360: Applied Regression Unit 1: Linear Model United Arab Emirates University Department of Statistics and Business Analytics College of Business and Economics Ibrahim Alfaki, PhD Fall 2024 Ibrahim Alfaki STAT360: Applied Regression Fall 2024 1 / 42 Outline 1 Introduction 2 General Form of Linear Model 3 Matrix Representation of Linear Model 4 Estimating β (Regression Coefficients) Ibrahim Alfaki STAT360: Applied Regression Fall 2024 2 / 42 Outline Topic 1 - Linear Model Introduction. General Form of Linear Model Matrix Representation. Estimating Coefficients. QR Decomposition. Gauss-Markov Theorem. Goodness of Fit. Identifiability. Orthogonality. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 3 / 42 Introduction Table of Contents 1 Introduction 2 General Form of Linear Model 3 Matrix Representation of Linear Model 4 Estimating β (Regression Coefficients) Ibrahim Alfaki STAT360: Applied Regression Fall 2024 4 / 42 Introduction Introduction Statistics starts with a problem, proceeds with the collection of data, continues with the data analysis and finishes with conclusions. It is important that before starting any complex analysis, the statistician should understand the objectives and whether the data is appropriate for the kind of analysis proposed. A correct formulation of the problem requires: Understanding the physical background of the problem, working in collabo- ration with others to understand subject area. Understanding the objective and formulated questions that needed to be addressed, working with a collaborator from subject area. Make sure he/she knows what the client wants, avoid doing complicated analysis sometimes far more than the client needed. Express the problem into statistical terms, try to avoid making irreparable errors. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 5 / 42 Introduction Introduction It is important to understand how the data were collected. 1 How the data were collected has a crucial impact on what conclusions can be made: Are the data observational or experimental? Are the data a sample of convenience or were they obtained via a designed sample survey. 2 Is there non-response? The data you do not see may be just as important as the data you do see. 3 Are there missing values? This is a common problem that is troublesome and time consuming to handle. 4 How are the data coded? In particular, how are the categorical variables represented? 5 What are the units of measurement? 6 Beware of data entry errors and other corruption of the data, perform some data sanity checks. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 6 / 42 Introduction Initial Data Analysis Initial Data Analysis (IDA) process consists of critical steps performed following data collection/entry before formal statistical analysis. IDA is vital to getting data into a suitable form for analysis and minimize the risk of incorrect or misleading results. The process uses: Numerical summaries including, means, standard deviations, maximum/min- imum, and measures appropriate to the specific data. Graphical summaries which include variety of techniques such as: ▶ For one variable at a time, boxplots, histogram, density plot, etc,... ▶ For two/more variables, scatterplots, dynamic plots, etc,... In the plots, look for outliers, data-entry errors, skewed or unusual distribu- tions and structure, also check whether the data are distributed according to prior expectations. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 7 / 42 Introduction Initial Data Analysis: Example The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The follow- ing variables were recorded: number of times pregnant, plasma glucose con- centration at 2 hours in an oral glucose tolerance test, diastolic blood pressure (mmHg), triceps skin fold thickness (mm), 2-hour serum insulin (mu U/ml), body mass index (weight in kg/(height in m2)), diabetes pedigree function, age (years) and a test whether the patient showed signs of diabetes (coded zero if nega- tive, one if positive). For more description of the ”pima” dataset visit https: //www.kaggle.com/uciml/pima-indians-diabetes-database. or http:// stats4stem.weebly.com/r-pimatr-data.html. Importantly, find out about the purpose of the study. How were the data collected? Then, look at the data by conducting an IDA. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 8 / 42 Introduction Initial Data Analysis: Pima Indians Data The ”pima” dataset is part of the R package ”faraway”. To make the dataset available in the R session, you need first to install the package ”faraway”. > install.packages(”faraway”) 1 > library (faraway) 2 > data(pima, package=”faraway”) 3 > head(pima) 4 pregnant glucose diastolic triceps insulin bmi diabetes age test 5 1 6 148 72 35 0 33.6 0.627 50 1 6 2 1 85 66 29 0 26.6 0.351 31 0 7 3 8 183 64 0 0 23.3 0.672 32 1 8 4 1 89 66 23 94 28.1 0.167 21 0 9 5 0 137 40 35 168 43.1 2.288 33 1 1 6 5 116 74 0 0 25.6 0.201 30 0 1 Ibrahim Alfaki STAT360: Applied Regression Fall 2024 9 / 42 Introduction Initial Data Analysis: Summary Measures We can start the IDA process with some numerical summaries: > summary(pima) 1 pregnant glucose diastolic triceps insulin 2 Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 3 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00 1st Qu.: 0.0 4 Median : 3.000 Median :117.0 Median : 72.00 Median :23.00 Median : 30.5 5 Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54 Mean : 79.8 6 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00 3rd Qu.:127.2 7 Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00 Max. :846.0 8 bmi diabetes age test 9 Min. : 0.00 Min. :0.0780 Min. :21.00 Min. :0.000 1 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00 1st Qu.:0.000 1 Median :32.00 Median :0.3725 Median :29.00 Median :0.000 1 Mean :31.99 Mean :0.4719 Mean :33.24 Mean :0.349 1 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00 3rd Qu.:1.000 1 Max. :67.10 Max. :2.4200 Max. :81.00 Max. :1.000 1 Look for unusual/unexpected values that might indicate a data-entry error. A look at each variable minimum/maximum values, the ”pregnant” variable has a maximum of 17. This is large, but not impossible. However, the next 5 variables have minimums of zero. No blood pressure ”diastolic” is not good for the health—something must be wrong. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 10 / 42 Introduction Initial Data Analysis: Summary Measures Further investigation of the variable ”diastolic”, by sorting, reveals 35 zeros, putting more questions to researchers as what really happened. > sort( pima$diastolic ) 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 24 30 30 38 40 44 44 44 44 3 It turned out that at the data entry zero was used as a missing value code, not a good choice. This would obscure analysis results if zeros are assumed real observations. Therefore, set to missing value code ”NA” in all 5 variables. > pima$diastolic[ pima$diastolic == 0] pima$glucose[pima$glucose == 0] pima$triceps[pima$triceps == 0] pima$insulin[pima$insulin == 0] pima$bmi[pima$bmi == 0] pima$test summary(pima$test) 2 0 1 3 500 268 4 > levels (pima$test) hist( pima$diastolic , xlab=”Diastolic”,main=””) 1 > plot(density ( pima$diastolic ,na.rm=TRUE),main=””) 2 > plot(sort ( pima$diastolic ), ylab=”Sorted Diastolic ”) 3 Ibrahim Alfaki STAT360: Applied Regression Fall 2024 12 / 42 Introduction Initial Data Analysis: Univariate Plots Figure 1(a) Figure 1(b) Figure 1(c) 120 0.030 200 Sorted Diastolic Frequency Density 80 0.015 50 100 60 40 0.000 0 20 60 120 20 60 120 0 400 Diastolic N = 733 Bandwidth = 2.872 Index Ibrahim Alfaki STAT360: Applied Regression Fall 2024 13 / 42 Introduction Initial Data Analysis: Bivariate Plots A common bivariate plot is the standard scatteplot. It shows relationship between two quantitative variables. Observed pattern reveals possible outliers in addition to: Form of relationship (e.g. linear), direction (+ve/-ve), and strength. Comment on (Figure 2(a)), relationship between ”bmi” body mass index and likelihood of ”diabetes” based on diabetes pedigree function. Boxplots are useful providing side-by-side comparison showing a quantitative variable against a categorical variable, Figure 2(b). > plot(diabetes ˜ bmi,pima) 1 > plot(diabetes ˜ test ,pima) 2 Figure 2(a) Figure 2(b) diabetes diabetes 1.5 1.5 0.0 0.0 20 30 40 50 60 negative positive bmi test Ibrahim Alfaki STAT360: Applied Regression Fall 2024 14 / 42 Introduction Initial Data Analysis: Why Graphs Important? Good graphs help in avoiding mistakes when it comes to suggesting the form of the modeling to come (e.g., linear). Important in communicating results of the analysis. They can be very useful in a way that formal analysis becomes just a confir- mation of what is already been seen. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 15 / 42 Introduction When to Use Linear Modeling Linear modeling used to explain or model the relationship between a sin- gle variable Y , called the response, outcome, output or dependent vari- able; and one or more predictor, input, independent or explanatory variables, X1 ,..., Xp , where p is the number of predictors. Another term for linear modeling is regression analysis, although regression can also be nonlinear. When p = 1, it is called simple regression. If p > 1 it is multiple regression or sometimes multivariate regression. When there is more than one response, then it is called multivariate multiple regression. The response must be a continuous variable, but explanatory variables can be continuous, discrete or categorical. Regression has two main objectives: 1 Prediction of future responses given specified values of the predictors. 2 Assessment of the effect of, or relationship between, explanatory variables and the response. That is, to infer causal relationships if possible. Knowing the true model is rare, except in a few cases in the precise physical sciences. In most applications, the model is an empirical construct designed to answer questions about prediction or causation. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 16 / 42 Introduction Exercise 1 (Due next week by time of first meeting) The dataset teengamb concerns a study of teenage gambling in Britain. Make a numerical and graphical summary of the data, commenting on any features that you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 17 / 42 General Form of Linear Model Table of Contents 1 Introduction 2 General Form of Linear Model 3 Matrix Representation of Linear Model 4 Estimating β (Regression Coefficients) Ibrahim Alfaki STAT360: Applied Regression Fall 2024 18 / 42 General Form of Linear Model Linear Model Suppose we want to model the response Y in terms of p predictors, X1 , X2 ,..., Xp. One general form for the model would be: Y = f (X1 , X2 ,..., Xp ) + ε where f is some unknown function and ε is an error term. To estimate f we can assume that it has some restricted linear form: Y = β0 + β1 X1 + β2 X2 +.... + βp Xp + ε where βi , i = 0, 1, 2,.., p are unknown parameters (model’s coefficients). β0 is called the intercept term. In a linear model the parameters enter linearly, the predictors themselves do not have to be linear. For example, Y = β0 + β1 X1β2 + ε is not linear. Linear models seem restrictive, but because predictors can be transformed and combined, they are actually very flexible. Linear is also used to refer to straight lines, but linear models can be curved, e.g. by including polynomial terms, such as squared or cubed predictors. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 19 / 42 General Form of Linear Model Where do models come from? Three different sources can be distinguished: 1 Physical theory may suggest a model. For example, Hooke’s law says that the extension of a spring is proportional to the weight attached. Models like these usually arise in the physical sciences and engineering. 2 Experience with past data. Similar data used in the past were modeled in a particular way. It is natural to see whether the same model will work with the current data. Models like these usually arise in the social sciences. 3 No prior idea exists — the model comes from an exploration of the data. We use skill and judgment to pick a model. Sometimes it does not work and we have to try again. Models that derive directly from physical theory are relatively uncommon so that usually the linear model can only be regarded as an approximation to a complex reality. We hope it predicts well or explains relationships usefully but usually we do not believe it is exactly true. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 20 / 42 Matrix Representation of Linear Model Table of Contents 1 Introduction 2 General Form of Linear Model 3 Matrix Representation of Linear Model 4 Estimating β (Regression Coefficients) Ibrahim Alfaki STAT360: Applied Regression Fall 2024 21 / 42 Matrix Representation of Linear Model The tabular form of a given data with a response Y and p predictors X1 , X2 ,..., Xp can be presented as y x x... x 1 11 12 1p y2 x21 x22... x2p............... yn xn1 xn2... xnp where n is the number of observations, or cases, in the dataset. Given the actual data values, we may write the model as: yi = β0 + β1 xi1 + β2 xi2 +... + βp xip + εi , i = 1,..., n. In a matrix/vector representation the regression equation can be written as y = Xβ + ε where y = (y1 ,...., yn )T , ε = (ε1 ,...., εn )T , β = (β0 ,...., βp )T and the design matrix X given as   1 x11 x12... x1p  1 x21 x22... x2p    ...............  1 xn1 xn2... xnp Ibrahim Alfaki STAT360: Applied Regression Fall 2024 22 / 42 Estimating β (Regression Coefficients) Table of Contents 1 Introduction 2 General Form of Linear Model 3 Matrix Representation of Linear Model 4 Estimating β (Regression Coefficients) Ibrahim Alfaki STAT360: Applied Regression Fall 2024 23 / 42 Estimating β (Regression Coefficients) Geometric Representation of the Estimation β The regression model, y = Xβ + ε, partitions the response into a systematic component Xβ and a random component ε. We would like to choose β so that the systematic part explains as much of the response as possible. Geometrically, the response y lies in an n−dimensional space, y ∈ Rn and β ∈ Rp , p is the number of predictors plus one (including the intercept). The problem is to find the best choice, the estimate β̂ of β (the regression coefficients) so that Xβ is as close to y as possible (see geometric illustration next slide). The response predicted by the model is ŷ = X β̂ or Hy, where H is an orthogonal projection matrix. The ŷ are called predicted or fitted values. The difference between the actual response y and the predicted response ŷ is denoted by ε̂ and is called the residual. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 24 / 42 Estimating β (Regression Coefficients) Geometric Representation of the Estimation β Geometrical representation of the estimation β. The data vector y is pro- jected orthogonally onto the model space spanned by X. The fit is repre- sented by projection ŷ = X β̂ with the difference between the fit and the data represented by the residual vector ε̂. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 25 / 42 Estimating β (Regression Coefficients) Least Squares Estimation of β The standard approach for estimating β is by minimizing the sum of squares of vertical distances, errors, accomplished by using the method of least squares: X ε2i = εT ε = (y − Xβ)T (y − Xβ) Differentiating with respect to β and setting to zero we get: −2X T (y − Xβ) = 0, ı.e. X T Xβ = X T y These are called the normal equations of the least squares method. To obtain pa- rameter estimates β̂ The equations can be solved using matrix algebra techniques, such as matrix inversion or QR decomposition. The same result can be derived using the geometric approach. If X T X is invertible: β̂ = (X T X)−1 X T y X β̂ = X(X T X)−1 X T y ŷ = Hy H = X(X T X)−1 X T is called the hat matrix and is the orthogonal projection of y onto the space spanned by X. The predicted or fitted values are ŷ = Hy = X β̂. The residuals are ε̂ = y − X β̂ = y − ŷ = (I − H)y. The residual sum of squares (RSS) is ε̂T ε̂ = y T (I − H)T (I − H)y = y T (I − H)y (idempotent property). Ibrahim Alfaki STAT360: Applied Regression Fall 2024 26 / 42 Estimating β (Regression Coefficients) Least Squares Estimation of β The least squares estimate is the best possible estimate of β when the errors ε are uncorrelated and have equal variance, i.e., varε = σ 2 I. β̂ is unbiased and has variance (X T X)−1 σ 2 provided varε = σ 2 I. E ε̂T ε̂ = σ 2 (n − p), suggests an unbiased estimate σ̂ 2 of σ 2 given as: ε̂T ε̂ RSS σ̂ 2 = = n−p n−p where (n − p) is the degrees of freedom of the model. The standardq error for a particular component of β̂ can be taken from −1 se(β̂i−1 ) = (X T X)ii σ̂. Usually it is not possible to find explicit formulae for the parameter estimates βi s unless X T X be a simple form. Typically, the computer, statistical pack- ages, are used to fit such models. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 27 / 42 Estimating β (Regression Coefficients) Example: Galapagos Islands Data Galapagos Islands Dataset: The dataset contains 30 cases (Islands) and seven variables. The variables are: Species — the number of species found on the island, Area — the area of the island (km2), Elevation — the highest elevation of the island (m), Nearest — the distance from the nearest island (km), Scruz — the distance from Santa Cruz Island (km), Adjacent — the area of the adjacent island (km2). > data(gala, package=”faraway”) 1 > head(gala[,=2])#2nd column omitted 2 Species Area Elevation Nearest Scruz Adjacent 3 Baltra 58 25.09 346 0.6 0.6 1.84 4 Bartolome 31 1.24 109 0.6 26.3 572.33 5 Caldwell 3 0.21 114 2.8 58.7 0.78 6 Champion 25 0.10 46 1.9 47.4 0.18 7 Coamano 2 0.05 77 1.9 1.9 903.82 8 Daphne.Major 18 0.34 119 8.0 8.0 1.84 9 Ibrahim Alfaki STAT360: Applied Regression Fall 2024 28 / 42 Estimating β (Regression Coefficients) Example: Galapagos Islands Data Fitting a linear model in R is done using the lm() function. Using the least squares method five predictors and a response variable, Species, β̂i are produced. > lmod summary(lmod) 2 3 Call : 4 lm(formula = Species ˜ Area + Elevation + Nearest + Scruz + Adjacent, 5 data = gala) 6 7 Residuals : 8 Min 1Q Median 3Q Max 9 =111.679 =34.898 =7.862 33.460 182.584 1 1 Coefficients : 1 Estimate Std. Error t value Pr(>|t|) 1 ( Intercept ) 7.068221 19.154198 0.369 0.715351 1 Area =0.023938 0.022422 =1.068 0.296318 1 Elevation 0.319465 0.053663 5.953 3.82e=06 *** 1 Nearest 0.009144 1.054136 0.009 0.993151 1 Scruz =0.240524 0.215402 =1.117 0.275208 1 Adjacent =0.074805 0.017700 =4.226 0.000297 *** 1 === 2 Signif. codes: 0 (***) 0.001 (**) 0.01 (*) 0.05 (.) 0.1 ( ) 1 2 2 Residual standard error : 60.98 on 24 degrees of freedom 2 Multiple R=squared: 0.7658, Adjusted R=squared: 0.7171 2 F= statistic : 15.7 on 5 and 24 DF, p=value: 6.838e=07 2 Species = −0.024Area + 0.319Elevation + 0.009N earest −.241Scruz − 0.075Adjacent Ibrahim Alfaki STAT360: Applied Regression Fall 2024 29 / 42 Matrix Representation of Linear Model Example: Galapagos Islands Data Although the lm() function produces almost all quantities, but we can still extract or compute many statistics using the data. For example the investigator can extract: The X-matrix, the response y, construct the (X T X)−1 , and get β̂: > x y xtxi solve(crossprod(x,x ), crossprod(x,y)) #Computes regression coefficients , same result as lm() 4 [,1] 5 ( Intercept ) 7.068220709 6 Area =0.023938338 7 Elevation 0.319464761 8 Nearest 0.009143961 9 Scruz =0.240524230 1 Adjacent =0.074804832 1 Regression quantities that can be extracted from the model object fitted by the function lm() include: > names(lmod)#lmod is the regression model object 1 ” coefficients ” ” residuals ” ” effects ” ”rank” ” fitted. values” 2 ”assign” ”qr” ”df. residual ” ” xlevels ” ” call ” 3 ”terms” ”model” 4 Ibrahim Alfaki STAT360: Applied Regression Fall 2024 30 / 42 Matrix Representation of Linear Model QR Decomposition (Allowing for the efficient estimation of β) QR decomposition is a factorization of a matrix X into a product of an n × n orthogonal matrix Q, QT Q = QQT = I, the identity matrix, and an upper triangular p × p matrix R (Rij = 0 for i > j): X = QR in the linear regression model, the QR decomposition allows obtaining a more numerically stable and efficient solution for the regression coefficients, β: 1 By substituting X = QR into the linear regression equation (y = Xβ + ε), we have y = QRβ + ε. 2 Multiplying by QT , we have QT y = QT QRβ + QT ε, the term reduces to QT y = Rβ + e, where e = QT ε. 3 Letting QT y = z, we now have z = Rβ + e. Since R is an upper triangular matrix, we can solve for β using back substitution, starting from the last row of R and solving for β in an iterative manner. By using the QR decomposition, we can avoid the explicit inversion of the matrix X T X, which can be computationally expensive. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 31 / 42 Matrix Representation of Linear Model Example: Calculating β̂ Using QR Decomposition Use the QR decomposition and the Galapagos Island data (gala) to calculate β̂ for the linear model y = Xβ + ε. First, extract design matrix X and response vector y from ”gala” dataset. Second, compute QR decomposition, then vector f. Finally, solve the linear system Rβ = f using backward substitution for β. > x y QR Q R (beta.hat (beta.hat n. When p = n, we may perhaps estimate all the parameters, but with no degrees of freedom left to estimate any standard errors or do any testing. Such a model is called saturated. When p > n, then the model is sometimes called supersaturated. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 36 / 42 Matrix Representation of Linear Model How is the Problem of Identifiability Handled Statistics packages handle non-identifiability differently: In the regression case, some packages may return error messages and some may fit models because rounding error may remove the exact identifiability. In other cases, constraints may be applied. R, by default, fits the largest identifiable model by removing variables in the reverse order of appearance in the model formula. Lack of identifiability is obviously a problem, but it is usually easy to identify and work around. The issue is more problematic in situations where we are close to unidentifiability. In most cases, the cause of identifiability can be revealed with some thought about the variables, but, failing that, an eigen-decomposition of X T will reveal the linear combination(s) that gave rise to the unidentifiability, this will be dealt with later when discussing principal components analysis. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 37 / 42 Matrix Representation of Linear Model Example: How R Deals with the Problem of Identifiability Create a new variable for the Galapagos dataset. The new variable is the difference in area between the island and its nearest neighbor. Then fit a regression model with the new variable added. > gala$Adiff lmod sumary(lmod) 3 4 Coefficients : (1 not defined because of singularities ) 5 Estimate Std. Error t value Pr(>|t|) 6 ( Intercept ) 7.068221 19.154198 0.3690 0.7153508 7 Area =0.023938 0.022422 =1.0676 0.2963180 8 Elevation 0.319465 0.053663 5.9532 3.823e=06 9 Nearest 0.009144 1.054136 0.0087 0.9931506 1 Scruz =0.240524 0.215402 =1.1166 0.2752082 1 Adjacent =0.074805 0.017700 =4.2262 0.0002971 1 1 n = 30, p = 6, Residual SE = 60.97519, R=Squared = 0.77 1 R messaged about singularity of X T X because the rank of the design matrix X, number of independent columns, is six, less than its seven columns. R dealt with the problem by fitting the largest identifiable model (5 pre- dictors), removed the variable ”Adiff” as it is a linear combination of two variables in the model, namely the island area and area of its neighbor. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 38 / 42 Matrix Representation of Linear Model Example: Cases Close to Unidentifiability Use the Galapagos dataset and the newly created variable ”Adiff”. Add to ”Adiff” a random variate (noise) from a uniform distribution U [−0.05, 0.005]. To keep generating the same random numbers you can your set.seed() function in R. Then, fit a regression model using all predictors, and the response ”Species”. > set.seed(123) #setting a seed to get the same random numbers 1 > Adiffe lmod sumary(lmod) 4 Estimate Std. Error t value Pr(>|t|) 5 ( Intercept ) 3.2964e+00 1.9434e+01 0.1696 0.8668 6 Area =4.5123e+04 4.2583e+04 =1.0596 0.3003 7 Elevation 3.1302e=01 5.3870e=02 5.8107 6.398e=06 8 Nearest 3.8273e=01 1.1090e+00 0.3451 0.7331 9 Scruz =2.6199e=01 2.1581e=01 =1.2140 0.2371 1 Adjacent 4.5123e+04 4.2583e+04 1.0596 0.3003 1 Adiffe 4.5123e+04 4.2583e+04 1.0596 0.3003 1 1 n = 30, p = 7, Residual SE = 60.81975, R=Squared = 0.78 1 Notice that now all parameters are estimated (i.e. no singularity detected), but the standard errors are very large because we cannot estimate them in a stable way. Generally there is a need to be able to identify such situations, this will be discussed later when dealing with topic of collinearity. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 39 / 42 Matrix Representation of Linear Model Orthogonality An orthogonal model means that all independent variables in that model are uncorrelated. Orthogonality is a useful property because it allows more easy interpretation of the effect of one predictor without regard to another. Orthogonality is a desirable property, but will only occur when X is chosen by the experimenter. It is a feature of a good design. In observational data, the investigator has no direct control over X and this is the source of many of the interpretational difficulties associated with non- experimental data. Ibrahim Alfaki STAT360: Applied Regression Fall 2024 40 / 42 Matrix Representation of Linear Model Exercise 2 (Due Tuesday September 13, 2022) 1 In this question, we investigate the relative merits of methods for computing the coefficients. Generate some artificial data by: >xy

Use Quizgecko on...
Browser
Browser