Statistical Learning PDF

STATISTICAL LEARNING - Stefanucci [email protected] 12/09 It is a framework for machine learning mainly interested in prediction applications in text mining, image processing, speech recognition, bioinformatics all machine learning are built upon statistical basics it is about the process of obtaining info from rough data (numbers) Iit raised a lot in the last decade because it developed powerful prediction models. We are using models because we are modeling the data and predicting new values. For instance, linear regression is a simple method and its power it’s limited. Moving toward MI they are much more complicated and they have less power to predict new data. The prediction performances of these models are just good, but a best model doesn't exist. A statistical model can be applied to all types of data, depending on their structures, economically or environmentally speaking. From the statistical point of view, they are just numbers. It is important to link the type of data to the right model, a specific applications doesn’ exist. Applications are mainly of 3 kind text mining (classification) image processing (each picture being treated as a number) speech recognition PREREQUISITES introductory statistics math (probability theory) statistical inference (trying to make a model for the data and to draw conclusions) MAIN TOPICS introduction to R softwares ⇒free ⇒it comes with some basic functions (mean computation..) ⇒mainly used by researchers who need to use complex models in order to develop other models, so they built R with packages and extensions if you need something specific, developed by someone else, so no any kind of industry behind it linear regression (observation of relationship between variables and trying to model it) ⇒simplest with 2 variables ⇒limitation that it is mainly used with continuous data, with no constraints ⇒built upon normal distribution logistic regression (extension of the linear model) principal component analysis (when having several variables, it is complicated to do just descriptive statistics) ⇒dimension reduction, meaning modifying the data in order to extract all the relevant info, reducing variables ⇒ n rows (observations, applications, subjects ⇒ p columns (variables) ⇒useful when the data set has a large number of variables cluster analysis (creating groups of homogeneous subjects) ⇒no info about variables machine learning methods for supervised learning (second module) modern applications: text mining, image processing DIFFERENCE between regression models (linear and logistic) & Principal component and Cluster analysis When we have y=dependent (presence of the disease) x=independent (or explanatory) (medical image) REGRESSION: trying to predict an outcome of a variable given other variables ⇒supervised prediction through another variable that can be observed directly there is a target, variable treated differently one variable is considered independent (predictor) variable x and the other the dependent (outcome) variable y PRINCIPAL COMPONENT and CLUSTER ANALYSIS: All variables are the same There is not a most important one to be your target No specific target variable ⇒ unsupervised Supervised learning techniques are used to predict a target variable (linear and logistic regression) based on predictors, and/or to assess interrelationships among predictors and a target variable (linear and logistic regression). As an example, suppose you want to predict the risk that a family will be materially deprived next year. This can be done by using data that can be measured at baseline (number of family members, disposable income, health status, etc.) and use these to predict material deprivation for a sample of families with known status. Incidentally, you will also understand how health status affects the risk of material deprivation. Unsupervised learning techniques are used to find groups in data, that is, to predict target categorical variables that are not measured (cluster analysis). Additionally, they are used to summarize data (dimension reduction, done with principal component analysis in this course). As an example, suppose you want to assess an unmeasurable trait, like happiness. Suppose your target units are geographic regions. Happiness can be measured indirectly through a series of variables (questionnaires, indices, etc.). A general score is obtained through dimension reduction by finding the optimal weighted average of all measurements. Cluster analysis will separate regions in few (two, three, four) groups, with respect to levels of happiness. Different policies can then be scheduled for each group. WHY Statistical Softwares Most computations are not available anymore with just mathematics, so softwares are specialized computer programs for analysis on statistics and econometrics huge amount of data are modeled in a simple way (descriptive or inferential analysis are easily manageable) Often the estimation of unknown model parameters requires numerical approximation. For ex, often there is no closed form solution to the maximization of the likelihood function programming languages allow to extend existing models or develop model created ad hoc in particular circumstances 13/09 INTRODUCTION TO R R and RSTUDIO R is a programming language for statistical computing and graphics (blank page to be written on) R STUDIO is an Integrated Development Environment (IDE) for R, more friendly version 2 Main features 1.Data processing 2.Programming R BASICS arithmetic objects and structures (it can be a number, a vector, an image) which are treated differently importing and exporting data (the first step to statistical analysis is to import data from external source) installing packages getting help BASIC STATISTICS with R descriptive statistics and graphics statistical tests linear regression R PROS vs CONS PROS ⇒ versatile, flexible, powerful, free, based on command-lines: reproducible CONS ⇒ based on command-lines: not user friendly and still slightly cumbersome with big data WHAT CAN R DO? 1. Import data 2. Manipulate any kind of data IMPORT = basic text files, excel files, several other formats MANIPULATE = indexing and subsetting = merging and reshaping GENERAL INSTRUCTIONS and FUNCTIONS If the layout is white, functions are in black key words are in red data are in blue In R, functions are tools that are identified by words (like mean or sd) and followed by the parenthesis, inside which there is the argument of the function (which is different for every function). So data will have a name in order to be an object in the environment. Mandatory arguments have no default and optional arguments do have it mean (x,...) x is an R object c will allow you to create a vector if you have a lot of variables. ?c stands for Combine values and arguments ⇒into a Vector or List For instance, the function “mean” requires a vector as a mandatory argument inside the parentheses, because in order to compute the mean of something, you need multiple variables. mean (x=c(12,34,3)) mean (c(12,34,3) R will automatically delete x= because it knows it’s the argument vec library(MASS) example of a function to load a package’s name plots will make you draw points >plot(a,b) a and b are understood as x and y >plot(x=c(2,3), y=c(3,4)) a and b as vectors SYMBOLS for mathematical computations * is for multiplications / is for divisions ^ is for powers sqrt is for square root log (4,2) log of 4 with the base of 2 4 is the x argument, 2 is the base In case of logs, if we specify the name of the arguments, the order is no longer important, such as >log (x=4, base=2) 14/09 TYPES of DATA 1. Numeric variables for ex 3.1 2. Character variables for ex “male” 3. Logical variables for ex T or F converted in numbers 1 for true and 0 for false KIND OF DATA STRUCTURE that we can manage in R The most simple structure is single values. So what if we have several values of the same kind? we can bind them in a sequence thru a vector (function c) for ex x 0 where λ = parameter for intensity, high λ more cases Representing the number of random events (customers, emails, covid cases) in some time interval We might model Y as a Poisson distribution with mean E(Y ) = λ = 5. This means that the probability of no users during this particular hour is. The probability that there is exactly one user is and so on. FAMOUS PROBABILITY DISTRIBUTIONS for CONTINUOUS r.v. 1) Uniform: X ∼ Unif (a, b) for b > a Probability for each value inside a specific interval whose outcomes have the same probability Density function Cumulative distribution function 2) Exponential: X ∼ Exp(λ) for λ > 0 Managing r.v. that no longer “work” ; positive i continuous time (for ex. Lifetime of something) Density function Cumulative distribution function 3) Normal: X ∼ N(µ, σ2) or (0,1) for −∞ < µ < ∞ and σ > 0 Bell-shaped density function MEAN of a r.v. The original name is expected value E(X) For ex., let X be a Bernoulli r.v. Then p(1) = p, p(0) = 1 − p ⇒ E(X) = 1 × p + 0 × (1 − p) = p For ex., let X be a discrete r.v. with probability p(x) = 1/3 for x = −1, 0, 1 For ex., Let X be a continuous r.v. with density fX(x). The expected value of X (or mean of X) is X ∼ Binomial(N, p) ⇒ E(X) = N p X ∼ Poisson(λ) ⇒ E(X) = λ X ∼ Exp(λ) ⇒ E(X) = 1/λ X ∼ N(0, 1) ⇒ E(X) = 0 VARIANCE The average distance that each person has to walk to get to the mean (half of the mean). If it is rooted, it’s called standard deviation (sd). For any random variable X , the variance of X is the expected value of the squared difference between X and its expected value. For normal random variables, X ∼ N(µ, σ2), the 2 parameters are respectively the mean and the variance. Bivariate r.v When the outcome of the random experiment is a pair of numbers (X, Y) we call (X, Y) a bivariate (or two-dimensional) random variable. Example: result of a football match, cases of COVID19 today and tomorrow, income of husband and income of wife, cost and outcome of a medical treatment... When X and Y are both discrete we call Z = (X, Y ) a bivariate discrete random variable When X and Y are both continuous r.v. we call Z = (X, Y ) a bivariate continuous random variable STATISTICAL INFERENCE / PREDICTION Statistical learning is a framework for machine learning that draws from statistics and functional analysis. It deals with finding a predictive function based on the data presented. Aim → draw conclusions from data and make predictions. 2 main types of data 1. DEPENDENT (y) → whose value depend on other variables → It Is also called RESPONSE - OUTPUT 2. INDEPENDENT (x) → whose value does not depend on the vales of other variable → EXPLANATORY - INPUT - COVARIATES The independent (explanatory) variable will affect the dependent (response) variable. In our statistical model we will observe how the explanatory variable influences the response variable. More generally, suppose that we observe a quantitative response Y and p different predictors, X1, X2,...,Xp. We assume that there is some relationship between Y and X = (X1, X2,...,Xp), which can be written in the very general form Y = f(X) + ϵ Where f is some fixed but unknown function of X1…Xp and ϵ is a random error term, which is independent of X and has mean zero. We note that some of the observations lie above the curve and some lie below it; overall, the errors have approximately mean zero. In this formulation, f represents the systematic information that X provides about Y. ⇒ To sum up, statistical learning refers to a set of approaches for estimating f, called statistical models, representing a relationship between the dependent and independent variable. it can be a y= mx + q, where m is the gradient and q the intercept. This relation between the two variables is measured by: covariance → determines types of interaction, how 2 variables moves (+ or -) correlation → how 2 variables are strongly related ⇒We are often interested in understanding the association between Y and X1,...,Xp. In this situation we wish to estimate f, but our goal is not necessarily to make predictions for Y. Which predictors are associated with the response? It is often the case that only a small fraction of the available predictors are substantially associated with Y. Identifying the few important predictors among a large set of possible variables can be extremely useful, depending on the application. What is the relationship between the response and each predictor? Some predictors may have a positive relationship with Y , in the sense that larger values of the predictor are associated with larger values of Y. Other predictors may have the opposite relationship. Depending on the complexity of f, the relationship between the response and a given predictor may also depend on the values of the other predictors. Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated? Historically, most methods for estimating f have taken a linear form. In some situations, such an assumption is reasonable or even desirable. But often the true relationship is more complicated, in which case a linear model may not provide an accurate representation of the relationship between the input and output variables. Depending on whether our ultimate goal is prediction, inference, or a combination of the two, different methods for estimating f may be appropriate. For example, linear models allow for relatively simple and interpretable inference, but may not yield as accurate predictions as some other approaches. In contrast, some of the highly non-linear approaches can potentially provide quite accurate predictions for Y , but this comes at the expense of a less interpretable model for which inference is more challenging. SUPERVISED vs. UNSUPERVISED LEARNING Most statistical learning problems fall into one of two categories: supervised supervised or unsupervised. For each observation of the predictor measurement(s) xi, i = 1,...,n there is an associated response measurement yi. We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference). Many classical statistical learning methods such as linear regression and logistic regression operate in the supervised learning domain. By contrast, unsupervised learning describes the somewhat more challenging situation in which for every observation i = 1,...,n, we observe a vector of measurements xi but no associated response yi (no target variable). It is not possible to fit a linear regression model, since there is no response variable to predict. QUANTITATIVE vs. QUALITATIVE Variables can be characterized as either quantitative or qualitative (also known as categorical). We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems. However, the distinction is not always that crisp. Least squares linear regression is used with a quantitative response, whereas logistic regression is typically used with a qualitative (two-class, or binary) response. Thus, despite its name, logistic regression is a classification method. But since it estimates class probabilities, it can be thought of as a regression method as well. We tend to select statistical learning methods on the basis of whether the response is quantitative or qualitative; i.e. we might use linear regression when quantitative and logistic regression when qualitative. However, whether the predictors are qualitative or quantitative is generally considered less important. Most of the statistical learning methods can be applied regardless of the predictor variable type, provided that any qualitative predictors are properly coded before the analysis is performed. HOW DO WE ESTIMATE f ? We will always assume that we have observed a set of n different data points. These observations are called the training data because we will use these training data observations to train, or teach, our method how to estimate f. Let xij represent the value of the jth predictor, or input, for observation i, where i = 1, 2,...,n and j = 1, 2,...,p. Correspondingly, let yi represent the response variable for the ith observation. Then our training data consist of {(x1, y1),(x2, y2),...,(xn, yn)} where xi = (xi1, xi2,...,xip)^T Our goal is to apply a statistical learning method to the training data in order to estimate the unknown function f. In other words, we want to find a function ˆf such that Y ≈ ˆf(X) for any observation (X, Y). LINEAR MODELS usually involve parametric methods, in which we need to estimate the parameters β0, β1,..., βp. That is, we want to find values of these parameters such that Y ≈ β0 + β1X1 + β2X2 + ··· + βpXp. The parametric model-based approach reduces the problem of estimating f down to one of estimating a set of parameters. Assuming a parametric form for f simplifies the problem of estimating f because it is generally much easier to estimate a set of parameters, such as β0, β1,..., βp in the linear model, than it is to fit an entirely arbitrary function f. Therefore, since we have assumed a linear relationship between the response and the two predictors, the entire fitting problem reduces to estimating β0, β1, and β2 by using least squares linear regression (the most common approach to fitting the model). However, the potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of f. So given some data y and a statistical model with parameters θ, we may ask: What values for θ are most consistent with y? → point estimation Is some prespecified restriction on θ consistent with y? → hypothesis testing Which of several alternative models/hypothesis is most consistent with y? → model selection What ranges of values of θ are consistent with y? → interval estimation Is the model consistent with y for any values of θ? → model checking The data gathering process can be optimized? → experimental or sampling design POINT ESTIMATION TECHNIQUES 1. Maximum likelihood estimation: ⇒ assume a statistical model f(x; θ) based on the observed data; ⇒ define and maximize the likelihood function L(θ) w.r.t. θ. 2. Ordinary least squares (linear regression) ⇒ minimize the sum of the squares of the differences between the observed dependent variable and those predicted by the linear function of the independent variable. LIKELIHOOD FUNCTION It’s the probability function or density function of the observed data x, viewed as a function of the unknown parameter θ. The parameter θ can be a scalar or a vector (for example in the normal case we have (µ, σ2 )). Vector parameters will be denoted with θ. The space of all possible realizations of X will be denoted with T and called sample space; so the parameter θ takes values in the parameter space Θ. Parameter values that make the observed data appear relatively probable are more likely to be correct than parameter values that make the observed data appear relatively improbable. MAX LIKELIHOOD ESTIMATE The maximum likelihood estimate (MLE) ˆθML of a parameter θ is the point where the likelihood assumes the maximum value Plausible values of θ should have a relatively high likelihood. CONFIDENCE INTERVALS (CI) A confidence interval is a range of estimates for an unknown parameter. A confidence interval is computed at a designated confidence level, which represents the long-run proportion of corresponding CIs that contain the true value of the parameter. For example, out of all intervals computed at the 95% level, 95% of them should contain the parameter’s true value. Typically a rule for constructing confidence intervals is closely tied to a particular way of finding a point estimate of the quantity being considered. PP.2 “ INTRODUCTION TO LINEAR REGRESSION” LINEAR REGRESSION MODEL Regression models so0me variable through others Fixed variables: two or more, known as explanatory/covariates (x or x1, x2…xn) Random variables: known as responses. Only one variable is regarded as a response (y). Simplest regression model: analyze how 1 variable depends on others; a single response depends linearly on a single covariate. The least squares approach is the most commonly used approach to fit this model. SIMPLE LINEAR REGRESSION: It is a very straight-forward approach to predict a quantitative response Y on the basis of a single predictor variable X. This model is a relatively inflexible approach, because it can only generate linear functions such as lines or planes, so it assumes that there is approximately a linear relationship between x and y. Mathematically, we can write this linear relationship as: “≈” means “is approximately modeled as”, “we are regressing Y on X (or Y onto X)”. ➔ b1 = the intercept (average value of the response, representing the value of Yj when Xj=0; unknown coefficient) ➔ b2 = is the slope of the regression line (influence of x on y) ; if x increases (decreases) y will increase (decrease) of b2. (unknown coefficient) ➔ ej= the error. The relationship is the general trend (line) but some single observations can fall below or above it. If you remove the error, all the observations become the same ⇒ the randomness comes from the error which we assume is normally distributed STRAIGHT-LINE or SIMPLE REGRESSION MODEL !! Assumes that random variables Yj satisfy Where x1... , xn are known constants β1, β2 and σ 2 are unknown parameters ϵ1,... , ϵn are i.i.d. N(0, σ2 ) (homoscedasticity) Together b1 and b2 are the UNKNOWN coefficients or parameters. Since they are unknown, in order to make predictions we must use the training data to estimate coefficients, or also estimates of b1 and b2. The training data is constant, so fixed, can’t change, it’s not random. Let (x1,y1) + (x2,y2)... (xn, yn) represent n observation pairs, each of which consists of a measurement of X and a measurement of Y. Our goal is to obtain coefficient estimates βˆ0 and βˆ1 such that the linear model Y≈β0 + β1x fits the available data well. ⇒ The data arise as pairs (x1, y1)... ,(xn, yn), from which β1, β2 and σ^2 are to be estimated. ⇒ so that yi ≈ βˆ0 + βˆ1xi for i = 1,...,n. In other words, we want to find an intercept βˆ0 and a slope βˆ1 such that the resulting line is as close as possible to the n data points. Once we have used our coefficient parameter training data to produce estimates βˆ0 and βˆ1 for the model coefficients, we can predict future responses on the basis of a particular value by computing where ˆy indicates a prediction of Y on the basis of X = x. Here we use a hat symbol, ˆ , to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response. There are a number of ways of measuring closeness. However, by far the most common approach involves minimizing the least squares criterion, which chooses estimated b0 and b1 to minimize the RSS. LEAST SQUARE ESTIMATES: minimizing the distance in order to estimate β1 and β2. Let ˆyi = βˆ0 + βˆ1xi be the prediction for Y based on the ith value of X. Then ei = yi −yˆi represents the ith residual (difference between the ith observed response value and the ith response value). We define the residual sum of squares (RSS) or equivalently as as the smallest sum of square SS (β1, β1) attainable by fitting the linear regression model to the data. To estimate e β1 and β2 we can minimize the distance which is the sum of squared vertical deviations between the yj and their means β1 + β2 xj under the linear model. We do the SUM because positive and negative errors tend to compensate, so we square them otherwise they will be 0. This is equivalent to find among all the possible straight lines β0 + β1x the one which minimizes the sum of the vertical distances between the points yj and β0 + β1xj ASSESSING the ACCURACY of the COEFFICIENT ESTIMATES We assume that the true relationship between X and Y takes the form Y = f(X) + ϵ for some unknown function f, where ϵ is a mean-zero random error term. If f is to be approximated by a linear function, then we can write this relationship as Where ❖ β0- intercept term, the expected value of Y when X = 0 ❖ β1- slope, the average increase in Y associated with a one-unit increase in X. This defines the population regression line, which is the best linear approximation to the true relationship between X and Y. 1 The least squares regression coefficient estimates (3.4) characterize the least squares line. The true relationship is generally not known for real data, but the least squares line can always be computed using the coefficient estimates. In the same way, the unknown coefficients β0 and β1 in linear regression define the population regression line. We seek to estimate these unknown coefficients using βˆ0 and βˆ1. These coefficient estimates define the least squares line. 2 σ estimator We continue the analogy with the estimation of the population mean µ of a random variable Y. A natural question is as follows: how accurate is the sample mean ˆµ as an estimate of µ? We have established that the average of ˆµ’s over many data sets will be very close to µ, but that a single estimate ˆµ may be a substantial underestimate or overestimate of µ. How far off will that single estimate of ˆµ be? In general, we answer this question by computing the standard error of ˆµ, written as SE(ˆµ). We have the well-known formula where σ is the standard deviation of each of the realizations yi of Y. 2 Roughly speaking, the standard error tells us the average amount that this estimate ˆµ differs from the actual value of µ. It also tells us how this deviation shrinks with n—the more observations we have, the smaller the standard error of ˆµ. In a similar vein, we can wonder how close βˆ0 and βˆ1 are to the true values β0 and β1. 2 σ = Var(ϵ) The errors ϵi for each observation have common variance σ2 and are uncorrelated. 2 In general, σ is not known, but can be estimated from the data. This estimate of σ is known 2 as the residual standard error (RSE); strictly speaking, when σ is estimated from the data we should write SE(βˆ1) to indicate that an estimate has been made. Standard errors can be used to compute confidence intervals. A 95 % confidence interval is defined as a range of values such that with 95 % probability, the range will contain the true unknown value of the parameter. Since the simple linear model assumes With Then 2 We can estimate σ by calculating the variance of the residuals. ASSESSING the ACCURACY of the MODEL Once we have obtained the fitted value yˆj it is important to evaluate how they fit the observed values yj , that is we need to measure the goodness of fit of the regression model We have rejected the null hypothesis in favor of the alternative hypothesis, and it is natural to want to quantify the extent to which the model fits the data. The quality of a linear regression fit is typically assessed using 2 related quantities: the 2 residual standard error (RSE) and the 𝑅 statistic. COEFFICIENT OF DETERMINATION The explained sum of squares (ESS) is the sum of the squares of the deviations of the predicted values from their mean: It is opposed to the RSS (residual sum of squares) Where the total sum of squares (TSS) is Thus we have the following identity: TSS = ESS + RSS In general, the greater the ESS, the better the estimated model performs. In fact ESS represents the data variability explained by the regression model. If the variability of the error it’s low, it means that it has a low distance from the line. 2 The coefficient of determination 𝑅 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 represents an index of goodness of fit for the simple regression model. It measures the fraction of data variability explained by the 2 2 regression model. Note that 0 ≤ 𝑅 ≤ 1 and values of 𝑅 approaching 1 represents a perfect 2 2 fit. It is straightforward to prove that 𝑅 = 𝑟 where r is the correlation coefficient sxy/(sxsy). It provides an alternative measure of fit, it takes the form of a proportion (variance) and so a value between 0 and 1. TSS measures the total variance in the response Y , and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. Hence, TSS − RSS measures the amount of variability in the response that is explained (or 2 removed) by performing the regression, and 𝑅 measures the proportion of variability in Y 2 that can be explained using X. An 𝑅 statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. A number near 0 indicates that the regression does not explain much of the variability in the response; CONFIDENCE INTERVALS and HYPOTHESIS TEST In real applications, we have access to a set of observations from which we can compute the least squares line; however, the population regression line is unobserved. If we generate ten different data sets from the model and plot the corresponding ten least squares lines, we notice that different data sets generated from the same true model result in slightly different least squares lines, even though the unobserved population regression line does not change. At first glance, the difference between the population regression line and the least squares line may seem subtle and confusing. We only have one data set, and so what does it mean that two different lines describe the relationship between the predictor and the response? Fundamentally, the concept of these two lines is a natural extension of the standard statistical approach of using information from a sample to estimate characteristics of a large population. For example, suppose that we are interested in knowing the population mean µ of some random variable Y. Unfortunately, µ is unknown, but we do have access to n observations from Y , y1,...,yn, which we can use to estimate µ. A reasonable estimate is , where is the sample mean. The sample mean and the population mean are different, but in general the sample mean will provide a good estimate of the population mean. In the same way, the unknown coefficients β0 and β1 in linear regression define the population regression line. We seek to estimate these unknown coefficients using βˆ0 and βˆ1, which define the least square line. ⇒ Standard errors can be used to compute confidence intervals. A 95 % confidence interval is defined as a range of values such that with 95 % probability, the range will contain the true unknown value of the parameter. The range is defined in terms of lower and upper limits computed from the sample of data. A 95% confidence interval has the following property: if we take repeated samples and construct the confidence interval for each sample, 95% of the intervals will contain the true unknown value of the parameter. For linear regression, the 95 % confidence interval for β1 approximately takes the form that is, there is approximately a 95% chance that the interval will contain the true value of β1. ⇒ Standard errors can also be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing the NULL HYPOTHESIS vs. the ALTERNATIVE HYPOTHESIS. null hypothesis of H0: there is no relationship between X and Y alternative hypothesis of Ha: there is some relationship between X and Y Mathematically, this corresponds to testing vs. since if β1 = 0 then the model reduces to Y = β0 + ϵ, and X is not associated with Y. To test the null hypothesis, we need to determine whether βˆ1, our estimate for β1, is sufficiently far from zero that we can be confident that β1 is non-zero. How far is far enough? This of course depends on the accuracy of βˆ1, that depends on SE(βˆ1): If SE(βˆ1) is small, then even relatively small values of βˆ1 may provide strong evidence that β1 ̸= 0, and hence that there is a relationship between X and Y. In contrast, if SE(βˆ1) is large, then βˆ1 must be large in absolute value in order for us to reject the null hypothesis. In practice, we compute a t-statistic which measures the number of standard deviations that βˆ1 is away from 0. Consequently, it is a simple matter to compute the probability of observing any number equal to |t| or larger in absolute value, assuming β1 = 0. We call this probability the p-value. Roughly speaking, we interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Hence, if we see a small p-value, then we can infer that there is an association between the predictor and the response. We reject the null hypothesis, that is, we declare a relationship to exist between X and Y , if the p-value is small enough (typical p-value cutoffs for rejecting the null hypothesis are 5% or 1%). If we want to investigate the relationship between 2 variables Y and X by defining (i.i.d = independent and identically distributed) We assume that there is a causal relationship. One cannot ”search” for causality with the regression, the regression can only be used if a causal relationship is assumed. 2 If (0, σ ) are high = far from the line 2 If (0, σ ) are low = close to the line Linear model ⇒ CAUSAL RELATIONSHIP - identifying one variable as the response and the other as explanatory - Assuming then a direction - Relation that needs to be checked Correlation ⇒ to assess RELIABILITY - descriptive and symmetric measure - Stating how much the relation between the 2 variables is strong - If the correlation is high, there is a good fit SUMMARY ITINERARY for linear regression model: 1) Model specification and assumptions 2) Point estimation 3) Interpretation of the coefficients 4) Calculation of standard errors 2 5) Diagnostics ( 𝑅 , t-test, test for homoscedasticity aka constant variance) PP.3 “MULTIPLE LINEAR REGRESSION” MULTIPLE LINEAR REGRESSION Up to now, the explanatory variable (x) was just a single one. Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor. Therefore, when we have several explanatory variables, instead of fitting a separate simple linear regression model for each predictor, a better approach is to extend the simple linear regression model, so that it can directly accommodate multiple predictors. We can do this by giving each predictor a separate slope coefficient in a single model, isolating the effect of each of them. We have several regression coefficients, not just the intercept and the slope Regression: generalization of a line and answer to the question what happens to y when x varies? ; now looking for the best fitting plane The Multiple Linear Regression Model takes the form of Same as the linear one but with several explanatory variables added ★ Xj = j-th predictor ★ β0 = intercept ★ βj = quantifies the association between that variable and the response / average effect on Y a one unit increase in Xj, holding all other predictors fixed (vs. in simple regression the other predictors were ignored) ★ β1 = coefficient related to the first variable ★ β2 = coefficient related to the second variable As was the case in the simple linear regression setting, the regression coefficients β0, β1,..., βp are unknown, and must be estimated. Given estimates βˆ0, βˆ1,..., βˆp, we can make predictions using the formula The parameters are estimated using the same least squares approach that we saw in the context of simple linear regression. We choose β0, β1,..., βp to minimize the sum of squared residuals The values βˆ0, βˆ1,..., βˆp that minimize the sum of squared residuals are the multiple least squares regression coefficient estimates. Unlike the simple linear regression estimates, they have somewhat complicated forms that are most easily represented using matrix algebra. In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane. The plane is chosen to minimize the sum of the squared vertical distances between each observation (shown in red) and the plane. MATRIX ALGEBRA Matrix = generalization of vectors Vector = sequence of numbers usually written in rows 𝑎 (1, 3, 4) [dimension 1x3]; its transport will be in columns The linear regression model with design matrix X that can also be written as. In an extended form as a result of a SCALAR PRODUCT: In R STUDIO: %*% for scalar product of vectors How to multiply a MATRIX to a VECTOR? Only if the 2° dimension of the matrix corresponds to the 1° dimension of the vector 2x2 and 2x1 STRAIGHT-LINE REGRESSION in MATRIX NOTATION For the straight line regression model yj = β1 + β2xj + ϵj for j = 1,... , n, the matrix form of the model is: X is a nx2 MATRIX β is a 2x1 VECTOR of parameters ★ 𝑌𝑗 = 𝑥1𝑗 β1 + 𝑥2𝑗 β2 +... 𝑥𝑝𝑗 β𝑝 + 𝑒𝑗 𝑤𝑖𝑡ℎ 𝑗 = 1,... 𝑛 LEAST SQUARE ESTIMATES The least square estimate of β is obtained by the value that minimizes the sum of squares In simple cases it is possible to have analytical expressions for the least square estimates. For example in the straight-line regression model the X matrix of the representation y = Xβ + ϵ is. Then we have that. After some algebra we obtain FITTED VALUES AND RESIDUALS The sum of squares SS(β) plays a central role. Its minimum value is called residual sum of squares. It is the squared discrepancy between the observations y and the fitted values yˆ = Xβ The vector yˆ = Xβˆ is the linear combination of the columns of X that minimizes the squared distance with the data y. The unobservable error ϵj = yj − x t jβ is estimated by the jth residual. TWO GROUPS COMPARISON in MATRIX NOTATION Suppose that the response variable y has been observed on two groups of observations of size n1 and n2. Let y1j for j = 1,... n1 be the observations of the first group and let y2j for j = 1... n2 be the observation of the second group. Let β1 and β1 + β2 be the means of the variable y in the two groups. Hence We can write the model for the two groups comparison in matrix notation y = Xβ + ϵ where For this model we have The two groups comparison can be extended to more than two groups: SOME IMPORTANT QUESTIONS when performing multiple linear regression: 1. Is at least one of the predictors x1, x2,... , xp useful in predicting the response? 2. Do all the predictors help to explain y, or is only a subset of the predictors useful? 3. How well does the model fit the data? 4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction? 1. Is at least one of the predictors useful? ⇒In the simple linear regression setting, in order to determine whether there is a relationship between the response and the predictor we can simply check whether β1 = 0. ⇒In the multiple regression setting with p predictors, we need to ask whether all of the regression coefficients are zero, i.e. whether β1 = β2 = · · · = βp = 0. So the first step is to compute F-statistics in order to examine the associated p-value. We test the null hypothesis: With the alternative hypothesis: This hypothesis test is performed by computing the F-statistic: ★ When there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. On the other hand, if Ha is true, then E{(TSS − RSS)/p} > σ2, we expect F to be greater than 1. How large does the F-statistic need to be before we can reject H0 and conclude that there is a relationship? It turns out that the answer depends on the values of n and p. When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0. In contrast, a larger F-statistic is needed to reject H0 if n is small. When H0 is true and the errors ϵi have a normal distribution, the F-statistic follows an F-distribution POTENTIAL PROBLEMS When we fit a linear regression model to a particular data set, many problems may occur. Most common among these are the following: 1. Non-linearity of the response-predictor relationships ⇒ tool of residual plots 2. Correlation of error terms. 3. Non-constant variance of error terms. 4. Outliers. 5. High-leverage points. 6. (Multi)Collinearity. RESIDUAL PLOTS Residual plots are a useful graphical tool for identifying non-linearity. ⇒Given a simple linear regression model, we can plot the residuals, ei = yi − yˆi , versus the predictor xi. ⇒In the case of a multiple regression model, since there are multiple predictors, we instead plot the residuals versus the predicted (or fitted) values yˆi. Ideally, the residual plot will show no visible pattern because its aim is identifying non-linearity. The presence of a pattern may indicate a problem with some aspect of the linear model. THE MULTICOLLINEARITY PROBLEM Multicollinearity (collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Let consider a (n × p) design matrix X; if |cor(xi , xj )| = 1 for i ̸= j and i, j ∈ {1, p}, then there is perfect collinearity and the product matrix XtX is not invertible. The coefficient estimates βˆ0, βˆ1,..., βˆp are estimates for β0, β1,..., βp. That is, the least squares plane is only an estimate for the true population regression plane The inaccuracy in the coefficient estimates is related to the reducible error; we can compute a confidence interval in order to determine how close Yˆ will be to f(X). LIFE CYCLE SAVINGS DATA LifeCycleSavings is data set 5 with variables observed on 50 different countries. The variables are: 1. sr aggregate personal savings, 2. pop15 % of population under 15, 3. pop75 % of population over 75, 4. dpi real per-capita disposable income, 5. ddpi % growth rate of dpi This data set is available in R > summary (lifecyclesavings) Under the life-cycle savings hypothesis as developed by Franco Modigliani, the savings ratio (aggregate personal saving divided by disposable income) is explained by per-capita disposable income, the percentage rate of change in per-capita disposable income, and two demographic variables: the percentage of population less than 15 years old and the percentage of the population over 75 years old. The data are averaged over the decade 1960-1970 to remove the business cycle or other short-term fluctuations. In this case we might fit the model where y is the saving ratio and x2, x3, x4 and x5 are the variables pop15, pop75,dpi and ddpi. Looking the data we may expect a negative value for β2 and a positive value for β5 while the relationship between the saving ratio and the variables pop75 and dpi is not clear. The X matrix has dimension 50 × 5 and is SUMMARY ITINERARY for MULTIPLE LINEAR REGRESSION 1. Model specification and assumptions 2. Check for multicollinearity (new entry) 3. Estimate the model parameters 2 2 4. Diagnostics (𝑅 or adjusted 𝑅 , t-test, test for homoskedasticity) 5. Interpretation HOMOSCEDASTICITY = equal variance The assumption that the variance of the dependent variable is the same for all the data. PP.4 “ GENERALIZED LINEAR MODELS” GENERALIZED LINEAR MODELS The linear model is often adequate to describe the relation between a set of explanatory variables x1,... , xp and the response y, assuming this is quantitative. There are cases, however, where the linear model is not a good solution. Imagine that your variable of interest is the presence (or absence) of a disease as a function of, for example, age. The only possible values for Y are 1 or 0 (presence or absence). Even if we think of a number as the probability of having such a disease, any number outside the interval (0,1) does not make sense. The linear model will produce predictions that are not constrained to be 0 or 1 and not even in the interval (0,1). The same will happen if you want to model the number of customers entering a shop as a function of the hour. Predicted customers should be a positive number (more, an integer) and the linear model does not ensure this will happen. So in order to build a model to predict a non quantitative Y for any given value of X1 and X2, the simple linear regression model is not a good choice. THE TECHNICAL PROBLEM The problem is that we defined the linear model as a model for observations instead of parameters. The general formulation is the same as where the mean is equal to the linear predictor. Unfortunately, the relation observation = mean + random noise does not apply if the data are not symmetric with unbounded range of variation. We could rephrase the linear model as We are now directly modeling the parameter of the distribution. To summarize, there are at least two reasons not to perform classification using a regression method: a. a regression method cannot accommodate a qualitative response with more than two classes; b. a regression method will not provide meaningful estimates of Pr(Y |X), even with just two classes. Thus, it is preferable to use a classification method that is truly suited for qualitative response values. Every example by now refers to x variable that is continuous; but in a model you can include variables that are categorical (gender, geographical region…) and we are interested in the effect of that variable on the response. X is not numerical so x = {} If x is continuous you will have a different outcome for y, because everytime x will be different; but since here x is either 1 or 0 you will have just 2 possible predicted outcomes. Therefore, ★ a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. GLMs overcome some of the limits of the linear model, namely - implicit (or not) gaussian assumption, - Homoscedasticity. Hence, there are cases in which for example the response variable is instead qualitative aka categorical. The process studying the approaches for predicting qualitative responses is known as classification. Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class. On the other hand, often the methods used for classification first predict the probability that the observation belongs to each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods. One of the many possible classification techniques (classifiers) to use in order to predict a qualitative response is the LOGISTIC REGRESSION (well-suited for the case of a binary qualitative response). Classification problems occur often, perhaps even more so than regression problems. Some examples include: 1. A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have? 2. An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth. 3. On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not. Just as in the regression setting, in the classification setting we have a set of training observations (x1, y1),...,(xn, yn) that we can use to build a classifier. We could consider encoding these values as a quantitative response variable, Y , as follows: Each of these codings would produce fundamentally different line ar models that would ultimately lead to different sets of predictions on test observations. So the last formulation is amenable of generalizations such as where f(·) is a nonlinear function. Its role is changing the outcome to a different space interval (in order to deal with other data structures such as presence/absence and positive data) However, the usual GLM formulation is so g is the link function and inverse function of f: g(·) = f −1 (·) Depending on what you choose g to be, the different choices would lead to different models: ❖ g(x) = log(x), log-linear regression ❖ g(x) = logit(x) = log( x / 1−x ), logit regression These functions are chosen to ”force” predictions to be in some interval INVERSE FUNCTIONS ❖ ❖ ❖ The function exp(θi) constraints the linear predictor to be positive. ❖ The function constraints the linear predictor to lie in the interval (0, 1). ❖ 𝑔(𝑥) = 𝑙𝑜𝑔(𝑥). ❖ 𝑓 is the inverse function, and since the inverse function of the log is the exponential, 𝑓 will be: 𝑓(𝑥) = 𝑒𝑥𝑝 (𝑥) = positive number as the outcome for modeling data LINK WITH RANDOM VARIABLES A parameter only exists if you specify a random variable, that would establish a link between the r.v. and a nonlinear function g(·) ❖ Presence/absence data are treated as Bernoulli random variables, with parameter pi. In this case is an appropriate transform. ❖ Positive counts data are treated as Poisson random variables, with parameter λi. In this case is an appropriate transform. ❖ In the end, it is the parameter (or a nonlinear function of it) that is modeled in a linear way. THE VARIANCE The expected value 𝐸 is equal to the mean 𝐸𝑦𝑖 =μ𝑖 So, μ = (β0 + β1xi) 2 ❖ 𝑌𝑖~𝑁(μ𝑖, σ ) 2 The 2 parameters μ and σ have no restrictions for the Gaussian/normal distribution, so we don’t need 𝑓= identity function which has the same values. In this way, 𝑔(𝑥) = 𝑥 (GLM) So the mean depends on the first parameter = first parameter ❖ 𝐸(𝑦𝑖) =μ𝑖 The variance is linked only to the second parameter 2 ❖ 𝑉(𝑦𝑖) = σ ⇒ This is only true for the Gaussian random variables. ⇒Since with GLMs we move from Gaussian assumptions to other random variables, what’s the variance of our data? ❖ Logit (Bernoulli) model: Yi~Bernoulli(pi) 𝐸(𝑦𝑖) = 𝑝𝑖 MEAN=PARAMETER 𝑉(𝑦𝑖) = 𝑝𝑖(1 − 𝑝𝑖) the VARIANCE is not related to a different parameter in this case. It will depend on the MEAN, so no longer constant because it depends on the observation i. ❖ Log-linear (Poisson) model: Yi~Poisson (λi) 𝐸(𝑦𝑖) = λ𝑖 MEAN=PARAMETER 𝑉(𝑦𝑖)= λ𝑖 VARIANCE=PARAMETER Since the residuals are on average 0, depending on the length of the line, the variance of the error can be smaller or larger (if the variance depends on the observation we can remove the constant variance assumption of the linear model). GLM MODEL CONSTRUCTION In order to define a GLM, we have to: 1. specify distribution for the dependent variable y; 2. specify a link function g(·); 3. specify a linear predictor; 4. a model for the variance of the outcome (usually) automatically follows, hence heteroscedasticity. LOGIT REGRESSION to model between (0,1) when the outcome is a probability POISSON REGRESSION for counts (integers/positive numbers) ESTIMATION PROCEDURE The estimation procedure is based on maximum likelihood: the likelihood function L(θ, y) is maximized. So in a linear model, if you use estimation or least square, the outcome is the same. Instead here we want to maximize a nonlinear, since for most GLMs the likelihood equations are nonlinear functions of β: we need an iterative method to solve nonlinear equations and determine the maximum of a likelihood function. Two main (similar) algorithms are used for this purpose: 1. Newton-Raphson 2. Fisher scoring DEVIANCE To test the significance of the model, or the superiority of a model with respect to another. In a linear model with 3 variables 𝑌𝑖 = β0 + β1 𝑥1𝑖 + β2 𝑥2𝑖 + β3 𝑥3𝑖 + 𝑒𝑖, when comuìputing the p-value we remove any value tat is not significant based on the t-test values we have. In GLMs, the p value depends on the t-test that depends on the Gaussianity, so you can only have an approximate t-test and for this reason it’s preferable to use the deviance (comparing 2 models based 0n the likelihood). Essentially, the deviance is the likelihood-ratio statistic for testing the null hypothesis that the model M0 holds against the alternative that a more general model M1 holds. If we denote by ˆθ0 the vector of estimated parameters under model M0 and by ˆθ1 the vector of estimated parameters under model M1, the deviance can be computed as LOGISTIC REGRESSION (aka GLM with binomial random variable) ★ Categorical binary response variable ★ Outcome: success or failure (1 and 0) ★ The parameter is Pi and it will represent the probability of success (similarly when tossing a coin) Observation i has and If you consider a data set where the response falls into one of two categories, Yes or No, rather than modeling this response Y directly, logistic regression models the probability that Y belongs to a particular category. If you want to model the outcome, the link function for pi for this random variable is the LOGIT: GLMs using the logit link function are called logistic regression models and have the form of Where X = (X1,...,Xp) are p predictors. This formula can also be rewritten as INTERPRETATION of PARAMETERS A transformation of P depends linearly on the explanatory variable. To simplify notation, we focus on the case of a single quantitative explanatory variable x. The model is logit for which The inverse of the logit is exponential, that at the same time loses linearity and indicates indirectly the probability of having a success. The curve for P(y = 1) is monotone in x, meaning always increasing or decreasing in x. We have 2 (3) possible cases depending on β1 (which is the slope in the linear model): 1. If β1>0 (positive) , it means that if you increase the value of x you are increasing the probability of success P(y = 1) ; 2. If β10, you will have an increasing relation: if you increase x you will have a higher average of events (λ). If β1

Statistical Learning PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue