MA40198 Lecture Notes on Applied Statistical Inference PDF

Lecture notes for MA40198 (Applied Statistical Inference) Karim Anaya-Izquierdo (based on notes by Simon N. Wood) 2025-11-10 ii Table of contents 1 MA40198: Applied Statistical Inference...

Lecture notes for MA40198 (Applied Statistical Inference) Karim Anaya-Izquierdo (based on notes by Simon N. Wood) 2025-11-10 ii Table of contents 1 MA40198: Applied Statistical Inference 1 Overview of Applied Statistical Inference 3 Objective.................................. 3 Learning outcomes............................. 3 Summative assessment........................... 3 Moodle page................................ 4 2 Optimisation in Statistics 5 2.1 Regression analysis.......................... 5 2.1.1 Linear location regression.................. 7 2.1.2 Generalised linear mean regression............. 10 2.1.3 Likelihood function...................... 14 2.1.4 Maximum likelihood estimation............... 15 2.2 Unconstrained optimisation theory................. 31 2.2.1 Global and local minima................... 31 2.2.2 Conditions for local minima................. 32 2.2.3 Descent directions...................... 40 2.3 Optimisation algorithms....................... 43 2.3.1 Line-search algorithms.................... 43 2.3.2 Step-length selection..................... 44 2.3.3 Stopping criterion...................... 45 2.4 Raw Newton’s algorithm....................... 46 2.5 Fisher’s scoring algorithm...................... 81 2.6 Quasi-Newton algorithms...................... 98 2.6.1 BFGS algorithm....................... 99 3 Likelihood Theory 111 3.1 Large sample properties of the MLE................ 111 3.2 Likelihood as a random variable................... 112 3.3 Estimators of the asymptotic variance............... 117 3.4 Reparametrisations.......................... 122 3.5 Delta Method............................. 125 iii iv TABLE OF CONTENTS 3.6 Generalised likelihood ratio test (GLRT).............. 126 3.6.1 Asymptotic testing based on the GLRT.......... 132 3.6.2 Example: Normal distribution............... 133 3.7 Model selection............................ 134 3.7.1 Incorrectly specified models................. 134 3.7.2 Information criteria..................... 140 3.7.3 Practical model selection.................. 143 4 Bayesian Inference 155 4.1 Example: Bernoulli distribution................... 155 4.2 Example: Poisson distribution.................... 166 4.3 Bayesian statistical inference.................... 173 4.4 Example: Bernoulli distribution (continued)............ 174 4.5 Example: Breaking strength of fibres................ 176 Appendices 189 A Prerequisites 189 A.1 Numerical............................... 189 A.1.1 Vectors and matrices in R.................. 189 A.2 Linear Algebra............................ 189 A.3 Vector calculus............................ 191 A.4 Convex functions........................... 199 A.4.1 Examples convex functions................. 199 A.5 Random Vectors........................... 200 A.5.1 Transformation of multivariate random vectors...... 202 A.5.2 Multivariate Normal distribution.............. 202 A.6 Inequalities.............................. 204 A.7 The Monte Carlo algorithm for approximating sampling distribu- tions.................................. 204 A.7.1 Example............................ 207 B Optimisation and differentiation in R 209 B.1 Default optimisation in R...................... 209 B.2 Automatic differentiation in R.................... 210 Chapter 1 MA40198: Applied Statistical Inference 1 2 CHAPTER 1. MA40198: APPLIED STATISTICAL INFERENCE Overview of Applied Statistical Inference Objective To provide students with an introduction to some of the key quantitative meth- ods available for making statistical inferences about non-standard and non-linear models from data, in order to make inferences and predictions about the system that the data and model relate to. Learning outcomes By the end of the course students should be able to: take a simple non-standard and non-linear model of a system, to- gether with appropriate data, and write down the likelihood for a sensi- bly parametrised version of the model, maximise this likelihood, or use it as part of a Bayesian analysis, with R compare alternative models appropriately, find approximate confi- dence intervals for model parameters and check models critically handle simple stochastic model variants via approximate likelihood based methods, or stochastic simulation. Summative assessment Coursework: 40% of unit mark, group electronic submission. Exam: 60% of unit mark. 3 4 Overview of Applied Statistical Inference Moodle page Please see the Moodle page for this unit for more a more detailed overview on the organisation and expectations for Applied Statistical Inference this year. Chapter 2 Optimisation in Statistics 1 library(kableExtra) 2 library(magick) 3 library(pdftools) 4 library(tidyverse) Many methods in Statistics can be posed as optimisation problems where the objective is to find the value of the vector 𝜽 that maximise or minimise some objective function 𝜓(𝜽). The entries in the vector 𝜽 typically represent unknown parameters in a statistical model. One of the most important of such methods is that of maximum likelihood estimation which is described in this chapter in the context of regression analysis. 2.1 Regression analysis Regression analysis is, broadly, the study of relationships between random vari- ables. One of the variables, called the response, outcome or dependent variable, is singled out due to its importance and any other variables that are used to explain or predict the random behaviour of the response variable, are called explanatory, covariate, regressor, predictor or independent variables. 𝒴 denotes the real random variable describing the random behaviour of the response variable in the population of interest. 𝑝 𝓧 denotes the random vector in IR that describes the joint random be- haviour of the 𝑝 explanatory variables in the population of interest. We 5 6 CHAPTER 2. OPTIMISATION IN STATISTICS have: 𝒳1 ⎛ 𝒳2 ⎞ 𝓧=⎜ ⎜ ⎜ ⋮ ⎟ ⎟ ⎟ 𝒳 ⎝ 𝑝⎠ An observation from the conditional 𝒴|𝓧 = 𝐱 will be denoted by (𝐱, 𝑦). Not to be confused with an observation of the joint random variable (𝒳, 𝒴). In this course we will almost always use the conditional rather than the joint distribution, hence the preference in notation. The conditional random variable 𝒴|𝓧 = 𝐱 describes the random be- haviour of the response variable in an individual whose mea- surements of the explanatory variables are equal to 𝐱. It then describes the random behaviour in a sub-population of the population of interest. The probability density function (if 𝒴 continuous) or probability mass function (if 𝒴 is discrete) corresponding to 𝒴|𝓧 = 𝐱 will be denoted by 𝑓∗ (𝑦|𝐱). We will use the notation 𝒴|𝓧 = 𝐱 ∼ 𝑓∗ (𝑦|𝐱) (2.1) For brevity, we will usually call 𝑓∗ (𝑦|𝐱) simply a density function irre- spective of the underlying random variable being discrete or continuous. In regression, the relationship between the response and the explanatory variables is explained via the conditional density (2.1). If we know such density function then we can answer any regression question. Unfortu- nately, almost always such density is unknown. We will assume we have a sample of size 𝑛 from the population of interest. Then: 𝒴1 , 𝒴2 , … , 𝒴𝑛 denote the response random variables for the 𝑛 individuals in the sample. The actual observed values will be denoted by 𝑦1 , … , 𝑦𝑛. 𝐱1 , 𝐱2 , … , 𝐱𝑛 denote the values of the 𝑝 explanatory variables correspond- ing to the 𝑛 individuals in the sample. Regression analysis The main objective in regression analysis is to estimate the unknown con- ditional density 𝑓∗ (𝑦|𝐱) in (2.1) using data of the form (𝐱1 , 𝑦1 ), … , (𝐱𝑛 , 𝑦𝑛 ) (2.2) that is, response variable measurements in 𝑛 individuals from the popula- tion of interest whose explanatory variable measurements have been fixed or controlled at 𝐱1 , … , 𝐱𝑛. 2.1. REGRESSION ANALYSIS 7 It is generally diﬀicult to estimate the whole conditional distribution unless we have a substantial amount of data. In real applications, interest is focused on key summaries of the conditional distribution such as the mean, median or other measures of the location of the distribution. In what follows, we focus on the location but later we indicate how to deal with other summaries such as dispersion or quantiles. 2.1.1 Linear location regression In linear location regression we assume the response variable can be written directly in terms of the explanatory variables in a linear fashion as follows: 𝒴𝑖 = 𝐱𝑖𝑇 𝜽(1) ∗ + ℰ𝑖 (2.3) for 𝑖 = 1, 2, … , 𝑛 The vector of parameters 𝜃1∗ ⎛ 𝜃2∗ ⎞ ∗ 𝜽(1) =⎜ ⎜⋮⎟ ⎜ ⎟ ⎟ ∗ ⎝𝜃𝑝 ⎠ is unknown and to be estimated by the data. ℰ1 , … , ℰ𝑛 are independent and identically distributed random variables that are symmetric around zero, e.g. ℰ𝑖 has the same distribution as −ℰ𝑖. The random variables ℰ𝑖 are called the residuals The corresponding common distribution function of the residuals may de- pend on an unknown vector of parameters: ∗ 𝜃𝑝+1 ⎛ ∗ ⎜ 𝜃𝑝+2 ⎞ ⎟ ∗ 𝜽(2) =⎜ ⎜ ⎟ ⎜ ⋮ ⎟ ⎟ ∗ ⎝𝜃𝑝+𝑚 ⎠ to be estimated by the data. The first objective is to estimate the unknown joint parameter ∗ 𝜽(1) 𝜽∗ = ( ∗ ) 𝜽(2) Symmetry of the residuals implies that 𝐱𝑖𝑇 𝜽(1) ∗ is the median of the conditional distribution and if the conditional expectation exists, also that 𝐸[𝒴𝑖 |𝒳𝑖 = 𝐱𝑖 ] = 𝐱𝑖𝑇 𝜽(1) ∗. 8 CHAPTER 2. OPTIMISATION IN STATISTICS Example 2.1 (Location-scale linear regression). A continuous random variable 𝒴 follows a location-scale model with location parameter 𝜇 ∈ IR and scale parameter 𝜈 > 0 if its probability density function can be written as: 1 𝑦−𝜇 𝑓𝜙 (𝑦|𝜇, 𝜈) =𝜙( ) , 𝑦 ∈ IR 𝜈 𝜈 for some continuous density function 𝜙 over the real line. If 𝒴 ∼ 𝜙(𝜇, 𝜈) then we can write: 𝒴=𝜇+𝜈𝒵 where 𝒵 ∼ 𝜙(0, 1), that is 𝒵 has density function equal to 𝜙(𝑧). Obviously 𝜇 controls the location of the distribution of 𝒴 and 𝜈 controls the variance in the sense that: 𝑉 𝑎𝑟[𝒴] = 𝜈 2 𝑉 𝑎𝑟[𝒵] Specific instances are given below: 𝒵 ∼ 𝑁 (0, 1) , that is 𝒵 follows a standard normal distribution with density function: 1 𝑧2 𝜙norm (𝑧) = √ exp (− ( )) 2𝜋 2 In the corresponding location-scale model 𝜇 is the mean and the median. 𝑍 follows a standard Logistic distribution with density function: exp (−𝑧) 𝜙logis (𝑧) = 2 (1 + exp (−𝑧)) then in the corresponding location-scale model 𝜇 is the mean and the median. Given a set of explanatory variables, assume the unknown conditional density function 𝑓∗ (𝑦|𝐱) is of the form: 1 𝑦 − 𝐱𝑇 𝜽(1) ∗ 𝑓∗ (𝑦|𝐱) = 𝑓𝜙 (𝑦|𝐱𝑇 𝜽(1) ∗ ∗ , 𝜃𝑝+1 )= ∗ 𝜙( ∗ ) 𝜃𝑝+1 𝜃𝑝+1 ∗ ∗ 𝑝 for some known density 𝜙(𝑧) and some unknown 𝜽(1) ∈ IR and 𝜃𝑝+1 > 0. Equivalently, we are assuming that: 𝜙 𝑝 𝑓∗ (𝑦|𝐱) ∈ ℱloc.sca ∶= {𝑓𝜙 (𝑦|𝐱𝑇 𝜽(1) , 𝜃𝑝+1 ) ∶ 𝜽(1) ∈ IR , 𝜃𝑝+1 > 0} for some known density 𝜙(𝑧) which is symmetric over the real line. 2.1. REGRESSION ANALYSIS 9 Example 2.2 (Student’s t linear regression). A continuous random vari- able 𝒵 follows a Student’s t distribution with degrees of freedom 𝛿 > 0 if its probability density function can be written as: 𝛿+1 Γ( ) − 𝛿+1 2 𝑦2 2 𝜙t (𝑦|𝛿) = √ (1 + ) , 𝑦 ∈ IR 𝛿 𝛿 𝜋𝛿Γ( ) 2 If 𝒵 follows this distribution, then the distribution of: 𝒴 = 𝜇 + 𝜈𝒵 is said to be a scaled t distribution with location 𝜇 ∈ IR, scale 𝜈 and degrees of freedom 𝛿 > 0. We will use the notation 𝒴 ∼ t(𝜇, 𝜈, 𝛿). We will also restrict 𝜈 > 2 in order to guarantee the variance is finite. Given a set of explanatory variables, we assume the unknown conditional density function 𝑓∗ (𝑦|𝐱) is of the form: ∗ 𝜃∗𝑝+2 +1 𝜃𝑝+2 +1 𝑇 ∗ Γ( ) 𝑇 ∗ 2 − 2 1 𝑦−𝐱 𝜽(1) 2 1 𝑦 − 𝐱 𝜽(1) 𝑓∗ (𝑦|𝐱) = 𝜙t ( ∗ ∣ 𝜃𝑝+2 ) = ⎛ ⎜1 + ∗ ( ⎞ ) ⎟ ∗ ∗ ∗ ∗ 𝜃𝑝+1 𝜃𝑝+1 ∗ ∗ 𝜃𝑝+2 𝜃𝑝+2 𝜃𝑝+1 𝜃𝑝+1 √𝜋 𝜃𝑝+2 Γ( )⎝ ⎠ 2 ∗ ∗ 𝑝 ∗ for some unknown 𝜽(1) ∈ IR , 𝜃𝑝+1 > 0 and 𝜃𝑝+2 > 2. Equivalently, we are assuming that: 𝑦 − 𝐱𝑇 𝜽(1) 𝑝 𝑓∗ (𝑦|𝐱) ∈ ℱt ∶= {𝜙t ( ∣ 𝜃𝑝+2 ) ∶ 𝜽(1) ∈ IR , 𝜃𝑝+1 > 0 , 𝜃𝑝+2 > 2} 𝜃𝑝+1 The linear form (2.3) implies that we can accommodate all the random variables corresponding to a sample of 𝑛 individuals in matrix form as follows: ∗ 𝒴 = 𝐗𝜽(1) +ℰ (2.4) where 𝒴1 𝐱1𝑇 ℰ1 ⎛ 𝒴2 ⎞ ⎛𝐱2𝑇 ⎞ ⎛ ℰ2 ⎞ 𝒴=⎜ ⎜ ⎜ ⋮ ⎟ ⎟ ⎟ 𝐗=⎜ ⎜ ⎜ ⋮ ⎟ ⎟ ⎟ ℰ=⎜ ⎜ ⎜ ⋮ ⎟ ⎟ ⎟ ⎝𝒴𝑛 ⎠ ⎝𝐱𝑛𝑇 ⎠ ⎝ℰ𝑛 ⎠ That is: 10 CHAPTER 2. OPTIMISATION IN STATISTICS 𝒴 is the random vector of response variables of the 𝑛 individuals in the sample. An observation from 𝒴 will de denoted by 𝐲, that is 𝑦1 ⎛ 𝑦2 ⎞ 𝐲=⎜ ⎜ ⎜ ⋮ ⎟ ⎟ ⎟ ⎝𝑦𝑛 ⎠ where 𝑦1 , … , 𝑦𝑛 are the actual observed values of the response variable in the 𝑛 individuals in the sample. 𝐗 is a matrix of dimension 𝑛×𝑝 whose rows are the vectors of explanatory variable values corresponding to the 𝑛 individuals in the sample. The matrix 𝐗 is of dimension 𝑛 × 𝑝 and is called the model matrix (also called design matrix) of the linear model (2.4). Unless otherwise stated, we assume that 𝑛 > 𝑝 and the columns of 𝐗 are linearly independent 𝑛 vectors in IR so that the model matrix 𝐗 has rank 𝑝. ℰ is a random vector of the 𝑛 residuals. You can find details about the statistical properties, as well as worked examples, of the linear model (2.4) in the MA20227 Statistics 2B Lecture Notes specially in the case where the normality assumption is adopted. When the distribution of the response is assumed to be Normal (𝒵 ∼ 𝑁 (0, 1) in (2.1), it turns out that the majority of key quantities (estimators, variances, etc) are simple closed-form expressions of functions of both 𝐲 and 𝐗. When the distribution of the response is assumed to be Logistic, for example, such key quantities no longer have closed-form expressions in the terms of 𝐲 and 𝐗 and can only be computed numerically and iteratively rather than by simple evaluation. In this course, you will learn how to perform such computations as well as some of the theory behind them. 2.1.2 Generalised linear mean regression In mean regression, the main objective is to estimate the unknown conditional mean function. In the continuous case, this is given by: 𝐸∗ [𝒴|𝓧 = 𝐱] ∶= ∫ 𝑦 𝑓∗ (𝑦|𝐱) 𝑑𝑦 When the conditional mean function depends on a vector of unknown parameters we will denote it by 𝜇(𝜽∗ , 𝐱) that is 𝐸∗ [𝒴|𝓧 = 𝐱] = 𝜇(𝜽∗ , 𝐱) 2.1. REGRESSION ANALYSIS 11 to emphasize its dependence on both the unknown parameter 𝜽∗ and the covari- ates 𝐱. In generalised linear mean regression, the response random variables in the sample 𝒴 1 , … , 𝒴𝑛 are assumed to be independent. Furthermore, the transformed conditional mean function is assumed to be a linear function of the explanatory variables, that is: ∗ ℎ(𝜇(𝜽(1) , 𝐱𝑖 )) = 𝐱𝑖𝑇 𝜽(1) ∗ ∗ 𝑝 for 𝑖 = 1, … , 𝑛 where ℎ ∶ IR → IR is a known smooth function and 𝜽(1) ∈ IR is a vector of unknown parameter to be estimated by the data. The linear term 𝐱𝑇 𝜽(1) ∗ is called the linear predictor. The function ℎ is called the link function as it links the conditional mean with the linear predictor. The conditional mean alone does not uniquely determine the conditional prob- ability distribution. Two very different conditional distributions can share the same conditional mean. Then we need to specify other characteristics of the conditional distribution, apart from the mean, in order to uniquely characterise it. For example, the Gaussian (Normal) distribution is uniquely characterised when we specify both its mean and variance. The unknown conditional density 𝑓∗ (𝑦|𝐱) is assumed to satisfy: 𝜽(1) 𝑝+𝑚 𝑓∗ (𝑦|𝐱) ∈ ℱ = {𝑓(𝑦|𝜇(𝜽(1) , 𝐱), 𝜽(2) ) ∶ 𝜽 = ( ) ∈ 𝚯 ⊆ IR } 𝜽(2) 𝑚 where 𝑓(𝑦|𝜇, 𝜽(2) ), 𝜽(2) ∈ IR is a density function parametrised with 𝑚 + 1 parameters and one of them is 𝜇, the corresponding expectation. That is: 𝑚 𝐸[𝒴|𝜇, 𝜽(2) ] = 𝜇 ∀𝜽(2) ∈ IR. The first objective is to estimate the unknown joint parameter ∗ 𝜽(1) 𝜽∗ = ( ∗ ) 𝜽(2) For example, when the response variable can only take positive values, one can assume assume that the conditional mean has the form: 𝜇(𝜽∗ , 𝐱) = exp(𝐱𝑇 𝜽∗ ) (2.5) 12 CHAPTER 2. OPTIMISATION IN STATISTICS since the exponential function, forces the conditional mean to be positive. In other situations, the mean of the response variable can take values in the unit interval (0, 1) e.g. a proportion, then is common to assume the conditional mean has the form: exp(𝐱𝑇 𝜽∗ ) 𝜇(𝜽∗ , 𝐱) = (2.6) 1 + exp(𝐱𝑇 𝜽∗ ) which maps the real line onto the unit interval. Note we can trivially rewrite (2.5) and (2.6) as: log(𝜇(𝜽∗ , 𝐱)) = 𝐱𝑇 𝜽∗ and logit(𝜇(𝜽∗ , 𝐱)) = 𝐱𝑇 𝜽∗ where logit is the so-called logit function defined as 𝑢 logit(𝑢) ∶= log ( ) 𝑢 ∈ (0, 1) 1−𝑢 This is the natural logarithm of the odds of an event with probability 𝑢 of occurring. In both cases, the transformed conditional mean function is assumed to be a linear function of the explanatory variables. Example 2.3 (Exponential family generalised linear models). A random variable 𝒴 (continuous or discrete) follows an exponential family of distributions with canonical (a.k.a natural) parameter 𝜂 and dis- persion parameter 𝛿 > 0 if the density function can be written as : 𝜂 𝑦 − Ψ(𝜂) 𝑓EF-GLM (𝑦|𝜂, 𝛿) = exp ( + 𝑐(𝑦, 𝛿)) 𝛿 for some known functions 𝑐(𝑦, 𝛿) and Ψ(𝜂). The function Ψ is called the cumulant function and determines the mean and variance in the sense that: 𝐸[𝒴] = Ψ′ (𝜂) = 𝜇 𝑉 [𝒴] = 𝛿 Ψ″ (𝜂) We assume that: 𝜂 = 𝑔(𝜇) 2.1. REGRESSION ANALYSIS 13 for some function 𝑔, is the unique solution to Ψ′ (𝜂) = 𝜇 for a given 𝜇. Specific instances are given below: The normal (Gaussian) distribution with mean 𝜇 and variance 𝛿: – canonical parameter 𝜂 = 𝜇 – dispersion 𝛿, – 𝑐(𝑦, 𝛿) = −𝑦2 /(2𝛿) − log(2𝜋𝛿)/2 and – cumulant function Ψ(𝜂) = 𝜂2 /2. The Poisson distribution with mean 𝜇: – canonical parameter 𝜂 = log(𝜇) – dispersion 𝛿 = 1 – 𝑐(𝑦, 𝛿) = − log(𝑦!) – cumulant function Ψ(𝜂) = exp(𝜂) The Exponential distribution with mean 𝜇: – canonical parameter 𝜂 = −1/𝜇 – dispersion 𝛿 = 1 – 𝑐(𝑦, 𝛿) = 0 and – cumulant function Ψ(𝜂) = − log(−𝜂) The Binomial distribution with mean 𝜇 and index parameter 𝑁 : – canonical parameter 𝜂 = logit(𝜇/𝑁 ) – dispersion 𝛿 = 1 – 𝑐(𝑦, 𝛿) = log (𝑁𝑦 ) − 𝑁 log(𝑁 ) and – cumulant function Ψ(𝜂) = −𝑁 log(𝑁 (1 + exp(𝜂))−1 ). Given a set of explanatory variables, we assume that the unknown condi- tional density function 𝑓∗ (𝑦|𝐱) is of the form: 𝑔(ℎ−1 (𝐱𝑇 𝜽(1) ∗ )) 𝑦 − Ψ(𝑔(ℎ−1 (𝐱𝑇 𝜽(1) ∗ ))) 𝑓∗ (𝑦|𝐱) = 𝑓EF-GLM (𝑦|𝑔(ℎ−1 (𝐱𝑇 𝜽(1) ∗ ∗ )), 𝜃𝑝+1 ) = exp ( ∗ ∗ + 𝑐(𝑦, 𝜃𝑝+1 )) 𝜃𝑝+1 for some known link function ℎ such that ∗ ℎ(𝜇(𝜽(1) , 𝐱)) = 𝐱𝑇 𝜽(1) ∗ ∗ 𝑝 ∗ some unknown 𝜽(1) ∈ IR and 𝜃𝑝+1 > 0. Equivalently, we are assuming that: ℎ 𝑝 𝑓∗ (𝑦|𝐱) ∈ ℱEF-GLM ∶= {𝑓EF-GLM (𝑦|𝑔(ℎ−1 (𝐱𝑇 𝜽(1) )), 𝜃𝑝+1 ) ∶ 𝜽(1) ∈ IR , 𝜃𝑝+1 > 0} for some link function ℎ. Note that if we choose ℎ = 𝑔 (the so called canonical link) we have a simplification, namely 14 CHAPTER 2. OPTIMISATION IN STATISTICS (𝐱𝑇 𝜽(1) ∗ ) 𝑦 − Ψ(𝐱𝑇 𝜽(1) ∗ ) 𝑓∗ (𝑦|𝐱) = 𝑓EF-GLM (𝑦|𝐱𝑇 𝜽(1) ∗ ∗ , 𝜃𝑝+1 ) = exp ( ∗ ∗ + 𝑐(𝑦, 𝜃𝑝+1 )) 𝜃𝑝+1 ∗ 𝑝∗ for some unknown 𝜽(1) ∈ IR and 𝜃𝑝+1 > 0. 2.1.3 Likelihood function Consider a sample (𝐱1 , 𝑦1 ), … , (𝐱𝑛 , 𝑦𝑛 ) of independent measurements from 𝑛 individuals in the population of interest as described above. The corresponding unknown joint density function of the conditional random vector 𝓨|𝓧1 = 𝐱1 , … 𝓧𝑛 = 𝐱𝑛 is given by: 𝑛 𝑓∗ (𝐲|𝐱1 , … , 𝐱𝑛 ) = ∏ 𝑓∗ (𝑦𝑖 |𝐱𝐢 ). (2.7) 𝑖=1 then the unknown joint density function 𝑓∗ (𝐲|𝐱1 , … , 𝐱𝑛 ) is such that: 𝑛 𝜽(1) 𝑝+𝑚 𝑓∗ (𝐲|𝐱1 , … , 𝐱𝑛 ) ∈ {∏ 𝑓(𝑦𝑖 |𝜇(𝜽(1) , 𝐱𝑖 ), 𝜽(2) ) ∶ 𝜽 = ( ) ∈ 𝚯 ⊆ IR } 𝑖=1 𝜽(2) Likelihood function Definition 2.1. Given independent samples (𝐱1 , 𝑦1 ), … , (𝐱𝑛 , 𝑦𝑛 ) where each (𝐱𝑖 , 𝑦𝑖 ) is an observation from an unknown density 𝑓∗ (𝑦|𝐱𝑖 ). If we assume that: 𝜽(1) 𝑝+𝑚 𝑓∗ (𝑦|𝐱𝑖 ) ∈ ℱ = {𝑓(𝑦|𝜇(𝜽(1) , 𝐱), 𝜽(2) ) ∶ 𝜽 = ( ) ∈ 𝚯 ⊆ IR } 𝜽(2) The likelihood function relative to the parametric family ℱ is defined by: 2.1. REGRESSION ANALYSIS 15 𝑛 𝐿(𝜽|𝐲) ∶= ∏ 𝑓(𝑦𝑖 |𝜇(𝜽(1) , 𝐱𝑖 ), 𝜽(2) ) 𝑖=1 𝚯 is the parameter space, that is the set of allowed values of all the parameters. For numerical and convenience reasons that will be explained later we will usu- ally work with the natural logarithm of the likelihood function which we define below 𝑛 ℓ(𝜽|𝐲) ∶= log 𝐿(𝜽|𝐲) = ∑ log 𝑓(𝑦𝑖 |𝜇(𝜽(1) , 𝐱𝑖 ), 𝜽(2) ) 𝑖=1 We will call ℓ(𝜽|𝐲) the loglikelihood function. For brevity, the dependence of the likelihood or loglikelihood, on 𝐱1 , … , 𝐱𝑛 , has been dropped from the notation. The likelihood function is simply the joint density function seen as a function of 𝜽 rather than as a function of the samples since these are fixed. In the discrete case, the likelihood function evaluated a particular value 𝜽† ∈ 𝚯 (say) can be seen as a measure of agreement between the observed 𝐲 and 𝜽†. If 𝐿(𝜽† |𝐲) is close to one then the probability of observing 𝐲 is very high when 𝜽 = 𝜽† so we say that 𝜽† agrees with the observed 𝐲. If 𝐿(𝜽† |𝐲) is close to zero then the probability of observing 𝐲 is very small when 𝜽 = 𝜽† so we say that 𝜽† does not agree with the observed 𝐲. In general, the likelihood function is a measure of agreement between data and parameter values. Given the data is fixed, it is then reasonable to try to find the parameter values that maximise the agreement with the data. The corresponding optimisation problem is called maximum likelihood estimation. 2.1.4 Maximum likelihood estimation Maximum likelihood estimation Definition 2.2. Given independent samples 𝑦1 |𝐱1 , … , 𝑦𝑛 |𝐱𝑛. The max- imum likelihood estimate (MLE) relative to the parametric family: 𝜽(1) 𝑝+𝑚 ℱ = {𝑓(𝑦|𝜇(𝜽(1) , 𝐱), 𝜽(2) ) ∶ 𝜽 = ( ) ∈ 𝚯 ⊆ IR } 𝜽(2) is defined as follows: 16 CHAPTER 2. OPTIMISATION IN STATISTICS 𝜽𝑛̂ (𝐲) = argmax 𝐿(𝜽|𝐲) 𝜽∈𝚯 Since the logarithm is a monotone function then the optimisation problem above is equivalent to that of maximising the loglikelihood function ℓ(𝜽|𝐲), that is: ̂ 𝜽(𝐲) = argmax 𝐿(𝜽|𝐲) = argmax ℓ(𝜽|𝐲) 𝜽∈𝚯 𝜽∈𝚯 Since most optimisation software is setup by default to perform minimisa- tion rather than maximisation, we will also define the maximum likelihood estimate as the solution to the following minimisation problem ̂ 𝜽(𝐲) = argmax ℓ(𝜽|𝐲) = argmin 𝜙(𝜽|𝐲) 𝜽∈𝚯 𝜽∈𝚯 where 𝜙(𝜽|𝐲) ∶= −ℓ(𝜽|𝐲) is the negative loglikelihood function. Example 2.4 (Poisson loglinear regression). Consider the following arti- ficial regression data: x y -1 2 -1 3 0 6 0 7 0 8 0 9 1 10 1 12 1 15 Figure 2.1 shows the data and we can see a positive relation between the mean response and the values of the explanatory 𝑥. 2.1. REGRESSION ANALYSIS 17 1 library(tidyverse) 2 x

MA40198 Lecture Notes on Applied Statistical Inference PDF

Document Details

Tags

Related

Summary

Full Transcript