Linear models

Advanced Statistics and Data Analysis

Davide Risso

Linear Algebra

Linear Algebra

Linear algebra, also called matrix algebra, and its mathematical notation greatly facilitates the understanding of the concepts behind linear models.

Here, we spend some time to review key concepts of linear algebra, with particular focus on matrix operations and matrix notation.

We will also see how to perform matrix operations in R.

Matrix Notation

Linear algebra was created to solve systems of linear equations, e.g.

\[\begin{align*} a + b + c &= 6\\ 3a - 2b + x &= 2\\ 2a + b - c &=1 \end{align*}\]

which becomes, in matrix notation,

\[ \begin{bmatrix} 1 & 1 & 1 \\ 3 & -2 & 1 \\ 2 & 1 & -1 \end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix} = \begin{bmatrix} 6 \\ 2 \\ 1 \end{bmatrix} \]

Matrix Notation

The system is solved by

\[ \begin{bmatrix} a \\ b \\ c \end{bmatrix} = \begin{bmatrix} 1 & 1 & 1 \\ 3 & -2 & 1 \\ 2 & 1 & -1 \end{bmatrix}^{-1} \begin{bmatrix} 6 \\ 2 \\ 1 \end{bmatrix} \]

where the \(-1\) denotes the inverse of the matrix.

We can borrow this notation to solve linear models in statistics.

Notation

We define vectors as column vectors, i.e., matrices with just one column.

This means that all the operations defined for matrices will be applicable to vectors as well.

We can write matrices by either concatenating vectors or by explicitly writing their elements.

\[ Y = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} \quad \quad X = [X_1 \, X_2] = \begin{bmatrix} X_{1,1} & X_{1, 2}\\ X_{2,1} & X_{2, 2}\\ \vdots & \vdots\\ X_{n,1} & X_{n,2} \end{bmatrix} \]

Vectors, matrices

We already know how to create these objects in R

x1 <- c(1, 4, 6)
x2 <- c(7, 7, 2)
cbind(x1, x2)
     x1 x2
[1,]  1  7
[2,]  4  7
[3,]  6  2
matrix(data = c(1, 4, 6, 7, 7, 2), nrow = 3, ncol = 2)
     [,1] [,2]
[1,]    1    7
[2,]    4    7
[3,]    6    2

Matrix operations

Here, we review some of the operations that can be performed on matrices, both in mathematical terms and in R.

  • Multiplication by a scalar
  • Transposition
  • Matrix multiplication
  • Inverse

Multiplication by a scalar

This is the simplest operation: each element of the matrix is multiplied by the scalar.

R automatically recognize a scalar and a matrix and “does the right thing.”

a <- 2
x <- matrix(data = c(1, 4, 6, 7, 7, 2), nrow = 3, ncol = 2)
x
     [,1] [,2]
[1,]    1    7
[2,]    4    7
[3,]    6    2
a * x
     [,1] [,2]
[1,]    2   14
[2,]    8   14
[3,]   12    4

The transpose

Transposition is a simple operation that simply changes columns to rows. We use \(\top\) sign to denote the transposed of a matrix.

\[ X^\top = \begin{bmatrix} X_{1,1} & \ldots & X_{n, 1}\\ X_{1,2} & \ldots &X_{n, 2}\\ \vdots & \ddots &\vdots\\ X_{1,p} & \ldots & X_{n,p} \end{bmatrix} \]

The transpose

In R, we can use the t() operator to transpose a matrix.

t(x)
     [,1] [,2] [,3]
[1,]    1    4    6
[2,]    7    7    2

Matrix multiplication

If \(A\) is an \(n \times m\) matrix and \(B\) is an \(m \times p\) matrix, their matrix product \(X = AB\) is an \(n \times p\) matrix, in which each row of \(A\) is multiplied by each column of \(B\) and summed.

In R, we can use the %*% operator.

y <- matrix(c(5, 6, 7, 2), ncol=2, nrow=2)
x %*% y
     [,1] [,2]
[1,]   47   21
[2,]   62   42
[3,]   42   46

Properties of matrix multiplication

In general, the matrix multiplication operator is

  1. Not commutative: \(A \, B \neq B \, A.\)
  2. Distributive over matrix addition: \(A(B+C) = AB + AC.\)
  3. Compatible with scalar multiplication: \(a(AB) = (aA)B\)
  4. Transposition: \((AB)^\top = B^\top A^\top.\)

The identity matrix

For scalars we have the number \(1\), a number such that \(1 x = x\) for any \(x\).

The analogous for matrices, is the identity matrix \[ I = \begin{bmatrix} 1 & 0 & 0 & \ldots & 0 & 0\\ 0 & 1 & 0 & \ldots & 0 & 0\\ 0 & 0 & 1 & \ldots & 0 & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & 0 & \ldots & 1 & 0\\ 0 & 0 & 0 & \ldots & 0 & 1 \end{bmatrix} \]

By definition, for any matrix \(X\), \(X \, I = X\).

The identity matrix

In R we can use the function diag().

x %*% diag(2)
     [,1] [,2]
[1,]    1    7
[2,]    4    7
[3,]    6    2
diag(3) %*% x
     [,1] [,2]
[1,]    1    7
[2,]    4    7
[3,]    6    2

Warning

Note the different behavior of diag(x) when x is a scalar or a vector!

The inverse

A matrix is called square if it has the same number of rows and columns.

The inverse of a square matrix \(X\), denoted by \(X^{-1}\) is the matrix that when multiplied by \(X\) returns the identity matrix.

\[ X \, X^{-1} = I. \]

Note that not all matrices have a defined inverse.

We will see later how this plays a role for linear models.

The inverse

If the inverse exists, it can be computed using solve().

y
     [,1] [,2]
[1,]    5    7
[2,]    6    2
solve(y)
        [,1]     [,2]
[1,] -0.0625  0.21875
[2,]  0.1875 -0.15625
y %*% solve(y)
     [,1]         [,2]
[1,]    1 2.220446e-16
[2,]    0 1.000000e+00

Warning

The solve() function is numerically unstable and should be used with caution.

Using matrix algebra in statistics

  • Why do we need linear algebra and matrix notation in statistics?

  • Linear algebra is a very convenient and compact mathematical way of formalizing many of the concepts of statistics, especially with regards to linear models.

  • Let’s focus on a couple of simple examples.

Example: The mean

  • Assume that we have a vector \(X\) that contains the data for a sample of \(n\) observations. We can use matrix notation to denote the sample mean of that vector.

  • Define a \(n \times 1\) matrix \(\mathbf{1} = [1 \, \ldots \, 1]^\top\).

  • We can compute the mean simply as \(\frac{1}{n} \mathbf{1}^\top X\). \[ \frac{1}{n} \mathbf{1}^\top X = \frac{1}{n} [1 \, \ldots \, 1] \begin{bmatrix} X_1\\ \vdots\\ X_n \end{bmatrix} = \frac{1}{n} \sum_{i=1}^n X_i = \bar{X}. \]

Example: The mean

n <- 10
x <- rnorm(n)
mean(x)
[1] 0.4672474
A <- rep(1, n)
1/n * t(A) %*% x
          [,1]
[1,] 0.4672474

Example: The variance

The same is true for the variance: We can simply multiply the centered matrix by its transpose to compute it.

\[ R = X - \bar{X} = \begin{bmatrix} X_1 - \bar{X}\\ \vdots\\ X_n - \bar{X} \end{bmatrix} \]

Then, \[ \frac{1}{n-1} R^\top R = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2. \]

Example: The variance

var(x)
[1] 0.9345514
r <- x - mean(x)
1/(n-1) * t(r) %*% r
          [,1]
[1,] 0.9345514

The crossprod function

The operation \(X^\top Y\) is so important in statistics that R has a shortcut function for it, the crossprod() function.

A <- matrix(rnorm(4), nrow=2, ncol=2)
B <- matrix(rnorm(4), nrow=2, ncol=2)
t(A) %*% B
          [,1]       [,2]
[1,] -0.704367  0.6050073
[2,]  0.375773 -0.1231196
crossprod(A, B)
          [,1]       [,2]
[1,] -0.704367  0.6050073
[2,]  0.375773 -0.1231196

The crossprod function

The crossprod function can be used to compute the variance.

var(x)
[1] 0.9345514
crossprod(r)/(n-1)
          [,1]
[1,] 0.9345514

Note that crossprod(r) is a further shortcut for crossprod(r, r).

Least-squares regression

Least-Squares Regression

  • If we want to measure the linear association between two continuous variables, we can use the correlation coefficient.

  • An alternative way to look at the association between two variables is to determine the best line that describe their relationship.

  • What do we mean by “best”? In the context of regression, “best” means the line that minimizes the sum of squared distances.

A simple example

A simple example

A simple example

A simple example

A simple example

A simple example

Least-Squares Regression

Mathematically, we can describe the linear relation between the two quantities \(x\) and \(y\) as \[ \hat{y} = \alpha + \beta x. \]

Finding the best line corresponds to minimize the sum of squared distances between the values of \(y\) and the values of \(\alpha + \beta x\), i.e. \[ \min_{\alpha, \beta} \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2. \]

Least-Squares Regression

To minimize this quantity, as usual, we compute the partial derivatives.

\[\begin{align*} \frac{\partial}{\partial \alpha} &= -2 \sum_{i=1}^n (y_i - \alpha - \beta x_i) = 0,\\ \frac{\partial}{\partial \beta} &= -2 \sum_{i=1}^n (y_i - \alpha - \beta x_i) (x_i) = 0,\\ \end{align*}\]

We can show that \[\begin{align*} \alpha &= \bar{y} - \beta \bar{x},\\ \beta &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}. \end{align*}\]

Multiple Regression

The ideas of linear regression can be applied in the case where we have more than one covariate.

Instead of fitting the line \(\hat{y} = \alpha + \beta x\), we can consider the more general form \[ \hat{y} = \alpha + \beta_1 x_1 + \ldots + \beta_p x_p, \] where \(p\) is the number of covariates and \(x_1, \ldots, x_p\) are \(p\) variables.

For symmetry, the intercept \(\alpha\) is often denoted with \(\beta_0\).

Multiple Regression: matrix notation

Given a set of \(n\) independent observations, and \(p\) covariates, with \(n > p\), the linear regression model is defined, in matrix notation, as \[ \hat Y = X \beta, \] where

  • \(Y\) is a \(n \times 1\) vector containing the response variable.
  • \(X\) is a \(n \times p\) matrix of covariates.
  • \(\beta\) is a \(p \times 1\) vector of regression parameters.

Least squares: multiple regression

The least-square equation is easier in matrix notation:

\[ (Y - X\beta)^\top (Y - X\beta) \]

The derivative in matrix notation is \[ 2 X^\top (Y - X\beta) = 0 \] which leads to \[ X^\top X \beta = X^\top Y \] and \[ \hat\beta = (X^\top X)^{-1}X^\top Y \]

Example: Cherry trees

head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

Example: Cherry trees

Example: Cherry trees

For the Cherry tree data, assume to model the volume (\(y\)) of the trees as a function of girth (\(x_1\)) and height (\(x_2\)): \[ Y_i=\beta_0+ \beta_1 x_{i1}+ \beta_2 x_{i2}+\varepsilon_i\,,\,\,\, i=1, \ldots, 31. \]

y <- trees$Volume
X <- cbind(1, trees$Girth, trees$Height)
head(y)
[1] 10.3 10.3 10.2 16.4 18.8 19.7
head(X)
     [,1] [,2] [,3]
[1,]    1  8.3   70
[2,]    1  8.6   65
[3,]    1  8.8   63
[4,]    1 10.5   72
[5,]    1 10.7   81
[6,]    1 10.8   83

Example: Cherry trees

This yields

crossprod(X)
       [,1]     [,2]     [,3]
[1,]   31.0   410.70   2356.0
[2,]  410.7  5736.55  31524.7
[3,] 2356.0 31524.70 180274.0
crossprod(X, y)
         [,1]
[1,]   935.30
[2,] 13887.86
[3,] 72962.60
solve(crossprod(X)) %*% crossprod(X, y)
            [,1]
[1,] -57.9876589
[2,]   4.7081605
[3,]   0.3392512

Example: Cherry trees

fit <- lm(Volume ~ Girth + Height, data = trees)
fit

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Coefficients:
(Intercept)        Girth       Height  
   -57.9877       4.7082       0.3393  

Regression as a statistical model

  • Up until now, we have only defined the regression line as a mathematical entity that we can compute starting from a vector \(y\) and a matrix \(X\).

  • However, in inferential statistics, both \(Y\) and \(X\) are random variables, and what we observe are the values of our random sample from the population.

  • This means that what we want to infer the true relationship between \(X\) and \(Y\) in the population starting from the relationship that we observe in the sample.

Regression as a statistical model

  • In other words, there is a true parameter \(\beta\) in the population, and we need to estimate it with the values of our sample.

  • In the context of inference, regression is often referred to as the linear model.

Linear models

Simple linear Models

  • Suppose that, on \(n\) cases, we observe a continuous response \(y_i\) and one explanatory variable \(x_{i}\).

  • A simple linear model postulates that the observations are realizations of independent rv’s \(Y_i\) such that: \[ E(Y_i | X_i = x_i)= \beta_0 + \beta_1 x_{i}, \,\,\,i=1,\,\,\,\ldots,n. \]

  • Therefore, observations are realizations of independent rv’s whose averages lie on the straight line \(\beta_0 + \beta_1 x\).

Simple linear regression

Another way to look at this is: \[ Y_i = \beta_0 + \beta_1 x_{i} + \varepsilon_i\,\,\,i=1,\ldots,n, \] where \(\varepsilon_i\) are random errors with mean 0.

Often, it is assumed that the variance of the error terms, for all \(i=1,\ldots,n\), is the same, i.e., errors are homoschedastic.

Interpretation of slope

Consider two subjects:

  • A has covariate value \(x_{A}\)
  • B has covariate value \(x_{B}=x_{A}+1\)

The expected difference in the response of the two subjects is: \[ E(Y_B) - E(Y_A) = \beta_0 + \beta_1 (x_{A}+1) - \beta_0 - \beta_1 x_{A}=\beta_1, \] i.e., \(\beta_1\) is the effect of a one unit increase in \(x.\)

Multiple linear regression

In multiple linear regression, rather than one predictor, we have more than one, \(p\), say. \[ Y_i = \beta_0 + \beta_1 x_{i1} + \ldots +\beta_p x_{ip} + \varepsilon_i\,\,\,i=1,\ldots,n \]

or equivalently \[Y = X\beta + \varepsilon.\]

Interpretation of parameters

Consider two subjects:

  • A has covariate values \(x_{A1},x_{A2}, \ldots, x_{Ap}\)
  • B has covariate values \(x_{A1},x_{A2}+1, \ldots, x_{Ap}\)

Expected difference in the response of the two subjects: \[ \begin{aligned} E(Y_B) - E(Y_A) &=&\beta_0 + \beta_1 x_{A1} + \beta_2 (x_{A2}+1)+ \ldots + \beta_p x_{Ap}+\\ && - \beta_0 - \beta_1 x_{A1} - \beta_2 x_{A2}- \ldots - \beta_p x_{Ap}\\ &=&\beta_2,\end{aligned} \] i.e., \(\beta_2\) is the effect of a one unit increase in \(x_2\) for fixed level of the other predictors.

Coefficients \(\beta_1, \ldots, \beta_p\) are called (partial) regression coefficients because they “allow” for the (partial) effect of other variables.

The linear model: matrix notation

Given a set of \(n\) independent observations, and \(p\) covariates, with \(n > p\), the linear model is defined, in matrix notation, as \[ Y = X \beta + \varepsilon, \] where

  • \(Y\) is a \(n \times 1\) vector containing the response variable.
  • \(X\) is a \(n \times p\) matrix of covariates, called the design matrix.
  • \(\beta\) is a \(p \times 1\) vector of regression parameters.
  • \(\varepsilon\) is a \(n \times 1\) vector of random errors.

The linear model

As with all the statistical models, it is always a good idea to identify the observed / unobserved quantities, as well as the random / fixed quantities.

In this case,

  • \(X\) and \(Y\) are observed random variables.
  • \(\varepsilon\) is an unobserved random variable.
  • \(\beta\) is a set of unknown parameters that we will need to estimate.

The linear model

In some books, \(X\) is not considered a random variable, but a fixed quantity. In this course, we will consider it random, but we will condition our analysis to the observed values of \(X\).

For practical purposes, we do not care about the random variation of \(X\).

A note on linearity

The term linear in the linear model refers to the linearity of the model with respect to the parameter \(\beta\).

It does not imply that we have linearity in \(X\) and \(X\) may contain non-linear transformation of the data. For instance the following models are valid linear models.

\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon \]

\[ Y = \beta_0 + \beta_1 \log X + \varepsilon \]

A note on the design matrix

Recall that \[ \hat\beta = (X^\top X)^{-1}X^\top Y. \]

This means that \(X^\top X\) needs to be invertible, which implies that \(X\) is full rank.

Definition

The rank of a matrix is the number of columns that are independent of all the others.

If the rank is equal to the number of columns, the matrix is said to be full rank.

This is often referred to as the non collinearity condition.

Assumptions

As with all the statistical models, we need some assumptions.

  1. There is a linear relation between \(Y\) and \(X\).
  2. The error terms \(\varepsilon\) are i.i.d. with mean 0 and constant variance:
    • \(E[\varepsilon] = 0\);
    • \(Var(\varepsilon) = \sigma^2\) (homoscedasticity).
  3. The error terms \(\varepsilon\) are independent of \(X\):
    • \(\varepsilon {\perp\!\!\!\!\perp} X\).

Conditional Expectation

As a consequence of these assumptions, we can see that the conditional expectation of \(Y\) is

\[ E[Y | X] = X\beta. \]

Exercise

Why?

Why fitting linear models

Explanation

Here the idea is the that system under study really is (approximately) linear, and we are interested in the coefficients per se.

It is often of particular interest to find a minimal set of explanatory variables.

Example

We can use a linear model to understand the effect of a certain treatment on gene expression.

We want to understand which genes are affected by the treatment and why.

Why fitting linear models

Prediction

Here a model is a convenient means to create predictions for new cases.

The only interest is in the quality of the predictions.

Example

We can use a linear model to predict whether a patient will react to a therapy based on their gene expression

We want to get the best possible prediction, we do not care much about which genes are involved.

To solve now

  1. Fitting the model: how do we estimate \((\beta, \sigma^2)\)?

  2. Inference: what can we say about \(\beta\) (rarely, about \(\sigma^2\)) based on the \(n\) observations?

Ordinary Least Squares

Ordinary Least Squares

We do not observe the true parameter \(\beta\) and we need to estimate it from the data.

Luckily, we already know how to estimate \(\beta\), as the value that minimizes the sum of the squares of the differences between \(Y\) and \(X\beta\).

The value of \(\beta\) such that \[ \hat\beta=(X^\top X)^{-1}X^\top y. \] is called the ordinary least squares (OLS) estimator.

Ordinary Least Squares

To estimate the variance of the random terms, it seems natural to consider the variability of the residuals. An unbiased estimator for \(\sigma^2\) is given by \[ s^2=\displaystyle\frac{\sum(y_i - \hat y_i)^2}{n-p}. \]

Ordinary Least Squares

In matrix form we have that: \[ \hat y = X\hat\beta, \]

\[ e=y - \hat y, \]

\[ s^2=e^\top e/(n-p)=(y - \hat y)^\top(y - \hat y)/(n-p). \]

Why least squares?

We can prove that \(\hat\beta\) is conditionally unbiased, i.e., \[ E[\hat \beta | X] = \beta. \]

The variance of the OLS estimator is

\[ Var(\hat\beta | X) = \sigma^2 (X^\top X)^{-1}. \]

The OLS estimator is conditionally unbiased

\[\begin{align*} \hat\beta &= (X^\top X)^{-1} X^\top (X\beta + \varepsilon)\\ &= (X^\top X)^{-1} X^\top X\beta + (X^\top X)^{-1} X^\top \varepsilon\\ &= \beta + (X^\top X)^{-1} X^\top \varepsilon. \end{align*}\]

Hence, \[ \hat\beta = \beta + \eta \quad \text{where} \quad \eta = (X^\top X)^{-1} X^\top \varepsilon. \]

The OLS estimator is conditionally unbiased

To show that \(E[\hat \beta | X] = \beta\), we need to show that \(E[\eta | X] = 0.\)

\[ E[\eta | X] = (X^\top X)^{-1} X^\top E[\varepsilon | X]. \]

Since \(\varepsilon {\perp\!\!\!\!\perp} X\), conditioning on \(X\) does not influence the distribution of \(\varepsilon\), and we know by assumption that \(E[\varepsilon] = 0\). Hence, \[ E[\hat\beta | X] = \beta. \]

Standard Error of the OLS estimator

As it is true in general for estimation, it is not enough to have a point estimate of the regression parameter, but we want to know what is the distribution of the estimator.

First, we need to compute the standard errors of \(\hat\beta\).

In order to do that we have to remember the assumptions of the linear model, especially those related to the error term \(\varepsilon\).

Standard Error of the OLS estimator

We denote with \(\Sigma\) the variance/covariance matrix of the error term. In particular, \[ \Sigma_{i,j} = Cov(\varepsilon_i, \varepsilon_j) = \begin{cases} \sigma^2 & i=j\\ 0 & i \neq j \end{cases} \]

Hence, we can write \[ \Sigma = \sigma^2 I. \]

Standard Error of the OLS estimator

As a consequence, the conditional variance of \(Y\) is \[ Var(Y | X) = Var(X\beta + \epsilon| X) = Var(\epsilon) = \sigma^2 I \]

Variance of a linear combinations

In matrix notation, the variance of a linear combination \(AY\) can be computed as \[ Var(AY) = A Var(Y) A^\top \]

Since \(\hat\beta\) is a linear combination of \(Y\), we can use the same rule to compute its variance.

Standard Error of the OLS estimator

\[\begin{align*} Var(\hat\beta | X) &= Var((X^\top X)^{-1} X^\top Y | X)\\ &= (X^\top X)^{-1} X^\top Var(Y | X) ((X^\top X)^{-1} X^\top)^\top. \end{align*}\]

Note that:

  • \(X^\top X\) is symmetric;
  • if \(A\) is symmetric \(A^\top = A\);
  • \((AB)^\top = B^\top A^\top\).
  • \((A^\top)^\top = A\).

Standard Error of the OLS estimator

Hence, \[ ((X^\top X)^{-1} X^\top)^\top = X (X^\top X)^{-1} \]

Therefore,

\[\begin{align*} Var(\hat\beta | X) &= (X^\top X)^{-1} X^\top Var(Y|X) X (X^\top X)^{-1} \\ &= (X^\top X)^{-1} X^\top \sigma^2 I X (X^\top X)^{-1}\\ &= \sigma^2 (X^\top X)^{-1} X^\top X (X^\top X)^{-1}\\ &= \sigma^2 (X^\top X)^{-1}. \end{align*}\]

Hence, the diagonal of the square root of this matrix contains the standard errors of \(\beta\).

How to estimate \(\sigma^2\)?

If we could observe \(\varepsilon\), we could simply estimate \(\sigma^2\) by \[ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n \varepsilon_i^2. \]

Since we do not observe \(\varepsilon\), we can use the residuals \(e = Y - X\hat\beta\).

How to estimate \(\sigma^2\)?

We also need to correct for the degrees of freedom (more on this later), to get our estimator: \[ s^2 = \frac{1}{n-p} \sum_{i=1}^n e_i^2. \]

This estimator is conditionally unbiased, i.e., \[ E[s^2 | X] = \sigma^2. \]

A note on \(n-p\)

The fact that we divide by \(n-p\) derives from the proof that \(\hat\sigma^2\) is unbiased.

This proof is quite complicated mathematically, so we will skip it for now.

However, it derives from the geometric interpretation of regression, which we will see if we have time at the end of the course.

For now, notice that because we assumed that \(n > p\), we do not have problems in estimating \(\sigma^2\).

Also, note the similarity with the usual estimator of the variance: Here, instead of \(n - 1\) we use \(n-p\), because we have \(p\) parameters, and hence \(n-p\) degrees of freedom.

A note on the assumptions

Note which assumptions we did not need to get to these results.

  1. We did not assume normality!
  2. We did not assume independence of the columns of \(X\), just non-collinearity.

These results ensure that linear models can be applied in a lot of different settings, without requiring too stringent assumptions.

Example: Cherry trees

betahat <- solve(crossprod(X)) %*% crossprod(X, y)
res <- y - X %*% betahat
n <- nrow(X)
p <- ncol(X)
s2 <- sum(res^2)/(n-p)
s2
[1] 15.06862
sdb <- sqrt(diag(s2 * solve(crossprod(X))))
sdb
[1] 8.6382259 0.2642646 0.1301512

Example: Cherry trees

s <- summary(fit)
s$sigma^2
[1] 15.06862
s$coefficients[,2]
(Intercept)       Girth      Height 
  8.6382259   0.2642646   0.1301512 

The Normal Linear Model

Normality assumption

Remember that for now, we did not make any assumptions about the distribution of the data. We only assumed that:

  1. There is a linear relation between \(Y\) and \(X\); with \(X\) full-rank and \(p < n\).
  2. The error terms \(\varepsilon\) are i.i.d. with mean 0 and constant variance:
    • \(E[\varepsilon] = 0\);
    • \(Var(\varepsilon) = \sigma^2\) (homoscedasticity).
  3. The error terms \(\varepsilon\) are independent of \(X\): \(\varepsilon\perp\!\!\!\!\perp X\).

Normality assumption

For the next results to hold, we need to restrict our model to the following assumption: \[ \varepsilon_i \overset{iid}{\sim} N(0, \sigma^2). \]

We need the normality assumption in order to make inference on the coefficients of the model, i.e., to create confidence intervals and test hypotheses on the parameter \(\beta\).

The main results that we will see are the \(t\)-test and the \(F\)-test.

In the general model

It can be proved that:

  • \(\hat\beta \sim N_p(\beta, \sigma^2(X^\top X)^{-1})\);

  • \((n-p)s^2/\sigma^2 \sim \chi^2_{n-p}\);

  • \(\hat\beta\) and \(s^2\) are independent rv.

We have, for \(j=1,\ldots,p\)

  1. \(\hat\beta_j\sim N(\beta_j, \sigma^2[(X^\top X)^{-1}]_{jj})\);

  2. \(T_j = \frac{\hat\beta_j-\beta_j}{\sqrt{\hat{V}(\hat\beta_j)}} \sim t_{n-p}.\)

Testing \(H_0 : \beta_j = 0\)

Can be tested with a t-test: \[ T_j = \frac{\hat\beta_j-\beta_j}{\sqrt{\hat{V}(\hat\beta_j)}} \sim t_{n-p}. \]

Equivalently, using sum of squares, we can use: \[ F = \frac{(e^\top_Re_R - e^\top_Fe_F)/1}{e^\top_Fe_F/(n-p)} \sim F_{1, n-p} \] Reject \(H_0\) at level \(\alpha\) if \(F^{obs} > F_{1, n-3; 1-\alpha}.\)

Note that, in this case, \(F=T_j^2.\)

Overall goodness of fit

Consider the model \[ Y_i = \beta_0 + \beta_1 x_{i1} +\beta_2 x_{i2} + \varepsilon_i. \]

Want to test the overall goodness of fit \[ H_0: \beta_1 =\beta_2 = 0. \]

Overall goodness of fit

Two models

  • Full: \(Y_i = \beta_0 + \beta_1 x_{i1} +\beta_2 x_{i2} + \varepsilon_i\)

  • Reduced: \(Y_i = \beta_0 + \varepsilon_i\)

F-statistic, under \(H_0\) (details similar as before): \[ F = \frac{(e^\top_Re_R - e^\top_Fe_F)/2}{e^\top_Fe_F/(n-3)} \sim F_{2, n-3} \] Reject \(H_0\) at level \(\alpha\) if \(F^{obs} > F_{2, n-3; 1-\alpha}.\)

Example: t-test

# Girth
c(betahat[2], sdb[2], betahat[2]/sdb[2])
[1]  4.7081605  0.2642646 17.8160841
# Height
c(betahat[3], sdb[3], betahat[3]/sdb[3])
[1] 0.3392512 0.1301512 2.6065936
s$coefficients
               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -57.9876589  8.6382259 -6.712913 2.749507e-07
Girth         4.7081605  0.2642646 17.816084 8.223304e-17
Height        0.3392512  0.1301512  2.606594 1.449097e-02

Example: F-test

## Add example of "manual" goodness-of-fit test
fit0 <- lm(Volume ~ 1, data=trees)
anova(fit0, fit)
Analysis of Variance Table

Model 1: Volume ~ 1
Model 2: Volume ~ Girth + Height
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     30 8106.1                                  
2     28  421.9  2    7684.2 254.97 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
res0 <- y - mean(y)
ss <- (crossprod(res0) - crossprod(res))
rss <- crossprod(res)
c(rss, ss, (ss/2)/(rss/(n-3)))
[1]  421.9214 7684.1625  254.9723

Dropping subsets

Suppose we have the model \[ Y_i = \beta_0 +\beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i\,\,\,i=1,\ldots,n, \] and we want to test whether we can simplify the model by dropping \(k\) variables, i.e. want to test \[ H_0 : \beta_{j_{_1}} = \beta_{j_{_2}} = \ldots = \beta_{j_{_k}} = 0 \] Two models

  • Full: above

  • Reduced: model with the \(k\) columns \(x_{j_{_1}}, x_{j_{_2}}, \ldots, x_{j_{_k}}\) omitted from the design matrix.

Dropping subsets

F-statistic, under \(H_0\) (details similar as before): \[ F = \frac{(e^\top_Re_R - e^\top_Fe_F)/(p-k)}{e^\top_Fe_F/(n-p)} \sim F_{p-k, n-p}. \] Reject \(H_0\) at level \(\alpha\) if \(F^{obs} > F_{p-k, n-p; 1-\alpha}.\)

Maximum Likelihood Estimator

What about Maximum Likelihood? Why don’t we estimate the parameters with MLE like we did before?

Actually, one can show that if \[ Y | X \sim \mathcal{N}(X \beta, \sigma^2 I), \] the estimator \[ \hat \beta = (X^\top X)^{-1} X^\top Y \] is the maximimum likelihood estimator (MLE).

The math behind lm()

The math behind lm()

We have seen with the Cherry tree example that we can compute \(\hat \beta\) using the crossprod() and the solve() functions.

However, lm() uses a more efficient and numerically stable method, based on the QR decomposition.

solve() is numerically unstable

n <- 50; M <- 500
x <- seq(1, M, len=n)
X <- cbind(1, x, x^2, x^3)
beta <- matrix(c(1,1,1,1), nrow=4, ncol=1)

y <- X%*%beta+rnorm(n,sd=1)

solve(crossprod(X)) %*% crossprod(X,y)
Error in solve.default(crossprod(X)): system is computationally singular: reciprocal condition number = 2.93617e-17

This happens because \((X^\top X)\) has some huge elements.

crossprod(X)
                        x                          
          50        12525 4.217364e+06 1.597455e+09
x      12525      4217364 1.597455e+09 6.454018e+11
     4217364   1597455115 6.454018e+11 2.716100e+14
  1597455115 645401757068 2.716100e+14 1.175658e+17

lm() is numerically stable

fit <- lm.fit(X, y)
fit$coefficients
                  x                     
0.7702449 1.0043048 0.9999759 1.0000000 

The QR decomposition

Definition

We can decompose any full-rank \(n \times p\) matrix as \[ X = Q R, \] with

  • \(Q\) a \(n \times p\) matrix such that \(Q^\top Q = I\),
  • \(R\) a \(p \times p\) upper triangular matrix.

Upper triangular matrix

Upper triangular matrices are very useful to solve linear systems of equations, as we can show with this example: \[ \begin{bmatrix} 1 & 2 & -1 \\ 0 & 1 & 2 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix} = \begin{bmatrix} 6 \\ 4 \\ 1 \end{bmatrix} \]

It is immediate to see from the last row that \(c=1\), and we can make our way up to solve the other variables.

Finding \(\hat \beta\) with QR

We can rewrite the least squares equation using the QR decomposition:

\[\begin{align*} X^\top X \beta &= X^\top Y \\ (QR)^\top (QR) \beta &= (QR)^\top Y \\ R^\top (Q^\top Q) R \beta &= R^\top Q^\top Y \\ R^\top R \beta &= R^\top Q^\top Y \\ R \beta &= Q^\top Y \\ \beta &= R^{-1} Q^\top Y \end{align*}\]

Finding \(\hat \beta\) with QR

We can now use the same trick that we used in the example equations to solve for \(\beta\).

In R:

QR <- qr(X)
Q <- qr.Q(QR)
R <- qr.R(QR)

backsolve(R, crossprod(Q,y))
          [,1]
[1,] 0.7702449
[2,] 1.0043048
[3,] 0.9999759
[4,] 1.0000000
fit$coefficients
                  x                     
0.7702449 1.0043048 0.9999759 1.0000000 

QR

The QR decomposition can also be used to compute the fitted values and the standard deviation of \(\hat \beta\).

\[ \hat y = X \hat \beta = (QR) R^{-1} Q^\top Y = Q Q^\top Y \] and

\[ \text{Var}(\hat \beta) = \sigma^2 (X^\top X)^{-1} = \sigma^2 (R^\top Q^\top Q R)^{-1} = \sigma^2 (R^\top R)^{-1} \]