6  Linear models (part 2)

6.1 Lecture Slides

View slides in full screen

6.2 Lab: confounding and adjustment 1

In this Lab, we will use simulations to show the effect of modelling choices on inference.

Let’s consider this rather general situation, in which a continuous outcome is influenced by a binary treatment \(z\) (that we are interested in studying) and a continuous confounder \(x\): \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 z_i + \varepsilon_i. \]

Note that here we are using the outcome/treatment/confounder terminology, typical of epidemiology.

6.2.1 Experiment 1

Let’s simulate some data, starting from an ideal situation.

set.seed(1503)

n <- 200
z <- rep(c(0, 1), each=n/2)
x <- runif(n)
beta0 <- 0
beta1 <- 2
beta2 <- 1
sigma <- .2

y <- beta0 + beta1 * z + beta2 * x + rnorm(n, sd = sigma)

Let’s plot the data.

df <- data.frame(x, y, z=factor(z))
ggplot(df, aes(x = x, y = y, color = z)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "black") +
    geom_smooth(method = "lm", se = FALSE)

Let’s look at the marginal relation between \(y\) and \(z\).

ggplot(df, aes(y = y, x = z, fill=z)) +
    geom_boxplot()

:::

Let’s finally look at the relation between \(x\) and \(z\).

ggplot(df, aes(y = x, x = z, fill=z)) +
    geom_boxplot()

The marginal relation between \(z\) and \(y\) is consistent with the conditional relation. This is because \(x\) and \(z\) are independent.

Let’s fit a linear model.

fit1 <- lm(y ~ z + x)
summary(fit1)

Call:
lm(formula = y ~ z + x)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4650 -0.1437 -0.0031  0.1498  0.4484 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.02005    0.03073  -0.652    0.515    
z            1.99099    0.02751  72.385   <2e-16 ***
x            1.04417    0.04888  21.362   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1945 on 197 degrees of freedom
Multiple R-squared:  0.9668,    Adjusted R-squared:  0.9665 
F-statistic:  2869 on 2 and 197 DF,  p-value: < 2.2e-16

We can see that the estimates are not far from the true estimated values.

data.frame(estimate = round(fit1$coefficients, 2),
           true =c(beta0, beta1, beta2))
            estimate true
(Intercept)    -0.02    0
z               1.99    2
x               1.04    1

What happens if we omit \(x\) from the model?

fit1b <- lm(y ~ z)
summary(fit1b)

Call:
lm(formula = y ~ z)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78648 -0.26678  0.00481  0.26019  0.80658 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.48822    0.03533   13.82   <2e-16 ***
z            1.99867    0.04996   40.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3533 on 198 degrees of freedom
Multiple R-squared:  0.8899,    Adjusted R-squared:  0.8894 
F-statistic:  1600 on 1 and 198 DF,  p-value: < 2.2e-16

In this case the estimate of the coefficient for \(z\) is still good, since \(x\) does not confound the relation between \(z\) and \(y\).

data.frame(estimate = round(fit1b$coefficients, 2),
           true =c(beta0, beta1))
            estimate true
(Intercept)     0.49    0
z               2.00    2

6.2.2 Experiment 2

Let’s now include some dependence in the data generation, e.g., by estimating \(x\) differently for the two categories of \(z\).

set.seed(1538)

n <- 200
z <- rep(c(0, 1), each=n/2)
x <- c(runif(n/2), runif(n/2, min = 1.5, max = 2.5))
beta0 <- 0
beta1 <- 2
beta2 <- 1
sigma <- .2

y <- beta0 + beta1 * z + beta2 * x + rnorm(n, sd = sigma)

Let’s plot the data.

df <- data.frame(x, y, z=factor(z))
ggplot(df, aes(x = x, y = y, color = z)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "black") +
    geom_smooth(method = "lm", se = FALSE)

Now we can see that, as expected, \(x\) depends on \(z\).

ggplot(df, aes(y = x, x = z, fill=z)) +
    geom_boxplot()

This translates into a difference between the marginal and conditional relation between \(z\) and \(y\).

ggplot(df, aes(y = y, x = z, fill=z)) +
    geom_boxplot()

Notice the “compound effect” of \(x\) and \(z\) on \(y\).

Let’s fit the correct model.

fit2 <- lm(y ~ z + x)
summary(fit2)

Call:
lm(formula = y ~ z + x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45873 -0.14348  0.01642  0.12145  0.49850 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.03663    0.03300   -1.11    0.268    
z            1.99659    0.07518   26.56   <2e-16 ***
x            1.02294    0.04944   20.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1988 on 197 degrees of freedom
Multiple R-squared:  0.9874,    Adjusted R-squared:  0.9872 
F-statistic:  7694 on 2 and 197 DF,  p-value: < 2.2e-16

As expected, we get good estimates of the coefficients.

data.frame(estimate=round(fit2$coefficients, 2), 
           true=round(c(beta0, beta1, beta2), 2))
            estimate true
(Intercept)    -0.04    0
z               2.00    2
x               1.02    1

What happens if we now omit the \(x\) confounder?

fit2b <- lm(y ~ z)
summary(fit2b)

Call:
lm(formula = y ~ z)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.68982 -0.25933  0.00184  0.25912  0.92836 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.50825    0.03533   14.39   <2e-16 ***
z            3.43924    0.04996   68.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3533 on 198 degrees of freedom
Multiple R-squared:  0.9599,    Adjusted R-squared:  0.9597 
F-statistic:  4739 on 1 and 198 DF,  p-value: < 2.2e-16

We are overestimating the treatment effect!

data.frame(estimate=round(fit2b$coefficients, 2), 
           true=c(beta0, beta1))
            estimate true
(Intercept)     0.51    0
z               3.44    2

6.2.3 Experiment 3

Things get more complicated when we simulate coefficients with opposite signs.

set.seed(1547)

n <- 200
z <- rep(c(0, 1), each=n/2)
x <- c(runif(n/2), runif(n/2, min = 1, max = 2))
beta0 <- 0
beta1 <- -1
beta2 <- 1
sigma <- .2

y <- beta0 + beta1 * z + beta2 * x + rnorm(n, sd = sigma)

Let’s plot the data.

df <- data.frame(x, y, z=factor(z))
ggplot(df, aes(x = x, y = y, color = z)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "black") +
    geom_smooth(method = "lm", se = FALSE)

Now that the effects of two correlated variables are opposite to each other, the marginal effect of \(z\) on \(y\) becomes almost 0.

ggplot(df, aes(y = y, x = z, fill=z)) +
    geom_boxplot()

Let’s fit a linear model.

fit3 <- lm(y ~ z + x)
summary(fit3)

Call:
lm(formula = y ~ z + x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48522 -0.15172 -0.00461  0.14135  0.54230 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.01783    0.03188  -0.559    0.577    
z           -0.98838    0.05627 -17.566   <2e-16 ***
x            1.00277    0.04980  20.138   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.193 on 197 degrees of freedom
Multiple R-squared:  0.673, Adjusted R-squared:  0.6697 
F-statistic: 202.8 on 2 and 197 DF,  p-value: < 2.2e-16

Again, the correct model correctly estimates the parameters.

data.frame(estimate=round(fit3$coefficients, 2), 
           true=c(beta0, beta1, beta2))
            estimate true
(Intercept)    -0.02    0
z              -0.99   -1
x               1.00    1

But when omitting \(x\), we lose the significant association between \(z\) and \(y\).

fit3b <- lm(y ~ z)
summary(fit3b)

Call:
lm(formula = y ~ z)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.80393 -0.25652  0.04517  0.22719  0.84438 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.493279   0.033660  14.655   <2e-16 ***
z           0.002515   0.047602   0.053    0.958    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3366 on 198 degrees of freedom
Multiple R-squared:  1.41e-05,  Adjusted R-squared:  -0.005036 
F-statistic: 0.002792 on 1 and 198 DF,  p-value: 0.9579

In fact we are overestimating the treatment effect to essentially 0!

data.frame(estimate=round(fit3b$coefficients, 2), 
           true=c(beta0, beta1))
            estimate true
(Intercept)     0.49    0
z               0.00   -1

6.2.4 Experiment 4

Even more dramatic effects can be observed when the variable of interest is \(x\).

set.seed(1547)

n <- 200
z <- rep(c(0, 1), each=n/2)
x <- c(runif(n/2), runif(n/2, min = 1, max = 2))
beta0 <- 0
beta1 <- -2
beta2 <- 1
sigma <- .2

y <- beta0 + beta1 * z + beta2 * x + rnorm(n, sd = sigma)

Let’s plot the data.

df <- data.frame(x, y, z=factor(z))
ggplot(df, aes(x = x, y = y, color = z)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "black") +
    geom_smooth(method = "lm", se = FALSE)

In this case, while the conditional effect of \(x\) on \(y\) is positive, its marginal effect is negative!

Let’s fit a linear model.

fit4 <- lm(y ~ z + x)
summary(fit4)

Call:
lm(formula = y ~ z + x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48522 -0.15172 -0.00461  0.14135  0.54230 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.01783    0.03188  -0.559    0.577    
z           -1.98838    0.05627 -35.339   <2e-16 ***
x            1.00277    0.04980  20.138   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.193 on 197 degrees of freedom
Multiple R-squared:  0.8984,    Adjusted R-squared:  0.8974 
F-statistic: 870.9 on 2 and 197 DF,  p-value: < 2.2e-16

Once more, the correct model correctly estimates the parameters.

data.frame(estimate=round(fit4$coefficients, 2), 
           true=c(beta0, beta1, beta2))
            estimate true
(Intercept)    -0.02    0
z              -1.99   -2
x               1.00    1

But when omitting \(z\), we invert the sign of the \(x\) coefficient!

fit4b <- lm(y ~ x)
summary(fit4b)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.28221 -0.28897  0.04716  0.35760  1.31902 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.53271    0.07517   7.087 2.36e-11 ***
x           -0.53615    0.06526  -8.216 2.71e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5214 on 198 degrees of freedom
Multiple R-squared:  0.2542,    Adjusted R-squared:  0.2505 
F-statistic:  67.5 on 1 and 198 DF,  p-value: 2.706e-14
data.frame(estimate=round(fit4b$coefficients, 2), 
           true=c(beta0, beta2))
            estimate true
(Intercept)     0.53    0
x              -0.54    1

6.3 Homework 2

Wolff and Gorb (2013) studied the different frictional coefficients on the different legs of a spider.

Please, read the article, with particular focus on the analysis described in its Figure 4.

Briefly, the goal is to compare the pulling and pushing motions of different leg pairs.

Load the data available in this csv file.

Perform the following analyses:

  1. Visual inspection of the data: use a boxplot or similar plot to compare the distribution of the different forces (pull and push) and legs.

  2. Specify a linear model with friction as response and type as covariate. What is the effect of the force type on friction?

  3. Specify a linear model with friction as response and type and leg as covariates. What is the effect of the force type on friction? Did it change compared to the previous model? What is the effect of leg?

  4. Specify a contrast to compare leg pairs L3 and L2. Is the difference in friction significant?

  5. Specify a linear model that includes, in addition to the two covariates, their interaction. Is the interaction significant? How do you interpret the coefficients?

  6. Specify a contrast to compare the push and pull force of leg L2. Is the difference significant?

6.4 Further reading

References


  1. This lab was inspired by Regression Models for Data Science in R↩︎

  2. This homework is from Data Analysis for the Life Sciences. Please try to solve this yourself before looking at the solution there.↩︎