7  Generalized Linear Models (part 2)

7.1 Lecture Slides

View slides in full screen

7.2 Lab

8 1. Soay sheep fitness data 1

To illustrate Poisson regression model, we use a simulated data that consists of 52 female Soay sheep.

The variables available in the dataset are:

  • fitness: Lifetime reproductive success (number of offspring born to a female during her life);
  • body.size: Body mass (size).

Read the data and glimpse at the data.

library(tidyverse)
theme_set(theme_minimal(base_size = 20))
soay <- read_csv("slides/data/SoaySheep.csv")
glimpse(soay)
Rows: 50
Columns: 2
$ body.size <dbl> 5.075262, 9.050320, 5.756232, 5.803993, 4.403065, 6.646015, …
$ fitness   <dbl> 3, 13, 4, 1, 2, 1, 2, 3, 3, 6, 1, 3, 5, 3, 3, 5, 8, 4, 4, 2,…

The goal is to study whether there is an association between the body size and the number of reproductive success, i.e., does lifetime reproductive success increase with ewe body mass?

8.0.1 Data wrangling and exploration

We can visualize the relationship between lifetime reproductive sucess and body mass

ggplot(soay, aes(x = body.size , y = fitness)) +
  geom_point(size = 3, alpha = 0.6) +
  geom_smooth(method = "glm", se=F,
              method.args = list(family = "poisson"),
              color = "blue") +
  geom_smooth(method = "lm", se = FALSE,colour = "red" , span=1) 

Exercise

Fit a simple linear model. What does the plot Residuals vs Fitted tell you?

Solution
fit1 <- lm(fitness ~ body.size, data = soay)
summary(fit1)

Call:
lm(formula = fitness ~ body.size, data = soay)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8965 -1.9029 -0.7107  2.2464  6.2924 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.4021     2.2564  -5.053 6.73e-06 ***
body.size     2.2808     0.3268   6.979 7.91e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.577 on 48 degrees of freedom
Multiple R-squared:  0.5037,    Adjusted R-squared:  0.4933 
F-statistic: 48.71 on 1 and 48 DF,  p-value: 7.906e-09
par(mfrow=c(2,2))
plot(fit1)

From the Residuals vs Fitted plot, we can see there exist a nonlinearity relationship between fitness and body.size. Therefore, a different approach may be necessary.

8.0.2 Fit a Poisson regression model

fit2 <- glm(fitness ~ body.size, family = "poisson", data = soay)
summary(fit2)

Call:
glm(formula = fitness ~ body.size, family = "poisson", data = soay)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.0770     0.4256  -4.880 1.06e-06 ***
body.size     0.4896     0.0560   8.743  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 125.589  on 49  degrees of freedom
Residual deviance:  53.653  on 48  degrees of freedom
AIC: 208.46

Number of Fisher Scoring iterations: 5
par(mfrow=c(2,2))
plot(fit2)

Exercise

What does the output of Poisson regression tell you?

The fitted model is \[\log(\lambda)=-2.0770 + 0.4896 \text{ body.size}\] These results suggest that by exponentiating the coefficient on body.size, we obtain the multiplicative factor by which the mean count changes. In this case, the mean number of offspring born changes by a factor of \(\exp(0.4896)=1.631663\) or increases 63.17% (since $1.631663-1=0.63166) if a body mass increase 1.

The total deviance for NULL (fitness), and the deviance reduce \(125.589−53.653=71.936\) if we consider the body size as a predictor. This is a large reduction, meaning the predictor explains a lot of variation, i.e., the body size is an important predictor of fitness. We can verify this conclusion by perform a deviance test.

anova(fit2)
Analysis of Deviance Table

Model: poisson, link: log

Response: fitness

Terms added sequentially (first to last)

          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                         49    125.589              
body.size  1   71.936        48     53.653 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test statistic has a value of 71.936, corresponding p value is very small. This means that adding predictor (body.size) significantly improves the model.

8.0.3 Prediction

Let’s go back to parameter estimation. Here, we have estimated the two regression coefficients, we can also compute confidence intervals for those estimates.

confint(fit2)
                 2.5 %     97.5 %
(Intercept) -2.9166158 -1.2472904
body.size    0.3792023  0.5988234

This allows us to be more confident on the inferential statements about the relation between the lifetime reproductive success and the body size.

However, very often one of the goals of generalized linear model is prediction. For instance, we can try and predict what would be the number of number of offspring produced by an ewe that has body size is 6, 7, 8 in average.

# for body.size = 6
exp(-2.0770  + 0.4896 *6)
[1] 2.364579
# for body.size = 6
exp(-2.0770  + 0.4896 *7)
[1] 3.858197
# for body.size = 6
exp(-2.0770  + 0.4896 *8)
[1] 6.295279
Exercise

Predict the number of number of offspring produced by a ewe that has body size is 6, 7, 8 using predict() function. What do you notice?

8.0.4 Plotting the fitted curve

min.size <- min(soay$body.size)
max.size <- max(soay$body.size)

new.x <- expand.grid(body.size=seq(min.size,max.size,length=100))
new.y <- predict(fit2, newdata=new.x, se.fit=TRUE)

newdata <- data.frame(new.x, new.y)
newdata <- newdata %>% mutate(
  fitness=exp(fit),
  lwr = exp(fit-1.96*se.fit),
  upr = exp(fit+1.96*se.fit))

ggplot(soay, aes(x=body.size, y=fitness))+
  geom_point(size=3, alpha=0.5)+
  geom_smooth(data=newdata,
              aes(ymin=lwr, ymax=upr), stat='identity') +  # identity means that y values are provided
  theme_bw()

8.0.5 Check for overdispersion

Dispersion index (DI) is defined as \[DI=\frac{\text{Residual deviance}}{\text{Residual degrees of freedom}}\]

  • If \(DI\approx 1\) no overdispersion
  • If \(2>DI > 1.5\) Possible overdispersion
  • If \(DI \ge 2\) strong overdispersion

In this example,

DI=53.653/48
DI
[1] 1.117771

\(DI=1.12\) no dispersion so the model is good, i.e., Poisson is a suitable model.

9 2. Warpbreaks data 2

The warpbreak data set is a data in the package datasets. This data looks at how many warp breaks occurred for different types of looms per loom, per fixed length of yarn.

A data frame with 54 observations on 3 variables.

  • breaks: The number of breaks
  • wool: The type of wool (A or B)
  • tension: The level of tension (L, M, H)

There are measurements on 9 looms of each of the six types of warp, for a total of 54 entries in the dataset.

library(datasets)
data("warpbreaks")
glimpse(warpbreaks)
Rows: 54
Columns: 3
$ breaks  <dbl> 26, 30, 54, 25, 70, 52, 51, 26, 67, 18, 21, 29, 17, 12, 18, 35…
$ wool    <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A,…
$ tension <fct> L, L, L, L, L, L, L, L, L, M, M, M, M, M, M, M, M, M, H, H, H,…

9.1 Exploration

We can look at the distribution of breaks in the data

ggplot(warpbreaks, aes(x=breaks)) +
    geom_histogram(fill = "dodgerblue")

Check the mean and variance of this variable

mean(warpbreaks$breaks)
[1] 28.14815
var(warpbreaks$breaks)
[1] 174.2041

The variance is much greater than the mean, which suggests that we will have over-dispersion in the model.

9.2 Fit Poisson model

We fit a Poisson linear model to model the number of breaks as a function of the predictors wool and tension.

fit3 <- glm(breaks ~ wool+tension, data=warpbreaks, family = "poisson")
summary(fit3)

Call:
glm(formula = breaks ~ wool + tension, family = "poisson", data = warpbreaks)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.69196    0.04541  81.302  < 2e-16 ***
woolB       -0.20599    0.05157  -3.994 6.49e-05 ***
tensionM    -0.32132    0.06027  -5.332 9.73e-08 ***
tensionH    -0.51849    0.06396  -8.107 5.21e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 297.37  on 53  degrees of freedom
Residual deviance: 210.39  on 50  degrees of freedom
AIC: 493.06

Number of Fisher Scoring iterations: 4
Exercise

What do these results tell you?

9.3 Model selection

Is the inclusion of the levels of tension informative or not? Which model is better, with or without such variable?

To test for the global significance of the tension variable, we can fit a model without it and perform an Analysis of Deviance (likelihood ratio test) comparing a Poisson GLM with and without the tension variable

fit4 <- glm(breaks ~ wool, data=warpbreaks, family = "poisson")
summary(fit4)

Call:
glm(formula = breaks ~ wool, family = "poisson", data = warpbreaks)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.43518    0.03454  99.443  < 2e-16 ***
woolB       -0.20599    0.05157  -3.994 6.49e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 297.37  on 53  degrees of freedom
Residual deviance: 281.33  on 52  degrees of freedom
AIC: 560

Number of Fisher Scoring iterations: 4
anova(fit3,fit4)
Analysis of Deviance Table

Model 1: breaks ~ wool + tension
Model 2: breaks ~ wool
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        50     210.39                          
2        52     281.33 -2  -70.942 3.938e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A significant Chi-square test indicates that tension significantly improves model fit.

9.3.1 Consider interaction

fit5 <- glm(breaks ~ wool*tension, data=warpbreaks, family = "poisson")
summary(fit5)

Call:
glm(formula = breaks ~ wool * tension, family = "poisson", data = warpbreaks)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     3.79674    0.04994  76.030  < 2e-16 ***
woolB          -0.45663    0.08019  -5.694 1.24e-08 ***
tensionM       -0.61868    0.08440  -7.330 2.30e-13 ***
tensionH       -0.59580    0.08378  -7.112 1.15e-12 ***
woolB:tensionM  0.63818    0.12215   5.224 1.75e-07 ***
woolB:tensionH  0.18836    0.12990   1.450    0.147    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 297.37  on 53  degrees of freedom
Residual deviance: 182.31  on 48  degrees of freedom
AIC: 468.97

Number of Fisher Scoring iterations: 4
anova(fit3,fit5)
Analysis of Deviance Table

Model 1: breaks ~ wool + tension
Model 2: breaks ~ wool * tension
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        50     210.39                          
2        48     182.31  2   28.087 7.962e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p value is really small, it means that the interaction is significant, i.e., the difference in the number of breaks between wool types changes across tension levels.

Note

If interaction is significant, Do not interpret main effects alone. Interpret how the effect of one variable changes across levels of the other variable.

Effects of interest = linear combinations of main + interaction terms.

9.3.2 Plot fitted value

new.x <- expand.grid(wool=c("A","B"), tension=c("L","M","H"))
new.y <- predict(fit5, newdata=new.x, se.fit=TRUE)

newdata <- data.frame(new.x, new.y)
newdata <- newdata %>% mutate(
  fitness=exp(fit),
  lwr = exp(fit-1.96*se.fit),
  upr = exp(fit+1.96*se.fit))

ggplot(newdata, aes(x=tension, y=fitness, group=wool, col=wool))+
  geom_point()+
  geom_line() +
  geom_errorbar(aes(ymin=lwr, ymax=upr), width=0.1)+
  scale_color_manual(values=c("black","dodgerblue"))+ 
  xlab("Tension")+
  ylab("Breaks")+
  theme_bw()

9.3.3 Check overdispersion

The dispersion index is

DI = 182.31/48
DI
[1] 3.798125

The \(DI= 3.8 >2\), this indicates strong overdispersion. Hence, the Poisson model is underestimating the variability in the data, and a diffent approach maybe necessary. A common solution could be, consider the negative binomial model.

library(MASS)
fit6 <- glm.nb(breaks ~ wool * tension, data=warpbreaks)
summary(fit6)

Call:
glm.nb(formula = breaks ~ wool * tension, data = warpbreaks, 
    init.theta = 12.08216462, link = log)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)      3.7967     0.1081  35.116  < 2e-16 ***
woolB           -0.4566     0.1576  -2.898 0.003753 ** 
tensionM        -0.6187     0.1597  -3.873 0.000107 ***
tensionH        -0.5958     0.1594  -3.738 0.000186 ***
woolB:tensionM   0.6382     0.2274   2.807 0.005008 ** 
woolB:tensionH   0.1884     0.2316   0.813 0.416123    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(12.0822) family taken to be 1)

    Null deviance: 86.759  on 53  degrees of freedom
Residual deviance: 53.506  on 48  degrees of freedom
AIC: 405.12

Number of Fisher Scoring iterations: 1

              Theta:  12.08 
          Std. Err.:  3.30 

 2 x log-likelihood:  -391.125 

10 Campus Crime 3

This data collects and reports information about crime occurring on campus. Data consits of 81 observations and six variables:

  • Enrollment: enrollment at the school;
  • type: college (C) or university (U);
  • nv: the number of violent crimes for that institution for the given year;
  • nvrate: number of violent crimes per 1000 students;
  • enroll1000 enrollment at the school, in thousands;
  • region: region of the country (C = Central, MW = Midwest, NE = Northeast, SE = Southeast, SW = Southwest, and W = West)

Read the data and glimpse at the data.

c_data <- read_csv("https://raw.githubusercontent.com/proback/BeyondMLR/refs/heads/master/data/c_data.csv")
head(c_data)
# A tibble: 6 × 6
  Enrollment type     nv nvrate enroll1000 region
       <dbl> <chr> <dbl>  <dbl>      <dbl> <chr> 
1       5590 U        30 5.37         5.59 SE    
2        540 C         0 0            0.54 SE    
3      35747 U        23 0.643       35.7  W     
4      28176 C         1 0.0355      28.2  W     
5      10568 U         1 0.0946      10.6  SW    
6       3127 U         0 0            3.13 SW    

The goal is to study whether is an association among the number of violent crimes, region and the type of school.

10.1 Data wrangling and exploration

We combine the SW and SE to form a single category of the South because Southwestern colleges has 0 observed rate.

c_data <- c_data %>%
  mutate(
    region2 = case_when(
      region %in% c("SW", "SE") ~ "S",
      TRUE ~ region
    ))

We then turn the categorical variables into factors.

c_data <- c_data |>
    mutate(type = factor(type), region2 = factor(region2)) 
head(c_data)
# A tibble: 6 × 7
  Enrollment type     nv nvrate enroll1000 region region2
       <dbl> <fct> <dbl>  <dbl>      <dbl> <chr>  <fct>  
1       5590 U        30 5.37         5.59 SE     S      
2        540 C         0 0            0.54 SE     S      
3      35747 U        23 0.643       35.7  W      W      
4      28176 C         1 0.0355      28.2  W      W      
5      10568 U         1 0.0946      10.6  SW     S      
6       3127 U         0 0            3.13 SW     S      

We can now look at the number of crimes across the type of schools and regions.

c_data |>
    ggplot(aes(region2, nvrate, fill=type)) +
    geom_boxplot()

This plot shows that

  • In general the mean violent crime rates of universities is lower than the mean violent crime rates of colleges within a region (except the Northeast).
  • The regional pattern of rates at universities appears to differ from that of the colleges across different regions.
  • There are several outliers.

We filter out the outliers

c_data_filt<- c_data %>%
  filter(nvrate <= 2)

Plot the density of violent crime rates

ggplot(c_data_filt, aes(x=nv)) +
    geom_histogram(aes(y = after_stat(density)), fill = "dodgerblue", bins = 12)

From the histogram plot, we can see many schools reported few crimes, and a few schools have a large number of crimes. Moreover, the shape of the density does not look like a normal distribution. Therefore, Poisson regression should be considered to model the data.

10.2 Fit Poisson linear models

It is also important to note that the counts are not directly comparable because they come from different school size. We expect schools with more students to have more reports of violent crime since there are more students who could be affected. Therefore, we take the differences in enrollments into account by including an offset in our model, i.e., consider a Poisson regression of the form \[\log(\frac{\lambda}{\text{enroll1000}})=\beta_0+\beta_1 \text{type}\] or \[\log(\lambda)=\beta_0+\beta_1 \text{type}+\log(\text{enroll1000})\]

fit7 <- glm(nv ~ type + region2, family = "poisson", offset=log(enroll1000), data=c_data_filt)
summary(fit7)

Call:
glm(formula = nv ~ type + region2, family = "poisson", data = c_data_filt, 
    offset = log(enroll1000))

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.7969     0.1870  -9.611  < 2e-16 ***
typeU         0.5567     0.1536   3.625 0.000289 ***
region2MW     0.1004     0.1775   0.565 0.571763    
region2NE     0.5663     0.1607   3.524 0.000425 ***
region2S      0.5836     0.1490   3.918 8.94e-05 ***
region2W      0.3077     0.1874   1.641 0.100701    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 323.91  on 77  degrees of freedom
Residual deviance: 283.86  on 72  degrees of freedom
AIC: 499.55

Number of Fisher Scoring iterations: 5

These results tell us the Northeast and the South differ significantly from the Central region. The estimated coefficient of 0.5663 means that the violent crime rate per 1000 in the Northeast is 1.76 ( \(=\exp( 0.5663)\)) times that of the Central region when the type of school is the same.

Exercise
  1. Fit the Poisson linear model on the original data (without filtering out outliers). What do you notice?

We also note that the effect of the type may vary denpending on the region, so we also consider a Poisson regression model with interaction between type and region.

fit8 <- glm(nv ~ type * region2, family = "poisson", offset = log(enroll1000), data = c_data_filt)
summary(fit8)

Call:
glm(formula = nv ~ type * region2, family = "poisson", data = c_data_filt, 
    offset = log(enroll1000))

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.4741     0.3536  -4.169 3.05e-05 ***
typeU             0.1959     0.3775   0.519  0.60377    
region2MW        -1.9765     1.0607  -1.863  0.06239 .  
region2NE         1.1732     0.3994   2.938  0.00331 ** 
region2S         -0.1562     0.4859  -0.322  0.74779    
region2W         -1.8337     0.7906  -2.319  0.02037 *  
typeU:region2MW   2.1965     1.0765   2.040  0.04132 *  
typeU:region2NE  -0.8016     0.4381  -1.830  0.06732 .  
typeU:region2S    0.8121     0.5108   1.590  0.11185    
typeU:region2W    2.4106     0.8140   2.962  0.00306 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 323.91  on 77  degrees of freedom
Residual deviance: 236.02  on 68  degrees of freedom
AIC: 459.7

Number of Fisher Scoring iterations: 6
Exercise
  1. What do these results tell you?

These results confirm the significance of the interaction between the effect of region and the type of institution. We also can perform a test to see where the interaction is significant.

anova(fit7, fit8)
Analysis of Deviance Table

Model 1: nv ~ type + region2
Model 2: nv ~ type * region2
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1        72     283.86                          
2        68     236.02  4   47.841 1.018e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

10.3 Check overdispersion

The dispersion index is

DI = 236.02/68
DI
[1] 3.470882

The \(DI= 3.47 >2\), this indicates there exist strong overdispersion. Hence, a different approach is neccessary!!

10.4 Further reading

  • S. Holmes and W. Huber. Modern Statistics for Modern Biology. Chapter 8.

  1. Warpbreaks and Soay sheep fitness example are based on the Poisson regression activities lecture.↩︎

  2. Warpbreaks and Soay sheep fitness example are based on the Poisson regression activities lecture.↩︎

  3. Campus Crime example is from the Beyond Multiple Linear Regression.↩︎