8 Generalized Linear Models
8.1 Lecture Slides
8.2 Lab 1
In the United States, there have been significant and historical declines in cigarette smoking. In the 1970s, 75% of high school seniors were smoking, that number is below 10% now. This progress is largely due to the tobacco control movement and their focus on initiatives like ending advertising to children (like Joe Camel), passing indoor smoking laws, health communication, etc.
According to a recent report, overall tobacco/nicotine use increased in youths (middle school and high school students) in the United States in 2017 and 2018, despite previous years of declining use.
This major increase is attributed to an increase in the use of electronic cigarette (e-cigarette) products.
The main questions are:
- How has tobacco and e-cigarette/vaping use by American youths changed since 2015?
- How does e-cigarette use compare between males and females?
- What vaping brands and flavors appear to be used the most frequently?
- Is there a relationship between e-cigarette/vaping use and other tobacco use?
8.2.1 Guided solution
Read-in the data: because we focused on data wrangling on a previous lab, I suggest that you start from the already cleaned-up version of the data, that you can find here.
- Note that you have to use the function
load(). - The “codebook” with the explanation of the variables can be found here
- In addition to the variables in the cookbook, some other variables have been define that sum all the e-cigarette / non-e-cigarette products.
- Note that you have to use the function
Create plots to visualize the data and answer graphically to the questions above.
Consider only data from 2015 and fit a model to compare current use of e-cigarettes between males and females.
- Compute “by hand” the Odds Ratio (OR) of the use of e-cigarettes between males and females. Who is most likely to smoke e-cigarettes in 2015? By how much?
- Fit a logistic regression model. Is the difference significance? How do you interpret \(\beta_1\)? What is its relationship with the OR calculated above? Is it the same?
Go back to the full data, and fit a new logistic model that includes, in addition to Sex, year, and “non_ecig_ever”. If appropriate, consider including interactions. What do you conclude?
Bonus:
- Use the
glmnetpackage to fit a lasso regression that has the current use of e-cigarettes as response and any appropriate variable in the dataset as covariates.- Hint: use the
cv.glmnetfunction to perform the cross-validation and thecoef(fit.cv, s = "lambda.1se")to obtain the coefficient estimates.
- Hint: use the
8.3 Further reading
- S. Holmes and W. Huber. Modern Statistics for Modern Biology. Chapter 8.
This homework is one of the Open Case Studies. Please try to solve this yourself before looking at the solution there.↩︎