3  Data Wrangling and Visualization (part 2)

3.1 Lecture Slides

View slides in full screen

3.2 Lab: tidy data in R

3.2.1 What’s not covered (i.e., prerequisites)

Basic R syntax is not covered in this lab, as we assume that you are already familiar with it. Most of the concepts in this first lab should be accessible to peolple with minimal exposure to R (Googling what you don’t remember is allowed – and encouraged!).

If you need help getting started with R, this is a good and free tutorial: https://swcarpentry.github.io/r-novice-gapminder/

3.2.2 What’s covered (i.e., outline)

In this second lab, we will cover how to:

  • import the data into R
  • clean the data and transform them into tidy data
  • work with more complex data structures

3.2.3 Import and export the data

There are several functions to import the data into R depending on the format in which the data live outside of R. First, we will assume that the data exist in a standard text format, such as tab-delimited or comma-separated. We will briefly mention some packages that can be used to import data in MS Excel, Matlab, and other non-standard formats.

Base R has many useful functions for the import of flat data tables, read.table and read.cvs among others. However, we will use the readr package here and in particular we will illustrate the read_csv, read_tsv and read_delim functions. These functions are faster and offer more flexibility compared to the base R counterparts. However, sometimes using read.table can be useful, for instance when one wants to use row.names.

3.2.3.1 Read the data into R

Often the data come in the form of csv files (comma-separated values). Excel spreadsheet can also be saved to csv files, making it a useful format to read into R data generated with Excel.

Another popular format is tsv files (tab-separated values). This is for instance how the gapminder data has been saved to file in the gapminder package.

Finally, read_delim can be used to read files whose fields are separated by any character (most commonly a space) which need to be specified in the delim argument.

Here, we will illustrate the use of read_tsv since the other two functions are very similar.

First, we need a string that describes the location on disk of the file that we want to import in R. Note that the string will change depending on your OS.

library(readr)
gap_tsv <- system.file("extdata/gapminder.tsv", package = "gapminder")
gap_tsv
[1] "/Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library/gapminder/extdata/gapminder.tsv"
gap <- read_tsv(gap_tsv)
Rows: 1704 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): country, continent
dbl (4): year, lifeExp, pop, gdpPercap

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gap
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

Note how the read_tsv function printed a message with the type that it inferred from each of the columns.

3.2.3.2 Write the data to file

Similarly to read_tsv and read_csv, the functions write_tsv and write_csv can be used to write the R tables to a file on disk.

Note that writing flat tables to csv or tsv files is the easiest and most reproducible approach. For complex data structures, say an histogram or a clustering tree, one can use the function save to save a binary R representation of the object on a .rda file. Similarly, the function load will read .rda files back into R.

3.2.3.3 Additional resources for data import/export

Sometimes you may need to import data from different formats, e.g., because your collaborator has the data in a MS Excel file or because you are continuing in R an analysis started with SPSS, Stata, or SAS.

Fortunately, R has many packages that can be used for data import/export:

  • haven can be used to read SPSS, Stata, and SAS files.
  • readxl to read excel files (both .xls and .xlsx).
  • DBI to import data from a variety of databases (advanced).
  • jsonlite to import json files
  • xml2 to import XML files.

3.2.4 Tidy the data: the tidyr package

Often, a large amount of time is spent on “cleaning” the data, in what is sometimes referred to data wrangling.

This is because often the data are plagued by missing or implausible values, either because of technical issues with the data collection or because of human error. Moreover, data are often recorded in a way that is useful for storing them, but not ideal for analyzing them.

The objective of data wrangling is to take a messy dataset and make it tidy, so that it becomes easier to work with.

3.2.4.1 What are tidy data?

The concept of tidy data was introduced by Hadley Wickham in his seminal paper.

There are three fundamental rules which make a dataset tidy:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

The advantage of tidy data is both conceptual and practical: it allows you to have consistency between datasets and to use tools designed for this data structure. Moreover, because R works extremely well with vectorized operation it is very efficient to operate on the columns of a data frame.

3.2.4.2 Examples of tidy and untidy data

To illustrate the concept of tidy data, we can use the examples in the EDAWR package. First we need to download and install it from Github with the devtools package.

library(remotes)
install_github("rstudio/EDAWR")
library(EDAWR)
storms
    storm wind pressure       date
1 Alberto  110     1007 2000-08-03
2    Alex   45     1009 1998-07-27
3 Allison   65     1005 1995-06-03
4     Ana   40     1013 1997-06-30
5  Arlene   50     1010 1999-06-11
6  Arthur   45     1010 1996-06-17
cases
  country  2011  2012  2013
1      FR  7000  6900  7000
2      DE  5800  6000  6200
3      US 15000 14000 13000
pollution
      city  size amount
1 New York large     23
2 New York small     14
3   London large     22
4   London small     16
5  Beijing large    121
6  Beijing small     56
Exercise

Take some time to list and describe the variables present in each datasets. Which of these datasets is tidy?

Note that somebody’s tidy data is someone else’s untidy data. For instance in the pollution example, one could argue that the two variables in addition to city are: particle size, and particle amount. In this case the dataset is tidy. But another argument is that the variables are: amount of small particles and amount of large particles. In this case the dataset is untidy.

For this reason, I actually prefer another terminology: long data and wide data. The pollution dataset is stored in long format, while the cases dataset is stored in wide format.

To switch between long and wide format, the tidyr package provide two extremely useful functions: piveot_longer() and pivot_wider().

3.2.4.3 Make your data “tall”: pivot_longer()

The function pivot_longer() can be used to transform the data from wide to tall.

The cases data frame is in wide form. If we want to make it tall, we can use the following command.

library(tidyr)
pivot_longer(cases, names_to = "year", values_to = "n", cols = 2:4)
# A tibble: 9 × 3
  country year      n
  <chr>   <chr> <dbl>
1 FR      2011   7000
2 FR      2012   6900
3 FR      2013   7000
4 DE      2011   5800
5 DE      2012   6000
6 DE      2013   6200
7 US      2011  15000
8 US      2012  14000
9 US      2013  13000

The names_to and values_to arguments are simply the names of the new columns that will have the variable values. The cols argument specify the columns that contain the data and that should be transformed.

3.2.4.4 Make your data “wide”: pivot_wider()

Analogously, the pivot_wider() function let us go back to wide data from tall data.

If we pivot_wider a table that we previously pivot_longered, we should return to the original data representation.

cases
  country  2011  2012  2013
1      FR  7000  6900  7000
2      DE  5800  6000  6200
3      US 15000 14000 13000
pivot_longer(cases, names_to = "year", values_to = "n", cols = 2:4)  |>
  pivot_wider(names_from = "year", values_from = "n")
# A tibble: 3 × 4
  country `2011` `2012` `2013`
  <chr>    <dbl>  <dbl>  <dbl>
1 FR        7000   6900   7000
2 DE        5800   6000   6200
3 US       15000  14000  13000

As we mentioned, sometimes the wide format is the tidy format. This could be the case for the pollution data, which we can transform in the following way.

pivot_wider(pollution, names_from = "size", values_from = "amount")
# A tibble: 3 × 3
  city     large small
  <chr>    <dbl> <dbl>
1 New York    23    14
2 London      22    16
3 Beijing    121    56

Additional functions useful to transform the data are the separate() and unite() functions. You can find out what they do as an exercize.

Exercise

Explore the dataset airquality:

  • What are the variables? What are the observations?
  • Transform it into a taller dataset, so that it will look like this:
# A tibble: 612 × 4
   Month   Day variable value
   <int> <int> <chr>    <dbl>
 1     5     1 Ozone     41  
 2     5     1 Solar.R  190  
 3     5     1 Wind       7.4
 4     5     1 Temp      67  
 5     5     2 Ozone     36  
 6     5     2 Solar.R  118  
 7     5     2 Wind       8  
 8     5     2 Temp      72  
 9     5     3 Ozone     12  
10     5     3 Solar.R  149  
# ℹ 602 more rows
  • Starting from the created dataset use pivot_wider to turn it back to the original
Solution
Code
air_long <- airquality |>
    pivot_longer(c(Ozone, Solar.R, Wind, Temp), 
                 names_to = "variable", 
                 values_to = "value")

air_wide <- air_long |>
    pivot_wider(names_from = "variable", values_from = "value")

3.2.4.5 Joining tables

In the tidy data framework, each table has one row per observational unit and one column per variable. But what if our analysis involves multiple observational unit types? In that case, there might be one table per observational unit types and you might need to join tables to perform certain data analyses.

The dplyr package contains several join functions. See ?inner_join for a full list. Here, we will cover only the inner and full joins, with the other operations left for homework exercises.

To illustrate joins, we will use the songs and artists datasets available in the EDAWR package. As you can see the observational units are quite different in the two tables, but nonetheless we might need both sets of variables in a single analysis.

songs
                 song  name
1 Across the Universe  John
2       Come Together  John
3      Hello, Goodbye  Paul
4           Peggy Sue Buddy
artists
    name  plays
1 George  sitar
2   John guitar
3   Paul   bass
4  Ringo  drums

The inner join only returns those observations that are present in both datasets. You can think of it as a “intersection.”

library(dplyr)
inner_join(songs, artists, by = "name")
                 song name  plays
1 Across the Universe John guitar
2       Come Together John guitar
3      Hello, Goodbye Paul   bass

Coversely, the full join returns all observations that are present in either dataset. You can think of it as a “union.”

full_join(songs, artists, by = "name")
                 song   name  plays
1 Across the Universe   John guitar
2       Come Together   John guitar
3      Hello, Goodbye   Paul   bass
4           Peggy Sue  Buddy   <NA>
5                <NA> George  sitar
6                <NA>  Ringo  drums

Finally, the left and right join keep the left or right observations, respectively.

left_join(songs, artists)
                 song  name  plays
1 Across the Universe  John guitar
2       Come Together  John guitar
3      Hello, Goodbye  Paul   bass
4           Peggy Sue Buddy   <NA>
right_join(songs, artists)
                 song   name  plays
1 Across the Universe   John guitar
2       Come Together   John guitar
3      Hello, Goodbye   Paul   bass
4                <NA> George  sitar
5                <NA>  Ringo  drums

Note how we can imply the “by” argument if we want to use all the common variable names between the two tables.

Exercise

Load the nycflights13 package to explore the four datasets flights, airlines, weather, and planes.

  • Join the tables flights and airlines in a way that each flight contains the names of the airline
  • Join the resulting table with the weather information
  • Join the resulting table with the airplane information
  • Use tidyverse functions to extract the complete information (including weather and airplane info) of UA1545 flight departed on Jan 1st 2013.
  • Plot departure delay versus wind speed: is there a relation between these variables? (Hint: avoid overplotting)
Solution
Code
## 1-3
library(nycflights13)
left_join(flights, airlines) |>
    left_join(weather) |>
    left_join(planes) -> complete_data

## 4
complete_data |>
    filter(year == 2013 & month == 1 & day == 1 & carrier == "UA", flight == 1545)

## 5
complete_data |>
    ggplot(aes(y = dep_delay, x = wind_speed)) +
    geom_hex()

3.2.5 The “tidyverse” vs. “base R”

As you look at examples or find answers to your questions online, you will notice that people refer to the set of packages that includes dplyr, tidyr, and ggplot2 as the tidyverse. You will often see people asking for a “tidyverse” way of doing something or for a “base R” way of doing something.

There is no right or wrong way to achieve something in R, but considerations about code readability and efficiency may be important depending on the application.

3.3 Homework 1

Health policy in the United States is complicated, and several forms of healthcare coverage exist, including both coverage by federal goverment-led healthcare policy, and by private insurance companies. Before making any inference about the relationship between health condition and health policy, it is important for us to have a general idea about healthcare economics in the United States. Thus, we are interested in getting sense of healthcare coverage and healthcare spending across States. More specifically, the questions are:

  1. Is there a relationship between healthcare coverage and healthcare spending in the United States?
  2. How does the spending distribution change across geographic regions in the United States?
  3. Does the relationship between healthcare coverage and healthcare spending in the United States change from 2013 to 2014?

3.3.1 Guided solution

  1. Read-in the data: use the read_csv function (readr package) to read the healthcare-coverage.cvs and the healthcare-spending.csv files (found here at data/KFF).

  2. Load the state information found in the state datasets (datasets package). Note that you need to manually add information about the District of Columbia.

  3. Add the abbreviation and the region of each state to the coverage dataset.

  4. Join the coverage and spending datasets.

  5. Use ggplot to produce a scatterplot of the proportion of coverage by state vs. the spending per capita. Color code the points by region and add state abbreviations as labels. Use facet to stratify the analysis by type of coverage.

  6. Use ggplot to create a boxplot of spending per capita stratified by region.

  7. Repeat the graph in point 5 but faceting by year, in addition to coverage type. Hint: use the facet_grid() function.

3.4 Further reading


  1. This homework is one of the Open Case Studies. Please try to solve this yourself before looking at the solution there.↩︎