The SummarizedExperiment class

One of the main strengths of the Bioconductor project lies in the use of a common data infrastructure that powers interoperability across packages.

Users should be able to analyze their data using functions from different Bioconductor packages without the need to convert between formats. To this end, the SummarizedExperiment class (from the SummarizedExperiment package) serves as the common currency for data exchange across hundreds of Bioconductor packages.

This class implements a data structure that stores all aspects of the data - gene-by-sample expression data, per-sample metadata and per-gene annotation - and manipulate them in a synchronized manner.

Let’s start with an example dataset.

data(airway)
airway
## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

We can think of this (and other) class as a container, that contains several different pieces of data in so-called slots.

The getter methods are used to extract information from the slots and the setter methods are used to add information into the slots. These are the only ways to interact with the objects (rather than directly accessing the slots).

Depending on the object, slots can contain different types of data (e.g., numeric matrices, lists, etc.). We will here review the main slots of the SummarizedExperiment class as well as their getter/setter methods.

The assays

This is arguably the most fundamental part of the object that contains the count matrix, and potentially other matrices with transformed data. We can access the list of matrices with the assays function and individual matrices with the assay function.

assay(airway)[1:3, 1:3]
##                 SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003        679        448        873
## ENSG00000000005          0          0          0
## ENSG00000000419        467        515        621

You will notice that in this case we have a regular matrix inside the object. More generally, any “matrix-like” object can be used, e.g., sparse matrices or HDF5-backed matrices.

The colData and rowData

Conceptually, these are two data frames that annotate the columns and the rows of your assay, respectively.

One can interact with them as usual, e.g., by extracting columns or adding additional variables as columns.

colData(airway)
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677
rowData(airway)
## DataFrame with 63677 rows and 10 columns
##                         gene_id     gene_name  entrezid   gene_biotype
##                     <character>   <character> <integer>    <character>
## ENSG00000000003 ENSG00000000003        TSPAN6        NA protein_coding
## ENSG00000000005 ENSG00000000005          TNMD        NA protein_coding
## ENSG00000000419 ENSG00000000419          DPM1        NA protein_coding
## ENSG00000000457 ENSG00000000457         SCYL3        NA protein_coding
## ENSG00000000460 ENSG00000000460      C1orf112        NA protein_coding
## ...                         ...           ...       ...            ...
## ENSG00000273489 ENSG00000273489 RP11-180C16.1        NA      antisense
## ENSG00000273490 ENSG00000273490        TSEN34        NA protein_coding
## ENSG00000273491 ENSG00000273491  RP11-138A9.2        NA        lincRNA
## ENSG00000273492 ENSG00000273492    AP000230.1        NA        lincRNA
## ENSG00000273493 ENSG00000273493  RP11-80H18.4        NA        lincRNA
##                 gene_seq_start gene_seq_end              seq_name seq_strand
##                      <integer>    <integer>           <character>  <integer>
## ENSG00000000003       99883667     99894988                     X         -1
## ENSG00000000005       99839799     99854882                     X          1
## ENSG00000000419       49551404     49575092                    20         -1
## ENSG00000000457      169818772    169863408                     1         -1
## ENSG00000000460      169631245    169823221                     1          1
## ...                        ...          ...                   ...        ...
## ENSG00000273489      131178723    131182453                     7         -1
## ENSG00000273490       54693789     54697585 HSCHR19LRC_LRC_J_CTG1          1
## ENSG00000273491      130600118    130603315          HG1308_PATCH          1
## ENSG00000273492       27543189     27589700                    21          1
## ENSG00000273493       58315692     58315845                     3          1
##                 seq_coord_system        symbol
##                        <integer>   <character>
## ENSG00000000003               NA        TSPAN6
## ENSG00000000005               NA          TNMD
## ENSG00000000419               NA          DPM1
## ENSG00000000457               NA         SCYL3
## ENSG00000000460               NA      C1orf112
## ...                          ...           ...
## ENSG00000273489               NA RP11-180C16.1
## ENSG00000273490               NA        TSEN34
## ENSG00000273491               NA  RP11-138A9.2
## ENSG00000273492               NA    AP000230.1
## ENSG00000273493               NA  RP11-80H18.4

Note the $ short cut.

identical(colData(airway)$cell, airway$cell)
## [1] TRUE
airway$my_sum <- colSums(assay(airway))
colData(airway)
## DataFrame with 8 rows and 10 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
##            Experiment    Sample    BioSample    my_sum
##              <factor>  <factor>     <factor> <numeric>
## SRR1039508  SRX384345 SRS508568 SAMN02422669  20637971
## SRR1039509  SRX384346 SRS508567 SAMN02422675  18809481
## SRR1039512  SRX384349 SRS508571 SAMN02422678  25348649
## SRR1039513  SRX384350 SRS508572 SAMN02422670  15163415
## SRR1039516  SRX384353 SRS508575 SAMN02422682  24448408
## SRR1039517  SRX384354 SRS508576 SAMN02422673  30818215
## SRR1039520  SRX384357 SRS508579 SAMN02422683  19126151
## SRR1039521  SRX384358 SRS508580 SAMN02422677  21164133

The rowRanges

You might have noticed that our example object is of a special type of SummarizedExperiment class.

class(airway)
## [1] "RangedSummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"

This means that in addition to the already-discussed slots, it contains a rowRanges component, wich is a GRanges object with the information on the genomic locations of the genes.

class(rowRanges(airway))
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"
rowRanges(airway)
## GRangesList object of length 63677:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        X 99883667-99884983      - |    667145 ENSE00001459322
##    [2]        X 99885756-99885863      - |    667146 ENSE00000868868
##    [3]        X 99887482-99887565      - |    667147 ENSE00000401072
##    [4]        X 99887538-99887565      - |    667148 ENSE00001849132
##    [5]        X 99888402-99888536      - |    667149 ENSE00003554016
##    ...      ...               ...    ... .       ...             ...
##   [13]        X 99890555-99890743      - |    667156 ENSE00003512331
##   [14]        X 99891188-99891686      - |    667158 ENSE00001886883
##   [15]        X 99891605-99891803      - |    667159 ENSE00001855382
##   [16]        X 99891790-99892101      - |    667160 ENSE00001863395
##   [17]        X 99894942-99894988      - |    667161 ENSE00001828996
##   -------
##   seqinfo: 722 sequences (1 circular) from an unspecified genome
## 
## ...
## <63676 more elements>

GRanges and GRangesList operations

A number of operations are implemented for GRanges, including subset, split, length, subsetByOverlap and others.

For instance, we can find the number of exons of the first gene and their average length in nucleotides with the following code.

# Exctract the first gene
gene1 <- rowRanges(airway)[[1]]

# Number of exons
length(gene1)
## [1] 17
# Average exon length
gene1 |> width() |> mean()
## [1] 216.9412

Session Info

## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] airway_1.20.0               SummarizedExperiment_1.30.2
##  [3] Biobase_2.60.0              GenomicRanges_1.52.0       
##  [5] GenomeInfoDb_1.36.2         IRanges_2.34.1             
##  [7] S4Vectors_0.38.1            BiocGenerics_0.46.0        
##  [9] MatrixGenerics_1.12.3       matrixStats_1.0.0          
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.7              bitops_1.0-7            stringi_1.7.12         
##  [4] lattice_0.21-8          digest_0.6.33           magrittr_2.0.3         
##  [7] evaluate_0.21           grid_4.3.0              fastmap_1.1.1          
## [10] rprojroot_2.0.3         jsonlite_1.8.7          Matrix_1.6-1           
## [13] purrr_1.0.2             textshaping_0.3.6       jquerylib_0.1.4        
## [16] abind_1.4-5             cli_3.6.1               rlang_1.1.1            
## [19] crayon_1.5.2            XVector_0.40.0          cachem_1.0.8           
## [22] DelayedArray_0.26.7     yaml_2.3.7              S4Arrays_1.0.5         
## [25] tools_4.3.0             memoise_2.0.1           GenomeInfoDbData_1.2.10
## [28] vctrs_0.6.3             R6_2.5.1                lifecycle_1.0.3        
## [31] zlibbioc_1.46.0         stringr_1.5.0           fs_1.6.3               
## [34] ragg_1.2.5              desc_1.4.2              pkgdown_2.0.7          
## [37] bslib_0.5.1             glue_1.6.2              systemfonts_1.0.4      
## [40] xfun_0.40               highr_0.10              knitr_1.43             
## [43] htmltools_0.5.6         rmarkdown_2.24          compiler_4.3.0         
## [46] RCurl_1.98-1.12