library(SummarizedExperiment)
library(GenomicRanges)
library(airway)
SummarizedExperiment
class
One of the main strengths of the Bioconductor project lies in the use of a common data infrastructure that powers interoperability across packages.
Users should be able to analyze their data using functions from
different Bioconductor packages without the need to convert between
formats. To this end, the SummarizedExperiment
class (from
the SummarizedExperiment package) serves as the common currency
for data exchange across hundreds of Bioconductor packages.
This class implements a data structure that stores all aspects of the data - gene-by-sample expression data, per-sample metadata and per-gene annotation - and manipulate them in a synchronized manner.
Let’s start with an example dataset.
data(airway)
airway
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
We can think of this (and other) class as a container, that contains several different pieces of data in so-called slots.
The getter methods are used to extract information from the slots and the setter methods are used to add information into the slots. These are the only ways to interact with the objects (rather than directly accessing the slots).
Depending on the object, slots can contain different types of data (e.g., numeric matrices, lists, etc.). We will here review the main slots of the SummarizedExperiment class as well as their getter/setter methods.
assays
This is arguably the most fundamental part of the object that
contains the count matrix, and potentially other matrices with
transformed data. We can access the list of matrices with the
assays
function and individual matrices with the
assay
function.
assay(airway)[1:3, 1:3]
## SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003 679 448 873
## ENSG00000000005 0 0 0
## ENSG00000000419 467 515 621
You will notice that in this case we have a regular matrix inside the object. More generally, any “matrix-like” object can be used, e.g., sparse matrices or HDF5-backed matrices.
colData
and rowData
Conceptually, these are two data frames that annotate the columns and the rows of your assay, respectively.
One can interact with them as usual, e.g., by extracting columns or adding additional variables as columns.
colData(airway)
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
## SRR1039520 SRX384357 SRS508579 SAMN02422683
## SRR1039521 SRX384358 SRS508580 SAMN02422677
rowData(airway)
## DataFrame with 63677 rows and 10 columns
## gene_id gene_name entrezid gene_biotype
## <character> <character> <integer> <character>
## ENSG00000000003 ENSG00000000003 TSPAN6 NA protein_coding
## ENSG00000000005 ENSG00000000005 TNMD NA protein_coding
## ENSG00000000419 ENSG00000000419 DPM1 NA protein_coding
## ENSG00000000457 ENSG00000000457 SCYL3 NA protein_coding
## ENSG00000000460 ENSG00000000460 C1orf112 NA protein_coding
## ... ... ... ... ...
## ENSG00000273489 ENSG00000273489 RP11-180C16.1 NA antisense
## ENSG00000273490 ENSG00000273490 TSEN34 NA protein_coding
## ENSG00000273491 ENSG00000273491 RP11-138A9.2 NA lincRNA
## ENSG00000273492 ENSG00000273492 AP000230.1 NA lincRNA
## ENSG00000273493 ENSG00000273493 RP11-80H18.4 NA lincRNA
## gene_seq_start gene_seq_end seq_name seq_strand
## <integer> <integer> <character> <integer>
## ENSG00000000003 99883667 99894988 X -1
## ENSG00000000005 99839799 99854882 X 1
## ENSG00000000419 49551404 49575092 20 -1
## ENSG00000000457 169818772 169863408 1 -1
## ENSG00000000460 169631245 169823221 1 1
## ... ... ... ... ...
## ENSG00000273489 131178723 131182453 7 -1
## ENSG00000273490 54693789 54697585 HSCHR19LRC_LRC_J_CTG1 1
## ENSG00000273491 130600118 130603315 HG1308_PATCH 1
## ENSG00000273492 27543189 27589700 21 1
## ENSG00000273493 58315692 58315845 3 1
## seq_coord_system symbol
## <integer> <character>
## ENSG00000000003 NA TSPAN6
## ENSG00000000005 NA TNMD
## ENSG00000000419 NA DPM1
## ENSG00000000457 NA SCYL3
## ENSG00000000460 NA C1orf112
## ... ... ...
## ENSG00000273489 NA RP11-180C16.1
## ENSG00000273490 NA TSEN34
## ENSG00000273491 NA RP11-138A9.2
## ENSG00000273492 NA AP000230.1
## ENSG00000273493 NA RP11-80H18.4
Note the $
short cut.
## [1] TRUE
## DataFrame with 8 rows and 10 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample my_sum
## <factor> <factor> <factor> <numeric>
## SRR1039508 SRX384345 SRS508568 SAMN02422669 20637971
## SRR1039509 SRX384346 SRS508567 SAMN02422675 18809481
## SRR1039512 SRX384349 SRS508571 SAMN02422678 25348649
## SRR1039513 SRX384350 SRS508572 SAMN02422670 15163415
## SRR1039516 SRX384353 SRS508575 SAMN02422682 24448408
## SRR1039517 SRX384354 SRS508576 SAMN02422673 30818215
## SRR1039520 SRX384357 SRS508579 SAMN02422683 19126151
## SRR1039521 SRX384358 SRS508580 SAMN02422677 21164133
rowRanges
You might have noticed that our example object is of a special type of SummarizedExperiment class.
class(airway)
## [1] "RangedSummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"
This means that in addition to the already-discussed slots, it
contains a rowRanges
component, wich is a
GRanges
object with the information on the genomic
locations of the genes.
class(rowRanges(airway))
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"
rowRanges(airway)
## GRangesList object of length 63677:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] X 99883667-99884983 - | 667145 ENSE00001459322
## [2] X 99885756-99885863 - | 667146 ENSE00000868868
## [3] X 99887482-99887565 - | 667147 ENSE00000401072
## [4] X 99887538-99887565 - | 667148 ENSE00001849132
## [5] X 99888402-99888536 - | 667149 ENSE00003554016
## ... ... ... ... . ... ...
## [13] X 99890555-99890743 - | 667156 ENSE00003512331
## [14] X 99891188-99891686 - | 667158 ENSE00001886883
## [15] X 99891605-99891803 - | 667159 ENSE00001855382
## [16] X 99891790-99892101 - | 667160 ENSE00001863395
## [17] X 99894942-99894988 - | 667161 ENSE00001828996
## -------
## seqinfo: 722 sequences (1 circular) from an unspecified genome
##
## ...
## <63676 more elements>
A number of operations are implemented for GRanges, including
subset
, split
, length
,
subsetByOverlap
and others.
For instance, we can find the number of exons of the first gene and their average length in nucleotides with the following code.
# Exctract the first gene
gene1 <- rowRanges(airway)[[1]]
# Number of exons
length(gene1)
## [1] 17
## [1] 216.9412
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] airway_1.20.0 SummarizedExperiment_1.30.2
## [3] Biobase_2.60.0 GenomicRanges_1.52.0
## [5] GenomeInfoDb_1.36.2 IRanges_2.34.1
## [7] S4Vectors_0.38.1 BiocGenerics_0.46.0
## [9] MatrixGenerics_1.12.3 matrixStats_1.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 bitops_1.0-7 stringi_1.7.12
## [4] lattice_0.21-8 digest_0.6.33 magrittr_2.0.3
## [7] evaluate_0.21 grid_4.3.0 fastmap_1.1.1
## [10] rprojroot_2.0.3 jsonlite_1.8.7 Matrix_1.6-1
## [13] purrr_1.0.2 textshaping_0.3.6 jquerylib_0.1.4
## [16] abind_1.4-5 cli_3.6.1 rlang_1.1.1
## [19] crayon_1.5.2 XVector_0.40.0 cachem_1.0.8
## [22] DelayedArray_0.26.7 yaml_2.3.7 S4Arrays_1.0.5
## [25] tools_4.3.0 memoise_2.0.1 GenomeInfoDbData_1.2.10
## [28] vctrs_0.6.3 R6_2.5.1 lifecycle_1.0.3
## [31] zlibbioc_1.46.0 stringr_1.5.0 fs_1.6.3
## [34] ragg_1.2.5 desc_1.4.2 pkgdown_2.0.7
## [37] bslib_0.5.1 glue_1.6.2 systemfonts_1.0.4
## [40] xfun_0.40 highr_0.10 knitr_1.43
## [43] htmltools_0.5.6 rmarkdown_2.24 compiler_4.3.0
## [46] RCurl_1.98-1.12