[1] -1.4813459 1.7060283 0.7773327 4.7381955 1.1061364 -1.8361890
[7] 2.2874821 0.2966952 1.0253173 2.0514608
Advanced Statistics and Data Analysis
One of my personal goals for this course is to convince you of how important the concept of reproducible research is.
This is a fundamental topic in modern statistics and data science, and it is particularly important when dealing with large and complex data, such as those we find in computational biology.
Here I will illustrate some practical tools that we can use to ensure reproducibility of the results.
From a collaborator:
Davide,
Please find attached the updated data.
It would be best to re-run the whole analysis to update the figures for the manusctipt.
Best,
Your lovely collaborator
Important
A data analsyis is reproducible if, starting from the raw data and the code used to analyze them, another analyst can reproduce the results of the original analysis.
A study is reproducible if starting from the same data we arrive at the same results.
A study is replicable if starting from new, independent data we arrive at the same conclusions.
Warning
Reproducibility does not imply correctness!
Source: Karl Broman Tools for reproducible research
knitr / R markdown / Quarto
Git / version control
Docker
Source: Karl Broman Tools for reproducible research
Literate programming was introduced by Donald Knuth as a way to weave together programming and natural language.
The idea is that the natural language explanation and the program itself live in the same document.
“I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature.”
“Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”
R has a long history of supporting literate programming.
knitr, its predecessor Sweave, and rmarkdown, are three packages that make it possible.
Lately, Quarto has been developed as a language-agnostic alternative, able to combine R and Python with markdown to produce html and pdf documents.
Quarto can be used for websites, books, and more.
Markdown is a mark-up language, very easy to write and human readable.
It’s easy to create lists, italic, bold text, e.g.:
- One
- _Two_
- __Three__
Renders as
Code “chunks” can be included with the following syntax.
Depending on the options, we can decide to show the code or not, and to execute it or not, printing outputs and even figures.
Depending on the options, we can decide to show the code or not, and to execute it or not, printing outputs and even figures.
YAML is a human-readable language that can be used to include options at the beginning of your documents.
There are essentially three possibilities when you work on a long project (think about your thesis!).
Obviously 2 is better than 1!
But what happens if something goes wrong and you need to find (quickly) the last working version?
What happens if you send a version to your collaborator and then keep working on the file?
Git is a version control system developed by Linus Torvalds to work on the Linux project.
It tracks all file types, but not ideal for binary or proprietary files.
It’s integrated in R Studio and is freely available as an open source software.
Free websites offer it as a service (Github, Gitlab, …)
Important
How come?
When R packages (or R itself) are updated, the developers may change default values or fix bugs.
This may cause your code to not work anymore.
Or worse, it could change your results.
One solution is to add the command sessionInfo() to all your documents.
This way, we can record the R version and the version of all the packages that we have loaded in our session.
R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS 15.1.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Rome
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.4.0 fastmap_1.2.0 cli_3.6.3 tools_4.4.0
[5] htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10 rmarkdown_2.29
[9] knitr_1.49 jsonlite_1.8.9 xfun_0.50 digest_0.6.37
[13] rlang_1.1.4 evaluate_1.0.3
Create and share a “virtual machine” that contains all the software needed to reproduce your analysis.
This container will have the exact same versions of all the software that you need and will guarantee that whoever runs it will get the exact results that you had.
Popular software to achieve this are Docker and Singularity.
A good R-centric first intro to Docker and singularity is provided by the Bioconductor project here.