This site gathers and describes all scripts used for the analysis of single-cell RNA sequencing (scRNAseq), from count matrices to manuscript figures.
All computational analyses were conducted using R language within RStudio. We used the rmarkdown package to develop notebooks in RMD format. RMD format is organized into two types of sections. The “chunk” sections contains code in a programming language, such as R, while the output is displayed below the chunk. Around the chunk, explanatory comments in human readable language can be added at will. Generally, the Markdown language is used to format the text. Since the notebook is then compiled to obtain a HTML file, HTML language can also be used.
RMD files were compiled to generate HTML files, which are shared through this site. Section titles are written in bold. The plain text items listed in the contents box on the left are clickable and each one links to one HTML file. Once a file is open, a link in the top left corner enables to open it in a new tab. In the top right corner, the button “Code” allows to download the RMD file behind the HTML.
This section covers data availability, including a succinct presentation of the datasets considered in this study.
All datasets, such as FASTQ files and count matrices, considered in this study are all available in the functional genomics data collection from EMBL-EBI (ArrayExpress) using their accession number E-%TAB-13334.
Eleven datasets were analyzed in the article, whose main characteristics are described in the table below.
In this table, “Project name”corresponds to the unique name given to the DNA library associated with the tissue, and subjected to sequencing. “Genotype” is provided in an abbreviated form, using the following definitions :
All mouse models were sacrificed at the age of 3 months, while the tissue considered correspond to subcutaneous nerves. Each project name corresponds to one single mouse. Several mice were pooled together to have enough cells from plexiform neurofibromas (pNF). Sample identifiers were generated based on genotype and condition, stressing the difference between samples from nerves or pNFs.
Condition corresponds to mice housing condition. Mice were grouped together in cages, or kept separately.
Finally, we defined a specific custom color for each sample type (genotype + condition), depicted as colored bullet in the first column, while the associated HTML color codes s are provided in the last column. Control mice kept alone are associated with green shades, while control mice kept together are in gold shades. Grey, blue, pink and red denote the other sample types.
This section explains how computational analyses were conducted. The analyses were spread over several notebooks, following a branching organisation (i.e. a tree structure).
A workflow of our analysis from raw count matrices to manuscript figures is presented below.
First (Aa), each BCL folder, corresponding to one project name, is processed to obtain a count matrix. The count matrix contains as many rows as there are annotated genes in the transcriptome annotation, and as many columns as there were droplets containing RNA. This initial matrix is called the raw count matrix. Step (B) is performed before the next steps, in order to generate metadata associated with all samples (see table above), specific color for cell type, and marker genes for cell type annotation. Next (Ab), each sample (“project name”) is analyzed individually, in order to filter out low quality cells, generate a two-dimension (2D) projection to visualize cells, and annotate cells for clusters, cell type, cell cycle, etc. Further analyses are organised following two branches. Along the first branch (C), all datasets are combined to generate a common 2D representation, to visualize all cells as an atlas. Along the second branch (D), a population of interest is extracted from each individual dataset, and cells, associated with their annotations, are merged in a new dataset. As for the atlas, this new dataset is also processed to generate a 2D representation to visualize selected cells. Dataset construction and annotation is always separated from the analysis. Separated notebooks are developed to conduct differential expression, enrichment analysis, and so on. Finally (E), figures for the manuscript are made from the various files saved from each step.
Analyses were conducted in specific directories, organised in a tree as shown below :
cat tree.txt
## tree -n -P "*Rmd" ./.. > tree.txt
##
## ./..
## ├── 0_intro
## │ ├── 1_welcome.Rmd
## │ └── 2_mat_and_meth.Rmd
## ├── 1_input
## │ ├── 2020_09
## │ ├── 2020_17
## │ ├── 2022_35
## │ ├── 2022_36
## │ ├── 2022_37
## │ ├── 2022_38
## │ ├── 2022_41
## │ ├── 2022_42
## │ ├── 2023_02
## │ ├── 2023_04
## │ └── 2023_06
## ├── 2_metadata
## │ └── 1_build_metadata.Rmd
## ├── 3_individual
## │ ├── 1_make_individual.Rmd
## │ └── datasets
## ├── 4_combined
## │ ├── 41_combined_ctrl
## │ │ └── 1_make_combined_ctrl.Rmd
## │ └── 42_combined_all
## │ └── 1_make_combined_all.Rmd
## ├── 5_zoom
## │ ├── 51_zoom_SC
## │ │ ├── 1_zoom_SC_dataset.Rmd
## │ │ └── 2_zoom_SC_analysis.Rmd
## │ ├── 52_zoom_epiFb
## │ │ ├── 1_zoom_epiFb_dataset.Rmd
## │ │ └── 2_zoom_epiFb_analysis.Rmd
## │ ├── 53_zoom_periFb
## │ │ ├── 1_zoom_periFb_dataset.Rmd
## │ │ └── 2_zoom_periFb_analysis.Rmd
## │ ├── 54_zoom_endoFb
## │ │ ├── 1_zoom_endoFb_dataset.Rmd
## │ │ └── 2_zoom_endoFb_analysis.Rmd
## │ └── 55_zoom_immune
## │ ├── 1_zoom_immune_dataset.Rmd
## │ └── 2_zoom_immune_analysis.Rmd
## ├── 6_figures
## │ ├── figures_detail
## │ └── figures.Rmd
## └── index_builder
##
## 28 directories, 17 files
These directories can be described as follow :
Note that the menu on the left follows almost the same structure.
This section explains how we ensure the stability of notebook compilation. We used more than 40 packages of interest to conduct the analysis, and each package has its own set of dependencies, often requiring specific versions. Dependencies include other R packages, as well as operating system-related tools. Hence, it is challenging to compile such notebooks on another computer, without paying attention to the required environment. Moreover, to ensure full reproducibility of the analysis, the exact same versions of all packages should be deployed.
To enforce the use the same R packages environment for all analyses, we generate a Singularity container containing all specific versions of R packages used.
To build the container, we defined a dependency tree for all packages of interest :
package_of_interest = c("base", "aquarius",
"Seurat", "ggplot2", "patchwork", "infercnv",
"dplyr", "slingshot", "TInGa", "ComplexHeatmap",
"nichenetr", "org.Mm.eg.db", "infercnv", "rtracklayer",
"stringr", "RColorBrewer", "viridis", "circlize",
"ggvenn", "gridExtra", "gtable", "grid", "harmony",
"dynplot", "dynmethods", "AnnotationDbi", "msigdbr",
"clusterProfiler", "enrichplot",
"EnhancedVolcano", "AUCell", "Matrix", "hdf5r",
"knitr", "kableExtra", "corrplot", "dynutils", "ggtext")
Next, we generated a table defining an installation order for all
these packages, such that, when installing a package, all its
dependencies are previously installed. This way, we avoid installation
conflicts. These two steps are achieved by the
aquarius::repro_installation_order
function. The function
generates a table including links to stored tar.gz
files.
Finally, the container was built using the definition file (https://github.com/audrey-onfroy/Mansour_et_al).
aquarius
packageWe include all our wrapping and home-made functions in a R package
called aquarius
, which is accessible in the GitHub
repository ((https://github.com/audrey-onfroy/Mansour_et_al)).
The version used for the analyses described here is embed in the
Singularity container.
To ensure reproducibility of all the analyses from count matrices to the final figures and results, RMD files were compiled using a Singularity container. First, RMD notebooks were developed within RStudio. Next, we opened a Unix terminal and ran a bash commands to compile RMD files using Singularity. For example, the following code :
file="1_welcome"
singularity exec \
--bind /home \
/path/to/singularity/image/singularity.simg \
Rscript -e "rmarkdown::render(input = '${file}.Rmd', output_file = '${file}.html')"
was used to launch the compilation of the notebook to obtain this document “Welcome” itself. We say to singularity tool to execute a bash command (the 5th line) using the specified container on the 4th line, by specifying to bind the home directory to the container (the 3rd line) such that external data to the container are available within it. On our machine, the analysis directory is located in the home directory.
Excepting the notebooks focusing on individual dataset, all notebooks
are parameter-free, so that they can be compiled using the same command.
The 1_make_individual
notebook requires one parameter
called sample_name
and corresponding to the project name as
described earlier. Project name should be defined in the metadata file,
with color, sample identifier and so on, because these information will
be used to annotate the individual datasets. Project name should also
correspond to a folder name in the input folder. To generate the
individual datasets, we thus use the following bash commands :
for sample in $(ls ../1_input/)
do
echo $sample
singularity exec \
--bind /home \
/path/to/singularity/image/singularity.simg \
Rscript -e "rmarkdown::render(input = './1_make_individual.Rmd',
output_file = './${sample}.html',
params = list(sample_name = '${sample}'))"
done
This code is very similar to the preceding one, except that we
defined the a “sample” variable before running the singularity command.
The variable is passed as the sample_name
parameter to
compile the notebook. Moreover, to avoid copying-pasting multiple times
the same command, we iterate it (for loop) through all the
subdirectories of the input folder. To follow the progression of
calculations, sample name is printed at each iteration of the for
loop.
Since R libraries are not located at the same location within the
Singularity container or on the local machine, we add the
.libPaths()
R command at the beginning of each notebook. In
the compiled HTML file, we thus can easily check if the notebook was
compiled through the container or not. An improvement would be to print
a version of the Singularity container used. In the singularity
container, libraries are located in
/usr/local/lib/R/library
.
To test the reproducibility of the analyses, we ran it several times at 2-3 months of interval on the same Linux-based machine. We focused on the number of cells after quality control filtering steps, tSNE representation, cell type annotation, clustering and Gene Set Enrichment Analysis results to compare between two distinct (by the day) compilations of the same notebook. On the same Linux-based machine turned out to be stable. Between a Mac-based machine and Linux-based machine, everything except the 2D representation turned out to be identical. In the tSNE, the main visual trend was the same, but some details were different. Since tSNE is only used for visualization and not for further analysis, and since differences were minor, we consider that our analyses are globally reproducible. Differences observed with tSNE could be due to machine accuracy and numerical rounding.
Each RMD notebook can be downloaded using the button on the top right corner of each HTML document, as well as directly from the GitHub repository associated with this site ((https://github.com/audrey-onfroy/Mansour_et_al)).