This site gathers and describes all scripts used for the analysis of single-cell RNA sequencing (scRNAseq), from count matrices to manuscript figures.

All computational analyses were conducted using R language within RStudio. We used the rmarkdown package to develop notebooks in RMD format. RMD format is organized into two types of sections. The “chunk” sections contains code in a programming language, such as R, while the output is displayed below the chunk. Around the chunk, explanatory comments in human readable language can be added at will. Generally, the Markdown language is used to format the text. Since the notebook is then compiled to obtain a HTML file, HTML language can also be used.

RMD files were compiled to generate HTML files, which are shared through this site. Section titles are written in bold. The plain text items listed in the contents box on the left are clickable and each one links to one HTML file. Once a file is open, a link in the top left corner enables to open it in a new tab. In the top right corner, the button “Code” allows to download the RMD file behind the HTML.

About data

This section covers data availability, including a succinct presentation of the datasets considered in this study.

Data availability

All datasets, such as FASTQ files and count matrices, considered in this study are all available in the functional genomics data collection from EMBL-EBI (ArrayExpress) using their accession number E-%TAB-13334.

Data presentation

Eleven datasets were analyzed in the article, whose main characteristics are described in the table below.

In this table, “Project name”corresponds to the unique name given to the DNA library associated with the tissue, and subjected to sequencing. “Genotype” is provided in an abbreviated form, using the following definitions :

ctrl : Prss56^Cre/+;R26^tdTom/+ mouse, i.e. Prss56+ cells and their derivatives are definitively expressing the Tomato reporter gene (Tomato+)
flfl (Nf1-KO^wt) : Prss56^Cre/+;Nf1^fl/fl;R26^tdTom/+ mouse, i.e. Prss56+ derivatives are Tomato+ and Nf1^-/-, while genetic background is wild-type for Nf1 (Nf1^+/+)
flmn (Nf1-KO^het) : Prss56^Cre/+;Nf1^fl/-;R26^tdTom/+ mouse, i.e. Prss56+ derivatives are Tomato+ and Nf1^-/-, while genetic background is heterozygous for Nf1 (Nf1^+/-)
pNF : 5-month-old Nf1-KO^wt mice

All mouse models were sacrificed at the age of 3 months, while the tissue considered correspond to subcutaneous nerves. Each project name corresponds to one single mouse. Several mice were pooled together to have enough cells from plexiform neurofibromas (pNF). Sample identifiers were generated based on genotype and condition, stressing the difference between samples from nerves or pNFs.

Condition corresponds to mice housing condition. Mice were grouped together in cages, or kept separately.

Finally, we defined a specific custom color for each sample type (genotype + condition), depicted as colored bullet in the first column, while the associated HTML color codes s are provided in the last column. Control mice kept alone are associated with green shades, while control mice kept together are in gold shades. Grey, blue, pink and red denote the other sample types.

Analysis strategy

This section explains how computational analyses were conducted. The analyses were spread over several notebooks, following a branching organisation (i.e. a tree structure).

Analysis squeleton

A workflow of our analysis from raw count matrices to manuscript figures is presented below.

First (Aa), each BCL folder, corresponding to one project name, is processed to obtain a count matrix. The count matrix contains as many rows as there are annotated genes in the transcriptome annotation, and as many columns as there were droplets containing RNA. This initial matrix is called the raw count matrix. Step (B) is performed before the next steps, in order to generate metadata associated with all samples (see table above), specific color for cell type, and marker genes for cell type annotation. Next (Ab), each sample (“project name”) is analyzed individually, in order to filter out low quality cells, generate a two-dimension (2D) projection to visualize cells, and annotate cells for clusters, cell type, cell cycle, etc. Further analyses are organised following two branches. Along the first branch (C), all datasets are combined to generate a common 2D representation, to visualize all cells as an atlas. Along the second branch (D), a population of interest is extracted from each individual dataset, and cells, associated with their annotations, are merged in a new dataset. As for the atlas, this new dataset is also processed to generate a 2D representation to visualize selected cells. Dataset construction and annotation is always separated from the analysis. Separated notebooks are developed to conduct differential expression, enrichment analysis, and so on. Finally (E), figures for the manuscript are made from the various files saved from each step.

Directory squeleton

Analyses were conducted in specific directories, organised in a tree as shown below :

cat tree.txt

## tree -n -P "*Rmd" ./.. > tree.txt
## 
## ./..
## ├── 0_intro
## │   ├── 1_welcome.Rmd
## │   └── 2_mat_and_meth.Rmd
## ├── 1_input
## │   ├── 2020_09
## │   ├── 2020_17
## │   ├── 2022_35
## │   ├── 2022_36
## │   ├── 2022_37
## │   ├── 2022_38
## │   ├── 2022_41
## │   ├── 2022_42
## │   ├── 2023_02
## │   ├── 2023_04
## │   └── 2023_06
## ├── 2_metadata
## │   └── 1_build_metadata.Rmd
## ├── 3_individual
## │   ├── 1_make_individual.Rmd
## │   └── datasets
## ├── 4_combined
## │   ├── 41_combined_ctrl
## │   │   └── 1_make_combined_ctrl.Rmd
## │   └── 42_combined_all
## │       └── 1_make_combined_all.Rmd
## ├── 5_zoom
## │   ├── 51_zoom_SC
## │   │   ├── 1_zoom_SC_dataset.Rmd
## │   │   └── 2_zoom_SC_analysis.Rmd
## │   ├── 52_zoom_epiFb
## │   │   ├── 1_zoom_epiFb_dataset.Rmd
## │   │   └── 2_zoom_epiFb_analysis.Rmd
## │   ├── 53_zoom_periFb
## │   │   ├── 1_zoom_periFb_dataset.Rmd
## │   │   └── 2_zoom_periFb_analysis.Rmd
## │   ├── 54_zoom_endoFb
## │   │   ├── 1_zoom_endoFb_dataset.Rmd
## │   │   └── 2_zoom_endoFb_analysis.Rmd
## │   └── 55_zoom_immune
## │       ├── 1_zoom_immune_dataset.Rmd
## │       └── 2_zoom_immune_analysis.Rmd
## ├── 6_figures
## │   ├── figures_detail
## │   └── figures.Rmd
## └── index_builder
## 
## 28 directories, 17 files

These directories can be described as follow :

intro corresponds to this introducing file
input stores all the input data, i.e. the raw count matrices; this directory contains one folder named according to each project name; each folder contains the three files associated with the raw count matrix.
metadata contains the notebook used to build global satellite data, such as cell types markers to annotate cells, colors associated with cell types, and samples information (project name, associated color and sample identifier, as described earlier).
individual is the directory where all individual datasets are generated, and stored in the datasets directory
combined corresponds to the directory some individual datasets are combined.
zoom directories are named according to a population of interest, and enable to build and analyze dataset corresponding to a specific population.
figure contains the notebook enabling the generation of all figures for the article, each stored in the figures_detail folder as a PDF and a PNG files automatically upon compilation; files are named according to the chunk name from which they are generated.
index_build contains the CSS and JS files required to format this web site.

Note that the menu on the left follows almost the same structure.

Stability

This section explains how we ensure the stability of notebook compilation. We used more than 40 packages of interest to conduct the analysis, and each package has its own set of dependencies, often requiring specific versions. Dependencies include other R packages, as well as operating system-related tools. Hence, it is challenging to compile such notebooks on another computer, without paying attention to the required environment. Moreover, to ensure full reproducibility of the analysis, the exact same versions of all packages should be deployed.

Singularity

To enforce the use the same R packages environment for all analyses, we generate a Singularity container containing all specific versions of R packages used.

To build the container, we defined a dependency tree for all packages of interest :

package_of_interest = c("base", "aquarius",
                        "Seurat", "ggplot2", "patchwork", "infercnv",
                        "dplyr", "slingshot", "TInGa", "ComplexHeatmap",
                        "nichenetr", "org.Mm.eg.db", "infercnv", "rtracklayer",
                        "stringr", "RColorBrewer", "viridis", "circlize",
                        "ggvenn", "gridExtra", "gtable", "grid", "harmony",
                        "dynplot", "dynmethods", "AnnotationDbi", "msigdbr",
                        "clusterProfiler", "enrichplot",
                        "EnhancedVolcano", "AUCell", "Matrix", "hdf5r",
                        "knitr", "kableExtra", "corrplot", "dynutils", "ggtext")

Next, we generated a table defining an installation order for all these packages, such that, when installing a package, all its dependencies are previously installed. This way, we avoid installation conflicts. These two steps are achieved by the aquarius::repro_installation_order function. The function generates a table including links to stored tar.gz files. Finally, the container was built using the definition file (https://github.com/audrey-onfroy/Mansour_et_al).

`aquarius` package

We include all our wrapping and home-made functions in a R package called aquarius, which is accessible in the GitHub repository ((https://github.com/audrey-onfroy/Mansour_et_al)). The version used for the analyses described here is embed in the Singularity container.

RMD knitting

To ensure reproducibility of all the analyses from count matrices to the final figures and results, RMD files were compiled using a Singularity container. First, RMD notebooks were developed within RStudio. Next, we opened a Unix terminal and ran a bash commands to compile RMD files using Singularity. For example, the following code :

file="1_welcome"
singularity exec \
--bind /home \
/path/to/singularity/image/singularity.simg \
Rscript -e "rmarkdown::render(input = '${file}.Rmd', output_file = '${file}.html')"

was used to launch the compilation of the notebook to obtain this document “Welcome” itself. We say to singularity tool to execute a bash command (the 5th line) using the specified container on the 4th line, by specifying to bind the home directory to the container (the 3rd line) such that external data to the container are available within it. On our machine, the analysis directory is located in the home directory.

Excepting the notebooks focusing on individual dataset, all notebooks are parameter-free, so that they can be compiled using the same command. The 1_make_individual notebook requires one parameter called sample_name and corresponding to the project name as described earlier. Project name should be defined in the metadata file, with color, sample identifier and so on, because these information will be used to annotate the individual datasets. Project name should also correspond to a folder name in the input folder. To generate the individual datasets, we thus use the following bash commands :

for sample in $(ls ../1_input/)
do
echo $sample
singularity exec \
--bind /home \
/path/to/singularity/image/singularity.simg \
Rscript -e "rmarkdown::render(input = './1_make_individual.Rmd',
                              output_file = './${sample}.html',
                              params = list(sample_name = '${sample}'))"
done

This code is very similar to the preceding one, except that we defined the a “sample” variable before running the singularity command. The variable is passed as the sample_name parameter to compile the notebook. Moreover, to avoid copying-pasting multiple times the same command, we iterate it (for loop) through all the subdirectories of the input folder. To follow the progression of calculations, sample name is printed at each iteration of the for loop.

Since R libraries are not located at the same location within the Singularity container or on the local machine, we add the .libPaths() R command at the beginning of each notebook. In the compiled HTML file, we thus can easily check if the notebook was compiled through the container or not. An improvement would be to print a version of the Singularity container used. In the singularity container, libraries are located in /usr/local/lib/R/library.

Reproducibility

To test the reproducibility of the analyses, we ran it several times at 2-3 months of interval on the same Linux-based machine. We focused on the number of cells after quality control filtering steps, tSNE representation, cell type annotation, clustering and Gene Set Enrichment Analysis results to compare between two distinct (by the day) compilations of the same notebook. On the same Linux-based machine turned out to be stable. Between a Mac-based machine and Linux-based machine, everything except the 2D representation turned out to be identical. In the tSNE, the main visual trend was the same, but some details were different. Since tSNE is only used for visualization and not for further analysis, and since differences were minor, we consider that our analyses are globally reproducible. Differences observed with tSNE could be due to machine accuracy and numerical rounding.

Code availability

Each RMD notebook can be downloaded using the button on the top right corner of each HTML document, as well as directly from the GitHub repository associated with this site ((https://github.com/audrey-onfroy/Mansour_et_al)).

Welcome

2023-09-04