README.md 2.58 KB
Newer Older
dimit98's avatar
dimit98 committed
1
# Snakemake Workflow: single cell RNA-seq analysis
andreott's avatar
andreott committed
2
3
The chosen methods are based on the paper of Luecken and Theis, 2019 (https://doi.org/10.15252/msb.20188746), explaining the current best practice in single cell RNA-seq analysis, and have been extended to multimodal analysis using Specter (https://www.biorxiv.org/content/10.1101/2020.06.15.151910v1). We are currently preparing a standalone application of Specter that will no longer require a MATLAB license by the user. 

dimit98's avatar
dimit98 committed
4

dimit98's avatar
dimit98 committed
5
6
7
8
9
10
## Installations
* install Snakemake (workflow management system)
* install conda (package management system)
* install mamba (package management system) (optional)
* install bamtofastq and cellranger by 10X Genomics (used for preprocessing of data)
* install Sphetcher (downsampling algorithm) (optional)
dimit98's avatar
dimit98 committed
11

dimit98's avatar
dimit98 committed
12
13
14
15
## Usage
1. Clone this repository recursively (because of the submodule Specter)
2. Configure the workflow by editing the config.yaml-file (parameters described in the file)
3. Start the execution in the folder the Snakefile is located in by typing one of the following commands:
dimit98's avatar
dimit98 committed
16
17
18
19
```
snakemake --use-conda --cores x
snakemake --use-conda --cores x --conda-frontend mamba
```
dimit98's avatar
dimit98 committed
20
x specifies the amount of cores used in the workflow. The second command uses the package management system mamba instead of conda and should be used if the installation of the environments takes too long.
dimit98's avatar
dimit98 committed
21

dimit98's avatar
dimit98 committed
22
23
## Data
The workflow starts with bam-files, definied in two tsv-files, which are linked in the config.yaml. The Samples.tsv has two columns, of which the first one defines the sample and the second one the corresponding path to the bam-file. The Units.tsv has three to five columns and defines further information about the samples. The first column specifies the sample and the second one the alias used in the workflow. The third column defines regions for all samples (used in visualizations across the regions). The last two columns are optional and should be only used if differential testing is supposed to be performed. In that case the fourth column is named contrast and specifies the two groups between which differential testing is performed (defined by the letters A and B). The last column is optional and can be named by the user. It specifies another source of variability, which is accounted for in the differential testing.
dimit98's avatar
dimit98 committed
24

dimit98's avatar
dimit98 committed
25
26
27
28
29
30
example Samples.tsv: 

|sample|path                |
|---|---------------------|
|sample1|path_to_sample1.bam|
|sample2|path_to_sample2.bam|
dimit98's avatar
dimit98 committed
31
32

example Units.tsv:
dimit98's avatar
dimit98 committed
33

dimit98's avatar
dimit98 committed
34
sample | sample_alias | region | contrast | variable_name
dimit98's avatar
dimit98 committed
35
36
37
-------|--------------|-------|---------|--------
sample1 | alias1 | reg1 | A | x
sample2 | alias2 | reg2 | B | y