Skip to content
Snippets Groups Projects
Commit f1831026 authored by james94's avatar james94
Browse files

Update README.md

parent 2ac6e7f6
No related branches found
No related tags found
No related merge requests found
# README
# Genome Evaluation Pipeline (GEP)
<br>
* User-friendly and **all-in-one** **quality control and evaluation** pipeline for genome assemblies
* User-friendly and **all-in-one quality control and evaluation** pipeline for genome assemblies
* Run **multiple genome evaluations** in one go (as many as you want!)
* Run **multiple genome evaluations** in one parallel (as many as you want!)
* Seamlessly **scaled to server, cluster, grid and cloud environments**
* **Scales to server, HPC environments**
* Required **software** **stack** **automatically deployed** to any execution environment using **snakemake** and **conda**
* Required **software stack automatically deployed** using conda
---
# Getting Started
GEP can be run in two independent modes - the inputs for which are specified in respective sample sheets. The general idea of these two modes are as follows:
## The Workflow - 2 Steps
1. Create meryl database (Step 4.1)
- Inputs: WGS sequencing libraries (PacBio Sequel II/HiFi or Illumina PE short-insert)
1. [Build meryl k-mer databases](#build-meryl-k-mer-databases)
- Requires: WGS sequencing libraries (Pacbio HiFi or Illumina PE short-insert)
- Output: (`.meryl`) k-mer database
2. Run evaluation (Step 4.2)
- Inputs: (`.meryl`) k-mer database and corresponding genome assembly you wish to evaluate
2. [Assembly evaluation](#assembly-evaluation)
- Requires: (`.meryl`) k-mer database and corresponding genome assembly you wish to evaluate
- Output: Evaluation results and report
## **Step 1. Downloading the workflow**
<br>
---
## **Downloading the workflow**
To clone the repository, use the following command:
```
git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git
```
<br>
---
## **Step 2. Conda management**
- Conda (v4.10.3) *but may work on older versions*
*If you already have conda installed on your system, please skip to **step 3***
## Installing Conda
- Conda (v4.11+) *but may work on older versions*
If you already have conda installed on your system, please skip to [Creating our Snakemake conda environment](#creating-our-snakemake-conda-environment)
Download the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html
......@@ -56,20 +69,23 @@ conda update conda
```
If `conda command not found` please close and re-open your terminal for conda installation to take effect, and then update.
<br>
---
<div id="creating-our-snakemake-conda-environment"></div>
## **Step 3. Creating our Snakemake conda environment**
## Creating our Snakemake conda environment
The pipeline requires the following software to run:
- snakemake (6.6.1+)
- python (3.9.1+)
- tabulate (0.8.7+)
- beautifulsoup4 (4.9+)
- mamba (0.15.2)
- pandoc (2.2.1) *most recent version 2.16.2 causes error*
- snakemake (v6.6.1)
- python (v3.9.10)
- tabulate (v0.8.7)
- beautifulsoup4 (v4.9)
- mamba (v0.15.2) *[Newest version causes error]*
- pandoc (v2.15)
- tectonic (v0.8.2)
The easiest method to install this software stack is to create a GEP conda environment with the provided `installGEP.yaml` ***Note**
The easiest method to install this software stack is to create a GEP conda environment with the provided `installGEP.yaml` (see ***Note**)
```
......@@ -81,68 +97,109 @@ conda activate GEP
snakemake --version
```
**Note** *If you already have a snakemake installation and would like to avoid installing it again, ensure that the above software are all in your `PATH` and you have conda installed/activated.*
***Note** *If you already have a snakemake (or suitable Python) installation and would like to avoid installing again, ensure that all of the above software are in your `PATH`. If you do this instead of installing from the provided GEP environment (`installGEP.yaml`), you will still need at least the base conda installed/activated - as it's required to handle software dependencies of all the tools used within the workflow itself*
<br>
## **Step 4. SampleSheets `.tsv` and Config `.yaml`**
---
GEP can be run in two independent modes - the inputs for which are specified in respective [sample sheets](configuration/exampleSampleSheets/). The general idea of these two modes are as follows:
<div id="configure-sample"></div>
1. Create meryl database
- Inputs: Sample sheet containing paths to WGS sequencing libraries (PacBio Sequel II/HiFi or Illumina PE short-insert)
- Output: (`.meryl`) k-mer database
## **Configure Sample Sheet `.tsv`s**
<span style="color:darkred">For the moment, GEP can run with *only one* of the below sample sheets at a given time. </span>
2. Assembly Evaluation
- Inputs: Sample sheet containing paths to (`.meryl`) k-mer database and corresponding genome assembly you wish to evaluate
- Output: Evaluation results and report
**Depending on which sample sheet is provided, GEP will automatically run in one of the following two modes:**
1. **Build**
- This mode is run if either the [Illumina Sample Sheet `.tsv`](#illumina-sample) or the [PacBio HiFi Sample Sheet `.tsv`](#hifi-sample) are provided to GEP.
### **Step 4.1. Build k-mer databases**
2. **Evaluate**
- This mode is run if the [Assembly Evaluation Sample Sheet `.tsv`](#assembly-eval) is provided.
*If you already have a [meryl](https://github.com/marbl/merqury/wiki/1.-Prepare-meryl-dbs) k-mer database corresponding respectively to the genomes you wish to evaluate, you can skip to step 4.2.*
#### Illumina Sample Sheet `.tsv`
If you are a little uncertain about which short-read libraries you should use, see [How to choose your illumina libraries](#how-to-choose-your-illumina-libraries) further down the page.
If you already have meryl k-mer databases [(see the meryl github for more details)](https://github.com/marbl/merqury/wiki/1.-Prepare-meryl-dbs) for the genomes you wish to evaluate, you can skip **Build** mode.
<br>
---
See [GEP/configuration/exampleSampleSheets/build_illumina_example.tsv](configuration/exampleSampleSheets/build_illumina_example.tsv) for an idea of how to set up the illumina sample sheet.
<div id="illumina-sample"></div>
#### Illumina Sample Sheet `.tsv` - for **Build** mode
(see example [GEP/configuration/exampleSampleSheets/build_illumina_example.tsv](configuration/exampleSampleSheets/build_illumina_example.tsv))
|sample|Library_R1|Library_R2|meryl_kmer_size| trim10X| trimAdapters| fastQC|
|:----|:---|:---|:---|:---|:----|:---|
|Your identifier for the sample from which the provided WGS libraries belong| Full path to Forward (R1) read of PE library/s in `fastq` format. Can be `.gz`ipped | Full path to Reverse (R2) read of PE library/s in `fastq` format. Can be `.gz`ipped| Your choice of k-mer size used to count k-mers in the reads. Recommended for illumina reads is `21`| Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are `True` or `False` | Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are `True` or `False` |Run FastQC on the library pair provided in `hifi_reads`. Possible options are `True` or `False`|
|Your preferred identifier (Results will include this sample name in output files) | Full path to Forward (R1) read of PE library/s in `fastq` format. Can be `.gz`ipped | Full path to Reverse (R2) read of PE library/s in `fastq` format. Can be `.gz`ipped| Your choice of k-mer size used to count k-mers in the reads. Recommended for illumina reads is `21`| Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are `True` or `False` | Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are `True` or `False` |Run FastQC on the library pair provided in `hifi_reads`. Possible options are `True` or `False`|
#### PacBio HiFi Sample Sheet `.tsv`
<br>
Additional Info in case of multiple PE libraries for a single sample:
- **sample**: Provide the same sample ID for each of the pairs of a sample (the final output will be a single `.meryl` database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample.
- **Library_R1** and **Library_R2**: Each library pair is provided as one line in the tsv. If you have three PE libraries, then you will have three lines in the tsv for this sample.
- **meryl_kmer_size**: This should be consistent for libraries that have the same sample ID.
- **trim10X**: This does not need to be consistent. If you wish to build a database using a combination of 10x and non-10x barcoded libraries, you can do so. Only provide `True` option if you definitely want to trim the `Library_R1` provided in that same line.
- **trimAdapters**: Similarly, you may select `True` or `False` for each library independent of whether they are part of the same sample or not.
- **fastQC**: If any library from a sample has `True` in this column, then all libraries with the identical sample ID will their quality checked with fastQC, even if these other libraries have `False` in this column.
If you are a little uncertain about which short-read libraries you should use, see [How to choose your illumina libraries](#how-to-choose-your-illumina-libraries) further down the page.
<br>
---
<div id="hifi-sample"></div>
See [GEP/configuration/exampleSampleSheets/build_hifi_example.tsv](configuration/exampleSampleSheets/build_hifi_example.tsv) for an idea of how to set up the PacBio sample sheet.
#### PacBio HiFi Sample Sheet `.tsv`
(see example [GEP/configuration/exampleSampleSheets/runEval_example.tsv](configuration/exampleSampleSheets/runEval_example.tsv))
|sample|hifi_reads|meryl_kmer_size| trimSMRTbell| fastQC|
|:----|:---|:---|:----|:---|
|Your identifier for the sample from which the provided WGS libraries belong| Full path to hifi library/s in `fastq` format. Can be `.gz`ipped | Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is `31`| Check for and remove any SMRT-bell adapter sequences. Possible options are `True` or `False` | Run FastQC on the library provided in `hifi_reads`. Possible options are `True` or `False` |
|Your preferred identifier (Results will include this sample name in output files) | Full path to hifi library/s in `fastq` format. Can be `.gz`ipped | Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is `31`| Check for and remove any SMRT-bell adapter sequences. Possible options are `True` or `False` | Run FastQC on the library provided in `hifi_reads`. Possible options are `True` or `False` |
**With one of the above sample sheets complete, you can now run the database building step of GEP.** See **Step 4.3** for how configure your run.
<br>
### **Step 4.2. Assembly Evaluation**
Additional Info in case of multiple HiFi libraries for a single sample:
- **sample**: Provide the same sample ID for each Hifi library of a sample (the final output will be a single `.meryl` database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample.
- **Library_R1** and **Library_R2**: Each library pair is provided as one line in the tsv. If you have three PE libraries, then you will have three lines in the tsv for this sample.
- **meryl_kmer_size**: This should be consistent for libraries that have the same sample ID.
- **trim10X**: This does not need to be consistent. If you wish to build a database using a combination of 10x and non-10x barcoded libraries, you can do so. Only provide `True` option if you definitely want to trim the `Library_R1` provided in that same line.
- **trimAdapters**: Similarly, you may select `True` or `False` for each library independent of whether they are part of the same sample or not.
- **fastQC**: If any library from a sample has `True` in this column, then all libraries with the identical sample ID will their quality checked with fastQC, even if these other libraries have `False` in this column.
<br>
---
<span style="color:darkred">You may also wish to concatenate the libraries of a sample together prior to building the database. In this case, you will only need to provide one line per sample in the respective sample sheets. However, the execution run-time will be hindered as the pipeline is designed to run on multiple libraries in parallel. </span>
---
<div id="assembly-eval"></div>
#### Assembly Evaluation Sample Sheet `.tsv`
See [GEP/configuration/exampleSampleSheets/runEval_example.tsv](configuration/exampleSampleSheets/runEval_example.tsv) for an idea of how to set up the evaluation sample sheet.
|ID|PRI_asm|ALT_asm| merylDB| merylDB_kmer |genomeSize|
|:----|:---|:---|:----|:---|:---|
|Identifier for results and reporting| Full path to primary assembly you wish to evaluate in`fasta` format. Can be `.gz`ipped | Full path to alternate assembly (haplotype) in`fasta` format. Can be `.gz`ipped. **If you do not have one, write** `None` | Full path to `.meryl` database | The k-mer size used to build your provided `.meryl` db | Provide a size estimate (in bp) for the corresponding assembly/species. Can **leave blank** and it will be inferred during evaluation |
|ID|PRI_asm|ALT_asm| merylDB| merylDB_kmer |genomeSize|HiC_R1|HiC_R2|
|:----|:---|:---|:----|:---|:---|:---|:------------------|
|Identifier for results and reporting| Full path to primary assembly you wish to evaluate in`fasta` format. Can be `.gz`ipped | Full path to alternate assembly (haplotype) in`fasta` format. Can be `.gz`ipped. **If you do not have one, write** `None` | Full path to `.meryl` database | The k-mer size used to build your provided `.meryl` db | Provide a size estimate (in bp) for the corresponding assembly/species. **If you do not have one, write** `auto` |Full path to Forward (R1) read of HiC library in `fastq` format. Can be `.gz`ipped |Full path to Forward (R1) read of HiC library in `fastq` format. Can be `.gz`ipped|
### 4.3 Configuration `.yaml`
**Note** There are multiple `config.yaml` files found inside the GEP project directories. Unlike the above sample sheets - which can be saved in any location and with any filename you wish - these config files must always preside in their existing locations and the name should not be changed.
First you must provide some run-specific information in [GEP/configuration/config.yaml](configuration/exampleSampleSheets/runEval_example.tsv)
First you must provide some run-specific information in [GEP/configuration/config.yaml](configuration/config.yaml)
```
Results: # e.g. "/srv/public/users/james94/insecta_results_05_11_2021"
......@@ -155,9 +212,11 @@ Once you have a sample sheet ready, you need to configure your GEP run.
Modify the
**Step 5. Running the workflow**
-
## Running the workflow
<div id="build-meryl-k-mer-databases"></div>
## **Build meryl k-mer databases**
*Make sure your GEP environment is activated.*
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment