Update README.md

f1831026 · james94 · 2ac6e7f6 · f1831026
Commit f1831026 authored 3 years ago by james94
--- a/README.md
+++ b/README.md
+# README
+
+
 # Genome Evaluation Pipeline (GEP)
+<br>

-* User-friendly and **all-in-one** **quality control and evaluation** pipeline for genome assemblies
+* User-friendly and **all-in-one quality control and evaluation** pipeline for genome assemblies

-* Run **multiple genome evaluations** in one go (as many as you want!)
+* Run **multiple genome evaluations** in one parallel (as many as you want!)

-* Seamlessly **scaled to server, cluster, grid and cloud environments**
+* **Scales to server, HPC environments**

-* Required **software** **stack** **automatically deployed** to any execution environment using **snakemake** and **conda**
+* Required **software stack automatically deployed** using conda



+---


-# Getting Started

-GEP can be run in two independent modes - the inputs for which are specified in respective sample sheets.  The general idea of these two modes are as follows:
+## The Workflow - 2 Steps

-1. Create meryl database (Step 4.1)
-     - Inputs: WGS sequencing libraries (PacBio Sequel II/HiFi or Illumina PE short-insert)
+
+1. [Build meryl k-mer databases](#build-meryl-k-mer-databases)
+     - Requires: WGS sequencing libraries (Pacbio HiFi or Illumina PE short-insert)
     - Output: (`.meryl`) k-mer database  


-2. Run evaluation (Step 4.2)
-     - Inputs: (`.meryl`) k-mer database and corresponding genome assembly you wish to evaluate
+2. [Assembly evaluation](#assembly-evaluation)
+     - Requires: (`.meryl`) k-mer database and corresponding genome assembly you wish to evaluate
     - Output: Evaluation results and report

-## **Step 1. Downloading the workflow**
+<br>
+
+---
+
+
+## **Downloading the workflow**


 To clone the repository, use the following command:
 ```
 git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git
 ```
+<br>

 ---

-## **Step 2. Conda management**

- Conda (v4.10.3)  *but may work on older versions*

-*If you already have conda installed on your system, please skip to **step 3***
+## Installing Conda
+
+- Conda (v4.11+)  *but may work on older versions*
+
+If you already have conda installed on your system, please skip to [Creating our Snakemake conda environment](#creating-our-snakemake-conda-environment)

 Download the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html

@@ -56,20 +69,23 @@ conda update conda
 ```

 If  `conda command not found` please close and re-open your terminal for conda installation to take effect, and then update.
+<br>

 ---
+<div id="creating-our-snakemake-conda-environment"></div>

-## **Step 3. Creating our Snakemake conda environment**
+## Creating our Snakemake conda environment

 The pipeline requires the following software to run:
- snakemake (6.6.1+)
- python (3.9.1+)
- tabulate (0.8.7+)
- beautifulsoup4 (4.9+)
- mamba (0.15.2)
- pandoc (2.2.1) *most recent version 2.16.2 causes error*
+- snakemake (v6.6.1)
+- python (v3.9.10)
+- tabulate (v0.8.7)
+- beautifulsoup4 (v4.9)
+- mamba (v0.15.2) *[Newest version causes error]*
+- pandoc (v2.15) 
+- tectonic (v0.8.2)

-The easiest method to install this software stack is to create a GEP conda environment with the provided `installGEP.yaml` ***Note**
+The easiest method to install this software stack is to create a GEP conda environment with the provided `installGEP.yaml` (see ***Note**)


 ```
@@ -81,68 +97,109 @@ conda activate GEP

 snakemake --version
 ```
-**Note** *If you already have a snakemake installation and would like to avoid installing it again, ensure that the above software are all in your `PATH` and you have conda installed/activated.*
+***Note** *If you already have a snakemake (or suitable Python) installation and would like to avoid installing again, ensure that all of the above software are in your `PATH`.  If you do this instead of installing from the provided GEP environment (`installGEP.yaml`), you will still need at least the base conda installed/activated - as it's required to handle software dependencies of all the tools used within the workflow itself*
+<br>

-## **Step 4. SampleSheets `.tsv` and Config `.yaml`**
+---


-GEP can be run in two independent modes - the inputs for which are specified in respective [sample sheets](configuration/exampleSampleSheets/).  The general idea of these two modes are as follows:

+<div id="configure-sample"></div>

-1. Create meryl database
-     - Inputs: Sample sheet containing paths to WGS sequencing libraries (PacBio Sequel II/HiFi or Illumina PE short-insert)
-     - Output: (`.meryl`) k-mer database  
+## **Configure Sample Sheet `.tsv`s**

+<span style="color:darkred">For the moment, GEP can run with *only one* of the below sample sheets at a given time. </span> 

-2. Assembly Evaluation
-     - Inputs: Sample sheet containing paths to  (`.meryl`) k-mer database and corresponding genome assembly you wish to evaluate
-     - Output: Evaluation results and report
+**Depending on which sample sheet is provided, GEP will automatically run in one of the following two modes:**

+1. **Build**

+    - This mode is run if either the [Illumina Sample Sheet `.tsv`](#illumina-sample) or the [PacBio HiFi Sample Sheet `.tsv`](#hifi-sample) are provided to GEP.

-### **Step 4.1. Build k-mer databases**
+2. **Evaluate**
+    - This mode is run if the [Assembly Evaluation Sample Sheet `.tsv`](#assembly-eval) is provided.

-*If you already have a [meryl](https://github.com/marbl/merqury/wiki/1.-Prepare-meryl-dbs) k-mer database corresponding respectively to the genomes you wish to evaluate, you can skip to step 4.2.*

-#### Illumina Sample Sheet `.tsv`

-If you are a little uncertain about which short-read libraries you should use, see [How to choose your illumina libraries](#how-to-choose-your-illumina-libraries) further down the page.
+If you already have meryl k-mer databases [(see the meryl github for more details)](https://github.com/marbl/merqury/wiki/1.-Prepare-meryl-dbs) for the genomes you wish to evaluate, you can skip **Build** mode.
+
+
+<br>
+
+---

-See [GEP/configuration/exampleSampleSheets/build_illumina_example.tsv](configuration/exampleSampleSheets/build_illumina_example.tsv) for an idea of how to set up the illumina sample sheet.
+<div id="illumina-sample"></div>
+
+#### Illumina Sample Sheet `.tsv` - for **Build** mode
+
+
+(see example [GEP/configuration/exampleSampleSheets/build_illumina_example.tsv](configuration/exampleSampleSheets/build_illumina_example.tsv))


 |sample|Library_R1|Library_R2|meryl_kmer_size| trim10X| trimAdapters| fastQC|
 |:----|:---|:---|:---|:---|:----|:---|
-|Your identifier for the sample from which the provided WGS libraries belong| Full path to Forward (R1) read of PE library/s in `fastq` format.  Can be `.gz`ipped  | Full path to Reverse (R2) read of PE library/s in `fastq` format.  Can be `.gz`ipped| Your choice of k-mer size used to count k-mers in the reads. Recommended for illumina reads is `21`| Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are `True` or `False` | Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are `True` or `False`  |Run FastQC on the library pair provided in `hifi_reads`. Possible options are `True` or `False`|
+|Your preferred identifier (Results will include this sample name in output files) | Full path to Forward (R1) read of PE library/s in `fastq` format.  Can be `.gz`ipped  | Full path to Reverse (R2) read of PE library/s in `fastq` format.  Can be `.gz`ipped| Your choice of k-mer size used to count k-mers in the reads. Recommended for illumina reads is `21`| Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are `True` or `False` | Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are `True` or `False`  |Run FastQC on the library pair provided in `hifi_reads`. Possible options are `True` or `False`|

-#### PacBio HiFi Sample Sheet `.tsv`
+<br>
+
+Additional Info in case of multiple PE libraries for a single sample:
+ - **sample**: Provide the same sample ID for each of the pairs of a sample (the final output will be a single `.meryl` database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample.
+ - **Library_R1** and **Library_R2**: Each library pair is provided as one line in the tsv.  If you have three PE libraries, then you will have three lines in the tsv for this sample.
+ - **meryl_kmer_size**:  This should be consistent for libraries that have the same sample ID.
+ - **trim10X**: This does not need to be consistent.  If you wish to build a database using a combination of 10x and non-10x barcoded libraries, you can do so.  Only provide `True` option if you definitely want to trim the `Library_R1` provided in that same line.
+ - **trimAdapters**: Similarly, you may select `True` or `False` for each library independent of whether they are part of the same sample or not. 
+ - **fastQC**:  If any library from a sample has `True` in this column, then all libraries with the identical sample ID will their quality checked with fastQC, even if these other libraries have `False` in this column.  
+ 
+
+If you are a little uncertain about which short-read libraries you should use, see [How to choose your illumina libraries](#how-to-choose-your-illumina-libraries) further down the page.
+
+<br>
+
+---
+<div id="hifi-sample"></div>

-See [GEP/configuration/exampleSampleSheets/build_hifi_example.tsv](configuration/exampleSampleSheets/build_hifi_example.tsv) for an idea of how to set up the PacBio sample sheet.
+#### PacBio HiFi Sample Sheet `.tsv`
+(see example [GEP/configuration/exampleSampleSheets/runEval_example.tsv](configuration/exampleSampleSheets/runEval_example.tsv))

 |sample|hifi_reads|meryl_kmer_size| trimSMRTbell| fastQC|
 |:----|:---|:---|:----|:---|
-|Your identifier for the sample from which the provided WGS libraries belong| Full path to hifi library/s in `fastq` format.  Can be `.gz`ipped  | Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is `31`| Check for and remove any SMRT-bell adapter sequences. Possible options are `True` or `False` | Run FastQC on the library provided in `hifi_reads`. Possible options are `True` or `False` |
+|Your preferred identifier (Results will include this sample name in output files)   | Full path to hifi library/s in `fastq` format.  Can be `.gz`ipped  | Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is `31`| Check for and remove any SMRT-bell adapter sequences. Possible options are `True` or `False` | Run FastQC on the library provided in `hifi_reads`. Possible options are `True` or `False` |


-**With one of the above sample sheets complete, you can now run the database building step of GEP.**   See **Step 4.3** for how configure your run.

+<br>

-### **Step 4.2. Assembly Evaluation**
+Additional Info in case of multiple HiFi libraries for a single sample:
+ - **sample**: Provide the same sample ID for each Hifi library of a sample (the final output will be a single `.meryl` database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample.
+ - **Library_R1** and **Library_R2**: Each library pair is provided as one line in the tsv.  If you have three PE libraries, then you will have three lines in the tsv for this sample.
+ - **meryl_kmer_size**:  This should be consistent for libraries that have the same sample ID.
+ - **trim10X**: This does not need to be consistent.  If you wish to build a database using a combination of 10x and non-10x barcoded libraries, you can do so.  Only provide `True` option if you definitely want to trim the `Library_R1` provided in that same line.
+ - **trimAdapters**: Similarly, you may select `True` or `False` for each library independent of whether they are part of the same sample or not. 
+ - **fastQC**:  If any library from a sample has `True` in this column, then all libraries with the identical sample ID will their quality checked with fastQC, even if these other libraries have `False` in this column.
+<br>
+
+---
+<span style="color:darkred">You may also wish to concatenate the libraries of a sample together prior to building the database.  In this case, you will only need to provide one line per sample in the respective sample sheets.  However, the execution run-time will be hindered as the pipeline is designed to run on multiple libraries in parallel.   </span>
+
+---
+
+<div id="assembly-eval"></div>

 #### Assembly Evaluation Sample Sheet `.tsv`

 See [GEP/configuration/exampleSampleSheets/runEval_example.tsv](configuration/exampleSampleSheets/runEval_example.tsv) for an idea of how to set up the evaluation sample sheet.

-|ID|PRI_asm|ALT_asm| merylDB| merylDB_kmer |genomeSize|
-|:----|:---|:---|:----|:---|:---|
-|Identifier for results and reporting| Full path to primary assembly you wish to evaluate in`fasta` format.  Can be `.gz`ipped  | Full path to alternate assembly (haplotype) in`fasta` format.  Can be `.gz`ipped. **If you do not have one, write** `None`  | Full path to `.meryl` database | The k-mer size used to build your provided `.meryl` db | Provide a size estimate (in bp) for the corresponding assembly/species.  Can **leave blank** and it will be inferred during evaluation |
+
+|ID|PRI_asm|ALT_asm| merylDB| merylDB_kmer |genomeSize|HiC_R1|HiC_R2|
+|:----|:---|:---|:----|:---|:---|:---|:------------------|
+|Identifier for results and reporting| Full path to primary assembly you wish to evaluate in`fasta` format.  Can be `.gz`ipped  | Full path to alternate assembly (haplotype) in`fasta` format.  Can be `.gz`ipped. **If you do not have one, write** `None`  | Full path to `.meryl` database | The k-mer size used to build your provided `.meryl` db | Provide a size estimate (in bp) for the corresponding assembly/species.  **If you do not have one, write** `auto` |Full path to Forward (R1) read of HiC library in `fastq` format.  Can be `.gz`ipped |Full path to Forward (R1) read of HiC library in `fastq` format.  Can be `.gz`ipped|


 ### 4.3 Configuration `.yaml`

 **Note** There are multiple `config.yaml` files found inside the GEP project directories. Unlike the above sample sheets - which can be saved in any location and with any filename you wish - these config files must always preside in their existing locations and the name should not be changed.  

-First you must provide some run-specific information in [GEP/configuration/config.yaml](configuration/exampleSampleSheets/runEval_example.tsv)
+First you must provide some run-specific information in [GEP/configuration/config.yaml](configuration/config.yaml)

 ```
 Results:                # e.g. "/srv/public/users/james94/insecta_results_05_11_2021"
@@ -155,9 +212,11 @@ Once you have a sample sheet ready, you need to configure your GEP run.

 Modify the

-**Step 5. Running the workflow**
-
+## Running the workflow
+
+<div id="build-meryl-k-mer-databases"></div>

+## **Build meryl k-mer databases**

 *Make sure your GEP environment is activated.*