README

bbe5ec73 · james94 · ebf9f9c4 · bbe5ec73
Commit bbe5ec73 authored 3 years ago by james94
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@

 * Run **multiple genome evaluations** in one go (as many as you want!)

-* Seamlessly **scaled to server, cluster, grid and cloud environments** 
+* Seamlessly **scaled to server, cluster, grid and cloud environments**

 * Required **software** **stack** **automatically deployed** to any execution environment using **snakemake** and **conda**

@@ -14,8 +14,19 @@

 # Getting Started

-**Step 1. Downloading the workflow**
-
+GEP can be run in two independent modes - the inputs for which are specified in respective sample sheets.  The general idea of these two modes are as follows:
+
+1. Create meryl database (Step 4.1)
+     - Inputs: WGS sequencing libraries (PacBio Sequel II/HiFi or Illumina PE short-insert)
+     - Output: (`.meryl`) k-mer database  
+
+
+2. Run evaluation (Step 4.2)
+     - Inputs: (`.meryl`) k-mer database and corresponding genome assembly you wish to evaluate
+     - Output: Evaluation results and report
+
+## **Step 1. Downloading the workflow**
+

 To clone the repository, use the following command:
 ```
@@ -24,11 +35,11 @@ git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git

 ---

-**Step 2. Conda management**
-
+## **Step 2. Conda management**
+
 - Conda (v4.10.3)  *but may work on older versions*

-If you already have conda installed on your system, please **skip to step 3**
+*If you already have conda installed on your system, please skip to **step 3***

 Download the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html

@@ -48,14 +59,15 @@ If  `conda command not found` please close and re-open your terminal for conda i

 ---

-**Step 3. Creating our Snakemake conda environment**
-
+## **Step 3. Creating our Snakemake conda environment**
+
 The pipeline requires the following software to run:
 - snakemake (6.6.1+)
 - python (3.9.1+)
 - tabulate (0.8.7+)
 - beautifulsoup4 (4.9+)
 - mamba (0.15.2)
+- pandoc (2.2.1) *most recent version 2.16.2 causes error*

 The easiest method to install this software stack is to create a GEP conda environment with the provided `installGEP.yaml` ***Note**

@@ -69,34 +81,83 @@ conda activate GEP

 snakemake --version
 ```
-**Note** If you already have a snakemake installation and would like to avoid installing it again, ensure that the above software are all in your `PATH` and you have conda installed/activated.
+**Note** *If you already have a snakemake installation and would like to avoid installing it again, ensure that the above software are all in your `PATH` and you have conda installed/activated.*

-**Step 4. Set up sample sheet and configuration file**
-
-***BELOW NOT COMPLETE***
+## **Step 4. SampleSheets `.tsv` and Config `.yaml`**
+
+
+GEP can be run in two independent modes - the inputs for which are specified in respective [sample sheets](configuration/exampleSampleSheets/).  The general idea of these two modes are as follows:

-GEP can be run in two modes:
-1. Create meryl database 
-     - Input: Sample sheet outlining either [Illumina PE](configuration/exampleSampleSheets/build_illumina_example.tsv) or [PacBio HiFi](configuration/exampleSampleSheets/build_hifi_example.tsv) reads
+
+1. Create meryl database
+     - Inputs: Sample sheet containing paths to WGS sequencing libraries (PacBio Sequel II/HiFi or Illumina PE short-insert)
     - Output: (`.meryl`) k-mer database  
-     
-2. Run evaluation
-     - Input: Sample sheet outlining k-mer database (`.meryl`) and corresponding assembly [(example)](configuration/exampleSampleSheets/runEval_example.tsv)
+
+
+2. Assembly Evaluation
+     - Inputs: Sample sheet containing paths to  (`.meryl`) k-mer database and corresponding genome assembly you wish to evaluate
     - Output: Evaluation results and report



+### **Step 4.1. Build k-mer databases**
+
+*If you already have a [meryl](https://github.com/marbl/merqury/wiki/1.-Prepare-meryl-dbs) k-mer database corresponding respectively to the genomes you wish to evaluate, you can skip to step 4.2.*
+
+#### Illumina Sample Sheet `.tsv`
+
+If you are a little uncertain about which short-read libraries you should use, see [How to choose your illumina libraries](#how-to-choose-your-illumina-libraries) further down the page.
+
+See [GEP/configuration/exampleSampleSheets/build_illumina_example.tsv](configuration/exampleSampleSheets/build_illumina_example.tsv) for an idea of how to set up the illumina sample sheet.
+

+|sample|Library_R1|Library_R2|meryl_kmer_size| trim10X| trimAdapters| fastQC|
+|:----|:---|:---|:---|:---|:----|:---|
+|Your identifier for the sample from which the provided WGS libraries belong| Full path to Forward (R1) read of PE library/s in `fastq` format.  Can be `.gz`ipped  | Full path to Reverse (R2) read of PE library/s in `fastq` format.  Can be `.gz`ipped| Your choice of k-mer size used to count k-mers in the reads. Recommended for illumina reads is `21`| Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are `True` or `False` | Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are `True` or `False`  |Run FastQC on the library pair provided in `hifi_reads`. Possible options are `True` or `False`|

+#### PacBio HiFi Sample Sheet `.tsv`

-Save your sample sheet and provide its location to the `samplesTSV` key in the [config.yaml](configuration/config.yam), which can be found in the `configuration` folder.
+See [GEP/configuration/exampleSampleSheets/build_hifi_example.tsv](configuration/exampleSampleSheets/build_hifi_example.tsv) for an idea of how to set up the PacBio sample sheet.

-GEP will run in either meryl building mode or evaluation mode depending on which sample sheet you provide to the `config.yaml`
+|sample|hifi_reads|meryl_kmer_size| trimSMRTbell| fastQC|
+|:----|:---|:---|:----|:---|
+|Your identifier for the sample from which the provided WGS libraries belong| Full path to hifi library/s in `fastq` format.  Can be `.gz`ipped  | Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is `31`| Check for and remove any SMRT-bell adapter sequences. Possible options are `True` or `False` | Run FastQC on the library provided in `hifi_reads`. Possible options are `True` or `False` |


+**With one of the above sample sheets complete, you can now run the database building step of GEP.**   See **Step 4.3** for how configure your run.
+
+
+### **Step 4.2. Assembly Evaluation**
+
+#### Assembly Evaluation Sample Sheet `.tsv`
+
+See [GEP/configuration/exampleSampleSheets/runEval_example.tsv](configuration/exampleSampleSheets/runEval_example.tsv) for an idea of how to set up the evaluation sample sheet.
+
+|ID|PRI_asm|ALT_asm| merylDB| merylDB_kmer |genomeSize|
+|:----|:---|:---|:----|:---|:---|
+|Identifier for results and reporting| Full path to primary assembly you wish to evaluate in`fasta` format.  Can be `.gz`ipped  | Full path to alternate assembly (haplotype) in`fasta` format.  Can be `.gz`ipped. **If you do not have one, write** `None`  | Full path to `.meryl` database | The k-mer size used to build your provided `.meryl` db | Provide a size estimate (in bp) for the corresponding assembly/species.  Can **leave blank** and it will be inferred during evaluation |
+
+
+### 4.3 Configuration `.yaml`
+
+**Note** There are multiple `config.yaml` files found inside the GEP project directories. Unlike the above sample sheets - which can be saved in any location and with any filename you wish - these config files must always preside in their existing locations and the name should not be changed.  
+
+First you must provide some run-specific information in [GEP/configuration/config.yaml](configuration/exampleSampleSheets/runEval_example.tsv)
+
+```
+Results:                # e.g. "/srv/public/users/james94/insecta_results_05_11_2021"
+
+samplesTSV:             # e.g. "/srv/public/users/james94/GEP/configuration/buildPRI.tsv"
+
+busco5Lineage:          # e.g. "insecta"
+```
+Once you have a sample sheet ready, you need to configure your GEP run.
+
+Modify the
+
 **Step 5. Running the workflow**
 -
-You should be inside the main GEP folder where the `Snakefile` is directly accessible.
+

 *Make sure your GEP environment is activated.*

@@ -148,21 +209,22 @@ The software/tools used as part of our genome evaluation are as follows:


 #### Assessing quality and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCOs)
-* BUSCOv4 			(*Seppey M., Manni M., Zdobnov E.M. (2019)* https://busco.ezlab.org/ )
+* BUSCOv5 			(*Seppey M., Manni M., Zdobnov E.M. (2019)* https://busco.ezlab.org/ )
+
+
+

 #### Scaffold/contig statistics: N# and L# stats, scaffold metrics, sequence counts, GC content, Estimated genome size
 * Python scripts (*Mike Trizna. assembly_stats 0.1.4 (Version 0.1.4). Zenodo. (2020)*.  http://doi.org/10.5281/zenodo.3968775 )

 #######################################################################
-# How to choose your illumina libraries
+# How to choose your Illumina libraries

-Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries.  Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively.  Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive.  In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.
-
-If your library/s was sequenced using 10x barcodes (10X Genomics), you should assign a value of `True` to the trim10x column in the relevant sample sheet.

+Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries.  Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively.  

+Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive.  In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.



 # Reporting
-