Skip to content
Snippets Groups Projects
Commit bbe5ec73 authored by james94's avatar james94
Browse files

README

parent ebf9f9c4
Branches
No related tags found
No related merge requests found
......@@ -4,7 +4,7 @@
* Run **multiple genome evaluations** in one go (as many as you want!)
* Seamlessly **scaled to server, cluster, grid and cloud environments**
* Seamlessly **scaled to server, cluster, grid and cloud environments**
* Required **software** **stack** **automatically deployed** to any execution environment using **snakemake** and **conda**
......@@ -14,8 +14,19 @@
# Getting Started
**Step 1. Downloading the workflow**
-
GEP can be run in two independent modes - the inputs for which are specified in respective sample sheets. The general idea of these two modes are as follows:
1. Create meryl database (Step 4.1)
- Inputs: WGS sequencing libraries (PacBio Sequel II/HiFi or Illumina PE short-insert)
- Output: (`.meryl`) k-mer database
2. Run evaluation (Step 4.2)
- Inputs: (`.meryl`) k-mer database and corresponding genome assembly you wish to evaluate
- Output: Evaluation results and report
## **Step 1. Downloading the workflow**
To clone the repository, use the following command:
```
......@@ -24,11 +35,11 @@ git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git
---
**Step 2. Conda management**
-
## **Step 2. Conda management**
- Conda (v4.10.3) *but may work on older versions*
If you already have conda installed on your system, please **skip to step 3**
*If you already have conda installed on your system, please skip to **step 3***
Download the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html
......@@ -48,14 +59,15 @@ If `conda command not found` please close and re-open your terminal for conda i
---
**Step 3. Creating our Snakemake conda environment**
-
## **Step 3. Creating our Snakemake conda environment**
The pipeline requires the following software to run:
- snakemake (6.6.1+)
- python (3.9.1+)
- tabulate (0.8.7+)
- beautifulsoup4 (4.9+)
- mamba (0.15.2)
- pandoc (2.2.1) *most recent version 2.16.2 causes error*
The easiest method to install this software stack is to create a GEP conda environment with the provided `installGEP.yaml` ***Note**
......@@ -69,34 +81,83 @@ conda activate GEP
snakemake --version
```
**Note** If you already have a snakemake installation and would like to avoid installing it again, ensure that the above software are all in your `PATH` and you have conda installed/activated.
**Note** *If you already have a snakemake installation and would like to avoid installing it again, ensure that the above software are all in your `PATH` and you have conda installed/activated.*
**Step 4. Set up sample sheet and configuration file**
-
***BELOW NOT COMPLETE***
## **Step 4. SampleSheets `.tsv` and Config `.yaml`**
GEP can be run in two independent modes - the inputs for which are specified in respective [sample sheets](configuration/exampleSampleSheets/). The general idea of these two modes are as follows:
GEP can be run in two modes:
1. Create meryl database
- Input: Sample sheet outlining either [Illumina PE](configuration/exampleSampleSheets/build_illumina_example.tsv) or [PacBio HiFi](configuration/exampleSampleSheets/build_hifi_example.tsv) reads
1. Create meryl database
- Inputs: Sample sheet containing paths to WGS sequencing libraries (PacBio Sequel II/HiFi or Illumina PE short-insert)
- Output: (`.meryl`) k-mer database
2. Run evaluation
- Input: Sample sheet outlining k-mer database (`.meryl`) and corresponding assembly [(example)](configuration/exampleSampleSheets/runEval_example.tsv)
2. Assembly Evaluation
- Inputs: Sample sheet containing paths to (`.meryl`) k-mer database and corresponding genome assembly you wish to evaluate
- Output: Evaluation results and report
### **Step 4.1. Build k-mer databases**
*If you already have a [meryl](https://github.com/marbl/merqury/wiki/1.-Prepare-meryl-dbs) k-mer database corresponding respectively to the genomes you wish to evaluate, you can skip to step 4.2.*
#### Illumina Sample Sheet `.tsv`
If you are a little uncertain about which short-read libraries you should use, see [How to choose your illumina libraries](#how-to-choose-your-illumina-libraries) further down the page.
See [GEP/configuration/exampleSampleSheets/build_illumina_example.tsv](configuration/exampleSampleSheets/build_illumina_example.tsv) for an idea of how to set up the illumina sample sheet.
|sample|Library_R1|Library_R2|meryl_kmer_size| trim10X| trimAdapters| fastQC|
|:----|:---|:---|:---|:---|:----|:---|
|Your identifier for the sample from which the provided WGS libraries belong| Full path to Forward (R1) read of PE library/s in `fastq` format. Can be `.gz`ipped | Full path to Reverse (R2) read of PE library/s in `fastq` format. Can be `.gz`ipped| Your choice of k-mer size used to count k-mers in the reads. Recommended for illumina reads is `21`| Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are `True` or `False` | Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are `True` or `False` |Run FastQC on the library pair provided in `hifi_reads`. Possible options are `True` or `False`|
#### PacBio HiFi Sample Sheet `.tsv`
Save your sample sheet and provide its location to the `samplesTSV` key in the [config.yaml](configuration/config.yam), which can be found in the `configuration` folder.
See [GEP/configuration/exampleSampleSheets/build_hifi_example.tsv](configuration/exampleSampleSheets/build_hifi_example.tsv) for an idea of how to set up the PacBio sample sheet.
GEP will run in either meryl building mode or evaluation mode depending on which sample sheet you provide to the `config.yaml`
|sample|hifi_reads|meryl_kmer_size| trimSMRTbell| fastQC|
|:----|:---|:---|:----|:---|
|Your identifier for the sample from which the provided WGS libraries belong| Full path to hifi library/s in `fastq` format. Can be `.gz`ipped | Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is `31`| Check for and remove any SMRT-bell adapter sequences. Possible options are `True` or `False` | Run FastQC on the library provided in `hifi_reads`. Possible options are `True` or `False` |
**With one of the above sample sheets complete, you can now run the database building step of GEP.** See **Step 4.3** for how configure your run.
### **Step 4.2. Assembly Evaluation**
#### Assembly Evaluation Sample Sheet `.tsv`
See [GEP/configuration/exampleSampleSheets/runEval_example.tsv](configuration/exampleSampleSheets/runEval_example.tsv) for an idea of how to set up the evaluation sample sheet.
|ID|PRI_asm|ALT_asm| merylDB| merylDB_kmer |genomeSize|
|:----|:---|:---|:----|:---|:---|
|Identifier for results and reporting| Full path to primary assembly you wish to evaluate in`fasta` format. Can be `.gz`ipped | Full path to alternate assembly (haplotype) in`fasta` format. Can be `.gz`ipped. **If you do not have one, write** `None` | Full path to `.meryl` database | The k-mer size used to build your provided `.meryl` db | Provide a size estimate (in bp) for the corresponding assembly/species. Can **leave blank** and it will be inferred during evaluation |
### 4.3 Configuration `.yaml`
**Note** There are multiple `config.yaml` files found inside the GEP project directories. Unlike the above sample sheets - which can be saved in any location and with any filename you wish - these config files must always preside in their existing locations and the name should not be changed.
First you must provide some run-specific information in [GEP/configuration/config.yaml](configuration/exampleSampleSheets/runEval_example.tsv)
```
Results: # e.g. "/srv/public/users/james94/insecta_results_05_11_2021"
samplesTSV: # e.g. "/srv/public/users/james94/GEP/configuration/buildPRI.tsv"
busco5Lineage: # e.g. "insecta"
```
Once you have a sample sheet ready, you need to configure your GEP run.
Modify the
**Step 5. Running the workflow**
-
You should be inside the main GEP folder where the `Snakefile` is directly accessible.
*Make sure your GEP environment is activated.*
......@@ -148,21 +209,22 @@ The software/tools used as part of our genome evaluation are as follows:
#### Assessing quality and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCOs)
* BUSCOv4 (*Seppey M., Manni M., Zdobnov E.M. (2019)* https://busco.ezlab.org/ )
* BUSCOv5 (*Seppey M., Manni M., Zdobnov E.M. (2019)* https://busco.ezlab.org/ )
#### Scaffold/contig statistics: N# and L# stats, scaffold metrics, sequence counts, GC content, Estimated genome size
* Python scripts (*Mike Trizna. assembly_stats 0.1.4 (Version 0.1.4). Zenodo. (2020)*. http://doi.org/10.5281/zenodo.3968775 )
#######################################################################
# How to choose your illumina libraries
# How to choose your Illumina libraries
Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively. Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.
If your library/s was sequenced using 10x barcodes (10X Genomics), you should assign a value of `True` to the trim10x column in the relevant sample sheet.
Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively.
Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.
# Reporting
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment