Genome Evaluation Pipeline (GEP)
-
User-friendly and all-in-one quality control and evaluation pipeline for genome assemblies
-
Run multiple genome evaluations in one go (as many as you want!)
-
Seamlessly scaled to server, cluster, grid and cloud environments
-
Required software stack automatically deployed to any execution environment using snakemake and conda
Getting Started
GEP can be run in two independent modes - the inputs for which are specified in respective sample sheets. The general idea of these two modes are as follows:
-
Create meryl database (Step 4.1)
- Inputs: WGS sequencing libraries (PacBio Sequel II/HiFi or Illumina PE short-insert)
- Output: (
.meryl
) k-mer database
-
Run evaluation (Step 4.2)
- Inputs: (
.meryl
) k-mer database and corresponding genome assembly you wish to evaluate - Output: Evaluation results and report
- Inputs: (
Step 1. Downloading the workflow
To clone the repository, use the following command:
git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git
Step 2. Conda management
- Conda (v4.10.3) but may work on older versions
If you already have conda installed on your system, please skip to step 3
Download the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html
Run the miniconda3 installation and check if it worked:
bash /<your_path_to>/Miniconda3-latest-Linux-x86_64.sh
##Follow miniconda3 installation instructions##
source ~/.bashrc
conda update conda
If conda command not found
please close and re-open your terminal for conda installation to take effect, and then update.
Step 3. Creating our Snakemake conda environment
The pipeline requires the following software to run:
- snakemake (6.6.1+)
- python (3.9.1+)
- tabulate (0.8.7+)
- beautifulsoup4 (4.9+)
- mamba (0.15.2)
- pandoc (2.2.1) most recent version 2.16.2 causes error
The easiest method to install this software stack is to create a GEP conda environment with the provided installGEP.yaml
*Note
conda env create -f /<your_path_to>/GEP/installGEP.yaml
conda activate GEP
##check snakemake installed correctly
snakemake --version
Note If you already have a snakemake installation and would like to avoid installing it again, ensure that the above software are all in your PATH
and you have conda installed/activated.
.tsv
and Config .yaml
Step 4. SampleSheets GEP can be run in two independent modes - the inputs for which are specified in respective sample sheets. The general idea of these two modes are as follows:
-
Create meryl database
- Inputs: Sample sheet containing paths to WGS sequencing libraries (PacBio Sequel II/HiFi or Illumina PE short-insert)
- Output: (
.meryl
) k-mer database
-
Assembly Evaluation
- Inputs: Sample sheet containing paths to (
.meryl
) k-mer database and corresponding genome assembly you wish to evaluate - Output: Evaluation results and report
- Inputs: Sample sheet containing paths to (
Step 4.1. Build k-mer databases
If you already have a meryl k-mer database corresponding respectively to the genomes you wish to evaluate, you can skip to step 4.2.
.tsv
Illumina Sample Sheet If you are a little uncertain about which short-read libraries you should use, see How to choose your illumina libraries further down the page.
See GEP/configuration/exampleSampleSheets/build_illumina_example.tsv for an idea of how to set up the illumina sample sheet.
sample | Library_R1 | Library_R2 | meryl_kmer_size | trim10X | trimAdapters | fastQC |
---|---|---|---|---|---|---|
Your identifier for the sample from which the provided WGS libraries belong | Full path to Forward (R1) read of PE library/s in fastq format. Can be .gz ipped |
Full path to Reverse (R2) read of PE library/s in fastq format. Can be .gz ipped |
Your choice of k-mer size used to count k-mers in the reads. Recommended for illumina reads is 21
|
Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are True or False
|
Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are True or False
|
Run FastQC on the library pair provided in hifi_reads . Possible options are True or False
|
.tsv
PacBio HiFi Sample Sheet See GEP/configuration/exampleSampleSheets/build_hifi_example.tsv for an idea of how to set up the PacBio sample sheet.
sample | hifi_reads | meryl_kmer_size | trimSMRTbell | fastQC |
---|---|---|---|---|
Your identifier for the sample from which the provided WGS libraries belong | Full path to hifi library/s in fastq format. Can be .gz ipped |
Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is 31
|
Check for and remove any SMRT-bell adapter sequences. Possible options are True or False
|
Run FastQC on the library provided in hifi_reads . Possible options are True or False
|
With one of the above sample sheets complete, you can now run the database building step of GEP. See Step 4.3 for how configure your run.
Step 4.2. Assembly Evaluation
.tsv
Assembly Evaluation Sample Sheet See GEP/configuration/exampleSampleSheets/runEval_example.tsv for an idea of how to set up the evaluation sample sheet.
ID | PRI_asm | ALT_asm | merylDB | merylDB_kmer | genomeSize |
---|---|---|---|---|---|
Identifier for results and reporting | Full path to primary assembly you wish to evaluate infasta format. Can be .gz ipped |
Full path to alternate assembly (haplotype) infasta format. Can be .gz ipped. If you do not have one, write None
|
Full path to .meryl database |
The k-mer size used to build your provided .meryl db |
Provide a size estimate (in bp) for the corresponding assembly/species. Can leave blank and it will be inferred during evaluation |
.yaml
4.3 Configuration Note There are multiple config.yaml
files found inside the GEP project directories. Unlike the above sample sheets - which can be saved in any location and with any filename you wish - these config files must always preside in their existing locations and the name should not be changed.
First you must provide some run-specific information in GEP/configuration/config.yaml
Results: # e.g. "/srv/public/users/james94/insecta_results_05_11_2021"
samplesTSV: # e.g. "/srv/public/users/james94/GEP/configuration/buildPRI.tsv"
busco5Lineage: # e.g. "insecta"
Once you have a sample sheet ready, you need to configure your GEP run.
Modify the
Step 5. Running the workflow
Make sure your GEP environment is activated.
First you should run GEP in drymode:
snakemake -n
Which will check to see if some of your parameters/paths have been modified incorrectly. Further, it will install all the necessary environments to be utilised by the workflow, as well as download the busco5 database if it doesn't already exist. Unfortunaly when downloading the busco5 database, there will be lots of output in the terminal - a product of the limitations of the wget
command used for downloading.
After the dry-run and downloading has complete, you can simply run the full pipeline with:
Where --cores # is the maximum number of cores (synonomous with threads in this case) you want to utilise.
For example if you run snakemake with the command:
Citations
#######################################################################
The software/tools used as part of our genome evaluation are as follows:
Pre-Processing (least biased short-read dataset available):
- Trimmomatic (Bolger, A. M., Lohse, M., & Usadel, B. (2014). http://www.usadellab.org/cms/?page=trimmomatic)
- Trim_galore (Felix Krueger bioinformatics.babraham.ac.uk)
- Fastqc (Simon Andrews https://github.com/s-andrews/FastQC
- Multiqc (Ewels, P., Magnusson, M., Lundin, S., Käller, M. (2016). https://doi.org/10.1093/bioinformatics/btw354)
Reference-free Genome Profiling
- GenomeScope2 (Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. (2020) https://github.com/tbenavi1/genomescope2.0)
K-mer distribution (copy-number spectra) analysis
- meryl (Rhie, A., Walenz, B.P., Koren, S. et al. (2020). https://doi.org/10.1186/s13059-020-02134-9)
- merqury (Rhie, A., Walenz, B.P., Koren, S. et al. (2020). https://doi.org/10.1186/s13059-020-02134-9)
Assessing quality and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCOs)
- BUSCOv5 (Seppey M., Manni M., Zdobnov E.M. (2019) https://busco.ezlab.org/ )
Scaffold/contig statistics: N# and L# stats, scaffold metrics, sequence counts, GC content, Estimated genome size
- Python scripts (Mike Trizna. assembly_stats 0.1.4 (Version 0.1.4). Zenodo. (2020). http://doi.org/10.5281/zenodo.3968775 )
#######################################################################
How to choose your Illumina libraries
Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively.
Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.