README
Genome Evaluation Pipeline (GEP)
-
User-friendly and all-in-one quality control and evaluation Snakemake pipeline for genome assemblies
-
Run multiple genome evaluations in one parallel (as many as you want!)
-
Scales to server and HPC environments
-
Required software stack automatically deployed using Conda
The GEP Workflow - Two Modes
1. Build Mode:
Builds meryl k-mer databases from sequencing reads. If you've already generated these database, skip to Evaluate Mode.
Requires:
- Raw Illumina short-insert whole-genome sequencing reads
and/or
- Raw PacBio HiFi sequencing reads
Process and tools:
-
(Optionally) trim raw sequencing reads
- Trim Galore (remove Illumina adapters) and Trimmomatic (remove 10x barcodes if present) for Illumina reads
- Cutadapt (remove SMRTbell adaptors) for HiFI reads
-
(Optionally) perform quality check of raw sequencing reads
- FastQC and MultiQC
-
Builds k-mer databases from raw sequencing reads to be used for evalution
- Meryl
2. Evaluation Mode:
Runs numerous whole-genome assembly evaluation tools, generating a user-friendly report for easy viewing and distribution (as well as individual outputs as normally generated by the tools in a well-defined folder structure).
Requires:
- Meryl k-mer databases (as produced by Build Mode)
- Genome Assembly(s) to evaluate - including (optionally) phased alternate assembly
- (Optionally) Hi-C libraries
- (Optionally) Estimated genome sizes
Process and tools:
-
K-mer completeness/correctness, QV Score, Spectra-CN (k-mer multiplicity) plots
- Merqury
-
Genome Profiling based on raw sequencing readss
- GenomeScope2 and Smudgeplot
-
Extensive contiguity statistics such as N# and L# stats, Scaffold and contig counts, GC Content, etc.
- Gfastats
-
Gene completeness/correctness (i.e. recovered complete, duplicated, missing BUSCOs)
- BUSCO5
-
(Optionally) Hi-C contact maps
- BWA-MEM2 and samtools for aligning Hi-C data
- Pretext for generating contact maps
Downloading the workflow
To clone the repository, use the following command:
git clone https://git.imp.fu-berlin.de/begendiv/gep.git
Installing Conda
- Conda (v4.11+) but may work on older versions
If you already have conda installed on your system, please skip to Creating our Snakemake conda environment
Download the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html
Run the miniconda3 installation and check if it worked:
bash /<your_path_to>/Miniconda3-latest-Linux-x86_64.sh
##Follow miniconda3 installation instructions##
source ~/.bashrc
conda update conda
If conda command not found
please close and re-open your terminal for conda installation to take effect, and then update.
Creating our GEP (Snakemake) Conda environment
The pipeline requires the following software to run (some versions may not need to be exactly matching):
- snakemake (v7.8.5)
- beautifulsoup4 (v4.9.3)
- pandas (v1.4.2)
- numpy (v1.22.3)
- python (v3.10)
- mamba (v0.25)
The easiest method to install this software stack is to create a GEP conda environment with the provided installGEP.yaml
(see *Note)
conda env create -f /<your_path_to>/GEP/installGEP.yaml
conda activate GEP
##check snakemake installed correctly
snakemake --version
You may need to close and reopen your terminal after the conda env create ...
command, or use source ~/.bashrc
depending on your system.
*Note If you already have a snakemake (or suitable Python) installation and would like to avoid installing again, ensure that all of the above software are in your PATH
. If you do this instead of installing from the provided GEP environment (installGEP.yaml
), you will still need at least the base conda installed/activated - as it's required to handle software dependencies of all the tools used within the workflow itself
.tsv
Sheets
Configure Sample IMPORTANT: GEP can run with only one of the below sample sheets at a given time. A sample sheet can however include multiple different samples, datasets, and assemblies for a given run.
Depending on which sample sheet is provided, GEP will automatically run in one of the following two modes:
-
Build
- This mode is run if either the Illumina Sample Sheet
.tsv
or the PacBio HiFi Sample Sheet.tsv
are provided to GEP.
- This mode is run if either the Illumina Sample Sheet
-
Evaluate
- This mode is run if the Assembly Evaluation Sample Sheet
.tsv
is provided.
- This mode is run if the Assembly Evaluation Sample Sheet
If you already have meryl k-mer databases (see the meryl github for more details) for the genomes you wish to evaluate, you can skip Build mode.
.tsv
- for Build mode
Illumina Sample Sheet (see example GEP/configuration/exampleSampleSheets/build_illumina_example.tsv)
sample | Library_R1 | Library_R2 | meryl_kmer_size | trim10X | trimAdapters | fastQC |
---|---|---|---|---|---|---|
Your preferred identifier (Results will include this sample name in output files) string
|
Full path to Forward (R1) read of PE library/s in fastq format. Can be .gz ipped
|
Full path to Reverse (R2) read of PE library/s in fastq format. Can be .gz ipped
|
Your choice of k-mer size used to count k-mers in the reads. Recommended for llumina reads is 21
|
Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are True or False
|
Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are True or False
|
Run FastQC on the library pair provided in Library_R1 and Library_R2 columns. Possible options are True or False
|
Additional Info in case of multiple PE libraries for a single sample:
-
sample: Provide the same sample ID for each of the pairs of a sample (the final output will be a single
.meryl
database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample. - Library_R1 and Library_R2: Each library pair is provided as one line in the tsv. If you have three PE libraries, then you will have three lines in the tsv for this sample.
- meryl_kmer_size: This should be consistent for libraries that have the same sample ID.
If you are a little uncertain about which short-read libraries you should use, see How to choose your illumina libraries further down the page.
.tsv
- for Build mode
PacBio HiFi Sample Sheet (see example GEP/configuration/exampleSampleSheets/build_hifi_example.tsv
sample | hifi_reads | meryl_kmer_size | trimSMRTbell | fastQC |
---|---|---|---|---|
Your preferred identifier (Results will include this sample name in output files) string
|
Full path to PacBio HiFi library in fastq format. Can be .gz ipped
|
Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is 31
|
Check for and remove any SMRT-bell adapter sequences. Possible options are True or False
|
Run FastQC on the library provided in hifi_reads column. Possible options are True or False
|
Additional Info in case of multiple HiFi libraries for a single sample:
-
sample: Provide the same sample ID for each Hifi library of a sample (the final output will be a single
.meryl
database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample. - hifi_reads: Each library is provided as one line in the tsv. If you have three HiFi libraries for the same sample, then you will have three lines in the tsv for this sample.
- meryl_kmer_size: This should be consistent for libraries that have the same sample ID.
Note: You may also wish to concatenate the libraries of a sample together prior to building the database. In this case, you will only need to provide one line per sample in the respective sample sheets. However, the execution run-time will be hindered as the pipeline is designed to run on multiple libraries in parallel.
.tsv
- for Evaluate mode
Assembly Evaluation Sample Sheet See GEP/configuration/exampleSampleSheets/runEval_example.tsv for an idea of how to set up the evaluation sample sheet.
ID | ASM_LEVEL | busco_lineage | PRI_asm | ALT_asm | merylDB | merylDB_kmer | genomeSize | HiC_R1 | HiC_R2 |
---|---|---|---|---|---|---|---|---|---|
Identifier for results and reporting | Level that genome is assembled to. Possible options are chr (chromosome), scaff (scaffold), or cont (contig)
|
Name of your desired BUSCO database. Can be full name (i.e. insecta_odb10 ) or a substring (i.e. insecta )
|
Full path to primary assembly you wish to evaluate infasta format (any extension allowed). Can be .gz ipped
|
Full path to alternate assembly (haplotype) infasta format (any extension allowed). Can be .gz ipped. If you do not have one, write None
|
Full path to .meryl database. Must have .meryl extension
|
The k-mer size used to build the provided corresponding .meryl database. |
Provide a size estimate (in bp and as an integer ) for the corresponding assembly/species. If you do not have one, write auto
|
Full path to Forward (R1) reads of HiC library* in fastq format. Can be .gz ipped
|
Full path to Reverse (R2) read of HiC library* in fastq format. Can be .gz ipped
|
IMPORTANT
-
Please concatenate all HiC Libraries for a sample into a single library pair (one file for R1 and one file for R2), providing this concatenated pair to the sample sheet
-
Have a look at all the busco databases here
https://busco-data.ezlab.org/v5/data/lineages/
. Also, you can manually download and unpack your desired BUSCO lineage (or if you already have a BUSCO database downloaded), and provide the full path to this. Alternatively, you may already have the necessary BUSCO database on your system and do not wish for GEP to re-download it. In this case, provide the path to the already existing BUSCO db:
busco5Lineage: "/<PathTo>/manuallyDownloaded/vertebrata_odb10"
If a busco database matching the given name already exists in the GEP/buscoLineage/
folder, either from manual download or from previously automatic download (from previously executed runs), then the pipeline will skip re-downloading.
sample
column in Build Mode sample sheets and ID
column in Evaluate Mode sample sheet
Difference between In practice, ’sample’ from Build Mode sample sheets and ’ID’ from Evaluate Mode sample sheets may often be synonymous, however they are defined differently for cases where multiple assemblies (defined in the sample sheet with unique ID’s) are evaluated using the same k-mer database from a single ’sample’, and so the results need to be differentiated.
.yaml
Configuration For both Build and Evaluate modes, you will need to configure the config.yaml
found in the GEP/configuration folder.
Results: "<path>"
samplesTSV: "<path>"
resources: "configuration/define_resources.yaml"
EAR: False
smudgeplot: True
Results:
Provide a path where all your results are stored for the run. It does not have to be inside the project folder (can be any location accessible by the user). You also do not need to create this folder yourself, Snakemake will create this folder if it does not yet exist (together with any non-existent parent folders).
samplesTSV:
This is where you provide the full path for one of the sample sheets after you have filled out the necessary information. Depending on which kind of sample sheet you provide, GEP will run the corresponding mode (Build - Illumina, Build - HiFi, or Evaluate)
EAR
- Only required for creating an ERGA Assembly Report
By default is set to False
. See the section below.
smudgeplot:
By default is set to True
. Change it to False for not running smudgeplot.
.yaml
's)
Configure run (some more The user will also need to configure a separate config.yaml file pertaining to the execution of Snakemake,and depending on whether or not GEP will be executed HPC infrastructure.
Default (non-HPC) configuration
Modify the Snakemake parameters in the file located at GEP/SUBMIT_CONFIG/default/config.yaml.
Here the user can provide a key-value pair for any Snakemake parameter, such as the maximum number of threads or jobs to utilise at any given time for the entire execution. Some commonly used parameters are already included in the provided file, but the user can add any Snakemake parameter to this file (run ’snakemake –help for possible options). There are already some default parameters present.
#Example
cores: 64
dry-run: False
use-conda: True
latency-wait: 60
rerun-incomplete: True
printshellcmds: False
keep-going: False
HPC (SLURM) configuration
Modify the Snakemake parameters in the file located at GEP/SUBMIT_CONFIG/slurm/config.yaml. Similar to the default execution, any Snakemake-specific parameter can be configured here. Additionally, the user can provide default resources to be used by every job submitted to the Slurm scheduler such as ’–partition’ and ’–qos’.
cores
, jobs
, resources
, and parallelisation
Memory (in Mb), wall-time, and number of threads for non-locally executed jobs are defined in the file GEP/configuration/define_resources.yaml. The current values are likely over estimations and should allow for executing GEP without any issues. As required, these values should be increased or decreased accordingly. If running on non-HPC infrastructure the wall-time will be ignored.
The Snakemake parameter cores
(as found in SUBMIT_CONFIG/<default_OR_slurm>/config.yaml
) is the maximum number of cores (synonymous with threads in this case) GEP will utilise at any time during execution. For example during Evaluation Mode if a single BUSCO job uses 16 threads (as is set currently in define_resources.yaml), then by setting the maximum number of cores to 64, GEP will run up to (64/16 =) 4 BUSCO jobs in parallel. For other steps that require less than 16 threads, the number of jobs will be scaled up to utilise as many of the total requested cores as possible without exceeding the maximum.
Similarly the user can provide a maximum number of jobs (e.g. jobs: 10
). GEP will never exceed either the maximum number of cores or the maximum number of jobs, whichever is reached first. In the above example, if the user was to provide cores: 64
AND jobs: 2
, then GEP will only ever run two jobs in parallel regardless of whether the maximum number of cores has been reached. The user needs to consider how many samples there are (and therefore how many instances of a single step will be run), as well as how much resources are available.
Running GEP
Checklist before running:
- Installed and activated the GEP Conda environment
- Filled out the respective sample sheet correctly
- Filled out the run configuration (i.e providing a results folder and the path to the sample sheet) in GEP/configuration/config.yaml
-
Change the maximum number of
cores
andjobs
, as well as any other snakemake-specific parameters you wish to use, in the file GEP/SUBMIT_CONFIG/default/config.yaml, or for executing on an HPC (Slurm) infrastructure the file GEP/SUBMIT_CONFIG/slurm/config.yaml - Modify the per-job resources (i.e. threads, memory, and wall-time) in the file GEP/configuration/define_resources.yaml
For simplicity, make you are inside the main GEP folder before running
For both local and HPC (slurm) execution, it is good practice to run GEP as a dry run first before any full execution. This is give you an idea of whether the sample sheet and configuration/config.yaml
are filled out correctly (by checking if the paths exists), and if there is anything wrong with your installation.
snakemake --cores 1 --dry-run
Default Execution
snakemake --profile SUBMIT_CONFIG/default/
HPC (Slurm) Execution
snakemake --profile SUBMIT_CONFIG/slurm/
Run in background
For those not aware, you can run GEP with nohup
and &
added to the command (both default and slurm). nohup
will run the Snakemake command in a way that cannot be interrupted when you lose connection to the server/cluster. The trailing &
will simply run the command in the background of your current terminal, allowing you to freely use the terminal instance in the meantime.
nohup snakemake --profile SUBMIT_CONFIG/slurm/ &
More Useful Tips
The user can directly override any of the Snakemake parameters (and paths) set in the respective configuration files through the command line when executing GEP. For example, to override the values in the sample configuration GEP/configuration/config.yaml for the path to the main results folder without having to edit the document itself:
snakemake --profile SUBMIT_CONFIG/default/ --config [Results=/Alternative/ResultsFolder]
Or to override the BUSCO lineage for example:
snakemake --profile SUBMIT_CONFIG/default/ --config [busco5Lineage=vertebrata]
To override values for maximum number of cores and jobs, as well as setting the flag for printing the shell commands (a Snakemake-specific parameter) in the submission configuration as found in GEP/SUBMIT_CONFIG/default/config.yaml:
snakemake --profile SUBMIT_CONFIG/default/ --cores 144 --jobs 20 --printshellcmds
The user can also run GEP only up to and including specific rules in the pipeline if they do not wish to run the full evaluation:
snakemake --profile SUBMIT_CONFIG/default/ --until <ruleName>
In such cases all files that this rule depends on will be run, but final PDF reports will not be created.
Known issues
There is a known merqury issue where png k-mer plots are not generated, making the pieline to fail. If this happens to you, please use this fix:
- Go to the gep installation folder.
- Run the following command:
VAR=$(grep merqury .snakemake/conda/*.yaml | awk -F'[:/.]' '{print $4}'); sed -i '/function check_module()/,/^}/s/echo \$?/echo 1/' .snakemake/conda/${VAR}/share/merqury/util/util.sh
ERGA Assembly Reports
If you are evaluating a pre-curation and post-curation steps, you can obtain the yaml file automatically filled to produce an ERGA Assembly Report (EAR). Please run like this:
nohup snakemake --profile SUBMIT_CONFIG/slurm/ --config EAR=True &
The TSV table should look like this (here for a phased assembly):
ID | ASM_LEVEL | busco_lineage | PRI_asm | ALT_asm | merylDB | merylDB_kmer | genomeSize | HiC_R1 | HiC_R2 |
---|---|---|---|---|---|---|---|---|---|
myID_pre | scaff | mammalia | /my_asm_dir/Hap1_precuration.fa | /my_asm_dir/Hap2_precuration.fa | /my_meryl_dir/my_data.meryl | 31 | auto | None | None |
myID_curated | chr | mammalia | /my_asm_dir/Hap1_curated.fa | /my_asm_dir/Hap2_curated.fa | /my_meryl_dir/my_data.meryl | 31 | auto | /my_hic_dir/Forward.fq | /my_hic_dir/Reverse.fq |
The assemblies must be "scaff" for pre-curation and "chr" for post-curation in the ASM_LEVEL column. It is not necessary to add the HiC reads paths for the pre-curation. If only one haplotype (e.g., collapsed or pri, or just hap1), add None to the ALT_asm column.
Plase visit the EARs repo for more information.
Results
Folder Structure of results
Build Mode
Evaluate Mode
PDF Report
One page per-assembly for key results
- A: Name of the assembly and k-mer size used for databases.
- B: Key contiguity statistics. Full contiguity statistics can be view in the respective output files for that step.
- C: Genomescope Profiles (Untransformed Linear and Log versions).
- D: QV Score and K-mer Completeness tables.
- E: BUSCO results including name of BUSCO clade.
- F: K-mer multiplicity plots for primary assembly only (Top) and both primary and alternate assemblies together (Bottom). If no alternate assembly is provided, the same plot will be present in both positions.
Followed by Up to six (2 by 3) Hi-C contact maps per page
Followed by a page with an aggregated table (coloured as a heatmap) for all assemblies evaluated in a given run
Followed by a page with an aggregated table coloured according to tiers, also for all assemblies evaluated in a given run. The threshold values for the different tiers were loosely based on the Earth Biogenome Project's [TABLE OF PROPOSED EBP METRICS](https://www.earthbiogenome.org/assembly-standards.
Tables in different formats
The above tables can also be found in the results folder: /<pathToResults>/1evaluation/finalResults/tables/
in various formats (i.e. .tsv
, .md
, .html
).
All result files
All results files as output by each evaluation tool can also be found in their respective folders.
How to choose your Illumina libraries
Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively.
Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.
Citations
#######################################################################
The software/tools used as part of our genome evaluation are as follows. Please see [GEP/envs] for exact versions of tools, as they may be updated frequently:
Pre-Processing:
- Trimmomatic (Bolger, A. M., Lohse, M., & Usadel, B. (2014). http://www.usadellab.org/cms/?page=trimmomatic)
- Trim_galore (Felix Krueger bioinformatics.babraham.ac.uk)
- Cutadapt (Martin, M. (2011). https://doi.org/10.14806/ej.17.1.200)
- Fastqc (Simon Andrews https://github.com/s-andrews/FastQC
- Multiqc (Ewels, P., Magnusson, M., Lundin, S., Käller, M. (2016). https://doi.org/10.1093/bioinformatics/btw354)
Reference-free Genome Profiling
- GenomeScope2 (Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. (2020) https://github.com/tbenavi1/genomescope2.0)
K-mer distribution analysis
- meryl and merqury (Rhie, A., Walenz, B.P., Koren, S. et al. (2020). https://doi.org/10.1186/s13059-020-02134-9)
Assessing quality and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCOs)
- BUSCOv5 (Seppey M., Manni M., Zdobnov E.M. (2019) https://busco.ezlab.org/ )
Hi-C Alignment and Contact Maps
- samtools (Danecek, P., (2021) https://doi.org/10.1093/gigascience/giab008)
- BWA (Li H. and Durbin R. (2009) https://doi.org/10.1093/bioinformatics/btp324)
- Pretext (https://github.com/wtsi-hpag/PretextMap)
Generate PDF reports from markdown files
- Pandoc (https://pandoc.org/MANUAL.html)
Scaffold/contig statistics: N# and L# stats, scaffold metrics, sequence counts, GC content, Estimated genome size
- Python scripts (Mike Trizna. assembly_stats 0.1.4 (Version 0.1.4). Zenodo. (2020). http://doi.org/10.5281/zenodo.3968775 ) (This script was used as a foundation, being modified and improved upon for inclusion in GEP)
#######################################################################