Skip to content
Snippets Groups Projects
ddepanis's avatar
ddepanis authored
fde3dcff
History

README

Genome Evaluation Pipeline (GEP)


  • User-friendly and all-in-one quality control and evaluation Snakemake pipeline for genome assemblies

  • Run multiple genome evaluations in one parallel (as many as you want!)

  • Scales to server and HPC environments

  • Required software stack automatically deployed using Conda


The GEP Workflow - Two Modes

1. Build Mode:

Builds meryl k-mer databases from sequencing reads. If you've already generated these database, skip to Evaluate Mode.


Requires:

  • Raw Illumina short-insert whole-genome sequencing reads

and/or

  • Raw PacBio HiFi sequencing reads

Process and tools:

  • (Optionally) trim raw sequencing reads
    • Trim Galore (remove Illumina adapters) and Trimmomatic (remove 10x barcodes if present) for Illumina reads
    • Cutadapt (remove SMRTbell adaptors) for HiFI reads
  • (Optionally) perform quality check of raw sequencing reads
    • FastQC and MultiQC
  • Builds k-mer databases from raw sequencing reads to be used for evalution
    • Meryl



2. Evaluation Mode:

Runs numerous whole-genome assembly evaluation tools, generating a user-friendly report for easy viewing and distribution (as well as individual outputs as normally generated by the tools in a well-defined folder structure).


Requires:

  • Meryl k-mer databases (as produced by Build Mode)
  • Genome Assembly(s) to evaluate - including (optionally) phased alternate assembly
  • (Optionally) Hi-C libraries
  • (Optionally) Estimated genome sizes

Process and tools:

  • K-mer completeness/correctness, QV Score, Spectra-CN (k-mer multiplicity) plots
    • Merqury
  • Genome Profiling based on raw sequencing readss
    • GenomeScope2 and Smudgeplot
  • Extensive contiguity statistics such as N# and L# stats, Scaffold and contig counts, GC Content, etc.
    • Gfastats
  • Gene completeness/correctness (i.e. recovered complete, duplicated, missing BUSCOs)
    • BUSCO5
  • (Optionally) Hi-C contact maps
    • BWA-MEM2 and samtools for aligning Hi-C data
    • Pretext for generating contact maps


Downloading the workflow

To clone the repository, use the following command:

git clone https://git.imp.fu-berlin.de/begendiv/gep.git

Installing Conda

  • Conda (v4.11+) but may work on older versions

If you already have conda installed on your system, please skip to Creating our Snakemake conda environment

Download the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html

Run the miniconda3 installation and check if it worked:

bash /<your_path_to>/Miniconda3-latest-Linux-x86_64.sh
##Follow miniconda3 installation instructions##

source ~/.bashrc

conda update conda

If conda command not found please close and re-open your terminal for conda installation to take effect, and then update.


Creating our GEP (Snakemake) Conda environment

The pipeline requires the following software to run (some versions may not need to be exactly matching):

  • snakemake (v7.8.5)
  • beautifulsoup4 (v4.9.3)
  • pandas (v1.4.2)
  • numpy (v1.22.3)
  • python (v3.10)
  • mamba (v0.25)

The easiest method to install this software stack is to create a GEP conda environment with the provided installGEP.yaml (see *Note)

conda env create -f /<your_path_to>/GEP/installGEP.yaml

conda activate GEP

##check snakemake installed correctly

snakemake --version

You may need to close and reopen your terminal after the conda env create ... command, or use source ~/.bashrc depending on your system.

*Note If you already have a snakemake (or suitable Python) installation and would like to avoid installing again, ensure that all of the above software are in your PATH. If you do this instead of installing from the provided GEP environment (installGEP.yaml), you will still need at least the base conda installed/activated - as it's required to handle software dependencies of all the tools used within the workflow itself


Configure Sample .tsv Sheets

IMPORTANT: GEP can run with only one of the below sample sheets at a given time. A sample sheet can however include multiple different samples, datasets, and assemblies for a given run.

Depending on which sample sheet is provided, GEP will automatically run in one of the following two modes:

  1. Build

  2. Evaluate

If you already have meryl k-mer databases (see the meryl github for more details) for the genomes you wish to evaluate, you can skip Build mode.



Illumina Sample Sheet .tsv - for Build mode

(see example GEP/configuration/exampleSampleSheets/build_illumina_example.tsv)

sample Library_R1 Library_R2 meryl_kmer_size trim10X trimAdapters fastQC
Your preferred identifier (Results will include this sample name in output files)

string
Full path to Forward (R1) read of PE library/s in fastq format. Can be .gzipped Full path to Reverse (R2) read of PE library/s in fastq format. Can be .gzipped Your choice of k-mer size used to count k-mers in the reads.

Recommended for llumina reads is 21
Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform.

Possible options are True or False
Check for and remove any other sequencing adapters that may still be present in the reads.

Possible options are True or False
Run FastQC on the library pair provided in Library_R1 and Library_R2 columns.

Possible options are True or False

Additional Info in case of multiple PE libraries for a single sample:

  • sample: Provide the same sample ID for each of the pairs of a sample (the final output will be a single .meryl database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample.
  • Library_R1 and Library_R2: Each library pair is provided as one line in the tsv. If you have three PE libraries, then you will have three lines in the tsv for this sample.
  • meryl_kmer_size: This should be consistent for libraries that have the same sample ID.

If you are a little uncertain about which short-read libraries you should use, see How to choose your illumina libraries further down the page.



PacBio HiFi Sample Sheet .tsv - for Build mode

(see example GEP/configuration/exampleSampleSheets/build_hifi_example.tsv

sample hifi_reads meryl_kmer_size trimSMRTbell fastQC
Your preferred identifier (Results will include this sample name in output files)

string
Full path to PacBio HiFi library in fastq format. Can be .gzipped Your choice of k-mer size used to count k-mers in the reads.

Recommended for PacBio Hifi is 31
Check for and remove any SMRT-bell adapter sequences.

Possible options are True or False
Run FastQC on the library provided in hifi_reads column.

Possible options are True or False

Additional Info in case of multiple HiFi libraries for a single sample:

  • sample: Provide the same sample ID for each Hifi library of a sample (the final output will be a single .meryl database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample.
  • hifi_reads: Each library is provided as one line in the tsv. If you have three HiFi libraries for the same sample, then you will have three lines in the tsv for this sample.
  • meryl_kmer_size: This should be consistent for libraries that have the same sample ID.


Note: You may also wish to concatenate the libraries of a sample together prior to building the database. In this case, you will only need to provide one line per sample in the respective sample sheets. However, the execution run-time will be hindered as the pipeline is designed to run on multiple libraries in parallel.


Assembly Evaluation Sample Sheet .tsv - for Evaluate mode

See GEP/configuration/exampleSampleSheets/runEval_example.tsv for an idea of how to set up the evaluation sample sheet.

ID ASM_LEVEL busco_lineage PRI_asm ALT_asm merylDB merylDB_kmer genomeSize HiC_R1 HiC_R2
Identifier for results and reporting Level that genome is assembled to.

Possible options are chr (chromosome), scaff (scaffold), or cont (contig)
Name of your desired BUSCO database.

Can be full name (i.e. insecta_odb10) or a substring (i.e. insecta)
Full path to primary assembly you wish to evaluate infasta format (any extension allowed).

Can be .gzipped
Full path to alternate assembly (haplotype) infasta format (any extension allowed).

Can be .gzipped.

If you do not have one, write None
Full path to .meryl database.

Must have .meryl extension
The k-mer size used to build the provided corresponding .meryl database. Provide a size estimate (in bp and as an integer) for the corresponding assembly/species.

If you do not have one, write auto
Full path to Forward (R1) reads of HiC library* in fastq format.

Can be .gzipped
Full path to Reverse (R2) read of HiC library* in fastq format.

Can be .gzipped

IMPORTANT

  • Please concatenate all HiC Libraries for a sample into a single library pair (one file for R1 and one file for R2), providing this concatenated pair to the sample sheet

  • Have a look at all the busco databases here https://busco-data.ezlab.org/v5/data/lineages/. Also, you can manually download and unpack your desired BUSCO lineage (or if you already have a BUSCO database downloaded), and provide the full path to this. Alternatively, you may already have the necessary BUSCO database on your system and do not wish for GEP to re-download it. In this case, provide the path to the already existing BUSCO db:

busco5Lineage: "/<PathTo>/manuallyDownloaded/vertebrata_odb10"

If a busco database matching the given name already exists in the GEP/buscoLineage/ folder, either from manual download or from previously automatic download (from previously executed runs), then the pipeline will skip re-downloading.

Difference between sample column in Build Mode sample sheets and ID column in Evaluate Mode sample sheet

In practice, ’sample’ from Build Mode sample sheets and ’ID’ from Evaluate Mode sample sheets may often be synonymous, however they are defined differently for cases where multiple assemblies (defined in the sample sheet with unique ID’s) are evaluated using the same k-mer database from a single ’sample’, and so the results need to be differentiated.


Configuration .yaml

For both Build and Evaluate modes, you will need to configure the config.yaml found in the GEP/configuration folder.

Results: "<path>"

samplesTSV: "<path>"

resources: "configuration/define_resources.yaml"

EAR: False
smudgeplot: True
Results:

Provide a path where all your results are stored for the run. It does not have to be inside the project folder (can be any location accessible by the user). You also do not need to create this folder yourself, Snakemake will create this folder if it does not yet exist (together with any non-existent parent folders).

samplesTSV:

This is where you provide the full path for one of the sample sheets after you have filled out the necessary information. Depending on which kind of sample sheet you provide, GEP will run the corresponding mode (Build - Illumina, Build - HiFi, or Evaluate)

EAR - Only required for creating an ERGA Assembly Report

By default is set to False. See the section below.

smudgeplot:

By default is set to True. Change it to False for not running smudgeplot.


Configure run (some more .yaml's)

The user will also need to configure a separate config.yaml file pertaining to the execution of Snakemake,and depending on whether or not GEP will be executed HPC infrastructure.

Default (non-HPC) configuration

Modify the Snakemake parameters in the file located at GEP/SUBMIT_CONFIG/default/config.yaml.

Here the user can provide a key-value pair for any Snakemake parameter, such as the maximum number of threads or jobs to utilise at any given time for the entire execution. Some commonly used parameters are already included in the provided file, but the user can add any Snakemake parameter to this file (run ’snakemake –help for possible options). There are already some default parameters present.

#Example

cores: 64
dry-run: False
use-conda: True
latency-wait: 60
rerun-incomplete: True
printshellcmds: False
keep-going: False

HPC (SLURM) configuration

Modify the Snakemake parameters in the file located at GEP/SUBMIT_CONFIG/slurm/config.yaml. Similar to the default execution, any Snakemake-specific parameter can be configured here. Additionally, the user can provide default resources to be used by every job submitted to the Slurm scheduler such as ’–partition’ and ’–qos’.


cores, jobs, resources, and parallelisation

Memory (in Mb), wall-time, and number of threads for non-locally executed jobs are defined in the file GEP/configuration/define_resources.yaml. The current values are likely over estimations and should allow for executing GEP without any issues. As required, these values should be increased or decreased accordingly. If running on non-HPC infrastructure the wall-time will be ignored.

The Snakemake parameter cores (as found in SUBMIT_CONFIG/<default_OR_slurm>/config.yaml ) is the maximum number of cores (synonymous with threads in this case) GEP will utilise at any time during execution. For example during Evaluation Mode if a single BUSCO job uses 16 threads (as is set currently in define_resources.yaml), then by setting the maximum number of cores to 64, GEP will run up to (64/16 =) 4 BUSCO jobs in parallel. For other steps that require less than 16 threads, the number of jobs will be scaled up to utilise as many of the total requested cores as possible without exceeding the maximum.

Similarly the user can provide a maximum number of jobs (e.g. jobs: 10). GEP will never exceed either the maximum number of cores or the maximum number of jobs, whichever is reached first. In the above example, if the user was to provide cores: 64 AND jobs: 2, then GEP will only ever run two jobs in parallel regardless of whether the maximum number of cores has been reached. The user needs to consider how many samples there are (and therefore how many instances of a single step will be run), as well as how much resources are available.


Running GEP

Checklist before running:

For simplicity, make you are inside the main GEP folder before running

For both local and HPC (slurm) execution, it is good practice to run GEP as a dry run first before any full execution. This is give you an idea of whether the sample sheet and configuration/config.yaml are filled out correctly (by checking if the paths exists), and if there is anything wrong with your installation.

snakemake --cores 1 --dry-run

Default Execution

snakemake --profile SUBMIT_CONFIG/default/

HPC (Slurm) Execution

snakemake --profile SUBMIT_CONFIG/slurm/

Run in background

For those not aware, you can run GEP with nohup and & added to the command (both default and slurm). nohup will run the Snakemake command in a way that cannot be interrupted when you lose connection to the server/cluster. The trailing & will simply run the command in the background of your current terminal, allowing you to freely use the terminal instance in the meantime.

nohup snakemake --profile SUBMIT_CONFIG/slurm/ &

More Useful Tips

The user can directly override any of the Snakemake parameters (and paths) set in the respective configuration files through the command line when executing GEP. For example, to override the values in the sample configuration GEP/configuration/config.yaml for the path to the main results folder without having to edit the document itself:

snakemake --profile SUBMIT_CONFIG/default/ --config [Results=/Alternative/ResultsFolder]

Or to override the BUSCO lineage for example:

snakemake --profile SUBMIT_CONFIG/default/ --config [busco5Lineage=vertebrata]

To override values for maximum number of cores and jobs, as well as setting the flag for printing the shell commands (a Snakemake-specific parameter) in the submission configuration as found in GEP/SUBMIT_CONFIG/default/config.yaml:

snakemake --profile SUBMIT_CONFIG/default/ --cores 144 --jobs 20 --printshellcmds

The user can also run GEP only up to and including specific rules in the pipeline if they do not wish to run the full evaluation:

snakemake --profile SUBMIT_CONFIG/default/ --until <ruleName>

In such cases all files that this rule depends on will be run, but final PDF reports will not be created.


Known issues

There is a known merqury issue where png k-mer plots are not generated, making the pieline to fail. If this happens to you, please use this fix:

  1. Go to the gep installation folder.
  2. Run the following command:
VAR=$(grep merqury .snakemake/conda/*.yaml | awk -F'[:/.]' '{print $4}'); sed -i '/function check_module()/,/^}/s/echo \$?/echo 1/' .snakemake/conda/${VAR}/share/merqury/util/util.sh

ERGA Assembly Reports

If you are evaluating a pre-curation and post-curation steps, you can obtain the yaml file automatically filled to produce an ERGA Assembly Report (EAR). Please run like this:

nohup snakemake --profile SUBMIT_CONFIG/slurm/ --config EAR=True &

The TSV table should look like this (here for a phased assembly):

ID ASM_LEVEL busco_lineage PRI_asm ALT_asm merylDB merylDB_kmer genomeSize HiC_R1 HiC_R2
myID_pre scaff mammalia /my_asm_dir/Hap1_precuration.fa /my_asm_dir/Hap2_precuration.fa /my_meryl_dir/my_data.meryl 31 auto None None
myID_curated chr mammalia /my_asm_dir/Hap1_curated.fa /my_asm_dir/Hap2_curated.fa /my_meryl_dir/my_data.meryl 31 auto /my_hic_dir/Forward.fq /my_hic_dir/Reverse.fq

The assemblies must be "scaff" for pre-curation and "chr" for post-curation in the ASM_LEVEL column. It is not necessary to add the HiC reads paths for the pre-curation. If only one haplotype (e.g., collapsed or pri, or just hap1), add None to the ALT_asm column.

Plase visit the EARs repo for more information.


Results

Folder Structure of results

Build Mode

Evaluate Mode

PDF Report

One page per-assembly for key results

  • A: Name of the assembly and k-mer size used for databases.
  • B: Key contiguity statistics. Full contiguity statistics can be view in the respective output files for that step.
  • C: Genomescope Profiles (Untransformed Linear and Log versions).
  • D: QV Score and K-mer Completeness tables.
  • E: BUSCO results including name of BUSCO clade.
  • F: K-mer multiplicity plots for primary assembly only (Top) and both primary and alternate assemblies together (Bottom). If no alternate assembly is provided, the same plot will be present in both positions.

Followed by Up to six (2 by 3) Hi-C contact maps per page


Followed by a page with an aggregated table (coloured as a heatmap) for all assemblies evaluated in a given run


Followed by a page with an aggregated table coloured according to tiers, also for all assemblies evaluated in a given run. The threshold values for the different tiers were loosely based on the Earth Biogenome Project's [TABLE OF PROPOSED EBP METRICS](https://www.earthbiogenome.org/assembly-standards.


Tables in different formats

The above tables can also be found in the results folder: /<pathToResults>/1evaluation/finalResults/tables/ in various formats (i.e. .tsv, .md, .html).


All result files

All results files as output by each evaluation tool can also be found in their respective folders.


How to choose your Illumina libraries

Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively.

Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.


Citations

#######################################################################

The software/tools used as part of our genome evaluation are as follows. Please see [GEP/envs] for exact versions of tools, as they may be updated frequently:

Pre-Processing:

Reference-free Genome Profiling

K-mer distribution analysis

Assessing quality and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCOs)

Hi-C Alignment and Contact Maps

Generate PDF reports from markdown files

Scaffold/contig statistics: N# and L# stats, scaffold metrics, sequence counts, GC content, Estimated genome size

  • Python scripts (Mike Trizna. assembly_stats 0.1.4 (Version 0.1.4). Zenodo. (2020). http://doi.org/10.5281/zenodo.3968775 ) (This script was used as a foundation, being modified and improved upon for inclusion in GEP)

#######################################################################