README
Genome Evaluation Pipeline (GEP)
-
User-friendly and all-in-one quality control and evaluation pipeline for genome assemblies
-
Run multiple genome evaluations in one parallel (as many as you want!)
-
Scales to server, HPC environments
-
Required software stack automatically deployed using conda
Links to other parts of the readme currently not working. Links to other files in the gitlab do work!
General Workflow of Genome evaluation
The GEP Workflow - Two Modes
- Requires: WGS sequencing libraries (Pacbio HiFi or Illumina PE short-insert
.fastq
's) - Output: (
.meryl
) k-mer database
- Requires: (
.meryl
) k-mer databases and corresponding genome assemblies (.fasta
's) you wish to evaluate. As well as HiC Libraries as optional input. - Output:
- Assembly Stats (L#, N#, Scaffold, contig, etc.)
- BUSCO results
- QV Score
- K-mer Completeness
- genomescope profile
- Pretext Map (optional)
- PDF Report aggregating all of the above
Downloading the workflow
To clone the repository, use the following command:
git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git
Installing Conda
- Conda (v4.11+) but may work on older versions
If you already have conda installed on your system, please skip to Creating our Snakemake conda environment
Download the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html
Run the miniconda3 installation and check if it worked:
bash /<your_path_to>/Miniconda3-latest-Linux-x86_64.sh
##Follow miniconda3 installation instructions##
source ~/.bashrc
conda update conda
If conda command not found
please close and re-open your terminal for conda installation to take effect, and then update.
Creating our Snakemake conda environment
The pipeline requires the following software to run (some versions may not need to be exactly matching):
- snakemake (v7.6.2)
- beautifulsoup4 (v4.9.3)
- pandas (v1.4.2)
- numpy (v1.22.3)
- python (v3.10.4)
The easiest method to install this software stack is to create a GEP conda environment with the provided installGEP.yaml
(see *Note)
You may need to close and reopen your terminal after the conda env create ...
command, or use source ~/.bashrc
depending on your system.
conda env create -f /<your_path_to>/GEP/installGEP.yaml
conda activate GEP
##check snakemake installed correctly
snakemake --version
Installing mamba within your GEP environment is recommended, as it makes conda package managing/installing quicker. To do so, run the following command after the GEP environment is installed and activated:
conda install -c conda-forge mamba
If you do not install mamba, you will need to make sure the following flag is included in your snakemake command at the time of execution: --conda-frontend=conda
*Note If you already have a snakemake (or suitable Python) installation and would like to avoid installing again, ensure that all of the above software are in your PATH
. If you do this instead of installing from the provided GEP environment (installGEP.yaml
), you will still need at least the base conda installed/activated - as it's required to handle software dependencies of all the tools used within the workflow itself
.tsv
s
Configure Sample Sheet For the moment, GEP can run with only one of the below sample sheets at a given time.
Depending on which sample sheet is provided, GEP will automatically run in one of the following two modes:
-
Build
- This mode is run if either the Illumina Sample Sheet
.tsv
or the PacBio HiFi Sample Sheet.tsv
are provided to GEP.
- This mode is run if either the Illumina Sample Sheet
-
Evaluate
- This mode is run if the Assembly Evaluation Sample Sheet
.tsv
is provided.
- This mode is run if the Assembly Evaluation Sample Sheet
If you already have meryl k-mer databases (see the meryl github for more details) for the genomes you wish to evaluate, you can skip Build mode.
.tsv
- for Build mode
Illumina Sample Sheet (see example GEP/configuration/exampleSampleSheets/build_illumina_example.tsv)
sample | Library_R1 | Library_R2 | meryl_kmer_size | trim10X | trimAdapters | fastQC |
---|---|---|---|---|---|---|
Your preferred identifier (Results will include this sample name in output files) | Full path to Forward (R1) read of PE library/s in fastq format. Can be .gz ipped |
Full path to Reverse (R2) read of PE library/s in fastq format. Can be .gz ipped |
Your choice of k-mer size used to count k-mers in the reads. Recommended for illumina reads is 21
|
Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are True or False
|
Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are True or False
|
Run FastQC on the library pair provided in hifi_reads . Possible options are True or False
|
Additional Info in case of multiple PE libraries for a single sample:
-
sample: Provide the same sample ID for each of the pairs of a sample (the final output will be a single
.meryl
database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample. - Library_R1 and Library_R2: Each library pair is provided as one line in the tsv. If you have three PE libraries, then you will have three lines in the tsv for this sample.
- meryl_kmer_size: This should be consistent for libraries that have the same sample ID.
-
trim10X: This does not need to be consistent. If you wish to build a database using a combination of 10x and non-10x barcoded libraries, you can do so. Only provide
True
option if you definitely want to trim theLibrary_R1
provided in that same line. -
trimAdapters: Similarly, you may select
True
orFalse
for each library independent of whether they are part of the same sample or not. -
fastQC: If any library from a sample has
True
in this column, then all libraries with the identical sample ID will their quality checked with fastQC, even if these other libraries haveFalse
in this column.
If you are a little uncertain about which short-read libraries you should use, see How to choose your illumina libraries further down the page.
.tsv
- for Build mode
PacBio HiFi Sample Sheet (see example GEP/configuration/exampleSampleSheets/runEval_example.tsv)
sample | hifi_reads | meryl_kmer_size | trimSMRTbell | fastQC |
---|---|---|---|---|
Your preferred identifier (Results will include this sample name in output files) | Full path to hifi library/s in fastq format. Can be .gz ipped |
Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is 31
|
Check for and remove any SMRT-bell adapter sequences. Possible options are True or False
|
Run FastQC on the library provided in hifi_reads . Possible options are True or False
|
Additional Info in case of multiple HiFi libraries for a single sample:
-
sample: Provide the same sample ID for each Hifi library of a sample (the final output will be a single
.meryl
database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample. - Library_R1 and Library_R2: Each library pair is provided as one line in the tsv. If you have three PE libraries, then you will have three lines in the tsv for this sample.
- meryl_kmer_size: This should be consistent for libraries that have the same sample ID.
-
trim10X: This does not need to be consistent. If you wish to build a database using a combination of 10x and non-10x barcoded libraries, you can do so. Only provide
True
option if you definitely want to trim theLibrary_R1
provided in that same line. -
trimAdapters: Similarly, you may select
True
orFalse
for each library independent of whether they are part of the same sample or not. -
fastQC: If any library from a sample has
True
in this column, then all libraries with the identical sample ID will their quality checked with fastQC, even if these other libraries haveFalse
in this column.
You may also wish to concatenate the libraries of a sample together prior to building the database. In this case, you will only need to provide one line per sample in the respective sample sheets. However, the execution run-time will be hindered as the pipeline is designed to run on multiple libraries in parallel.
.tsv
- for Evaluate mode
Assembly Evaluation Sample Sheet See GEP/configuration/exampleSampleSheets/runEval_example.tsv for an idea of how to set up the evaluation sample sheet.
ID | PRI_asm | ALT_asm | merylDB | merylDB_kmer | genomeSize | HiC_R1 | HiC_R2 |
---|---|---|---|---|---|---|---|
Identifier for results and reporting | Full path to primary assembly you wish to evaluate infasta format. Can be .gz ipped |
Full path to alternate assembly (haplotype) infasta format. Can be .gz ipped. If you do not have one, write None
|
Full path to .meryl database |
The k-mer size used to build your provided .meryl db |
Provide a size estimate (in bp) for the corresponding assembly/species. If you do not have one, write auto
|
Full path to Forward (R1) read of HiC library in fastq format. Can be .gz ipped |
Full path to Forward (R1) read of HiC library in fastq format. Can be .gz ipped |
Please concatenate all HiC Libraries for a sample into a single library pair (one file for R1 and one file for R2), providing this concatenated pair to the sample sheet
.yaml
Configuration For both Build and Evaluate modes, you will need to configure the config.yaml
found in the GEP/configuration folder.
Results: "<path>"
samplesTSV: "<path>"
busco5Lineage: "<nameofbuscoDB>"
resources: "configuration/define_resources.yaml"
Results:
Provide a path where all your results are stored for the run. It does not have to be inside the project folder (can be any location accessible by the user). You also do not need to create this folder yourself, snakemake will create this folder if it does not yet exist (together with any non-existent parent folders).
samplesTSV:
This is where you provide the full path for one of the sample sheets. Depending on which kind of sample sheet you provide, GEP will run the corresponding mode (Build - Illumina, Build - HiFi, or Evaluate)
busco5Lineage
- Only required for Evaluate mode
Busco needs a database to be able to run. Here you have a couple of different options.
Have a look at all the busco databases here https://busco-data.ezlab.org/v5/data/lineages/.
You can either manually download and unpack your desired busco lineage (or if you already have a busco database downloaded),and provide the full path to this, for example:
busco5Lineage: "/<PathTo>/manuallyDownloaded/vertebrata_odb10"
Or you can simply provide the taxonomy name that you wish to use and latest database matching the name provided will be automatically downloaded prior to execution.
busco5Lineage: "vertebrata"
If a busco database matching the given name already exists in the GEP/buscoLineage/
folder, either from manual download or from previously automatic download (from previously executed runs), then the pipeline will skip re-downloading.
Running GEP
Make sure your GEP environment is activated and you are inside the main GEP folder
For both local and HPC (slurm) execution, it is good practice to run GEP as a dry run first before any full execution. This is give you an idea of whether the sample sheet and configuration/config.yaml
are filled out correctly (by checking if the paths exists), and if there is anything wrong with your installation.
snakemake --cores 1 --dry-run
If you are going to run Evaluate mode, the above dry run will also download and unpack the given BUSCO lineage database. Expect a lot of output to the terminal from this.
.yaml
's)
Configure run (some more LOCAL
Modify the local snakemake parameters in the file located at GEP/SUBMIT_CONFIG/local/config.yaml. Here you can provide a key-value pair for any snakemake parameter (run snakemake --help
for options)
#Example
cores: 64
dry-run: False
use-conda: True
latency-wait: 60
rerun-incomplete: True
printshellcmds: False
keep-going: False
#shadow-prefix: /srv/public/user/<user>/tempfileshere
#use-singularity: True
LOCAL EXECUTION
snakemake --profile SUBMIT_CONFIG/local/
HPC (SLURM)
Modify the slurm snakemake parameters in the file located at GEP/SUBMIT_CONFIG/slurm/config.yaml. You can set your desired partition
and qos
under default-resources:
, as well as any snakemake parameters.
SLURM EXECUTION
snakemake --profile SUBMIT_CONFIG/slurm/
Running using singularity container (only on HPC)
GEP has been containerized using docker. Within this container, snakemake will install the necessary software stacks as contained conda environments (separate to those installed during normal execution). All you have to do is make sure singularity is installed (often installed by your respective HPC IT team), and in your PATH:
e.g.
module load Singularity
And executed by adding --use-singularity
and --singularity-args "--bind <path1>,<path2> --workdir <userTempDirectory>" --contain
Where the --bind
paths are any location/s in your system that you require singularity to be able to recursively access, commonly your /scratch/username/ directory where both your raw data and results are/will be stored. Multiple paths can be given separated by a comma.
The --workdir
is the singularity tmp directory. Often on HPC systems this will be automatically set as /tmp/...
, but this can lead to issues during initial installation of the container as this default tmp
directory may be too small. Instead, provide a path to a new directory to be used as the tmp
for snakemake + singularity.
The --contain
flag ensures that the container is completely kept separate from the host systems files so that the packages being installed are not shared with those that already exist (or are in the host system's conda cache).
EXAMPLE SINGULARITY EXECUTION
snakemake --profile SUBMIT_CONFIG/slurm/ --use-singularity --singularity-args "--bind /scratch/james94,/scratch2/james94 --workdir /scratch/james94/tmp --contain"
Run in background
For those not aware, you can run GEP with nohup
and &
added to the command (both local and slurm, and with or without singularity). nohup
will run the snakemake command in a way that cannot be interrupted when you lose connection to the server/cluster. The trailing &
will simply run the command in the background of your current terminal, allowing you to freely use the terminal instance in the meantime.
nohup snakemake --profile SUBMIT_CONFIG/slurm/ &
cores:
, Resources, and parallelisation
All steps have a memory (in mb), wall-time, and number of threads defined in the file GEP/configuration/define_resources.yaml. This defines the resources that will be requested for each individual step, for each individual sample. If running locally, wall-time will be ignored.
The parameter cores
as found in the SUBMIT_CONFIG/<local_OR_slurm>/config.yaml
is the maximum number of cores (synonomous with threads in this case) you want to utilise at any given time.
For example during Evaluation mode, if a single busco job used 16 threads (as is currently set as default), then by setting the maximum number of cores to 64
, GEP run can up to 4 (=64/16) busco jobs in parallel. For other steps that require less than 16 threads, the number of jobs will be scaled up to utlise as many of the total requested cores as possible without exceeding the maximum.
Similarly you can provide a maximum number of jobs (e.g. jobs: 10
). GEP will never exceed either the maximum number of cores, or the maximum number of jobs, whichever is reached first. In the above example, if you were to provide cores: 64
AND jobs: 2
, then GEP will only ever run two jobs in parallel regardless of whether the maximum number of cores has been utilised.
Try to consider how many samples you have (and therefore how many instances of a single step will be run), as well as how much resources you have available.
More tips
You can directly override any of the snakemake parameters (and paths) set in the respective config files through the command line when executing GEP.
Some examples:
To override the values in the sample configuration GEP/configuration/config.yaml such as path to results, path to samplesTSV, busco5Lineage without having to edit the document itself:
snakemake --profile SUBMIT_CONFIG/local/ --config [Results=/my/New/Results/Folder busco5Lineage=insecta]
To override values in the submission configuration as found in SUBMIT_CONFIG/<local_OR_slurm>/config.yaml
without having to edit the document itself:
snakemake --profile SUBMIT_CONFIG/slurm/ --cores 144 --jobs 20 --printshellcmds
Results
For Build mode, meryl databases can be found at:
(This is the path that you provide to the evaluation sample sheet under the merylDB
column)
/<pathToResults>/0_buildDatabases/{sample}/<illuminaReads OR hifiReads>/03_merylDb/complete_{illuminaORhifi}_{sample}_dB.{smrtTrimmed}.{kmer}.meryl
For Evaluate mode, individual results can be found at:
/<pathToResults>/1_evaluation/{asmID}/
PDF Reports
/<pathToResults>/1_evaluation/finalResults/FULL_Report_PDF.pdf
/<pathToResults>/1_evaluation/finalResults/FULL_TABLE_PDF.pdf
Full Report
Full Table
Aggregated Results Table
The above table can alsobe found in both TSV format (for easy import to plotting software such as R), and markdown (.rst) for copy-and-pasting into your preferred markdown.
/<pathToResults>/1_evaluation/finalResults/Combined_Results_FULLTABLE.tsv
/<pathToResults>/1_evaluation/finalResults/FullTableMarkdown.md
Citations
#######################################################################
The software/tools used as part of our genome evaluation are as follows:
Pre-Processing (least biased short-read dataset available):
- Trimmomatic (Bolger, A. M., Lohse, M., & Usadel, B. (2014). http://www.usadellab.org/cms/?page=trimmomatic)
- Trim_galore (Felix Krueger bioinformatics.babraham.ac.uk)
- Fastqc (Simon Andrews https://github.com/s-andrews/FastQC
- Multiqc (Ewels, P., Magnusson, M., Lundin, S., Käller, M. (2016). https://doi.org/10.1093/bioinformatics/btw354)
Reference-free Genome Profiling
- GenomeScope2 (Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. (2020) https://github.com/tbenavi1/genomescope2.0)
K-mer distribution (copy-number spectra) analysis
- meryl (Rhie, A., Walenz, B.P., Koren, S. et al. (2020). https://doi.org/10.1186/s13059-020-02134-9)
- merqury (Rhie, A., Walenz, B.P., Koren, S. et al. (2020). https://doi.org/10.1186/s13059-020-02134-9)
Assessing quality and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCOs)
- BUSCOv5 (Seppey M., Manni M., Zdobnov E.M. (2019) https://busco.ezlab.org/ )
Scaffold/contig statistics: N# and L# stats, scaffold metrics, sequence counts, GC content, Estimated genome size
- Python scripts (Mike Trizna. assembly_stats 0.1.4 (Version 0.1.4). Zenodo. (2020). http://doi.org/10.5281/zenodo.3968775 )
#######################################################################
How to choose your Illumina libraries
Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively.
Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.