Skip to content
Snippets Groups Projects
james94's avatar
james94 authored
f1831026
History

README

Genome Evaluation Pipeline (GEP)


  • User-friendly and all-in-one quality control and evaluation pipeline for genome assemblies

  • Run multiple genome evaluations in one parallel (as many as you want!)

  • Scales to server, HPC environments

  • Required software stack automatically deployed using conda


The Workflow - 2 Steps

  1. Build meryl k-mer databases

    • Requires: WGS sequencing libraries (Pacbio HiFi or Illumina PE short-insert)
    • Output: (.meryl) k-mer database
  2. Assembly evaluation

    • Requires: (.meryl) k-mer database and corresponding genome assembly you wish to evaluate
    • Output: Evaluation results and report


Downloading the workflow

To clone the repository, use the following command:

git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git


Installing Conda

  • Conda (v4.11+) but may work on older versions

If you already have conda installed on your system, please skip to Creating our Snakemake conda environment

Download the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html

Run the miniconda3 installation and check if it worked:

bash /<your_path_to>/Miniconda3-latest-Linux-x86_64.sh
##Follow miniconda3 installation instructions##

source ~/.bashrc

conda update conda

If conda command not found please close and re-open your terminal for conda installation to take effect, and then update.


Creating our Snakemake conda environment

The pipeline requires the following software to run:

  • snakemake (v6.6.1)
  • python (v3.9.10)
  • tabulate (v0.8.7)
  • beautifulsoup4 (v4.9)
  • mamba (v0.15.2) [Newest version causes error]
  • pandoc (v2.15)
  • tectonic (v0.8.2)

The easiest method to install this software stack is to create a GEP conda environment with the provided installGEP.yaml (see *Note)

conda env create -f /<your_path_to>/GEP/installGEP.yaml

conda activate GEP

##check snakemake installed correctly

snakemake --version

*Note If you already have a snakemake (or suitable Python) installation and would like to avoid installing again, ensure that all of the above software are in your PATH. If you do this instead of installing from the provided GEP environment (installGEP.yaml), you will still need at least the base conda installed/activated - as it's required to handle software dependencies of all the tools used within the workflow itself


Configure Sample Sheet .tsvs

For the moment, GEP can run with only one of the below sample sheets at a given time.

Depending on which sample sheet is provided, GEP will automatically run in one of the following two modes:

  1. Build

  2. Evaluate

If you already have meryl k-mer databases (see the meryl github for more details) for the genomes you wish to evaluate, you can skip Build mode.



Illumina Sample Sheet .tsv - for Build mode

(see example GEP/configuration/exampleSampleSheets/build_illumina_example.tsv)

sample Library_R1 Library_R2 meryl_kmer_size trim10X trimAdapters fastQC
Your preferred identifier (Results will include this sample name in output files) Full path to Forward (R1) read of PE library/s in fastq format. Can be .gzipped Full path to Reverse (R2) read of PE library/s in fastq format. Can be .gzipped Your choice of k-mer size used to count k-mers in the reads. Recommended for illumina reads is 21 Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are True or False Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are True or False Run FastQC on the library pair provided in hifi_reads. Possible options are True or False

Additional Info in case of multiple PE libraries for a single sample:

  • sample: Provide the same sample ID for each of the pairs of a sample (the final output will be a single .meryl database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample.
  • Library_R1 and Library_R2: Each library pair is provided as one line in the tsv. If you have three PE libraries, then you will have three lines in the tsv for this sample.
  • meryl_kmer_size: This should be consistent for libraries that have the same sample ID.
  • trim10X: This does not need to be consistent. If you wish to build a database using a combination of 10x and non-10x barcoded libraries, you can do so. Only provide True option if you definitely want to trim the Library_R1 provided in that same line.
  • trimAdapters: Similarly, you may select True or False for each library independent of whether they are part of the same sample or not.
  • fastQC: If any library from a sample has True in this column, then all libraries with the identical sample ID will their quality checked with fastQC, even if these other libraries have False in this column.

If you are a little uncertain about which short-read libraries you should use, see How to choose your illumina libraries further down the page.



PacBio HiFi Sample Sheet .tsv

(see example GEP/configuration/exampleSampleSheets/runEval_example.tsv)

sample hifi_reads meryl_kmer_size trimSMRTbell fastQC
Your preferred identifier (Results will include this sample name in output files) Full path to hifi library/s in fastq format. Can be .gzipped Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is 31 Check for and remove any SMRT-bell adapter sequences. Possible options are True or False Run FastQC on the library provided in hifi_reads. Possible options are True or False

Additional Info in case of multiple HiFi libraries for a single sample:

  • sample: Provide the same sample ID for each Hifi library of a sample (the final output will be a single .meryl database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample.
  • Library_R1 and Library_R2: Each library pair is provided as one line in the tsv. If you have three PE libraries, then you will have three lines in the tsv for this sample.
  • meryl_kmer_size: This should be consistent for libraries that have the same sample ID.
  • trim10X: This does not need to be consistent. If you wish to build a database using a combination of 10x and non-10x barcoded libraries, you can do so. Only provide True option if you definitely want to trim the Library_R1 provided in that same line.
  • trimAdapters: Similarly, you may select True or False for each library independent of whether they are part of the same sample or not.
  • fastQC: If any library from a sample has True in this column, then all libraries with the identical sample ID will their quality checked with fastQC, even if these other libraries have False in this column.


You may also wish to concatenate the libraries of a sample together prior to building the database. In this case, you will only need to provide one line per sample in the respective sample sheets. However, the execution run-time will be hindered as the pipeline is designed to run on multiple libraries in parallel.


Assembly Evaluation Sample Sheet .tsv

See GEP/configuration/exampleSampleSheets/runEval_example.tsv for an idea of how to set up the evaluation sample sheet.

ID PRI_asm ALT_asm merylDB merylDB_kmer genomeSize HiC_R1 HiC_R2
Identifier for results and reporting Full path to primary assembly you wish to evaluate infasta format. Can be .gzipped Full path to alternate assembly (haplotype) infasta format. Can be .gzipped. If you do not have one, write None Full path to .meryl database The k-mer size used to build your provided .meryl db Provide a size estimate (in bp) for the corresponding assembly/species. If you do not have one, write auto Full path to Forward (R1) read of HiC library in fastq format. Can be .gzipped Full path to Forward (R1) read of HiC library in fastq format. Can be .gzipped

4.3 Configuration .yaml

Note There are multiple config.yaml files found inside the GEP project directories. Unlike the above sample sheets - which can be saved in any location and with any filename you wish - these config files must always preside in their existing locations and the name should not be changed.

First you must provide some run-specific information in GEP/configuration/config.yaml

Results:                # e.g. "/srv/public/users/james94/insecta_results_05_11_2021"

samplesTSV:             # e.g. "/srv/public/users/james94/GEP/configuration/buildPRI.tsv"

busco5Lineage:          # e.g. "insecta"

Once you have a sample sheet ready, you need to configure your GEP run.

Modify the

Running the workflow

Build meryl k-mer databases

Make sure your GEP environment is activated.

First you should run GEP in drymode:

snakemake -n

Which will check to see if some of your parameters/paths have been modified incorrectly. Further, it will install all the necessary environments to be utilised by the workflow, as well as download the busco5 database if it doesn't already exist. Unfortunaly when downloading the busco5 database, there will be lots of output in the terminal - a product of the limitations of the wget command used for downloading.

After the dry-run and downloading has complete, you can simply run the full pipeline with:

Where --cores # is the maximum number of cores (synonomous with threads in this case) you want to utilise.

For example if you run snakemake with the command:

Citations

#######################################################################

The software/tools used as part of our genome evaluation are as follows:

Pre-Processing (least biased short-read dataset available):

Reference-free Genome Profiling

K-mer distribution (copy-number spectra) analysis

Assessing quality and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCOs)

Scaffold/contig statistics: N# and L# stats, scaffold metrics, sequence counts, GC content, Estimated genome size

#######################################################################

How to choose your Illumina libraries

Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively.

Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.

Reporting