Skip to content
Snippets Groups Projects
james94's avatar
james94 authored
dc1d4023
History

Genome Evaluation Pipeline (GEP)

  • User-friendly and all-in-one quality control and evaluation pipeline for genome assemblies

  • Run multiple genome evaluations in one go (as many as you want!)

  • Seamlessly scaled to server, cluster, grid and cloud environments

  • Required software stack automatically deployed to any execution environment using snakemake and conda

Getting Started

Step 1. Downloading the workflow

To clone the repository, use the following command:

git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git

Step 2. Conda management

  • Conda (v4.10.3) but may work on older versions

If you already have conda installed on your system, please skip to step 3

Download the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html

Run the miniconda3 installation and check if it worked:

bash /<your_path_to>/Miniconda3-latest-Linux-x86_64.sh
##Follow miniconda3 installation instructions##

source ~/.bashrc

conda update conda

If conda command not found please close and re-open your terminal for conda installation to take effect, and then update.


Step 3. Creating our Snakemake conda environment

The pipeline requires the following software to run:

  • snakemake (6.6.1+)
  • python (3.9.1+)
  • tabulate (0.8.7+)
  • beautifulsoup4 (4.9+)
  • mamba (0.15.2)

The easiest method to install this software stack is to create a GEP conda environment with the provided installGEP.yaml *Note

conda env create -f /<your_path_to>/GEP/installGEP.yaml

conda activate GEP

##check snakemake installed correctly

snakemake --version

Note If you already have a snakemake installation and would like to avoid installing it again, ensure that the above software are all in your PATH and you have conda installed/activated.

Step 4. Set up sample sheet and configuration file

BELOW NOT COMPLETE

GEP can be run in two modes:

  1. Create meryl database

  2. Run evaluation

    • Input: Sample sheet outlining k-mer database (.meryl) and corresponding assembly (example)
    • Output: Evaluation results and report

Save your sample sheet and provide its location to the samplesTSV key in the config.yaml, which can be found in the configuration folder.

GEP will run in either meryl building mode or evaluation mode depending on which sample sheet you provide to the config.yaml

Step 5. Running the workflow

You should be inside the main GEP folder where the Snakefile is directly accessible.

Make sure your GEP environment is activated.

First you should run GEP in drymode:

snakemake -n

Which will check to see if some of your parameters/paths have been modified incorrectly. Further, it will install all the necessary environments to be utilised by the workflow, as well as download the busco5 database if it doesn't already exist. Unfortunaly when downloading the busco5 database, there will be lots of output in the terminal - a product of the limitations of the wget command used for downloading.

After the dry-run and downloading has complete, you can simply run the full pipeline with:

Where --cores # is the maximum number of cores (synonomous with threads in this case) you want to utilise.

For example if you run snakemake with the command:

Citations

#######################################################################

The software/tools used as part of our genome evaluation are as follows:

Pre-Processing (least biased short-read dataset available):

Reference-free Genome Profiling

K-mer distribution (copy-number spectra) analysis

Assessing quality and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCOs)

Scaffold/contig statistics: N# and L# stats, scaffold metrics, sequence counts, GC content, Estimated genome size

#######################################################################

How to choose your illumina libraries

Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively. Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.

If your library/s was sequenced using 10x barcodes (10X Genomics), you should assign a value of True to the trim10x column in the relevant sample sheet.

Reporting