Genome Evaluation Pipeline
This pipeline allows users the ability to produce a wide range of commonly used evaluation metrics for genome assemblies, no matter your level of command-line experience/.
By harnessing the capabilities of snakemake, we present a workflow that incorporates a number of command-line tools and can be run on multiple independent genome assemblies in parallel. A streamlined user-experience is paramount to the devlopment process of this pipeline, as we strive for three key user-oriented components:
- Automate: Beginning to End with just a few clicks!
- Reduce user interaction significantly when compared to running the individual tools by themselves.
- Scalability
- Seamlessly scaled to server, cluster, grid and cloud environments without the need to modify the workflow definition.
- Portability
- Workflows entail a description of required software, which will be automatically deployed to any execution environment.
Snakemake will use conda to both install and manage our software packages and required tools. This helps to avoid software dependency conflicts, which will prevent the analysis from being simple to use and easily applied to different hardware. It also means that you, the user, do not have to be concerned with this at all - it is done for you!
#######################################################################
The software/tools used as part of our genome evaluation are as follows:
Pre-Processing (least biased short-read dataset available):
- Trimmomatic (Bolger, A. M., Lohse, M., & Usadel, B. (2014). http://www.usadellab.org/cms/?page=trimmomatic)
- Trim_galore (Felix Krueger bioinformatics.babraham.ac.uk)
- Fastqc (Simon Andrews https://github.com/s-andrews/FastQC
- Multiqc (Ewels, P., Magnusson, M., Lundin, S., Käller, M. (2016). https://doi.org/10.1093/bioinformatics/btw354)
Reference-free Genome Profiling
- GenomeScope2 (Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. (2020) https://github.com/tbenavi1/genomescope2.0)
K-mer distribution (copy-number spectra) analysis
- meryl (Rhie, A., Walenz, B.P., Koren, S. et al. (2020). https://doi.org/10.1186/s13059-020-02134-9)
- merqury (Rhie, A., Walenz, B.P., Koren, S. et al. (2020). https://doi.org/10.1186/s13059-020-02134-9)
Assessing quality and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCOs)
- BUSCOv4 (Seppey M., Manni M., Zdobnov E.M. (2019) https://busco.ezlab.org/ )
Scaffold/contig statistics: N# and L# stats, scaffold metrics, sequence counts, GC content, Estimated genome size
- Python scripts (Mike Trizna. assembly_stats 0.1.4 (Version 0.1.4). Zenodo. (2020). http://doi.org/10.5281/zenodo.3968775 )
#######################################################################
How to choose your illumina libraries
Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively. Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.
If your library/s was sequenced using 10x barcodes (10X Genomics), you should remove the first 25-30bp of the forward read (R1) only. This will remove all barcode content.
Use trimmomatic
Will be incorporated automatically shortly
Using the pipeline
Step 1. Downloading the workflow
First things first, we want to download/install this workflow. The easiest way to do this would be to clone the git repository, provided you have the git command line tool installed (for instructions on how to do this: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
To clone the repository, use the following command:
git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git
Here give an illustration (tree) of the folder structure for the project
Step 2. Conda management
If you already have conda installed on your system, please skip to step 3
We will use a minimal version of conda - miniconda3, it has everything we need.
Please downlaod the linux Miniconda3 installer from the following URL: https://docs.conda.io/en/latest/miniconda.html
Run the miniconda3 installation using:
bash /<your_path_to>/Miniconda3-latest-Linux-x86_64.sh
Hold enter for the licensing agreement to print in it's entirety and agree to is by typing yes then pressing ENTER once.
Choose the location you wish to install miniconda3, or use the default-determined location by simply hitting ENTER.
It is a good idea to update conda to it's newest available version:
conda update conda
If it says something like conda command not found
please either close and re-open your terminal for conda installation to take effect, or run the command:
On Linux
source ~/.bashrc
Step 3. Creating our Snakemake conda environment
Inside the main porject folder will be a file called installGEP.yaml
i.e.
/<your_path_to>/GEP/installGEP.yaml
If you have followed the process of installing miniconda3 correctly (along with closing and re-opening your terminal or sourcing your bash profile), installing your GEP environment is very simple.
To create/install GEP environment:
conda env create -f installGEP.yaml
The environment should install on it's own, press ENTER if prompted to install the list of packages.
Your environment is now created and installed - we want to activate it by running the command:
conda activate GEP
Step 4. Modifying our configuration files
There are only two files (config.yaml
and samples.tsv
) that you as the user are required to modify. These are found in the configuration
folder.
Firstly, we will modify the samples.tsv
, which consists of the paths to your data files.
The data required for the workflow to work as intended are the actual genome assembly we wish to evaluation (in .fasta
,.fa
, or fna
format), and the set of illumina short-insert paired-end raw sequencing libraries used to assemble the genome in question (in .fastq
, or fq
, and can be gzipped. e.g. .fastq.gz
)
An example of this samples.tsv file is as follows:
assemblyName Library_R1 Library_R2 assembly_fasta EstimatedSize(bps)
AssemblyX path/to/AssemblyX_R1_library1.fastq path/to/AssemblyX_R2_library1.fastq path/to/AssemblyX_genome.fasta 2136623189
AssemblyX path/to/AssemblyX_R1_library2.fastq path/to/AssemblyX_R2_library2.fastq path/to/AssemblyX_genome.fasta
AssemblyY path/to/AssemblyY_R1_library1.fastq path/to/AssemblyY_R2_library1.fastq path/to/AssemblyY_genome.fasta
AssemblyZ path/to/AssemblyZ_R1_library1.fastq path/to/AssemblyZ_R2_library1.fastq path/to/AssemblyZ_genome.fasta 1983101299
AssemblyZ path/to/AssemblyZ_R1_library2.fastq path/to/AssemblyZ_R2_library2.fastq path/to/AssemblyZ_genome.fasta 1983101299
This is a tab (or just white-space) separated document where you will fill out the paths to the relevant files. Usually you will not have more than a handful (or sometimes only one-pair) of raw illumina reads, so this document will most of the time be rather 'clean'.
Column1= Assembly name
This column you can put whatever you want to identify the results by, so if you are running analysis on Homo Sapiens, you could put the sample name as HomoSapiens and find your results within the HomoSapiens
folder **
Column 2 and 3= paths to one pair of raw illumina libraryies. R1 (column2) and paired R2 (column3) (Can be gzipped)
Column 4= path to the actual genome/assembly that you want to evaluate. (NOTE: currently must be uncompressed; .fa, .fna, .fasta) **
Column 5= If you know the estimated/expected genome size of the species for which you wish to evaluate an assembly, you can provide it in this column.**
**Important to note: As you can see in the above example file users can have multiple rows for the SAME sampleName and the SAME assembly. This is often the case when you have multiple illumina libraries that we were used to build the assembly. In this case, the assemblyName
(column 1) and assembly_fasta
(column 4) remain the same, but the paths to the raw illumina libraries (column 2 and 3) will differ, as each row points to one library from a possible set of multiple libraries.
For the estimated genome size (column 5), you can leave this blank as the pipeline will infer this from genomescope2. Further, if you have multiple rows (libraries) for the same assembly, you can simple provide the estimated genome size for one of the rows (i.e. AssemblyX in the above example), leaving any other rows for the same assembly blank. This is something I will add the ability to do in regards to column 1 and 4 as well.
Secondly, we will modify the config.yaml
which will look like this:
##Configuration files pathways
Results: "/<PathTo>/desiredDestination/Results_AssemblyEval"
samplesTSV: "/<PathTo>/GEP/configuration/samples.tsv"
busco4Lineage: "vertebrata"
buscoMode: "genome"
-
Path to your
Results
This will be where all your results are stored for the run. It does not have to be inside the project folder (can be any location accessible by the user). You also do not need to create this folder yourself, it will be handled by snakemake. -
Path to your
samplesTSV
. This is the path to the aforementionsamples.tsv
that was created/modified just above. For now, please keep this file inside theconfiguration
folder, together with thisconfig.yaml
-
busco4Lineage
Busco needs a database to be able to run. Here you have a couple of different options.
- Manually download and unpack your desired database from https://busco-data.ezlab.org/v4/data/lineages/ . In this case (or if you already have the database downloaded to a specific location), you can provide the full path:
busco4Lineage: "/<PathTo>/manuallyDownloaded/vertebrata_odb10"
- Alternatively, you can just provide the taxonomy name that you wish to use. In this case, the latest database matching the name provided will be automatically downloaded prior to execution, if it doesn't already exist inside the
buscoLineage
directory. If it already exists in thisbuscoLineage
either from manual download or from previously automatic download (from previously executed runs), then the pipeline will skip re-downloading.
busco4Lineage: "vertebrata"
- You can change the busco mode, but considering the scope of this evaluation in it's current state, this option is rather redundant and will be removed/hidden.
Step 5. Running the workflow
If everything is set up correctly, we can run the pipeline very simply.
For now (though it should be a simple fix!), you must run the pipeline while being inside the project folder. In otherwords, you must be inside the folder where the file Snakefile
is directly accessible.
Make sure your GEP environment is activated.
First you should run GEP in drymode:
snakemake -n
Which will check to see if some of your parameters/paths have been modified incorrectly. Further, it will install all the necessary environments to be utilised by the workflow, as well as download the busco4 database if it doesn't already exist. Unfortunaly when downloading the busco4 database, there will be lots of output in the terminal - a product of the limitations of the wget
command used for downloading.
After the dry-run and downloading has complete, you can simply run the full pipeline with:
snakemake --cores # --use-conda && snakemake --report
Where --cores # is the maximum number of cores (synonomous with threads in this case) you want to utilise.
For example if you run snakemake with the command:
snakemake --cores 32 --use-conda && snakemake --report
Your pipeline will be executed with at most 32 threads per job/process. All jobs defined in the pipeline will use a percentage of these 32 threads allowing for automatic scalability of execution. An example being, if two jobs are ready to be executed and they both are defined as requiring 50% of the total cores avaliable, they can both be run in parralel.
Never use more threads than feasibly available by the architecture used to run the pipeline. Soon I will incorporate further modifications to provide even more scalability and portability (cluster/cloud execution, job scheduler compatability, and more!)
If you want to see some of the options you can use for execution, please see the snakemake help with snakemake -h
For example, you can assign a total amount of memory (e.g. 100GB) allowed by the pipeline by modifying the execution command to:
snakemake --cores 32 --use-conda --resources mem_mb=100000 && snakemake --report
Reporting
Instead of or as well as retrieving the result files directly from the locations specified in the Results section (Step 6), the && snakemake --report
argument used when running will create an interactive html report upon completion. This .html document will consist of all the relevant key files among other things such as the Directed Acyclic Graph (DAG) that snakemake uses to drive the order of execution, run-times of each individual step, and more (work in progress)
The report will be created in the main project directory, the same location as the Snakefile, where you executed the pipeline from.
Step 6. Results
ALL results can be found at the results directory defined by you in the config.yaml
. Within this results folder, you will have a directory for each of the assemblies (assemblyName
) you defined in the samples.tsv
. The pipeline will produce a large amount of files at each step or for each tool, respectively. Taking this into consideration, the results can be considered in three tiers relative to their importance or ease of viewability.
Tier 3
The full results from all tools, including every file created during the execution of each tool, respectively. These results can be navigated at will, and are separated by their respective intended purposes (i.e. QVstats_merylAndMerqury, assemblystats, etc.)
Tier 2
The key result files (key plots, statistics, ) are aggregated and copied into a separate folder within the assembly folders. e.g./path/to/Results/SpeciesX/keyResults/
.
The key result files are:
Multiqc report
- assemblyName_multiqc_report.html
QV Score
- assemblyName_merq.qv
Copy-Number Spectra plots
- assemblyName_merq.spectra-cn.fl.png
- assemblyName_merq.spectra-cn.st.png
Genomescope2 profile plots and summary
- assemblyName_gScope_log_plot.png
- assemblyName_gScope_linear_plot.png
- assemblyName_gScope_summary.txt
Assembly statistics (N#, NG#, L#, LG#, sequence counts, total bps, etc.)
- assemblyName_scaffold_stats.tsv
- assemblyName_contig_stats.tsv
BUSCOv4 results
- short_summary.specific.vertebrata_odb10.assemblyName.txt
Key values from all the above files pulled into aggregated table. Useful for quick glance
- assemblyName_aggregatedResults.tsv
Tier 1
There is a separately created folder within the main results directory (i.e. /path/to/Results/allAssemblies_keyResults
)
Within this folder you will find a combined aggregate file (/path/to/Results/allAssemblies_keyResults/key_results.tsv
, a tsv that combines the aforementioned key values from each assembly evaluated, respectively, into one single file. This is useful for plotting the key values across multiple assemblies.