diff --git a/README.md b/README.md index c8543a3589cbd95968aca03d38efd2be0a04141e..a4f0df0c8debe90555ea963268827e9e803b589d 100644 --- a/README.md +++ b/README.md @@ -4,58 +4,83 @@ # Genome Evaluation Pipeline (GEP) <br> -* User-friendly and **all-in-one quality control and evaluation** pipeline for genome assemblies +* User-friendly and **all-in-one quality control and evaluation** Snakemake pipeline for genome assemblies * Run **multiple genome evaluations** in one parallel (as many as you want!) -* **Scales to server, HPC environments** +* **Scales to server and HPC environments** -* Required **software stack automatically deployed** using conda +* Required **software stack automatically deployed** using Conda -*Links to other parts of the readme currently not working. Links to other files in the gitlab do work!* --- -## General Workflow of Genome evaluation - +# The GEP Workflow - Two *Modes* -## The GEP Workflow - Two *Modes* +### 1. **Build Mode**: -1. [**Build** meryl k-mer databases](#build-meryl-k-mer-databases) -<br> +Builds meryl k-mer databases from sequencing reads. If you've already generated these database, skip to **Evaluate Mode**. - +--- +#### Requires: + - Raw Illumina short-insert whole-genome sequencing reads -- Requires: WGS sequencing libraries (Pacbio HiFi or Illumina PE short-insert `.fastq`'s) -- Output: (`.meryl`) k-mer database +and/or + - Raw PacBio HiFi sequencing reads +--- +#### *Process* and **tools**: + - *(Optionally) trim raw sequencing reads* + - **Trim Galore** (remove Illumina adapters) and **Trimmomatic** (remove 10x barcodes if present) for Illumina reads + - **Cutadapt** (remove SMRTbell adaptors) for HiFI reads + - *(Optionally) perform quality check of raw sequencing read*s + - **FastQC** and **MultiQC** + - *Builds k-mer databases from raw sequencing reads to be used for evalution* + - **Meryl** <br> + + + -2. [Assembly **Evaluation**](#assembly-evaluation) <br> - +--- -- Requires: (`.meryl`) k-mer databases and corresponding genome assemblies (`.fasta`'s) you wish to evaluate. As well as HiC Libraries as optional input. -- Output: - - Assembly Stats (L#, N#, Scaffold, contig, etc.) - - BUSCO results - - QV Score - - K-mer Completeness - - genomescope profile - - Pretext Map (optional) - - **PDF Report aggregating all of the above** +### 2. **Evaluation Mode:** +Runs numerous whole-genome assembly evaluation tools, generating a user-friendly report for easy viewing and distribution (as well as individual outputs as normally generated by the tools in a well-defined folder structure). - +--- +Requires: + - Meryl k-mer databases (as produced by **Build Mode**) + - Genome Assembly(s) to evaluate - including (optionally) phased alternate assembly + - (Optionally) Hi-C libraries + - (Optionally) Estimated genome sizes +--- +*Process* and **tools**: + - *K-mer completeness/correctness, QV Score, Spectra-CN (k-mer multiplicity) plots* + - **Merqury** + - *Genome Profiling based on raw sequencing reads*s + - **GenomeScope2** + - *Extensive contiguity statistics such as N# and L# stats, Scaffold and contig counts, GC Content, etc.* + - **Python script** + - *Gene completeness/correctness (i.e. recovered complete, duplicated, missing BUSCOs)* + - **BUSCO5** + - *(Optionally) Hi-C contact maps* + - **BWA-MEM2** and **samtools** for aligning Hi-C data + - **Pretext** for generating contact maps <br> + + + + --- @@ -66,7 +91,6 @@ To clone the repository, use the following command: ``` git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git ``` -<br> --- @@ -93,24 +117,23 @@ conda update conda ``` If `conda command not found` please close and re-open your terminal for conda installation to take effect, and then update. -<br> --- <div id="creating-our-snakemake-conda-environment"></div> -## Creating our Snakemake conda environment +## Creating our GEP (Snakemake) Conda environment The pipeline requires the following software to run (some versions may not need to be exactly matching): -- snakemake (v7.6.2) +- snakemake (v7.8.5) - beautifulsoup4 (v4.9.3) - pandas (v1.4.2) - numpy (v1.22.3) -- python (v3.10.4) +- python (v3.10) +- mamba (v0.25) The easiest method to install this software stack is to create a GEP conda environment with the provided `installGEP.yaml` (see ***Note**) -You may need to close and reopen your terminal after the `conda env create ...` command, or use `source ~/.bashrc` depending on your system. ``` conda env create -f /<your_path_to>/GEP/installGEP.yaml @@ -122,13 +145,9 @@ conda activate GEP snakemake --version ``` -Installing mamba within your GEP environment is recommended, as it makes conda package managing/installing quicker. To do so, run the following command after the GEP environment is installed and activated: +You may need to close and reopen your terminal after the `conda env create ...` command, or use `source ~/.bashrc` depending on your system. -``` -conda install -c conda-forge mamba -``` -If you do not install mamba, you will need to make sure the following flag is included in your snakemake command at the time of execution: `--conda-frontend=conda` ***Note** *If you already have a snakemake (or suitable Python) installation and would like to avoid installing again, ensure that all of the above software are in your `PATH`. If you do this instead of installing from the provided GEP environment (`installGEP.yaml`), you will still need at least the base conda installed/activated - as it's required to handle software dependencies of all the tools used within the workflow itself* <br> @@ -141,7 +160,7 @@ If you do not install mamba, you will need to make sure the following flag is in ## **Configure Sample Sheet `.tsv`s** -<span style="color:darkred">For the moment, GEP can run with *only one* of the below sample sheets at a given time. </span> +**IMPORTANT:** GEP can run with *only one* of the below sample sheets at a given time. A sample sheet *can* however include multiple different samples, datasets, and assemblies for a given run. **Depending on which sample sheet is provided, GEP will automatically run in one of the following two modes:** @@ -171,7 +190,7 @@ If you already have meryl k-mer databases [(see the meryl github for more detail |sample|Library_R1|Library_R2|meryl_kmer_size| trim10X| trimAdapters| fastQC| |:----|:---|:---|:---|:---|:----|:---| -|Your preferred identifier (Results will include this sample name in output files) | Full path to Forward (R1) read of PE library/s in `fastq` format. Can be `.gz`ipped | Full path to Reverse (R2) read of PE library/s in `fastq` format. Can be `.gz`ipped| Your choice of k-mer size used to count k-mers in the reads. Recommended for illumina reads is `21`| Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform. Possible options are `True` or `False` | Check for and remove any other sequencing adapters that may still be present in the reads. Possible options are `True` or `False` |Run FastQC on the library pair provided in `hifi_reads`. Possible options are `True` or `False`| +|Your preferred identifier (Results will include this sample name in output files) <br /><br />**`string`** | **Full path** to Forward (R1) read of PE library/s **in `fastq` format. Can be `.gz`ipped** | **Full path** to Reverse (R2) read of PE library/s **in `fastq` format. Can be `.gz`ipped**| Your choice of k-mer size used to count k-mers in the reads. <br /><br /> **Recommended for llumina reads is `21`**| Remove first 23bp from R1 of the library. This is only if your reads were sequenced using 10X sequencing platform.<br /><br /> **Possible options are `True` or `False`** | Check for and remove any other sequencing adapters that may still be present in the reads.<br /><br /> **Possible options are `True` or `False`** |Run FastQC on the library pair provided in `Library_R1` and `Library_R2` columns. <br /><br /> **Possible options are `True` or `False`**| <br> @@ -179,9 +198,6 @@ Additional Info in case of multiple PE libraries for a single sample: - **sample**: Provide the same sample ID for each of the pairs of a sample (the final output will be a single `.meryl` database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample. - **Library_R1** and **Library_R2**: Each library pair is provided as one line in the tsv. If you have three PE libraries, then you will have three lines in the tsv for this sample. - **meryl_kmer_size**: This should be consistent for libraries that have the same sample ID. - - **trim10X**: This does not need to be consistent. If you wish to build a database using a combination of 10x and non-10x barcoded libraries, you can do so. Only provide `True` option if you definitely want to trim the `Library_R1` provided in that same line. - - **trimAdapters**: Similarly, you may select `True` or `False` for each library independent of whether they are part of the same sample or not. - - **fastQC**: If any library from a sample has `True` in this column, then all libraries with the identical sample ID will their quality checked with fastQC, even if these other libraries have `False` in this column. If you are a little uncertain about which short-read libraries you should use, see [How to choose your illumina libraries](#how-to-choose-your-illumina-libraries) further down the page. @@ -192,11 +208,11 @@ If you are a little uncertain about which short-read libraries you should use, s <div id="hifi-sample"></div> #### PacBio HiFi Sample Sheet `.tsv` - for **Build** mode -(see example [GEP/configuration/exampleSampleSheets/runEval_example.tsv](configuration/exampleSampleSheets/runEval_example.tsv)) +(see example [GEP/configuration/exampleSampleSheets/build_hifi_example.tsv](configuration/exampleSampleSheets/runEval_example.tsv)) |sample|hifi_reads|meryl_kmer_size| trimSMRTbell| fastQC| |:----|:---|:---|:----|:---| -|Your preferred identifier (Results will include this sample name in output files) | Full path to hifi library/s in `fastq` format. Can be `.gz`ipped | Your choice of k-mer size used to count k-mers in the reads. Recommended for PacBio Hifi is `31`| Check for and remove any SMRT-bell adapter sequences. Possible options are `True` or `False` | Run FastQC on the library provided in `hifi_reads`. Possible options are `True` or `False` | +|Your preferred identifier (Results will include this sample name in output files) <br /><br />**`string`** | **Full path** to PacBio HiFi library **in `fastq` format. Can be `.gz`ipped** | Your choice of k-mer size used to count k-mers in the reads. <br /><br /> **Recommended for PacBio Hifi is `31`**| Check for and remove any SMRT-bell adapter sequences. <br /><br /> **Possible options are `True` or `False`** | Run FastQC on the library provided in `hifi_reads` column. <br /><br /> **Possible options are `True` or `False`** | @@ -204,15 +220,13 @@ If you are a little uncertain about which short-read libraries you should use, s Additional Info in case of multiple HiFi libraries for a single sample: - **sample**: Provide the same sample ID for each Hifi library of a sample (the final output will be a single `.meryl` database consisting of k-mers from all PE libraries given for a sample, with this identifier as a prefix). Every line with the same unique sample ID will be considered as coming from the same sample. - - **Library_R1** and **Library_R2**: Each library pair is provided as one line in the tsv. If you have three PE libraries, then you will have three lines in the tsv for this sample. + - **hifi_reads**: Each library is provided as one line in the tsv. If you have three HiFi libraries for the same sample, then you will have three lines in the tsv for this sample. - **meryl_kmer_size**: This should be consistent for libraries that have the same sample ID. - - **trim10X**: This does not need to be consistent. If you wish to build a database using a combination of 10x and non-10x barcoded libraries, you can do so. Only provide `True` option if you definitely want to trim the `Library_R1` provided in that same line. - - **trimAdapters**: Similarly, you may select `True` or `False` for each library independent of whether they are part of the same sample or not. - - **fastQC**: If any library from a sample has `True` in this column, then all libraries with the identical sample ID will their quality checked with fastQC, even if these other libraries have `False` in this column. + <br> --- -<span style="color:darkred">You may also wish to concatenate the libraries of a sample together prior to building the database. In this case, you will only need to provide one line per sample in the respective sample sheets. However, the execution run-time will be hindered as the pipeline is designed to run on multiple libraries in parallel. </span> +**Note:** You may also wish to concatenate the libraries of a sample together prior to building the database. In this case, you will only need to provide one line per sample in the respective sample sheets. However, the execution run-time will be hindered as the pipeline is designed to run on multiple libraries in parallel. --- @@ -223,13 +237,16 @@ Additional Info in case of multiple HiFi libraries for a single sample: See [GEP/configuration/exampleSampleSheets/runEval_example.tsv](configuration/exampleSampleSheets/runEval_example.tsv) for an idea of how to set up the evaluation sample sheet. -|ID|PRI_asm|ALT_asm| merylDB| merylDB_kmer |genomeSize|HiC_R1|HiC_R2| -|:----|:---|:---|:----|:---|:---|:---|:------------------| -|Identifier for results and reporting| Full path to primary assembly you wish to evaluate in`fasta` format. Can be `.gz`ipped | Full path to alternate assembly (haplotype) in`fasta` format. Can be `.gz`ipped. **If you do not have one, write** `None` | Full path to `.meryl` database | The k-mer size used to build your provided `.meryl` db | Provide a size estimate (in bp) for the corresponding assembly/species. **If you do not have one, write** `auto` |Full path to Forward (R1) read of HiC library in `fastq` format. Can be `.gz`ipped |Full path to Forward (R1) read of HiC library in `fastq` format. Can be `.gz`ipped| +|ID|ASM_LEVEL | PRI_asm|ALT_asm| merylDB| merylDB_kmer |genomeSize|HiC_R1|HiC_R2| +|:----|:---|:---|:---|:----|:---|:---|:---|:------------------| +|Identifier for results and reporting| Level that genome is assembled to. <br /><br /> **Possible options are `chr` (chromosome), `scaff` (scaffold), or `cont` (contig)** |Full path to primary assembly you wish to evaluate in`fasta` format (any extension allowed). <br /><br /> **Can be `.gz`ipped** | Full path to alternate assembly (haplotype) in`fasta` format (any extension allowed). <br /><br /> **Can be `.gz`ipped.** <br /><br /> **If you do not have one, write** `None` | Full path to `.meryl` database. <br /><br /> **Must have `.meryl` extension**| The k-mer size used to build the provided corresponding `.meryl` database. | Provide a size estimate (in bp and as an `integer`) for the corresponding assembly/species. <br /><br /> **If you do not have one, write** `auto` |Full path to Forward (R1) reads of HiC library* in `fastq` format. <br /><br /> **Can be `.gz`ipped** |Full path to Reverse (R2) read of HiC library* in `fastq` format. <br /><br /> **Can be `.gz`ipped**| + +*Please concatenate all HiC Libraries for a sample into a single library pair (one file for R1 and one file for R2), providing this concatenated pair to the sample sheet + +#### Difference between `sample` column in Build Mode sample sheets and `ID` column in Evaluate Mode sample sheet +In practice, ’sample’ from Build Mode sample sheets and ’ID’ from Evaluate Mode sample sheets may often be synonymous, however they are defined differently for cases where multiple assemblies (defined in the sample sheet with unique ID’s) are evaluated using the same k-mer database from a single ’sample’, and so the results need to be differentiated. ---- -<span style="color:darkred">Please concatenate all HiC Libraries for a sample into a single library pair (one file for R1 and one file for R2), providing this concatenated pair to the sample sheet </span> --- @@ -250,58 +267,47 @@ resources: "configuration/define_resources.yaml" ``` ##### `Results:` -Provide a path where all your results are stored for the run. It does not have to be inside the project folder (can be any location accessible by the user). You also do not need to create this folder yourself, snakemake will create this folder if it does not yet exist (together with any non-existent parent folders). +Provide a path where all your results are stored for the run. It does not have to be inside the project folder (can be any location accessible by the user). You also do not need to create this folder yourself, Snakemake will create this folder if it does not yet exist (together with any non-existent parent folders). ##### `samplesTSV:` -This is where you provide the full path for **one** of the sample sheets. Depending on which kind of sample sheet you provide, GEP will run the corresponding mode (**Build** - Illumina, **Build** - HiFi, or **Evaluate**) +This is where you provide the full path for **one** of the sample sheets after you have filled out the necessary information. Depending on which kind of sample sheet you provide, GEP will run the corresponding mode (**Build** - Illumina, **Build** - HiFi, or **Evaluate**) ##### `busco5Lineage` - Only required for **Evaluate** mode Busco needs a database to be able to run. Here you have a couple of different options. Have a look at all the busco databases here https://busco-data.ezlab.org/v5/data/lineages/. +**You can simply provide the taxonomy name that you wish to use and latest database matching the name provided will be automatically downloaded prior to execution**. -You can either manually download and unpack your desired busco lineage (or if you already have a busco database downloaded),and provide the full path to this, for example: ``` -busco5Lineage: "/<PathTo>/manuallyDownloaded/vertebrata_odb10" +busco5Lineage: "vertebrata" ``` -**Or you can simply provide the taxonomy name that you wish to use and latest database matching the name provided will be automatically downloaded prior to execution**. - +**OR** You can manually download and unpack your desired BUSCO lineage (or if you already have a BUSCO database downloaded), and provide the full path to this. Alternatively, you may already have the necessary BUSCO database on your system and do not wish for GEP to re-download it. In this case, provide the path to the already existing BUSCO db: ``` -busco5Lineage: "vertebrata" +busco5Lineage: "/<PathTo>/manuallyDownloaded/vertebrata_odb10" ``` -*If a busco database matching the given name already exists in the `GEP/buscoLineage/` folder, either from manual download or from previously automatic download (from previously executed runs), then the pipeline will skip re-downloading.* - ---- - -# Running GEP - - - -*Make sure your GEP environment is activated and you are inside the main GEP folder* - -For both local and HPC (slurm) execution, it is good practice to run GEP as a *dry run* first **before any full execution**. This is give you an idea of whether the sample sheet and `configuration/config.yaml` are filled out correctly (by checking if the paths exists), and if there is anything wrong with your installation. +*If a busco database matching the given name already exists in the `GEP/buscoLineage/` folder, either from manual download or from previously automatic download (from previously executed runs), then the pipeline will skip re-downloading.* -``` -snakemake --cores 1 --dry-run -``` +--- -If you are going to run Evaluate mode, the above dry run will also download and unpack the given BUSCO lineage database. Expect a lot of output to the terminal from this. ### Configure run (some more `.yaml`'s) +The user will also need to configure a separate config.yaml file pertaining to the execution of Snakemake,and depending on whether or not GEP will be executed HPC infrastructure. + +#### Default (non-HPC) configuration +Modify the Snakemake parameters in the file located at [GEP/SUBMIT_CONFIG/**default**/config.yaml](GEP/SUBMIT_CONFIG/default/config.yaml). -#### LOCAL +Here the user can provide a key-value pair for any Snakemake parameter, such as the maximum number of threads or jobs to utilise at any given time for the entire execution. Some commonly used parameters are already included in the provided file, but the user can add any Snakemake parameter to this file (run ’snakemake –help for possible options). There are already some default parameters present. -Modify the local snakemake parameters in the file located at [GEP/SUBMIT_CONFIG/**local**/config.yaml](GEP/SUBMIT_CONFIG/local/config.yaml). Here you can provide a key-value pair for any snakemake parameter (run `snakemake --help` for options) ``` #Example @@ -312,167 +318,201 @@ latency-wait: 60 rerun-incomplete: True printshellcmds: False keep-going: False - -#shadow-prefix: /srv/public/user/<user>/tempfileshere -#use-singularity: True ``` -##### LOCAL EXECUTION + + +#### HPC (SLURM) configuration + +Modify the Snakemake parameters in the file located at [GEP/SUBMIT_CONFIG/**slurm**/config.yaml](GEP/SUBMIT_CONFIG/slurm/config.yaml). Similar to the default execution, any Snakemake-specific parameter can be configured here. Additionally, the user can provide default resources to be used by every job submitted to the Slurm scheduler such as ’–partition’ and ’–qos’. + +--- + +## `cores`, `jobs`, `resources`, and parallelisation + +Memory (in Mb), wall-time, and number of threads for non-locally executed jobs are defined in the file [GEP/configuration/define_resources.yaml](GEP/configuration/define_resources.yaml). The current values are likely over estimations and should allow for executing GEP without any issues. As required, these values should be increased or decreased accordingly. **If running on non-HPC infrastructure the wall-time will be ignored.** + +The Snakemake parameter `cores` (as found in `SUBMIT_CONFIG/<default_OR_slurm>/config.yaml` ) is the maximum number of cores (synonymous with threads in this case) GEP will utilise at any time during execution. **For example during Evaluation Mode if a single BUSCO job uses 16 threads (as is set currently in define_resources.yaml), then by setting the maximum number of cores to 64, GEP will run up to (64/16 =) 4 BUSCO jobs in parallel. For other steps that require less than 16 threads, the number of jobs will be scaled up to utilise as many of the total requested cores as possible without exceeding the maximum.** + +Similarly the user can provide a maximum number of jobs (e.g. `jobs: 10`). **GEP will never exceed either the maximum number of cores or the maximum number of jobs, whichever is reached first.** In the above example, **if the user was to provide `cores: 64` AND `jobs: 2`, then GEP will only ever run two jobs in parallel regardless of whether the maximum number of cores has been reached.** The user needs to consider how many samples there are (and therefore how many instances of a single step will be run), as well as how much resources are available. + +--- +# Running GEP + +Checklist before running: + +- [x] Installed and activated the GEP Conda environment +- [x] Filled out the respective sample sheet correctly +- [x] Filled out the run configuration (i.e providing a results folder and the path to the sample sheet) in [GEP/configuration/config.yaml](configuration/config.yaml) +- [x] Change the maximum number of `cores` and `jobs`, as well as any other snakemake-specific parameters you wish to use, in the file [GEP/SUBMIT_CONFIG/**default**/config.yaml](GEP/SUBMIT_CONFIG/default/config.yaml), or for executing on an HPC (Slurm) infrastructure the file [GEP/SUBMIT_CONFIG/**slurm**/config.yaml](GEP/SUBMIT_CONFIG/slurm/config.yaml) +- [x] Modify the per-job resources (i.e. threads, memory, and wall-time) in the file [GEP/configuration/define_resources.yaml](GEP/configuration/define_resources.yaml) + +*For simplicity, make you are inside the main GEP folder before running* + +For both local and HPC (slurm) execution, it is good practice to run GEP as a *dry run* first **before any full execution**. This is give you an idea of whether the sample sheet and `configuration/config.yaml` are filled out correctly (by checking if the paths exists), and if there is anything wrong with your installation. + + + ``` -snakemake --profile SUBMIT_CONFIG/local/ +snakemake --cores 1 --dry-run ``` -#### HPC (SLURM) - -Modify the slurm snakemake parameters in the file located at [GEP/SUBMIT_CONFIG/**slurm**/config.yaml](GEP/SUBMIT_CONFIG/slurm/config.yaml). You can set your desired `partition` and `qos` under `default-resources:`, as well as any snakemake parameters. -##### SLURM EXECUTION +### Default Execution ``` -snakemake --profile SUBMIT_CONFIG/slurm/ +snakemake --profile SUBMIT_CONFIG/default/ ``` -##### Running using singularity container (only on HPC) +### HPC (Slurm) Execution -GEP has been containerized using docker. Within this container, snakemake will install the necessary software stacks as contained conda environments (separate to those installed during *normal* execution). All you have to do is make sure singularity is installed (often installed by your respective HPC IT team), and in your PATH: +``` +snakemake --profile SUBMIT_CONFIG/slurm/ +``` -e.g. +### Run in background +For those not aware, you can run GEP with `nohup` and `&` added to the command (both default and slurm). `nohup` will run the Snakemake command in a way that cannot be interrupted when you lose connection to the server/cluster. The trailing `&` will simply run the command in the background of your current terminal, allowing you to freely use the terminal instance in the meantime. ``` -module load Singularity +nohup snakemake --profile SUBMIT_CONFIG/slurm/ & ``` -And executed by adding `--use-singularity` and `--singularity-args "--bind <path1>,<path2> --workdir <userTempDirectory>" --contain` +--- -Where the `--bind` paths are any location/s in your system that you require singularity to be able to ***recursively*** access, commonly your /scratch/username/ directory where both your raw data and results are/will be stored. Multiple paths can be given separated by a comma. -The `--workdir` is the singularity tmp directory. Often on HPC systems this will be automatically set as `/tmp/...`, but this can lead to issues during initial installation of the container as this default `tmp` directory may be too small. Instead, provide a path to a new directory to be used as the `tmp` for snakemake + singularity. +## More Useful Tips -The `--contain` flag ensures that the container is completely kept separate from the host systems files so that the packages being installed are not shared with those that already exist (or are in the host system's conda cache). +The user can directly override any of the Snakemake parameters (and paths) set in the respective configuration files through the command line when executing GEP. For example, to override the values in the sample configuration [GEP/configuration/config.yaml](configuration/config.yaml) for the path to the main results folder without having to edit the document itself: -#### EXAMPLE SINGULARITY EXECUTION +``` +snakemake --profile SUBMIT_CONFIG/default/ --config [Results=/Alternative/ResultsFolder] +``` +Or to override the BUSCO lineage for example: +``` +snakemake --profile SUBMIT_CONFIG/default/ --config [busco5Lineage=vertebrata] +``` + + +To override values for maximum number of cores and jobs, as well as setting the flag for printing the shell commands (a Snakemake-specific parameter) in the submission configuration as found in [GEP/SUBMIT_CONFIG/default/config.yaml](GEP/SUBMIT_CONFIG/default/config.yaml): ``` -snakemake --profile SUBMIT_CONFIG/slurm/ --use-singularity --singularity-args "--bind /scratch/james94,/scratch2/james94 --workdir /scratch/james94/tmp --contain" +snakemake --profile SUBMIT_CONFIG/default/ --cores 144 --jobs 20 --printshellcmds ``` -### Run in background -For those not aware, you can run GEP with `nohup` and `&` added to the command (both local and slurm, and with or without singularity). `nohup` will run the snakemake command in a way that cannot be interrupted when you lose connection to the server/cluster. The trailing `&` will simply run the command in the background of your current terminal, allowing you to freely use the terminal instance in the meantime. +The user can also run GEP only up to and including specific rules in the pipeline if they do not wish to run the full evaluation: + ``` -nohup snakemake --profile SUBMIT_CONFIG/slurm/ & +snakemake --profile SUBMIT_CONFIG/default/ --until <ruleName> ``` +In such cases all files that this rule depends on will be run, but final PDF reports will not be created. + --- -## `cores:`, Resources, and parallelisation +# Results +## Folder Structure of results -All steps have a memory (in mb), wall-time, and number of threads defined in the file [GEP/configuration/define_resources.yaml](GEP/configuration/define_resources.yaml). This defines the resources that will be requested for each individual step, for each individual sample. **If running locally, wall-time will be ignored.** +### **Build Mode** + -The parameter `cores` as found in the `SUBMIT_CONFIG/<local_OR_slurm>/config.yaml` is the **maximum number of cores** (synonomous with threads in this case) you want to utilise at any given time. +### **Evaluate Mode** + -For example during **Evaluation** mode, if a single busco job used 16 threads (as is currently set as default), then by setting the maximum number of cores to `64`, GEP run can up to 4 (=64/16) busco jobs in parallel. For other steps that require less than 16 threads, the number of jobs will be scaled up to utlise as many of the total requested cores as possible without exceeding the maximum. -Similarly you can provide a maximum number of jobs (e.g. `jobs: 10`). GEP will never exceed either the maximum number of cores, or the maximum number of jobs, whichever is reached first. In the above example, if you were to provide `cores: 64` AND `jobs: 2`, then GEP will only ever run two jobs in parallel regardless of whether the maximum number of cores has been utilised. -Try to consider how many samples you have (and therefore how many instances of a single step will be run), as well as how much resources you have available. +## PDF Report ---- -## More tips +**One page per-assembly for key results** -You can directly override any of the snakemake parameters (and paths) set in the respective config files through the command line when executing GEP. + -Some examples: -*To override the values in the sample configuration [GEP/configuration/config.yaml](configuration/config.yaml) such as path to results, path to samplesTSV, busco5Lineage* without having to edit the document itself: -``` -snakemake --profile SUBMIT_CONFIG/local/ --config [Results=/my/New/Results/Folder busco5Lineage=insecta] -``` +- **A:** Name of the assembly and k-mer size used for databases. +- **B:** Key contiguity statistics. Full contiguity statistics can be view in the respective output files for that step. +- **C:** Genomescope Profiles (Untransformed Linear and Log versions). +- **D:** QV Score and K-mer Completeness tables. +- **E:** BUSCO results including name of BUSCO clade. +- **F:** K-mer multiplicity plots for primary assembly only (Top) and both primary and alternate assemblies together (Bottom). If no alternate assembly is provided, the same plot will be present in both positions. -To override values in the submission configuration as found in `SUBMIT_CONFIG/<local_OR_slurm>/config.yaml` without having to edit the document itself: +--- +Followed by **Up to six (2 by 3) Hi-C contact maps per page** -``` -snakemake --profile SUBMIT_CONFIG/slurm/ --cores 144 --jobs 20 --printshellcmds -``` + -## Results +--- -For **Build** mode, meryl databases can be found at: -(This is the path that you provide to the evaluation sample sheet under the `merylDB` column) -``` -/<pathToResults>/0_buildDatabases/{sample}/<illuminaReads OR hifiReads>/03_merylDb/complete_{illuminaORhifi}_{sample}_dB.{smrtTrimmed}.{kmer}.meryl -``` -For **Evaluate** mode, individual results can be found at: -``` -/<pathToResults>/1_evaluation/{asmID}/ -``` -#### PDF Reports +Followed by a page with **an aggregated table (coloured as a heatmap) for all assemblies evaluated in a given run** -``` -/<pathToResults>/1_evaluation/finalResults/FULL_Report_PDF.pdf -/<pathToResults>/1_evaluation/finalResults/FULL_TABLE_PDF.pdf -``` + -**Full Report** +--- +Followed by a page with **an aggregated table coloured according to tiers, also for all assemblies evaluated in a given run**. The threshold values for the different tiers were loosely based on the Earth Biogenome Project's [TABLE OF PROPOSED EBP METRICS](https://www.earthbiogenome.org/assembly-standards. - + --- - + +#### Tables in different formats + + +*The above tables can also be found in the results folder: `/<pathToResults>/1evaluation/finalResults/tables/` in various formats (i.e. `.tsv`, `.md`, `.html`).* --- -**Full Table** +#### All result files - +All results files as output by each evaluation tool can also be found in their respective folders. -#### Aggregated Results Table +--- +# How to choose your Illumina libraries -The above table can alsobe found in both TSV format (for easy import to plotting software such as R), and markdown (.rst) for copy-and-pasting into your preferred markdown. -``` -/<pathToResults>/1_evaluation/finalResults/Combined_Results_FULLTABLE.tsv -/<pathToResults>/1_evaluation/finalResults/FullTableMarkdown.md -``` +Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively. +Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline. + +--- ## Citations ####################################################################### -The software/tools used as part of our genome evaluation are as follows: +The software/tools used as part of our genome evaluation are as follows. Please see [GEP/envs] for exact versions of tools, as they may be updated frequently: -#### Pre-Processing (least biased short-read dataset available): -* Trimmomatic (*Bolger, A. M., Lohse, M., & Usadel, B. (2014)*. http://www.usadellab.org/cms/?page=trimmomatic) -* Trim_galore (*Felix Krueger* bioinformatics.babraham.ac.uk) -* Fastqc (*Simon Andrews* https://github.com/s-andrews/FastQC -* Multiqc (*Ewels, P., Magnusson, M., Lundin, S., Käller, M. (2016)*. https://doi.org/10.1093/bioinformatics/btw354) +#### Pre-Processing: +* **Trimmomatic** (*Bolger, A. M., Lohse, M., & Usadel, B. (2014)*. http://www.usadellab.org/cms/?page=trimmomatic) +* **Trim_galore** (*Felix Krueger* bioinformatics.babraham.ac.uk) +* **Cutadapt** (*Martin, M. (2011)*. https://doi.org/10.14806/ej.17.1.200) +* **Fastqc** (*Simon Andrews* https://github.com/s-andrews/FastQC +* **Multiqc** (*Ewels, P., Magnusson, M., Lundin, S., Käller, M. (2016)*. https://doi.org/10.1093/bioinformatics/btw354) #### Reference-free Genome Profiling -* GenomeScope2 (*Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. (2020)* https://github.com/tbenavi1/genomescope2.0) +* **GenomeScope2** (*Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. (2020)* https://github.com/tbenavi1/genomescope2.0) + +#### K-mer distribution analysis +* **meryl** and **merqury** (*Rhie, A., Walenz, B.P., Koren, S. et al. (2020)*. https://doi.org/10.1186/s13059-020-02134-9) -#### K-mer distribution (copy-number spectra) analysis -* meryl (*Rhie, A., Walenz, B.P., Koren, S. et al. (2020)*. https://doi.org/10.1186/s13059-020-02134-9) -* merqury (*Rhie, A., Walenz, B.P., Koren, S. et al. (2020)*. https://doi.org/10.1186/s13059-020-02134-9) #### Assessing quality and annotation completeness with Benchmarking Universal Single-Copy Orthologs (BUSCOs) -* BUSCOv5 (*Seppey M., Manni M., Zdobnov E.M. (2019)* https://busco.ezlab.org/ ) +* **BUSCOv5** (*Seppey M., Manni M., Zdobnov E.M. (2019)* https://busco.ezlab.org/ ) +#### Hi-C Alignment and Contact Maps +* **samtools** (*Danecek, P., (2021)* https://doi.org/10.1093/gigascience/giab008) +* **BWA** (*Li H. and Durbin R. (2009)* https://doi.org/10.1093/bioinformatics/btp324) +* Pretext (https://github.com/wtsi-hpag/PretextMap) + +#### Generate PDF reports from markdown files +* **Pandoc** (https://pandoc.org/MANUAL.html) #### Scaffold/contig statistics: N# and L# stats, scaffold metrics, sequence counts, GC content, Estimated genome size -* Python scripts (*Mike Trizna. assembly_stats 0.1.4 (Version 0.1.4). Zenodo. (2020)*. http://doi.org/10.5281/zenodo.3968775 ) +* Python scripts (*Mike Trizna. assembly_stats 0.1.4 (Version 0.1.4). Zenodo. (2020)*. http://doi.org/10.5281/zenodo.3968775 ) (This script was used as a foundation, being modified and improved upon for inclusion in GEP) ####################################################################### -# How to choose your Illumina libraries - - -Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively. - -Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.