**Note** If you already have a snakemake installation and would like to avoid installing it again, ensure that the above software are all in your `PATH` and you have conda installed/activated.
**Step 4. Running the pipeline**
**Step 4. Set up sample sheet and configuration file**
-
***BELOW NOT COMPLETE***
GEP can be run in two modes:
1. Create meryl database
- Input: Sample sheet outlining either Illumina PE or PacBio HiFi reads
- Input: Sample sheet outlining either [Illumina PE](configuration/exampleSampleSheets/build_illumina_example.tsv) or [PacBio HiFi](configuration/exampleSampleSheets/build_hifi_example.tsv) reads
Save your sample sheet and provide its location to the `samplesTSV` key in the [config.yaml](configuration/config.yam), which can be found in the `configuration` folder.
Secondly, we will modify the `config.yaml` which will look like this:
This will be where all your results are stored for the run. It does not have to be inside the project folder (can be any location accessible by the user). You also do not need to create this folder yourself, it will be handled by snakemake.
2. Path to your `samplesTSV`.
This is the path to the aforemention `samples.tsv` that was created/modified just above. For now, please keep this file inside the `configuration` folder, together with this `config.yaml`
3.`busco5Lineage` Busco needs a database to be able to run. Here you have a couple of different options.
- Manually download and unpack your desired database from https://busco-data.ezlab.org/v5/data/lineages/ . In this case (or if you already have the database downloaded to a specific location), you can provide the full path:
- Alternatively, you can just provide the taxonomy name that you wish to use. In this case, the latest database matching the name provided will be automatically downloaded prior to execution, if it doesn't already exist inside the `buscoLineage` directory. If it already exists in this `buscoLineage` either from manual download or from previously automatic download (from previously executed runs), then the pipeline will skip re-downloading.
```
busco5Lineage: "vertebrata"
```
4. You can change the busco mode, but considering the scope of this evaluation in it's current state, this option is rather redundant and will be removed/hidden.
GEP will run in either meryl building mode or evaluation mode depending on which sample sheet you provide to the `config.yaml`
**Step 5. Running the workflow**
-
If everything is set up correctly, we can run the pipeline very simply.
For now (though it should be a simple fix!), you must run the pipeline while being inside the project folder. In otherwords, you must be inside the folder where the file `Snakefile` is directly accessible.
You should be inside the main GEP folder where the `Snakefile` is directly accessible.
*Make sure your GEP environment is activated.*
First you should run GEP in drymode:
```
...
...
@@ -143,37 +112,22 @@ Which will check to see if some of your parameters/paths have been modified inco
After the dry-run and downloading has complete, you can simply run the full pipeline with:
Your pipeline will be executed with at most 32 threads per job/process. All jobs defined in the pipeline will use a percentage of these 32 threads allowing for automatic scalability of execution. An example being, if two jobs are ready to be executed and they both are defined as requiring 50% of the total cores avaliable, they can both be run in parralel.
Never use more threads than feasibly available by the architecture used to run the pipeline. Soon I will incorporate further modifications to provide even more scalability and portability (cluster/cloud execution, job scheduler compatability, and more!)
If you want to see some of the options you can use for execution, please see the snakemake help with `snakemake -h`
For example, you can assign a total amount of memory (e.g. 100GB) allowed by the pipeline by modifying the execution command to:
This pipeline allows users the ability to produce a wide range of commonly used evaluation metrics for genome assemblies, no matter your level of command-line experience/.
By harnessing the capabilities of snakemake, we present a workflow that incorporates a number of command-line tools and can be run on multiple independent genome assemblies in parallel. A streamlined user-experience is paramount to the devlopment process of this pipeline, as we strive for three key user-oriented components:
```
Snakemake will use conda to both install and manage our software packages and required tools. This helps to avoid software dependency conflicts, which will prevent the analysis from being simple to use and easily applied to different hardware. It also means that you, the user, do not have to be concerned with this at all - it is done for you!
@@ -204,11 +158,7 @@ The software/tools used as part of our genome evaluation are as follows:
Variations in sequencing methods/protocols can lead to an increase in bias in the corresponding raw sequencing libraries. Sequencing a biological sample may often consist of both mate-pair/long-insert (e.g. insert sizes of 5k, 10k, 20k bp, etc.) and short-insert (e.g. insert-sizes 180, 250, 500, 800bp) paired-end libraries, respectively. Usually you can deduce the insert sizes and library types from the metadata found within NCBI or and SRA archive. In order to maintain a little bias as possible whilst maintaining decent coverage, you should ideally use only short-insert paired-end libraries for this evaluation pipeline.
If your library/s was sequenced using 10x barcodes (10X Genomics), you should remove the first 25-30bp of the forward read (R1) only. This will remove all barcode content.
**Use trimmomatic**
*Will be incorporated automatically shortly*
If your library/s was sequenced using 10x barcodes (10X Genomics), you should assign a value of `True` to the trim10x column in the relevant sample sheet.
...
...
@@ -216,54 +166,3 @@ If your library/s was sequenced using 10x barcodes (10X Genomics), you should re
# Reporting
Instead of or as well as retrieving the result files directly from the locations specified in the Results section (Step 6), the `&& snakemake --report` argument used when running will create an interactive html report upon completion. This .html document will consist of all the relevant key files among other things such as the Directed Acyclic Graph (DAG) that snakemake uses to drive the order of execution, run-times of each individual step, and more (work in progress)
The report will be created in the **main** project directory, the same location as the Snakefile, where you executed the pipeline from.
**Step 6. Results**
-
ALL results can be found at the results directory defined by you in the `config.yaml`. Within this results folder, you will have a directory for each of the assemblies (`assemblyName`) you defined in the `samples.tsv`. The pipeline will produce a large amount of files at each step or for each tool, respectively. Taking this into consideration, the results can be considered in three tiers relative to their *importance* or ease of viewability.
***Tier 3***
The full results from all tools, including every file created during the execution of each tool, respectively. These results can be navigated at will, and are separated by their respective intended purposes (i.e. QVstats_merylAndMerqury, assemblystats, etc.)
***Tier 2***
The key result files (key plots, statistics, ) are aggregated and copied into a separate folder within the assembly folders. e.g.`/path/to/Results/SpeciesX/keyResults/`.
**Key values from all the above files pulled into aggregated table. Useful for quick glance**
- assemblyName_aggregatedResults.tsv
***Tier 1***
There is a separately created folder within the main results directory (i.e. `/path/to/Results/allAssemblies_keyResults` )
Within this folder you will find a combined aggregate file (`/path/to/Results/allAssemblies_keyResults/key_results.tsv`, a tsv that combines the aforementioned key values from each assembly evaluated, respectively, into one single file. This is useful for plotting the key values across multiple assemblies.