Skip to content
Snippets Groups Projects
Commit 56ef020f authored by james94's avatar james94
Browse files

update busco from v4 to v5

parent 6480bd41
Branches
No related tags found
No related merge requests found
......@@ -8,7 +8,7 @@ By harnessing the capabilities of snakemake, we present a workflow that incorpor
- Reduce user interaction significantly when compared to running the individual tools by themselves.
2. **Scalability**
- Seamlessly scaled to server, cluster, grid and cloud environments without the need to modify the workflow definition.
- Seamlessly scaled to server, cluster, grid and cloud environments without the need to modify the workflow definition.
3. **Portability**
- Workflows entail a description of required software, which will be automatically deployed to any execution environment.
......@@ -22,7 +22,7 @@ The software/tools used as part of our genome evaluation are as follows:
#### Pre-Processing (least biased short-read dataset available):
* Trimmomatic (*Bolger, A. M., Lohse, M., & Usadel, B. (2014)*. http://www.usadellab.org/cms/?page=trimmomatic)
* Trim_galore (*Felix Krueger* bioinformatics.babraham.ac.uk)
* Fastqc (*Simon Andrews* https://github.com/s-andrews/FastQC
* Fastqc (*Simon Andrews* https://github.com/s-andrews/FastQC
* Multiqc (*Ewels, P., Magnusson, M., Lundin, S., Käller, M. (2016)*. https://doi.org/10.1093/bioinformatics/btw354)
#### Reference-free Genome Profiling
......@@ -46,17 +46,17 @@ Variations in sequencing methods/protocols can lead to an increase in bias in th
If your library/s was sequenced using 10x barcodes (10X Genomics), you should remove the first 25-30bp of the forward read (R1) only. This will remove all barcode content.
**Use trimmomatic**
**Use trimmomatic**
*Will be incorporated automatically shortly*
# Using the pipeline
**Step 1. Downloading the workflow**
-
-
First things first, we want to download/install this workflow. The easiest way to do this would be to clone the git repository, provided you have the git command line tool installed (for instructions on how to do this: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
To clone the repository, use the following command:
......@@ -68,7 +68,7 @@ git clone https://git.imp.fu-berlin.de/cmazzoni/GEP.git
**Step 2. Conda management**
-
-
If you already have conda installed on your system, please skip to step 3
We will use a minimal version of conda - miniconda3, it has everything we need.
......@@ -83,7 +83,7 @@ bash /<your_path_to>/Miniconda3-latest-Linux-x86_64.sh
```
Hold enter for the licensing agreement to print in it's entirety and agree to is by typing yes then
pressing ENTER once.
pressing ENTER once.
Choose the location you wish to install miniconda3, or use the default-determined location by simply hitting ENTER.
......@@ -102,7 +102,7 @@ source ~/.bashrc
**Step 3. Creating our Snakemake conda environment**
-
-
Inside the main porject folder will be a file called `installGEP.yaml`
i.e.
......@@ -126,8 +126,8 @@ conda activate GEP
**Step 4. Modifying our configuration files**
-
There are only two files (`config.yaml` and `samples.tsv`) that you as the user are required to *modify*. These are found in the `configuration` folder.
-
There are only two files (`config.yaml` and `samples.tsv`) that you as the user are required to *modify*. These are found in the `configuration` folder.
Firstly, we will modify the `samples.tsv`, which consists of the paths to your data files.
......@@ -145,7 +145,7 @@ AssemblyZ path/to/AssemblyZ_R1_library1.fastq path/to/AssemblyZ_R2_libra
AssemblyZ path/to/AssemblyZ_R1_library2.fastq path/to/AssemblyZ_R2_library2.fastq path/to/AssemblyZ_genome.fasta 1983101299
```
This is a tab (or just white-space) separated document where you will fill out the paths to the relevant files. Usually you will not have more than a handful (or sometimes only one-pair) of raw illumina reads, so this document will most of the time be rather 'clean'.
This is a tab (or just white-space) separated document where you will fill out the paths to the relevant files. Usually you will not have more than a handful (or sometimes only one-pair) of raw illumina reads, so this document will most of the time be rather 'clean'.
Column1= Assembly name
......@@ -171,7 +171,7 @@ Results: "/<PathTo>/desiredDestination/Results_AssemblyEval"
samplesTSV: "/<PathTo>/GEP/configuration/samples.tsv"
busco4Lineage: "vertebrata"
busco5Lineage: "vertebrata"
buscoMode: "genome"
......@@ -183,25 +183,25 @@ This will be where all your results are stored for the run. It does not have to
2. Path to your `samplesTSV`.
This is the path to the aforemention `samples.tsv` that was created/modified just above. For now, please keep this file inside the `configuration` folder, together with this `config.yaml`
3. `busco4Lineage` Busco needs a database to be able to run. Here you have a couple of different options.
- Manually download and unpack your desired database from https://busco-data.ezlab.org/v4/data/lineages/ . In this case (or if you already have the database downloaded to a specific location), you can provide the full path:
3. `busco5Lineage` Busco needs a database to be able to run. Here you have a couple of different options.
- Manually download and unpack your desired database from https://busco-data.ezlab.org/v5/data/lineages/ . In this case (or if you already have the database downloaded to a specific location), you can provide the full path:
```
busco4Lineage: "/<PathTo>/manuallyDownloaded/vertebrata_odb10"
busco5Lineage: "/<PathTo>/manuallyDownloaded/vertebrata_odb10"
```
- Alternatively, you can just provide the taxonomy name that you wish to use. In this case, the latest database matching the name provided will be automatically downloaded prior to execution, if it doesn't already exist inside the `buscoLineage` directory. If it already exists in this `buscoLineage` either from manual download or from previously automatic download (from previously executed runs), then the pipeline will skip re-downloading.
```
busco4Lineage: "vertebrata"
busco5Lineage: "vertebrata"
```
4. You can change the busco mode, but considering the scope of this evaluation in it's current state, this option is rather redundant and will be removed/hidden.
**Step 5. Running the workflow**
-
-
If everything is set up correctly, we can run the pipeline very simply.
For now (though it should be a simple fix!), you must run the pipeline while being inside the project folder. In otherwords, you must be inside the folder where the file `Snakefile` is directly accessible.
For now (though it should be a simple fix!), you must run the pipeline while being inside the project folder. In otherwords, you must be inside the folder where the file `Snakefile` is directly accessible.
*Make sure your GEP environment is activated.*
......@@ -211,7 +211,7 @@ First you should run GEP in drymode:
snakemake -n
```
Which will check to see if some of your parameters/paths have been modified incorrectly. Further, it will install all the necessary environments to be utilised by the workflow, as well as download the busco4 database if it doesn't already exist. Unfortunaly when downloading the busco4 database, there will be lots of output in the terminal - a product of the limitations of the `wget` command used for downloading.
Which will check to see if some of your parameters/paths have been modified incorrectly. Further, it will install all the necessary environments to be utilised by the workflow, as well as download the busco5 database if it doesn't already exist. Unfortunaly when downloading the busco5 database, there will be lots of output in the terminal - a product of the limitations of the `wget` command used for downloading.
After the dry-run and downloading has complete, you can simply run the full pipeline with:
......@@ -219,7 +219,7 @@ After the dry-run and downloading has complete, you can simply run the full pipe
snakemake --cores # --use-conda && snakemake --report
```
Where --cores # is the maximum number of cores (synonomous with threads in this case) you want to utilise.
Where --cores # is the maximum number of cores (synonomous with threads in this case) you want to utilise.
For example if you run snakemake with the command:
......@@ -245,7 +245,7 @@ Instead of or as well as retrieving the result files directly from the locations
The report will be created in the **main** project directory, the same location as the Snakefile, where you executed the pipeline from.
**Step 6. Results**
-
-
ALL results can be found at the results directory defined by you in the `config.yaml`. Within this results folder, you will have a directory for each of the assemblies (`assemblyName`) you defined in the `samples.tsv`. The pipeline will produce a large amount of files at each step or for each tool, respectively. Taking this into consideration, the results can be considered in three tiers relative to their *importance* or ease of viewability.
***Tier 3***
......@@ -260,9 +260,9 @@ The key result files are:
**Multiqc report**
- assemblyName_multiqc_report.html
**QV Score**
- assemblyName_merq.qv
- assemblyName_merq.qv
**Copy-Number Spectra plots**
- assemblyName_merq.spectra-cn.fl.png
......@@ -270,11 +270,11 @@ The key result files are:
**Genomescope2 profile plots and summary**
- assemblyName_gScope_log_plot.png
- assemblyName_gScope_linear_plot.png
- assemblyName_gScope_summary.txt
- assemblyName_gScope_linear_plot.png
- assemblyName_gScope_summary.txt
**Assembly statistics (N#, NG#, L#, LG#, sequence counts, total bps, etc.)**
- assemblyName_scaffold_stats.tsv
- assemblyName_scaffold_stats.tsv
- assemblyName_contig_stats.tsv
**BUSCOv4 results**
......@@ -288,5 +288,3 @@ The key result files are:
There is a separately created folder within the main results directory (i.e. `/path/to/Results/allAssemblies_keyResults` )
Within this folder you will find a combined aggregate file (`/path/to/Results/allAssemblies_keyResults/key_results.tsv`, a tsv that combines the aforementioned key values from each assembly evaluated, respectively, into one single file. This is useful for plotting the key values across multiple assemblies.
......@@ -59,7 +59,7 @@ elif os.path.isdir(args_l) is False and os.path.isdir(checkLineagePath) is True:
print("Database already in buscoLineage directory, basename is:", buscoDataBaseName)
else:
print("Database will be downloaded")
url = "https://busco-data.ezlab.org/v4/data/lineages/"
url = "https://busco-data.ezlab.org/v5/data/lineages/"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
# kill all script and style elements
......
......@@ -4,6 +4,6 @@ Results: "/path/to/desiredDestination/Results_Test"
samplesTSV: "/path/to/configuration/samplesTest.tsv"
busco4Lineage: "vertebrata"
busco5Lineage: "vertebrata"
buscoMode: "genome"
......@@ -3,6 +3,6 @@ channels:
- conda-forge
- anaconda
dependencies:
- python=3.7
- busco=4.0.6
- python=3.8.8
- busco=5.0.0
- biopython=1.76
this is a test for multiqc
......@@ -12,8 +12,8 @@
rule trimAdapters:
input:
config['samplesTSV'],
# os.path.join(config['Results'], "{sample}" + "/busco4/short_summary.specific." + config['busco4Lineage'] + "_odb10." + "{sample}" + ".txt")
# check=os.path.join(config['Results'], "{sample}" + "/busco4/short_summary.specific." + config['busco4Lineage'] + "_odb10." + "{sample}" + ".txt")
# os.path.join(config['Results'], "{sample}" + "/busco5/short_summary.specific." + config['busco5Lineage'] + "_odb10." + "{sample}" + ".txt")
# check=os.path.join(config['Results'], "{sample}" + "/busco5/short_summary.specific." + config['busco5Lineage'] + "_odb10." + "{sample}" + ".txt")
params:
file=lambda wildcards: samples.at[wildcards.sample, 'combined'] \
if wildcards.sample in samples.index else '',
......@@ -88,7 +88,7 @@ rule combine_illuminaReads_R2:
# mv {input.scaffold} {output.scaffold}
# """
rule busco4:
rule busco5:
input:
assembly=lambda wildcards: samples.at[wildcards.sample, 'assembly_fasta'] \
if wildcards.sample in samples.index else '',
......@@ -96,19 +96,19 @@ rule busco4:
params:
mode = config['buscoMode'],
assemblyName = "{sample}",
chngDir = os.path.join(config['Results'], "{sample}" + "/1_busco4")
chngDir = os.path.join(config['Results'], "{sample}" + "/1_busco5")
threads:
workflow.cores * 0.5
output:
# report(os.path.join(config['Results'], "{sample}" + "/busco4/" + "{sample}" + "/short_summary.specific." + config['busco4Lineage'] + "_odb10." + "{sample}" + ".txt"), caption="../report/busco.rst", category="Benchmark Universal Single Copy Orthologs", subcategory="{sample}")
os.path.join(config['Results'], "{sample}" + "/1_busco4/" + "{sample}" + "/short_summary.specific." + buscoDataBaseName + "_odb10." + "{sample}" + ".txt"),
# report(os.path.join(config['Results'], "{sample}" + "/busco5/" + "{sample}" + "/short_summary.specific." + config['busco5Lineage'] + "_odb10." + "{sample}" + ".txt"), caption="../report/busco.rst", category="Benchmark Universal Single Copy Orthologs", subcategory="{sample}")
os.path.join(config['Results'], "{sample}" + "/1_busco5/" + "{sample}" + "/short_summary.specific." + buscoDataBaseName + "_odb10." + "{sample}" + ".txt"),
# symlink = os.path.join(config['Results'], "{sample}" + "/{sample}.fasta")
# mvRunBusco= directory(os.path.join(config['Results'], "{sample}" + "/busco4/" + "{sample}" + "/run_" + config['busco4Lineage'] + "_odb10")),
# blastDB= directory(os.path.join(config['Results'], "{sample}" + "/busco4/" + "{sample}" + "/blast_db"))
# mvRunBusco= directory(os.path.join(config['Results'], "{sample}" + "/busco5/" + "{sample}" + "/run_" + config['busco5Lineage'] + "_odb10")),
# blastDB= directory(os.path.join(config['Results'], "{sample}" + "/busco5/" + "{sample}" + "/blast_db"))
conda:
"../envs/busco_and_assembly.yaml"
log:
os.path.join(config['Results'], "logs/1_busco4/{sample}_busco4.log")
os.path.join(config['Results'], "logs/1_busco5/{sample}_busco5.log")
priority:
20
shell:
......@@ -120,20 +120,20 @@ rule busco4:
rule moveBuscoOutputs:
input:
buscoResult=os.path.join(config['Results'], "{sample}" + "/1_busco4/" + "{sample}" + "/short_summary.specific." + buscoDataBaseName + "_odb10." + "{sample}" + ".txt"),
# mvRunBusco= os.path.join(config['Results'], "{sample}" + "/busco4/" + "{sample}" + "/run_" + config['busco4Lineage'] + "_odb10"),
# blastDB= os.path.join(config['Results'], "{sample}" + "/busco4/" + "{sample}" + "/blast_db")
buscoResult=os.path.join(config['Results'], "{sample}" + "/1_busco5/" + "{sample}" + "/short_summary.specific." + buscoDataBaseName + "_odb10." + "{sample}" + ".txt"),
# mvRunBusco= os.path.join(config['Results'], "{sample}" + "/busco5/" + "{sample}" + "/run_" + config['busco5Lineage'] + "_odb10"),
# blastDB= os.path.join(config['Results'], "{sample}" + "/busco5/" + "{sample}" + "/blast_db")
params:
rmDir= os.path.join(config['Results'], "{sample}" + "/1_busco4/" + "{sample}"),
# logs = os.path.join(config['Results'], "{sample}" + "/busco4/" + "{sample}" + "/logs/*"),
# mvLogs= os.path.join(config['Results'], "{sample}" + "/busco4/logs/"),
mvRunBusco= os.path.join(config['Results'], "{sample}" + "/1_busco4/" + "{sample}" + "/run_" + buscoDataBaseName + "_odb10"),
mvRunBuscoDest= os.path.join(config['Results'], "{sample}" + "/1_busco4"),
destination= os.path.join(config['Results'], "{sample}" + "/1_busco4/"),
blastDB= os.path.join(config['Results'], "{sample}" + "/1_busco4/" + "{sample}" + "/blast_db")
# blastDB= os.path.join(config['Results'], "{sample}" + "/busco4/blast_db")
rmDir= os.path.join(config['Results'], "{sample}" + "/1_busco5/" + "{sample}"),
# logs = os.path.join(config['Results'], "{sample}" + "/busco5/" + "{sample}" + "/logs/*"),
# mvLogs= os.path.join(config['Results'], "{sample}" + "/busco5/logs/"),
mvRunBusco= os.path.join(config['Results'], "{sample}" + "/1_busco5/" + "{sample}" + "/run_" + buscoDataBaseName + "_odb10"),
mvRunBuscoDest= os.path.join(config['Results'], "{sample}" + "/1_busco5"),
destination= os.path.join(config['Results'], "{sample}" + "/1_busco5/"),
blastDB= os.path.join(config['Results'], "{sample}" + "/1_busco5/" + "{sample}" + "/blast_db")
# blastDB= os.path.join(config['Results'], "{sample}" + "/busco5/blast_db")
output:
file = report(os.path.join(config['Results'], "{sample}" + "/1_busco4/short_summary.specific." + buscoDataBaseName + "_odb10." + "{sample}" + ".txt"), caption="../report/busco.rst", category="Benchmark Universal Single Copy Orthologs", subcategory="{sample}")
file = report(os.path.join(config['Results'], "{sample}" + "/1_busco5/short_summary.specific." + buscoDataBaseName + "_odb10." + "{sample}" + ".txt"), caption="../report/busco.rst", category="Benchmark Universal Single Copy Orthologs", subcategory="{sample}")
shell:
"""
mv -t {params.mvRunBuscoDest} {params.mvRunBusco}
......@@ -145,7 +145,7 @@ rule moveBuscoOutputs:
rule meryl_R1:
input:
read1= rules.combine_illuminaReads_R1.output
# check = os.path.join(config['Results'], "{sample}" + "/busco4/short_summary.specific." + config['busco4Lineage'] + "_odb10." + "{sample}" + ".txt")
# check = os.path.join(config['Results'], "{sample}" + "/busco5/short_summary.specific." + config['busco5Lineage'] + "_odb10." + "{sample}" + ".txt")
params:
script = os.path.join(workflow.basedir, "programs/meryl-1.0/Linux-amd64/bin/meryl"),
kmer = 21,
......@@ -169,7 +169,7 @@ rule meryl_R1:
rule meryl_R2:
input:
read2= rules.combine_illuminaReads_R2.output
# check = os.path.join(config['Results'], "{sample}" + "/busco4/short_summary.specific." + config['busco4Lineage'] + "_odb10." + "{sample}" + ".txt")
# check = os.path.join(config['Results'], "{sample}" + "/busco5/short_summary.specific." + config['busco5Lineage'] + "_odb10." + "{sample}" + ".txt")
params:
script = os.path.join(workflow.basedir, "programs/meryl-1.0/Linux-amd64/bin/meryl"),
kmer = 21,
......@@ -194,7 +194,7 @@ rule meryl:
input:
read1=os.path.join(config['Results'], "{sample}" +"/2_QVstats_merylAndMerqury/" + "{sample}" + "_R1.21.meryl"),
read2=os.path.join(config['Results'], "{sample}" +"/2_QVstats_merylAndMerqury/" + "{sample}" + "_R2.21.meryl")
# check = os.path.join(config['Results'], "{sample}" + "/busco4/short_summary.specific." + config['busco4Lineage'] + "_odb10." + "{sample}" + ".txt")
# check = os.path.join(config['Results'], "{sample}" + "/busco5/short_summary.specific." + config['busco5Lineage'] + "_odb10." + "{sample}" + ".txt")
params:
script = os.path.join(workflow.basedir, "programs/meryl-1.0/Linux-amd64/bin/meryl"),
kmer = 21,
......@@ -341,7 +341,7 @@ rule saveConfiguration_and_getKeyValues:
gscopeSum=os.path.join(config['Results'], "{sample}" +"/3_genomescopeProfile/" + "{sample}" + "_summary.txt"),
gscopeLog=os.path.join(config['Results'], "{sample}" +"/3_genomescopeProfile/" + "{sample}" + "_log_plot.png"),
gscopeLin=os.path.join(config['Results'], "{sample}" +"/3_genomescopeProfile/" + "{sample}" + "_linear_plot.png"),
busco=os.path.join(config['Results'], "{sample}" + "/1_busco4/short_summary.specific." + buscoDataBaseName + "_odb10." + "{sample}" + ".txt"),
busco=os.path.join(config['Results'], "{sample}" + "/1_busco5/short_summary.specific." + buscoDataBaseName + "_odb10." + "{sample}" + ".txt"),
qv=os.path.join(config['Results'],"{sample}" + "/2_QVstats_merylAndMerqury/" + "{sample}" + "_merq.qv"),
spectraStacked=os.path.join(config['Results'],"{sample}" + "/2_QVstats_merylAndMerqury/" + "{sample}" + "_merq.spectra-cn.st.png"),
spectraFlat=os.path.join(config['Results'],"{sample}" + "/2_QVstats_merylAndMerqury/" + "{sample}" + "_merq.spectra-cn.fl.png"),
......@@ -406,7 +406,7 @@ rule saveConfiguration_and_getKeyValues:
echo "$(grep 'sequence_count' {input.scaffStats} | awk {{'print $2'}})" >> {output.keyValues}
dos2unix {output.keyValues}
echo -e "Assembly\nsize_estimate\nmax_heterozygosity\nqv_score\nN50_length\nNG50_length\nN95_length\nNG95_length\nN100_length\nNG100_length\ntotal_num_bases\ntotal_num_scaffolds" > {output.rowNames}
paste -d'\t' {output.rowNames} {output.keyValues} > {output.keyValuesWithRowNames}
paste -d'\t' {output.rowNames} {output.keyValues} | column -t > {output.keyValuesWithRowNames}
printf "Evaluation complete for {wildcards.sample}, please find these results in {params.resultsPath} folder"
"""
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment