Project code for the masters course applied sequence analysis.
## Run pipeline
To run the pipeline first change into the project1 directory and then define either the `samples.tsv` containing the samplenames and paths to the different sequence files, or the directory containing the sequence files, either via the config file (`project1/config/config.yaml`) or as a command line argument using the respective flags `input_directory` and `samples`. You then also have to provide a reference genome file for the mapping rule:
## Project 1
This project contains the basics of Snakemake and was part of the Snakemake tutorial of the course.
This project contains a workflow to process (viral) NGS data including quality control, trimming, decontamination, denovo assembly, assembly polishing, scaffolding, clustering of the assemblies and per base variance analysis.
@@ -5,7 +5,7 @@ Project code for the masters course applied sequence analysis.
## Test data
Test data (SARS-CoV2 sequencing data and human reference genome) can be found [here](https://box.fu-berlin.de/s/dt2d5MbwaxjfWtZ).
The SARS-CoV2 reference genome can be found in the resources directory in `ref.fasta`.
The SARS-CoV2 reference genomes (different variants for optimal reference selection in the scaffolding process) can be found under `resources/references` in the respective `.fasta` files.
## Run pipeline
To run the pipeline define either the `samples.tsv` containing the samplenames and paths to the different sequence files, or the directory containing the sequence files, either via the config file (`project1/config/config.yaml`) or as a command line argument using the respective flags `input_directory` and `samples`. You then also have to provide a reference genome file of your target organism for the scaffolding rule and a kraken2 database for the screening (please provide a host sequence if you want to run the decontamination step):