Merge branch 'master' of https://git.imp.fu-berlin.de/mdriller/mastertool

4b62102a · max · 0991b1a5 · bc3bf8c7 · 4b62102a
Commit 4b62102a authored 5 years ago by max
--- a/README.md
+++ b/README.md
-# masterTool
+# REAPRLong : A Tool to Scaffold and Quality Control genome assemblies using (low coverage) long reads
-## Tool to Scaffold and Quality Control genome assemblies
 <p align="center"> 
-<img src="graphics/simple_workflow.svg">
+<img src="figures/simple_workflow.pdf">
 </p>
 ### Dependencies/Prerequisites:
-1. Python3 (3.6.6)  
+1. Python3 →  version >3.6 necessary  
-→ script is executable (default python3 on machine)  
+2. Python library networkx →  can be installed using pip (pip install networkx)  
-→ should be downward compatible (Python2.6)  
+3. minimap2 →  provided within the REAPRLong git repository (needed at this path)  
-2. Biopython needs to be set up for used python version  
-→ pip/pip3 install biopython  
-3. NCBI-BLAST+ package (blastn) needs to be set up GLOBALLY  
-4. minimap2 set up within the script folder  
-→ Build on Ubuntu 18.04 might need to be compiled again depending on the machine (“make”) 
 ### Setup:
-Clone and include minimap2  
+REAPRLong is publically available to download and use in the git repository \url{https://git.imp.fu-berlin.de/mdriller/mastertool}.  
+The minimap2 gitlab repository is added as a submodule within REAPRLong, to include it in the download please run:   
 git clone --recursive https://github.com/mdriller/masterTool.git  
-Move into minimap2 directoy and build it  
+Then move into the minimap2 directory and build it   
-cd masterTool/scripts/minimap2/  
+cd masterTool/scripts/minimap2/   
 make  
 ### Usage:
-If python3 is in path the script can be called like: "./main.py" otherwise just "python main.py"  
+The main script is made executable and can be executed if python3 exists at: "#!/usr/bin/env python3".   
-The help function can be accessed via: ./main.py -h|--help  
+Otherwise the script can be run using python directly: "python main.py ..."   
-To run the script needs a genome assembly in fasta format and long reads (e.g. PacBio or ONT) in fastq or fasta format.   
+The help function can be accessed via: ./main.py -h|--help and provides a general overview of how to use and which parameters can be set when using the tool.   
->usage: main.py [-h] -ge GENOME -fq FASTQ -out OUTDIR [-m MODE] [-t THREADS]
+REAPRLong needs a genome assembly in fasta format, long reads (e.g. PacBio or ONT) in fastq or fasta format and a path, where output files will be generated, as mandatory input to run. Additional parameters can be set but the default values are tested and generally provide the best results.  
-               [-mo MINOVERLAP] [-mi MINIDENT] [-it ITERATIONS] [-s SIZE]
+REAPRLong can be used as follows:   
-               [-fa]
+\textbf{usage:} main.py [-h] -ge GENOME -fq FASTQ -out OUTDIR [-m MODE] [-t THREADS]
+[-ml MINLINKS] [-mo MINOVERLAP] [-mi MINIDENT] [-it ITERATIONS]
+[-s SIZE] [-fa]   
 >>optional arguments:
  -h, --help            show this help message and exit  
@@ -60,14 +60,20 @@ To run the script needs a genome assembly in fasta format and long reads (e.g. P
 ### Output files:
-1. scaffolds.fasta → final scaffolded assembly
+REAPRLong generates multiple output files in the specified output directory.
-2. haplotigs.fasta → removed “haplotigs” or (partially) duplicated contigs
-3. adjusted_contigs.fa → set of adjusted contigs (if broken in first iteration)
-4. conts_scafs.ids → Map contigIDs back to new scaffoldIDs
-5. scaffolds_QC_iti.gff - gff file with the GOODNESS for each assembly i representing the reference used for the current iteration.
+1. scaffolds.fasta - the generated scaffolds in fasta format  
+2. scaffolds.stats - statistics generated for scaffolds.fasta (total basepairs in the assembly, number of scaffolds, longest scaffold, average length and N10/20/30/40/ 50/60/70/80/90/100 values)  
+3. scaffolds.gff - gff3 file describing the regions of each new scaffold. Regions can either come from previous contigs or from reads if a gap was filled.  
+4. duplicates.fasta - fasta file containing contigs that were fully part of another contig and thus removed from the assembly.  
+5. adjusted\_contigs\_it\*.fa - fasta file containing adjusted contigs, if the QC identified misassemblies and broke the previous input. The \* is an integer value indicating the iteration of QC, starting with 0.   
+6. coverage\_map\_it\*.gff - a "coverage" map for the input assembly of each iteration. Regions are summarised giving a start and end position and the support given for the region. The support describes the amount of reads mapping continuously in the region substracted by the amount of reads mapping dis-continuously. Negative numbers indicate misassemblies.  
+7. deletions\_it\*.txt - identified deletions (within the genome compared to the reads). The \* is an integer value indicating the iteration of QC, starting with 0 which represents the original assembly. Every subsequent number relates to the adjusted\_contigs\_it\*.fa of the previous iteration.  
+8. insertions\_it\*.txt - identified insertions (within the genome compared to the reads). The \* is an integer value indicating the iteration of QC, starting with 0 which represents the original assembly. Every subsequent number relates to the adjusted\_contigs\_it\*.fa of the previous iteration.  
+9. inversions\_it\*.txt - identified inversions (within the genome compared to the reads). The \* is an integer value indicating the iteration of QC, starting with 0 which represents the original assembly. Every subsequent number relates to the adjusted\_contigs\_it\*.fa of the previous iteration.  
+10. misjoins\_it\*.txt - identified misjoins (within the genome compared to the reads). The \* is an integer value indicating the iteration of QC, starting with 0 which represents the original assembly. Every subsequent number relates to the adjusted\_contigs\_it\*.fa of the previous iteration.  
 ### Workflow
 <p align="center"> 
-<img src="graphics/workflow.svg">
+<img src="figures/workflow_noOptional.pdf">
 </p>