 |
|
Module: | GapSpanner |
Description: | Genome assembly gap long read support and reassembly tool |
Version: | 0.1.1 |
Last Edit: | 09/09/21 |
|
Copyright © 2021 Richard J. Edwards - See source code for GNU License Notice
Imported modules:
diploidocus
rje
rje_obj
rje_rmd
See SLiMSuite Blog for further documentation. See rje
for general commands.
Function
GapSpanner is a wrapper for the gap-spanning and reassembly methods introduced by Diploidocus v0.12.0 and v0.13.0.
GapSpanner needs a genome assembly (fasta format, seqin=FILE
) and a set of long read (ONT, PacBio or HiFi) data
for the assembly (reads=FILELIST
and readtype=LIST
).
First, all gaps in the assembly at least 10bp (mingap=INT
) are identified and saved in a table
($SEQBASE.gaps.tdt
). Long read data is mapped onto the assembly using [Minimap2](https://github.com/lh3/minimap2)
and reads spanning each gap are identified based on their positions and the target start and end positions in the PAF
file. In addition to absolute spanning of regions, reads spanning a region +/- distances set by checkflanks=LIST
(default 100bp, 1kb and 5kb) will also be calculated. (If the end of a sequence is reached before the end of the read,
this will also count as flanking. Such regions can be identified using the MaxFlank5
and MaxFlank3
fields, which
identify the maximum distance 5' and 3' that a read can span due to sequence length constraints.) Gap spanning output
will be saved to *.checkpos.tdt
. IDs for reads spanning each gap will also be saved to $BASEFILE_spanid/
, with
each gap will be named: seqname.start-end
.
The gapass
and gapfill
run modes will then attempt to re-assemble any gaps spanned by 2+ (using mingapspan=INT
)
the [Flye](https://github.com/fenderglass/Flye) assembler. First, long reads are re-mapped to generate BAM output and
then [samtools](http://www.htslib.org/) (samtools view
and samtools fasta
) used to extract the reads that span
the gap into a file for re-assembly. Gap assemblies will then be farmed out, running forks=X
Flye processes at a time, with each Flye assembly using subforks=INT
threads. This can take some time, and
GapSpanner will terminate all running Flye assemblies if none finish within a 10 hour window (killforks=INT
). This
can be switched off with killforks=0
. Assemblies are performed in $BASEFILE__gapassemble/
.
NOTE: Not all gaps will be successfully re-assembled at this point. See the log file for details.
With gapfill
run mode, re-assembled gap regions are compiled into *.assembledgaps.fasta
single file and then
mapped back on the original assembly using Minimap2, with tabulated hit output into $BASEFILE__gapfill/
. Local hits
must be at least 500bp. This stringency can be controlled with minlocid=PERC
and minloclen=INT
. Local hits
are reduced to unique coverage of the assembly sequences and saved to *.gapfiller.tdt
. Gaps are filled if one of
the two conditions are met:
1. A single local alignment spans an entire gap.
2. A pair of concordant local alignments from the same re-assembly contig (same orientation) flank an entire gap.
In the case of a single spanning local alignment, the entire assembly region is replaced by the corresponding
re-assembly contig region. For a pair of hits, the region between the two hits is replaced. The updated sequences are
output to *.fillcheck.fasta
with modified sequences given a -Xfix
suffix.
Long reads are then mapped on to the updated assembly and the [Diploidocus](https://github.com/slimsuite/diploidocus)
regcheck
mode is run a second time to check for mapped reads that span the filled gaps. If the single copy read
depth is given with scdepth=X
then the estimated copy number of the replaced region will also be calculated.
If no reads span the replaced region, the gap-filling will be reversed. (NOTE: this may cause issues when the full
assembly is inserted, and future releases may instead use [DepthCharge](https://github.com/slimsuite/depthcharge) to
identify bad gap-filling.) Such sequences will have an additional -Xrev
suffix.
The final gap-filled assembly is output to *.gapfill.fasta
.
Output will be saved to files with a prefix set by basefile=X
(default named after the seqin=FILE
prefix).
For more details, see the Diploidocus documentation.
GapSpanner run modes
Assembly gap read-spanning analysis [runmode=gapspan
]
This mode first identifies all the gaps in an assembly (seqin=FILE
) (using SeqList gapstats
or $SEQBASE.gaps.tdt
if pre-
existing) and then runs the Diploidocus read spanning analysis (runmode=regcheck
) with regcnv=F
. Long read data, given
with the reads=FILELIST
and readtype=LIST
options, are mapped onto the assembly using minimap2 to generate a PAF file.
This is then parsed and reads spanning each gap are identified based on their positions and the target start and end positions in the PAF file.
In addition to absolute spanning of regions, reads spanning a region +/- distances set by checkflanks=LIST
will also be calculated. If the end of a
sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified
using the MaxFlank5
and MaxFlank3
fields, which identify the maximum distance 5' and 3' that a read can span
due to sequence length constraints.
Spanning spanid
output is also generated for each gap and saved in $BASEFILE_spanid
. Each gap will be named:
seqname.start-end
.
---
In addition to the gapspan
analysis, reads identified as spanning each gap are extracted and assembled using flye
in a $BASEFILE__gapassemble/
output directory.
---
In addition to the gapspan
and gapass
outputs, re-assembled gap regions are compiled into a single file and then
mapped back on the original assembly using Minimap2, with tabulated hit output into $BASEFILE__gapfill/
. Local hits
are reduced to unique coverage of the assembly sequences. Gaps are filled if one of the two conditions are met:
1. A single local alignment spans an entire gap.
2. A pair of concordant local alignments from the same re-assembly contig (same orientation) flank an entire gap.
In the case of a single spanning local alignment, the entire assembly region is replaced by the corresponding
re-assembly contig region. For a pair of hits, the region between the two hits is replaced.
---
## Dependencies
[minimap2](https://github.com/lh3/minimap2) must be installed and either added to the environment $PATH
or given to
GapSpanner with the minimap2=PROG
setting. For gapass
or gapfill
run modes, [samtools](http://www.htslib.org/)
and the [Flye](https://github.com/fenderglass/Flye) assembler need to be installed.
To generate documentation with dochtml
, R will need to be installed and a pandoc environment variable must be set, e.g.
export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc
For GapSpanner documentation, run with dochtml=T
and read the *.docs.html
file generated.
Commandline
Main GapSpanner run options
seqin=FILE
: Input sequence assembly [None
]
runmode=X
: GapSpanner run mode (gapspan/gapass/gapfill) [gapspan
]
basefile=FILE
: Root of output file names [gapspanner or $SEQIN basefile
]
summarise=T/F
: Whether to generate and output summary statistics sequence data before and after processing [True
]
genomesize=INT
: Haploid genome size (bp) [0
]
paf=FILE
: PAF file of long reads mapped onto assembly [$BASEFILE.paf
]
bam=FILE
: BAM file of long reads mapped onto assembly [$BASEFILE.bam
]
reads=FILELIST
: List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST
: List of ont/pb/hifi file types matching reads for minimap2 mapping [ont
]
dochtml=T/F
: Generate HTML GapSpanner documentation (*.docs.html) instead of main run [False
]
tmpdir=PATH
: Path for temporary output files during forking (not all modes) [./tmpdir/
]
Gaps spanning and reassembly options
checkflanks=LIST
: List of lengths flanking check regions that must also be spanned by reads [0,100,1000,5000
]
subforks=INT
: Number of forks for assembly subproccesses during gapfill and gapass modes [1
]
mingapspan=INT
: Minimum number of reads spanning a gap in order to re-assemble [2
]
minlocid=PERC
: Minimum percentage identity for aligned chunk to be kept (local %identity) [0
]
minloclen=INT
: Minimum length for aligned chunk to be kept (local hit length in bp) [500
]
scdepth=NUM
: Single copy ("diploid") read depth. If zero, will not calculate CNV for filled gaps [0
]
Re-assembly forking options
forks=X
: Number of parallel sequences to process at once [0
]
killforks=X
: Number of seconds of no activity before killing all remaining forks. [36000
]
killmain=T/F
: Whether to kill main thread rather than individual forks when killforks reached. [False
]
logfork=T/F
: Whether to log forking in main log [False
]
History Module Version History
# 0.0.0 - Initial Compilation. Wrapper for Diploidocus v0.14.0 gapspan modes.
# 0.1.0 - Modified sequence summary output.