Function
SynBad is a tool for comparing two related genome assemblies and identify putative translocations and inversions
between the two that correspond to gap positions. These positions could indicate misplaced scaffolding.
Synbad will use or create:
1. A table of gap positions for each assembly (seqname, start, end). This can optionally have long reads mapped and
spanning coverage calculated for each gap using Diploidocus. Gaps without spanning long reads are more likely to
correspond to misassemblies.
2. The qryunique and hitunique local hits tables from a GABLAM run using Minimap2.
Pairwise hits between the genomes are filtered according to the minlocid=PERC
and minloclen=INT
criteria, which
by default limits hits to be at least 1kb and 50% identity. Note that this is applied after GABLAM has run, so it
should be possible to re-run with more relaxed constraints and re-use the GABLAM tables.
Next, all gap positions are read in along with the local hits tables. For each genome, the local hit tables are
sorted and QryGap
and SbjGap
fields added. Any local alignments with flanking hits are then flagged in these
new fields with 5', 3' or Both.
The gap tables will also be updated with GapSpan
and SynSpan
fields that have the distance between the
corresponding local hits on the Qry and Sbj genomes. If there is also an inversion, SynSpan
will be negative.
If the local hits are against two different sequences from the other genome, the two sequence names will be
entered in the SynSpan
field instead. If the gap is in the middle of local hit (likely to be true only for
small gaps), SynSpan
or GapSpan
will have a value of zero.
Gaps will then be classified according to the associated GapSpan
and SynSpan
values, and subsequent local hit
analysis to fix/update ratings:
Aln
= Aligned
= Gap is found in the middle of a local alignment to the Hit
Brk
= Breakpoint
= Difference between positive SynSpan
and GapSpan
is bigger than the maxsynspan=INT
distance.
Div
= Divergence
= Translocation or Breakpoint gap with flanking collinear hits.
Dup
= Duplication
= Overlapping flanking hits on the same strand.
Frag
= Fragmentation
= SynSpan
indicates matches are on different scaffolds, 1+ of which is not a chromosome scaffold.
Ins
= Insertion
= Achieved Syntenic
rating by skipping upto maxsynskip=INT
local alignments and max maxsynspan=INT
bp in both Qry and Hit.
Inv
= Inversion
= Flanking hits are on alternative strands.
InvBrk
= Inversion Breakpoint
= Reclassified Inversion that appears to be out of place.
InvDupFix
= Fixed Inversion Duplication
= Inversion that becomes a Duplication when inverted. (See *.corrections.tdt
)
InvFix
= Fixed Inversion
= Inversion that becomes collinear when inverted. (See *.corrections.tdt
)
Long
= Long Syntenic
= Gap flanked by collinear Qry/Hit pairs but distance is greater than maxsynspan=INT
(default 25 kb).
Null
= No mapping between genomes for that gap.
Span
= Spanned
= Any gaps without Aligned or Syntenic rating that are spanned by at least synreadspan=INT
reads.
Syn
= Syntenic
= Difference between positive SynSpan
and GapSpan
is maxsynspan=INT
or less (default 25 kb).
Tran
= Translocation
= SynSpan
indicates matches are on different scaffolds.
Term
= Terminal
= Gap is between a local alignment and the end of the query sequence.
Dependencies
SynBad needs Minimap2, samtools and kat installed for full functionality.
For gapass
gap mode, Flye also needs to be installed.
To generate documentation with dochtml
, R will need to be installed and a pandoc environment variable must be set, e.g.
export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc
For full documentation of the SynBad workflow, run with dochtml=T
and read the *.docs.html
file generated.
Commandline
Main SynBad run options
genome1=FILE
: Genome assembly used as the query in the GABLAM/Mashmap searches []
genome2=FILE
: Genome assembly used as the searchdb in the GABLAM/Mashmap searches []
basefile=X
: Prefix for output files [synbad
]
gablam=X
: Optional prefix for GABLAM/Mashmap search [defaults to $BASEFILE.map
]
mapper=X
: Whether to use minimap2, busco or mashmap (dev only) for all-by-all mapping [minimap2
]
gapmode=X
: Diploidocus gap run mode (gapspan/gapass) [gapspan
]
minloclen=INT
: Minimum length for aligned chunk to be kept (local hit length in bp) [1000
]
minlocid=PERC
: Minimum percentage identity for aligned chunk to be kept (local %identity) [50
]
maxsynskip=INT
: Maximum number of local alignments to skip for SynTrans classification [4
]
maxsynspan=INT
: Maximum distance (bp) between syntenic local alignments to count as syntenic [25000
]
synreadspan=INT
: Minimum number of reads spanning a gap to change the rating to "Spanned" [5
]
spannedflank=INT
: Required flanking distance for synreadspan "Spanned" rating [0
]
maxoverlap=INT
: Maximum overlap (bp) of adjacent local hits to allow compression [500
]
chr1=X
: PAFScaff-style chromosome prefix for Genome 1 to distinguish Translocation from Fragmentation []
chr2=X
: PAFScaff-style chromosome prefix for Genome 2 to distinguish Translocation from Fragmentation []
Correction and Fragmentation options
correct=LIST
: List of edit types to try to fix in the assembly (invert/extract/relocate/break/join; [T/True=all]{cmd:T/True}
) [True
]
fragment=T/F
: Whether to fragment the assembly at gaps marked as non-syntenic if no corrections made [False
]
fragtypes=LIST
: List of SynBad ratings to trigger fragmentation [Brk,Inv,InvBrk,Frag,Tran
]
minreadspan=INT
: Min number of Span0 reads in gaps table to prevent fragmentation [1
]
minctglen=INT
: Extract any contigs below a minimum length threshold [500
]
minbadctg=INT
: Extract any contigs with bad flanking ratings below a minimum length threshold [5000
]
minscafflen=INT
: Remove any scaffolds (inc. detached/extracted contigs) below minimum length threshold [500
]
gapsize=INT
: Size of gaps to add when relocating assembling chunks [500
]
rejoin=T/F
: Whether to rejoin original gaps that end up split into termini but not too short [True
]
Additional input options
bam1=FILE
: Optional BAM file of long reads mapped onto assembly 1 [$BASEFILE1.bam
]
paf1=FILE
: Optional PAF file of long reads mapped onto assembly 1 [$BASEFILE1.paf
]
reads1=FILELIST
: List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype1=LIST
: List of ont/pb/hifi file types matching reads for minimap2 mapping [ont
]
busco1=FILE
: Optional BUSCO full results file for genome 1 []
genomesize1=INT
: Haploid genome 1 size (bp) [0
]
scdepth1=NUM
: Single copy ("diploid") read depth for genome 1. If zero, will use SC BUSCO mode [0
]
bam2=FILE
: Optional BAM file of long reads mapped onto assembly 2 [$BASEFILE2.bam
]
paf2=FILE
: Optional PAF file of long reads mapped onto assembly 2 [$BASEFILE2.paf
]
reads2=FILELIST
: List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype2=LIST
: List of ont/pb/hifi file types matching reads for minimap2 mapping [ont
]
busco2=FILE
: Optional BUSCO full results file for genome 2 []
genomesize2=INT
: Haploid genome 2 size (bp) [0
]
scdepth2=NUM
: Single copy ("diploid") read depth for genome 2. If zero, will use SC BUSCO mode [0
]
mapflanks1=FILE
: Flanks fasta file from previous SynBad run for mapping genome 1 flank identifiers []
mapflanks2=FILE
: Flanks fasta file from previous SynBad run for mapping genome 2 flank identifiers []
fullmap=T/F
: Whether to abort if not all flanks can be mapped [True
]
HiC Gap Flank options
hicbam1=FILE
: Optional BAM file of HiC reads mapped onto assembly 1 [$BASEFILE1.HiC.bam
]
hicbam2=FILE
: Optional BAM file of HiC reads mapped onto assembly 2 [$BASEFILE2.HiC.bam
]
gapflanks=INT
: Size of gap flank regions to output for HiC pairing analysis (0=off
) [10000
]
pureflanks=T/F
: Whether to restrict gap flanks to pure contig sequence (True) or include good gaps (False) [True
]
hicscore=X
: HiC scoring mode (pairs/score/wtscore) [wtscore
]
hicmin=X
: Min. number of HiC read pairs for a "best" HiC pairing ruling [3
]
hicmode=X
: Pairwise HiC assessment scoring strategy (synbad/pure/rand/full) [synbad
]
hicdir1=PATH
: Path to HiC read ID lists for genome 1 [$BASEFILE.qryflanks/
]
hicdir2=PATH
: Path to HiC read ID lists for genome 1 [$BASEFILE.hitflanks/
]
Additional output options
newacc1=X
: Scaffold name prefix for updated Genome 1 output [None
]
newacc2=X
: Scaffold name prefix for updated Genome 2 output [None
]
bestpair=T/F
: Whether to restrict the paired output to the top scaffold pairs [False
]
update=T/F
: Whether to reload compressed qry and hit tables but re-run additional compression [False
]
hidegaps=LIST
: List of SynBad gap types to "hide" in final outputs. Will need to be revealed again later []
force=T/F
: Whether to force regeneration of SynBad results tables [False
]
dochtml=T/F
: Generate HTML Diploidocus documentation (*.docs.html) instead of main run [False
]
History Module Version History
# 0.0.0 - Initial Compilation.
# 0.1.0 - Initial working version without fragment=T implementation.
# 0.1.1 - Added minlocid=PERC and minloclen=INT.
# 0.1.2 - Added additional translocation skipping for SynTrans rating.
# 0.1.3 - Modified the SynBad classification text.
# 0.1.4 - Modified code to be able to run without long read mapping. Added dochtml output.
# 0.2.0 - Added fragment=T output.
# 0.3.0 - Added chromosome scaffold Translocation restriction.
# 0.4.0 - Added an Duplication rating in place of Breakpoint for overlapping flanking hits; added top sequence pairs.
# 0.5.0 - Added HiFi read type. Changed default to gapmode=gapspan.
# 0.5.1 - Added some extra bug-fixes for running Diploidocus checkpos.
# 0.6.0 - Added additional gap-spanning classes and qry-hit pair compression. Added dev reworking on main compression.
# 0.6.1 - Tidy up of code and transition of reworked code to main run (no longer dev=T only).
# 0.7.0 - Added code for fixing Inversions and well-supported translocations. Updated defaults.
# 0.8.0 - Added hicbam=FILE inputs and initial HiC assessment. Fixed inversion bug.
# 0.8.1 - Adding hicbest and summary table output. Fixed some calculation and re-run bugs.
# 0.8.2 - Separated and tidied HiC processing from contig flank/end processing. Fixed summary. Added correct=T/F option.
# 0.8.3 - Fixed HIC flank mapping error.
# 0.8.4 - Added simple duplicity analysis with KAT kmers and Diploidocus CNV. Added extract and relocate edits.
# 0.8.5 - Replace swap edits with break and join edits -> only join if pair are both scaffold ends. Add rejoin=T/F. Fixed major gap update bug.
# 0.8.6 - Small bug fixes for partial input.
# 0.9.0 - Added hidegaps=LIST option for hiding gaps. Add MashMap in place of GABLAM (dev=True). Fixed naming clashes.
# 0.10.0- Added mapper=busco option to use BUSCO genes in place of GABLAM.
# 0.10.1- Fixed end of sequence gap bug for contig/flank generation.
# 0.10.2- Fixed bug with new filenaming for Diploidocus wrapping of DepthKopy. (May need better fix.)
synbad REST Output formats
Run with
&rest=docs
for program documentation and options. A plain text version is accessed with
&rest=help
.
&rest=OUTFMT
can be used to retrieve individual parts of the output, matching the tabs in the default
(
&rest=format
) output. Individual
OUTFMT
elements can also be parsed from the full (
&rest=full
) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...
Available REST Outputs
There is currently no specific help available on REST output for this program.