Module:	synbad
Description:	Synteny-based scaffolding assessment and adjustment
Version:	0.10.2
Last Edit:	31/03/22
GitHub:	https://github.com/slimsuite/synbad

Imported modules: rje rje_obj rje_db rje_lrbridge rje_mashmap rje_rmd rje_seqlist rje_sequence diploidocus gablam

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

SynBad is a tool for comparing two related genome assemblies and identify putative translocations and inversions between the two that correspond to gap positions. These positions could indicate misplaced scaffolding.

Synbad will use or create:

1. A table of gap positions for each assembly (seqname, start, end). This can optionally have long reads mapped and spanning coverage calculated for each gap using Diploidocus. Gaps without spanning long reads are more likely to correspond to misassemblies.

2. The qryunique and hitunique local hits tables from a GABLAM run using Minimap2.

Pairwise hits between the genomes are filtered according to the minlocid=PERC and minloclen=INT criteria, which by default limits hits to be at least 1kb and 50% identity. Note that this is applied after GABLAM has run, so it should be possible to re-run with more relaxed constraints and re-use the GABLAM tables.

Next, all gap positions are read in along with the local hits tables. For each genome, the local hit tables are sorted and QryGap and SbjGap fields added. Any local alignments with flanking hits are then flagged in these new fields with 5', 3' or Both.

The gap tables will also be updated with GapSpan and SynSpan fields that have the distance between the corresponding local hits on the Qry and Sbj genomes. If there is also an inversion, SynSpan will be negative. If the local hits are against two different sequences from the other genome, the two sequence names will be entered in the SynSpan field instead. If the gap is in the middle of local hit (likely to be true only for small gaps), SynSpan or GapSpan will have a value of zero.

Gaps will then be classified according to the associated GapSpan and SynSpan values, and subsequent local hit analysis to fix/update ratings:

Aln = Aligned = Gap is found in the middle of a local alignment to the Hit
Brk = Breakpoint = Difference between positive SynSpan and GapSpan is bigger than the maxsynspan=INT distance.
Div = Divergence = Translocation or Breakpoint gap with flanking collinear hits.
Dup = Duplication = Overlapping flanking hits on the same strand.
Frag = Fragmentation = SynSpan indicates matches are on different scaffolds, 1+ of which is not a chromosome scaffold.
Ins = Insertion = Achieved Syntenic rating by skipping upto maxsynskip=INT local alignments and max maxsynspan=INT bp in both Qry and Hit.
Inv = Inversion = Flanking hits are on alternative strands.
InvBrk = Inversion Breakpoint = Reclassified Inversion that appears to be out of place.
InvDupFix = Fixed Inversion Duplication = Inversion that becomes a Duplication when inverted. (See *.corrections.tdt)
InvFix = Fixed Inversion = Inversion that becomes collinear when inverted. (See *.corrections.tdt)
Long = Long Syntenic = Gap flanked by collinear Qry/Hit pairs but distance is greater than maxsynspan=INT (default 25 kb).
Null = No mapping between genomes for that gap.
Span = Spanned = Any gaps without Aligned or Syntenic rating that are spanned by at least synreadspan=INT reads.
Syn = Syntenic = Difference between positive SynSpan and GapSpan is maxsynspan=INT or less (default 25 kb).
Tran = Translocation = SynSpan indicates matches are on different scaffolds.
Term = Terminal = Gap is between a local alignment and the end of the query sequence.

Dependencies

SynBad needs Minimap2, samtools and kat installed for full functionality.
For gapass gap mode, Flye also needs to be installed.
To generate documentation with dochtml, R will need to be installed and a pandoc environment variable must be set, e.g.

export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc

For full documentation of the SynBad workflow, run with dochtml=T and read the *.docs.html file generated.

Commandline

Main SynBad run options

genome1=FILE : Genome assembly used as the query in the GABLAM/Mashmap searches []
genome2=FILE : Genome assembly used as the searchdb in the GABLAM/Mashmap searches []
basefile=X : Prefix for output files [synbad]
gablam=X : Optional prefix for GABLAM/Mashmap search [defaults to $BASEFILE.map]
mapper=X : Whether to use minimap2, busco or mashmap (dev only) for all-by-all mapping [minimap2]
gapmode=X : Diploidocus gap run mode (gapspan/gapass) [gapspan]
minloclen=INT : Minimum length for aligned chunk to be kept (local hit length in bp) [1000]
minlocid=PERC : Minimum percentage identity for aligned chunk to be kept (local %identity) [50]
maxsynskip=INT : Maximum number of local alignments to skip for SynTrans classification [4]
maxsynspan=INT : Maximum distance (bp) between syntenic local alignments to count as syntenic [25000]
synreadspan=INT : Minimum number of reads spanning a gap to change the rating to "Spanned" [5]
spannedflank=INT: Required flanking distance for synreadspan "Spanned" rating [0]
maxoverlap=INT : Maximum overlap (bp) of adjacent local hits to allow compression [500]
chr1=X : PAFScaff-style chromosome prefix for Genome 1 to distinguish Translocation from Fragmentation []
chr2=X : PAFScaff-style chromosome prefix for Genome 2 to distinguish Translocation from Fragmentation []

Correction and Fragmentation options

correct=LIST : List of edit types to try to fix in the assembly (invert/extract/relocate/break/join; [T/True=all]{cmd:T/True}) [True]
fragment=T/F : Whether to fragment the assembly at gaps marked as non-syntenic if no corrections made [False]
fragtypes=LIST : List of SynBad ratings to trigger fragmentation [Brk,Inv,InvBrk,Frag,Tran]
minreadspan=INT : Min number of Span0 reads in gaps table to prevent fragmentation [1]
minctglen=INT : Extract any contigs below a minimum length threshold [500]
minbadctg=INT : Extract any contigs with bad flanking ratings below a minimum length threshold [5000]
minscafflen=INT : Remove any scaffolds (inc. detached/extracted contigs) below minimum length threshold [500]
gapsize=INT : Size of gaps to add when relocating assembling chunks [500]
rejoin=T/F : Whether to rejoin original gaps that end up split into termini but not too short [True]

Additional input options

bam1=FILE : Optional BAM file of long reads mapped onto assembly 1 [$BASEFILE1.bam]
paf1=FILE : Optional PAF file of long reads mapped onto assembly 1 [$BASEFILE1.paf]
reads1=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype1=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
busco1=FILE : Optional BUSCO full results file for genome 1 []
genomesize1=INT : Haploid genome 1 size (bp) [0]
scdepth1=NUM : Single copy ("diploid") read depth for genome 1. If zero, will use SC BUSCO mode [0]
bam2=FILE : Optional BAM file of long reads mapped onto assembly 2 [$BASEFILE2.bam]
paf2=FILE : Optional PAF file of long reads mapped onto assembly 2 [$BASEFILE2.paf]
reads2=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype2=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
busco2=FILE : Optional BUSCO full results file for genome 2 []
genomesize2=INT : Haploid genome 2 size (bp) [0]
scdepth2=NUM : Single copy ("diploid") read depth for genome 2. If zero, will use SC BUSCO mode [0]
mapflanks1=FILE : Flanks fasta file from previous SynBad run for mapping genome 1 flank identifiers []
mapflanks2=FILE : Flanks fasta file from previous SynBad run for mapping genome 2 flank identifiers []
fullmap=T/F : Whether to abort if not all flanks can be mapped [True]

HiC Gap Flank options

hicbam1=FILE : Optional BAM file of HiC reads mapped onto assembly 1 [$BASEFILE1.HiC.bam]
hicbam2=FILE : Optional BAM file of HiC reads mapped onto assembly 2 [$BASEFILE2.HiC.bam]
gapflanks=INT : Size of gap flank regions to output for HiC pairing analysis (0=off) [10000]
pureflanks=T/F : Whether to restrict gap flanks to pure contig sequence (True) or include good gaps (False) [True]
hicscore=X : HiC scoring mode (pairs/score/wtscore) [wtscore]
hicmin=X : Min. number of HiC read pairs for a "best" HiC pairing ruling [3]
hicmode=X : Pairwise HiC assessment scoring strategy (synbad/pure/rand/full) [synbad]
hicdir1=PATH : Path to HiC read ID lists for genome 1 [$BASEFILE.qryflanks/]
hicdir2=PATH : Path to HiC read ID lists for genome 1 [$BASEFILE.hitflanks/]

Additional output options

newacc1=X : Scaffold name prefix for updated Genome 1 output [None]
newacc2=X : Scaffold name prefix for updated Genome 2 output [None]
bestpair=T/F : Whether to restrict the paired output to the top scaffold pairs [False]
update=T/F : Whether to reload compressed qry and hit tables but re-run additional compression [False]
hidegaps=LIST : List of SynBad gap types to "hide" in final outputs. Will need to be revealed again later []
force=T/F : Whether to force regeneration of SynBad results tables [False]
dochtml=T/F : Generate HTML Diploidocus documentation (*.docs.html) instead of main run [False]

History Module Version History

# 0.0.0 - Initial Compilation.
# 0.1.0 - Initial working version without fragment=T implementation.
# 0.1.1 - Added minlocid=PERC and minloclen=INT.
# 0.1.2 - Added additional translocation skipping for SynTrans rating.
# 0.1.3 - Modified the SynBad classification text.
# 0.1.4 - Modified code to be able to run without long read mapping. Added dochtml output.
# 0.2.0 - Added fragment=T output.
# 0.3.0 - Added chromosome scaffold Translocation restriction.
# 0.4.0 - Added an Duplication rating in place of Breakpoint for overlapping flanking hits; added top sequence pairs.
# 0.5.0 - Added HiFi read type. Changed default to gapmode=gapspan.
# 0.5.1 - Added some extra bug-fixes for running Diploidocus checkpos.
# 0.6.0 - Added additional gap-spanning classes and qry-hit pair compression. Added dev reworking on main compression.
# 0.6.1 - Tidy up of code and transition of reworked code to main run (no longer dev=T only).
# 0.7.0 - Added code for fixing Inversions and well-supported translocations. Updated defaults.
# 0.8.0 - Added hicbam=FILE inputs and initial HiC assessment. Fixed inversion bug.
# 0.8.1 - Adding hicbest and summary table output. Fixed some calculation and re-run bugs.
# 0.8.2 - Separated and tidied HiC processing from contig flank/end processing. Fixed summary. Added correct=T/F option.
# 0.8.3 - Fixed HIC flank mapping error.
# 0.8.4 - Added simple duplicity analysis with KAT kmers and Diploidocus CNV. Added extract and relocate edits.
# 0.8.5 - Replace swap edits with break and join edits -> only join if pair are both scaffold ends. Add rejoin=T/F. Fixed major gap update bug.
# 0.8.6 - Small bug fixes for partial input.
# 0.9.0 - Added hidegaps=LIST option for hiding gaps. Add MashMap in place of GABLAM (dev=True). Fixed naming clashes.
# 0.10.0- Added mapper=busco option to use BUSCO genes in place of GABLAM.
# 0.10.1- Fixed end of sequence gap bug for contig/flank generation.
# 0.10.2- Fixed bug with new filenaming for Diploidocus wrapping of DepthKopy. (May need better fix.)

synbad REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

There is currently no specific help available on REST output for this program.

SLiMSuite REST Server

synbad V0.10.2

Synteny-based scaffolding assessment and adjustment