Module:	PAGSAT
Description:	Pairwise Assembled Genome Sequence Analysis Tool
Version:	2.8.1
Last Edit:	24/03/21

Imported modules: rje rje_db rje_genbank rje_html rje_menu rje_obj rje_paf rje_samtools rje_seqlist rje_sequence rje_synteny rje_tree rje_tree_group rje_xref rje_blast_V2 rje_dismatrix_V3 gablam snapper

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This module is for the assessment of an assembled genome versus a suitable reference. For optimal results, the reference genome will be close to identical to that which should be assembled. However, comparative analyses should still be useful when different assemblies are run against a related genome - although there will not be the same expectation for 100% coverage and accuracy, inaccuracies would still be expected to make an assembly less similar to the reference.

Input

Main input for PAGSAT is an assembled genome in fasta format (assembly=FILE) and a reference genome in fasta format
(refgenome=FILE or reference=FILE) with corresponding *.gb, *.gbk or *.gbff genbank download for feature
extraction. For full function, a features table plus protein and gene sequences should be provided. (These will be
parsed from a Genbank reference file.) Basic contig-reference mapping and plotting will still be performed with a
pure sequence reference that lacks features or gene sequences.

Reference Sequence Naming

PAGSAT expects a particular naming format for assembly sequences, which is a bit more constrained that most programs.
This is to enable the full suite of visualisation with clear unambiguous labelling of contigs. Input sequence names
must be in the form: ctgXX_SPCODE__ACCBASE.YY, where both ctgXX and YY are *unique* for each contig.
(Generally XX and YY will match but this is not a requirement.) ACCBASE could be the same for all sequences, or
it could be sequence-specific. It can also include . characters: only the final . element must be unique. YY
must end with numbers. (These are expected to be contig numbers, possibly with a prefix.)

Sequence mapping

Version 2.6 introduced a mapper=minimap option for using minimap2 in place of BLAST+ for mapping the assembly and
reference against each other (and themselves). This should be considered a "quick and dirty" version of PAGSAT, with
much faster run times but missing a lot of small repeat sequences and more divergent matches. It is primarily for
initial assessment and triage of different assemblies before picking a cleaner/tidier one for further analysis.

Output

Main output is a number of delimited text files and PNG graphics made with R. Details to follow.

NOTE: Snapper is now used for the underlying Reference vs Assembly GABLAM searches (unless snapper=F). For speed,
the SNP mapping functions are switched off. To get the full set of Snapper outputs, use makesnp=T.

MapFas Output

The mapfas=T function generates a new copy of the assembly (in the main PAGSAT output directory) in which contigs
have been reorientated to be the same strand as the reference chromosomes where possible. This is performed on the
basis of the reference chromosome with maximum unique coverage for each contig, e.g the contig will be matched to the
reference chromosome for which it has the most bases that ONLY map to that chromosome. (Clearly this will only work
if the reference is haploid!) Where the majority of matching bases are on the negative strand, the contig will be
reverse complemented and the accnum updated such that XXXX.YY becomes XXXX.RYY.

If the assembly has *.coverage.tdt and *.depthplot.tdt data (generated by
SeqSuite.SAMTools or PAGSAT if a *.sam file is provided), these files will be converted to *.map.*.tdt that
match the *.map.fasta contigs.

Commandline

Input/Setup Options

assembly=FILE : Fasta file of assembled contigs to assess [None]
refgenome=FILE : Fasta file of reference genome for assessment (also *.gb for full functionality) [None]
spcode=X : Species code for reference genome (if not already processed by rje_genbank) [None]
minqv=X : Minimum mean QV score for assembly contigs (read from *.qv.csv) [20]
mincontiglen=X : Minimum contig length to retain in assembly (QV filtering only) [1000]
casefilter=T/F : Whether to filter leading/trailing lower case (low QV) sequences [True]

Reference vs Assembly Options

minlocid=X : Minimum percentage identity for local hits mapping to chromosome coverage [95.0]
minloclen=X : Mininum length for local hits mapping to chromosome coverage [250]
genesummary=T/F : Whether to include reference gene searches in summary data [True]
protsummary=T/F : Whether to include reference protein searches in summary data [True]
tophitbuffer=X : Percentage identity difference to keep best hits for reference genes/proteins. [1.0]
diploid=T/F : Whether to treat assembly as a diploid [False]
minunique=X : Minimum number of "Unique-mapping" nucleotides to retain a contig-chromosome link [250]
snapper=T/F : Run Snapper to generate "best" unique mapping of assembly contigs to Reference [True]
makesnp=T : Generate the full set of SNP outputs for Snapper [False]
mapper=X : Program to use for mapping files against each other (blast/minimap) [blast]
mapopt=CDICT : Dictionary of updated minimap2 options [N:250,p:0.0001,x:asm20]

Output Options

basefile=X : Basename for output files and directories. [assembly+ref]
rgraphics=T/F : Whether to generate PNG graphics using R. (Needs R installed and setup) [True]
dotplots=T/F : Whether to use gablam.r to output dotplots for all ref vs assembly. [False]
assessment=T/F : Whether to perform full reference versus assembly assessment [False]
report=T/F : Whether to generate HTML report. Also sets assessment=T (default function). [False]
genetar=T/F : Whether to tar and zip the GeneHits/ and ProtHits/ folders (if generated & Mac/Linux) [True]
chromalign=T/F : [Discontinued] Whether to perform crude chromosome-contig alignment [False]
orderedfas=T/F : Whether to generate crude ordered contig output for e.g. Progressive Mauve [False]
treeformats=LIST: Output formats for chromosome versus contig %identify UPGMA tree [nwk,png]
dismatrix=T/F : Whether to generate distance matrix of chromosome vs contig identities [False]

Comparison Options

compare=FILES : Compare assemblies selected using a list of *.Summary.tdt files (wildcards allowed). []
fragcov=LIST : List of coverage thresholds to count min. local BLAST hits (checks integrity) [50,90,95,99]
chromcov=LIST : Report no. of chromosomes covered by a single contig at different %globID (GABLAM table) [95,98,99]
compile=FILES : Compile reference chromosome comparisons for a set of *.report.html files []

Assembly Tidy/Edit Options

mapfas=T/F : Output assembly *.map.fasta file with RevComp contigs based on initial (automatic) mapping [False]
tidy=T/F : Execute semi-automated assembly tidy/edit mode to complete draft assembly [False]
newacc=X : New base for edited contig accession numbers (None will keep old accnum) [None]
newchr=X : Code to replace "chr" in new sequence names for additional PAGSAT compatibility [ctg]
keepchr=T/F : Keep the existing chromosome assigments during tidy if found [False]
spcode=X : Species code for renaming assembly sequences [PAGSAT]
refchr=X : Code used in place of "chr" for reference sequence names [chr]
orphans=T/F : Whether to include and process orphan contigs [True]
joinsort=X : Whether to sort potential chromosome joins by Length or Identity [Identity]
joinmerge=X : Merging mode for joining chromosomes (mid/start/end/longest) [mid]
joinmargin=X : Number of extra bases allowed to still be considered an end local BLAST hit [10]

History Module Version History

# 1.0.0 - Initial working version for based on rje_pacbio assessment=T.
# 1.1.0 - Fixed bug with gene and protein summary data. Removed gene/protein reciprocal searches. Added compare mode.
# 1.1.1 - Added PAGSAT output directory for tidiness!
# 1.1.2 - Renamed the PacBio class PAGSAT.
# 1.2.0 - Tidied up output directories. Added QV filter and Top Gene/Protein hits output.
# 1.2.1 - Added casefilter=T/F : Whether to filter leading/trailing lower case (low QV) sequences [True]
# 1.3.0 - Added tophitbuffer=X and initial synteny analysis for keeping best reference hits.
# 1.4.0 - Added chrom-v-contig alignment files along with *.ordered.fas.
# 1.4.1 - Made default chromalign=T.
# 1.4.2 - Fixed casefilter=F.
# 1.5.0 - diploid=T/F : Whether to treat assembly as a diploid [False]
# 1.6.0 - mincontiglen=X : Minimum contig length to retain in assembly [1000]
# 1.6.1 - Added diploid=T/F to R PNG call.
# 1.7.0 - Added tidy=T/F option. (Development)
# 1.7.1 - Updated tidy=T/F to include initial assembly.
# 1.7.2 - Fixed some bugs introduced by changing gablam fragment output.
# 1.7.3 - Added circularise sequence generation.
# 1.8.0 - Added orphan processing and non-chr naming of Reference.
# 1.9.0 - Modified the join sorting and merging. Added better tracking of positions when trimming.
# 1.9.1 - Added joinmargin=X : Number of extra bases allowed to still be considered an end local BLAST hit [10]
# 1.10.0 - Added weighted tree output and removed report warning.
# 1.10.1 - Fixed issue related to having Description in GABLAM HitSum tables.
# 1.10.2 - Tweaked haploid core output.
# 1.10.3 - Fixed tidy bug for RevComp contigs and switched joinsort default to Identity. (Needs testing.)
# 1.10.4 - Added genetar option to tidy out genesummary and protsummary output. Incorporated rje_synteny.
# 1.10.5 - Set gablamfrag=1 for gene/protein hits.
# 1.11.0 - Consolidated automated tidy mode and cleaned up some excess code.
# 1.11.1 - Added option for running self-PAGSAT of ctidX contigs versus haploid set. Replaced ctid "X" with "N".
# 1.11.2 - Fixed Snapper run choice bug.
# 1.11.3 - Added reference=FILE as alias for refgenome=FILE. Fixed orphan delete bug.
# 1.12.0 - Tidying up and documenting outputs. Changed default minloclen=250 and minlocid=95. (LTR identification.)
# 2.0.0 - Major overhaul of outputs to improve consistency and clarity. Added Snapper to main run.
# 2.1.0 - Added localSAM output.
# 2.1.1 - Fixed the case of some output files.
# 2.1.2 - Fixed some issues with reverse hits in Snapper and application of minlocid.
# 2.2.0 - Added mapout=T, which is recommended for first run if going to subsequently tidy. (Run tidy on mapfile.)
# 2.2.1 - Tried to fix covplot bug in compare=FILES mode.
# 2.2.2 - Cleaned up *.map.* output for SAMPhaser output files. Added tidy/mapfas option selection.
# 2.2.3 - Added #NOTE to tidy and fixed makesnp=T bug.
# 2.2.4 - Fixed `fragrevcomp=F` bug for Gene and Protein TopHits.
# 2.2.5 - Hopefully really fixed makesnp=T bug now!
# 2.2.6 - Fixed Haploid tidy sequence output naming bug.
# 2.2.7 - Fixed Compare File path bug & dropped some empty outputs.
# 2.3.0 - Minor bug fixes and extra tidy options (join gaps and multi-deletes).
# 2.3.1 - Minor bug fixes.
# 2.3.2 - Updated the synteny mappings to be m::n instead of m:n for Excel compatibility.
# 2.3.3 - Fixed bad assembly sequence name bug.
# 2.3.4 - Fixed full.fas request bug.
# 2.4.0 - Added PAGSAT compile mode to generate comparisons of reference chromosomes across assemblies.
# 2.5.0 - Reduced the executed code when mapfas=T assessment=F. (Recommended first run.) Added renaming.
# 2.5.1 - Added recognition of *.gbff for genbank files.
# 2.6.0 - Added mapper=X : Program to use for mapping files against each other (blast/minimap) [blast]
# 2.6.1 - Switch failure to find key report files to a long warning, not program exit.
# 2.6.2 - Fixed bugs with mapper=minimap mode and started adding more internal documentation.
# 2.6.3 - Fixed default behaviour to run report=T mode.
# 2.6.4 - Fixed summary table merge bug.
# 2.6.5 - Fixed compile path bug.
# 2.6.6 - Fixed BLAST LocalIDCut error for GABLAM and QAssemble stat filtering.
# 2.6.7 - Generalised compile path bug fix.
# 2.6.8 - Added ChromXcov fields to PAGSAT Compare.
# 2.6.9 - Fixed renamed assembly bug when basefile not set.
# 2.7.0 - Added BAM generation for assembly if reads given.
# 2.7.1 - Fixed bug that caused assembly PNGs to disappear.
# 2.8.0 - Added keepchr=T/F : Keep the existing chromosome assigments during tidy if found [False]
# 2.8.1 - Fixed bug that caused too many assembly PNGs to disappear!

PAGSAT REST Output formats

Run with &rest=help for general options. Run with &rest=full to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT for:

coverage = main results table

SLiMSuite REST Server

PAGSAT V2.8.1

Pairwise Assembled Genome Sequence Analysis Tool