This module is for the assessment of an assembled genome versus a suitable reference. For optimal results, the
reference genome will be close to identical to that which should be assembled. However, comparative analyses should
still be useful when different assemblies are run against a related genome - although there will not be the same
expectation for 100% coverage and accuracy, inaccuracies would still be expected to make an assembly less similar
to the reference.
Main input for PAGSAT is an assembled genome in fasta format (
assembly=FILE) and a reference genome in fasta format
reference=FILE) with corresponding
*.gbff genbank download for feature
extraction. For full function, a features table plus protein and gene sequences should be provided. (These will be
parsed from a Genbank reference file.) Basic contig-reference mapping and plotting will still be performed with a
pure sequence reference that lacks features or gene sequences.
Reference Sequence Naming
PAGSAT expects a particular naming format for assembly sequences, which is a bit more constrained that most programs.
This is to enable the full suite of visualisation with clear unambiguous labelling of contigs. Input sequence names
must be in the form:
ctgXX_SPCODE__ACCBASE.YY, where both
YY are *unique* for each contig.
YY will match but this is not a requirement.)
ACCBASE could be the same for all sequences, or
it could be sequence-specific. It can also include
. characters: only the final
. element must be unique.
must end with numbers. (These are expected to be contig numbers, possibly with a prefix.)
Main output is a number of delimited text files and PNG graphics made with R. Details to follow.
NOTE: Snapper is now used for the underlying Reference vs Assembly GABLAM searches (unless
snapper=F). For speed,
the SNP mapping functions are switched off. To get the full set of Snapper outputs, use
mapfas=T function generates a new copy of the assembly (in the main PAGSAT output directory) in which contigs
have been reorientated to be the same strand as the reference chromosomes where possible. This is performed on the
basis of the reference chromosome with maximum unique coverage for each contig, e.g the contig will be matched to the
reference chromosome for which it has the most bases that ONLY map to that chromosome. (Clearly this will only work
if the reference is haploid!) Where the majority of matching bases are on the negative strand, the contig will be
reverse complemented and the accnum updated such that
If the assembly has
*.depthplot.tdt data (generated by
PAGSAT if a
*.sam file is provided), these files will be converted to
assembly=FILE : Fasta file of assembled contigs to assess [
refgenome=FILE : Fasta file of reference genome for assessment (also *.gb for full functionality) [
spcode=X : Species code for reference genome (if not already processed by rje_genbank) [
minqv=X : Minimum mean QV score for assembly contigs (read from *.qv.csv) [
mincontiglen=X : Minimum contig length to retain in assembly (QV filtering only) [
casefilter=T/F : Whether to filter leading/trailing lower case (low QV) sequences [
Reference vs Assembly Options
minlocid=X : Minimum percentage identity for local hits mapping to chromosome coverage [
minloclen=X : Mininum length for local hits mapping to chromosome coverage [
genesummary=T/F : Whether to include reference gene searches in summary data [
protsummary=T/F : Whether to include reference protein searches in summary data [
tophitbuffer=X : Percentage identity difference to keep best hits for reference genes/proteins. [
diploid=T/F : Whether to treat assembly as a diploid [
minunique=X : Minimum number of "Unique-mapping" nucleotides to retain a contig-chromosome link [
snapper=T/F : Run Snapper to generate "best" unique mapping of assembly contigs to Reference [
makesnp=T : Generate the full set of SNP outputs for Snapper [
basefile=X : Basename for output files and directories. [
rgraphics=T/F : Whether to generate PNG graphics using R. (Needs R installed and setup) [
dotplots=T/F : Whether to use gablam.r to output dotplots for all ref vs assembly. [
assessment=T/F : Whether to perform full reference versus assembly assessment [
report=T/F : Whether to generate HTML report. Also sets
assessment=T (default function). [
genetar=T/F : Whether to tar and zip the GeneHits/ and ProtHits/ folders (if generated & Mac/Linux) [
chromalign=T/F : [Discontinued] Whether to perform crude chromosome-contig alignment [
orderedfas=T/F : Whether to generate crude ordered contig output for e.g. Progressive Mauve [
treeformats=LIST: Output formats for chromosome versus contig %identify UPGMA tree [
dismatrix=T/F : Whether to generate distance matrix of chromosome vs contig identities [
compare=FILES : Compare assemblies selected using a list of *.Summary.tdt files (wildcards allowed). 
fragcov=LIST : List of coverage thresholds to count min. local BLAST hits (checks integrity) [
chromcov=LIST : Report no. of chromosomes covered by a single contig at different %globID (GABLAM table) [
compile=FILES : Compile reference chromosome comparisons for a set of *.report.html files 
Assembly Tidy/Edit Options
mapfas=T/F : Output assembly *.map.fasta file with RevComp contigs based on initial (automatic) mapping [
tidy=T/F : Execute semi-automated assembly tidy/edit mode to complete draft assembly [
newacc=X : New base for edited contig accession numbers (None will keep old accnum) [
newchr=X : Code to replace "chr" in new sequence names for additional PAGSAT compatibility [
spcode=X : Species code for renaming assembly sequences [
refchr=X : Code used in place of "chr" for reference sequence names [
orphans=T/F : Whether to include and process orphan contigs [
joinsort=X : Whether to sort potential chromosome joins by
joinmerge=X : Merging mode for joining chromosomes (mid/start/end/longest) [
joinmargin=X : Number of extra bases allowed to still be considered an end local BLAST hit [
History Module Version History
# 1.0.0 - Initial working version for based on rje_pacbio assessment=T.
# 1.1.0 - Fixed bug with gene and protein summary data. Removed gene/protein reciprocal searches. Added compare mode.
# 1.1.1 - Added PAGSAT output directory for tidiness!
# 1.1.2 - Renamed the PacBio class PAGSAT.
# 1.2.0 - Tidied up output directories. Added QV filter and Top Gene/Protein hits output.
# 1.2.1 - Added casefilter=T/F : Whether to filter leading/trailing lower case (low QV) sequences [True]
# 1.3.0 - Added tophitbuffer=X and initial synteny analysis for keeping best reference hits.
# 1.4.0 - Added chrom-v-contig alignment files along with *.ordered.fas.
# 1.4.1 - Made default chromalign=T.
# 1.4.2 - Fixed casefilter=F.
# 1.5.0 - diploid=T/F : Whether to treat assembly as a diploid [False]
# 1.6.0 - mincontiglen=X : Minimum contig length to retain in assembly 
# 1.6.1 - Added diploid=T/F to R PNG call.
# 1.7.0 - Added tidy=T/F option. (Development)
# 1.7.1 - Updated tidy=T/F to include initial assembly.
# 1.7.2 - Fixed some bugs introduced by changing gablam fragment output.
# 1.7.3 - Added circularise sequence generation.
# 1.8.0 - Added orphan processing and non-chr naming of Reference.
# 1.9.0 - Modified the join sorting and merging. Added better tracking of positions when trimming.
# 1.9.1 - Added joinmargin=X : Number of extra bases allowed to still be considered an end local BLAST hit 
# 1.10.0 - Added weighted tree output and removed report warning.
# 1.10.1 - Fixed issue related to having Description in GABLAM HitSum tables.
# 1.10.2 - Tweaked haploid core output.
# 1.10.3 - Fixed tidy bug for RevComp contigs and switched joinsort default to Identity. (Needs testing.)
# 1.10.4 - Added genetar option to tidy out genesummary and protsummary output. Incorporated rje_synteny.
# 1.10.5 - Set gablamfrag=1 for gene/protein hits.
# 1.11.0 - Consolidated automated tidy mode and cleaned up some excess code.
# 1.11.1 - Added option for running self-PAGSAT of ctidX contigs versus haploid set. Replaced ctid "X" with "N".
# 1.11.2 - Fixed Snapper run choice bug.
# 1.11.3 - Added reference=FILE as alias for refgenome=FILE. Fixed orphan delete bug.
# 1.12.0 - Tidying up and documenting outputs. Changed default minloclen=250 and minlocid=95. (LTR identification.)
# 2.0.0 - Major overhaul of outputs to improve consistency and clarity. Added Snapper to main run.
# 2.1.0 - Added localSAM output.
# 2.1.1 - Fixed the case of some output files.
# 2.1.2 - Fixed some issues with reverse hits in Snapper and application of minlocid.
# 2.2.0 - Added mapout=T, which is recommended for first run if going to subsequently tidy. (Run tidy on mapfile.)
# 2.2.1 - Tried to fix covplot bug in compare=FILES mode.
# 2.2.2 - Cleaned up *.map.* output for SAMPhaser output files. Added tidy/mapfas option selection.
# 2.2.3 - Added #NOTE to tidy and fixed makesnp=T bug.
# 2.2.4 - Fixed `fragrevcomp=F` bug for Gene and Protein TopHits.
# 2.2.5 - Hopefully really fixed makesnp=T bug now!
# 2.2.6 - Fixed Haploid tidy sequence output naming bug.
# 2.2.7 - Fixed Compare File path bug & dropped some empty outputs.
# 2.3.0 - Minor bug fixes and extra tidy options (join gaps and multi-deletes).
# 2.3.1 - Minor bug fixes.
# 2.3.2 - Updated the synteny mappings to be m::n instead of m:n for Excel compatibility.
# 2.3.3 - Fixed bad assembly sequence name bug.
# 2.3.4 - Fixed full.fas request bug.
# 2.4.0 - Added PAGSAT compile mode to generate comparisons of reference chromosomes across assemblies.
# 2.5.0 - Reduced the executed code when mapfas=T assessment=F. (Recommended first run.) Added renaming.
# 2.5.1 - Added recognition of *.gbff for genbank files.
PAGSAT REST Output formats
&rest=help for general options. Run with
&rest=full to get full server output as text or
for more user-friendly formatted output. Individual outputs can be identified/parsed using
coverage = main results table