Function
PAFScaff is designed for mapping genome assembly scaffolds to a closely-related chromosome-level reference genome
assembly. It uses (or runs) [Minimap2](https://github.com/lh3/minimap2) to perform an efficient (if rough) all-
against-all mapping, then parses the output to assign assembly scaffolds to reference chromosomes.
Mapping is based on minimap2-aligned assembly scaffold ("Query") coverage against the reference chromosomes.
Scaffolds are "placed" on the reference scaffold with most coverage. Any scaffolds failing to map onto any
chromosome are rated as "Unplaced". For each reference chromosome, PAFScaff then "anchors" placed assembly
scaffolds starting with the longest assembly scaffold. Each placed scaffold is then assessed in order of
decreasing scaffold length. Any scaffolds that do not overlap with already anchored scaffolds in terms of the
Reference chromosome positions they map onto are also considered "Anchored". if newprefix=X
is set, scaffolds
are renamed with the Reference chromosome they match onto. The original scaffold name and mapping details are
included in the description. Unplaced scaffolds are not renamed.
Finally, Anchored scaffolds are super-scaffolded by inserting gaps of NnNnNnNnNn
sequence between anchored
scaffolds. The lengths of these gaps are determined by the space between the reference positions, modified by
overhanging query scaffold regions (min. length 10). The alternating case of these gaps makes them easy to
identify later.
## Output
PAFScaff outputs renamed, sorted and reoriented scaffolds in fasta format, along with mapping details:
*.anchored.fasta
, *.placed.fasta
and *.unplaced.fasta
contain the relevant subsets of assembly scaffolds,
renamed and/or reverse-complemented if appropriate.
*.scaffolds.fasta
contains the super-scaffolded anchored scaffolds.
*.scaffolds.tdt
contains the details of the PAFScaff mapping of scaffolds to chromosomes.
*.log
contains run details, including any warnings or errors encountered.
NOTE: The precise ordering, orientation and naming of the output scaffolds depends on the settings for:
refprefix=X
newprefix=X
sorted=T/F
revcomp=T/F
.
For full documentation of the PAFScaff workflow, run with dochtml=T
and read the *.docs.html
file generated.
Commandline
Input/Output options
pafin=PAFFILE
: PAF generated from $REFERENCE $ASSEMBLY mapping; or run minimap2, or use busco [minimap2
]
basefile=STR
: Base for file outputs [PAFIN basefile
]
seqin=FASFILE
: Input genome assembly to map/scaffold onto $REFERENCE (minimap2 $ASSEMBLY) []
reference=FILE
: Fasta (with accession numbers matching Locus IDs) ($REFERENCE) []
assembly=FASFILE
: As seqin=FASFILE
busco=TSVFILE
: BUSCO v5 full table (pafin=busco
) [full_table_$BASEFILE.busco.tsv
]
refbusco=TSVFILE
: Reference BUSCO v5 full table [full_table_$REFBASE.busco.tsv
]
refprefix=X
: Reference chromosome prefix. If None, will use all $REFERENCE scaffolds [None
]
newprefix=X
: Assembly chromosome prefix. If None, will not rename $ASSEMBLY scaffolds [None
]
unplaced=X
: Unplaced scaffold prefix. If None, will not rename unplaced $ASSEMBLY scaffolds [None
]
ctgprefix=X
: Unplaced contig prefix. Replaces unplaced=X
when 0 gaps. [None
]
sorted=X
: Criterion for $ASSEMBLY scaffold sorting (QryLen/Coverage/RefStart/None) [QryLen
]
minmap=PERC
: Minimum percentage mapping to a chromosome for assignment [0.0
]
minpurity=PERC
: Minimum percentage "purity" for assignment to Ref chromosome [50.0
]
revcomp=T/F
: Whether to reverse complement relevant scaffolds to maximise concordance [True
]
scaffold=T/F
: Whether to "anchor" non-overlapping scaffolds by Coverage and then scaffold [True
]
dochtml=T/F
: Generate HTML PAFScaff documentation (*.info.html) instead of main run [False
]
pagsat=T/F
: Whether to output sequence names in special PAGSAT-compatible format [False
]
newchr=X
: Prefix for short PAGSAT sequence identifiers [ctg
]
spcode=X
: Species code for renaming assembly sequences in PAGSAT mode [PAFSCAFF
]
Mapping/Classification options
minimap2=PROG
: Full path to run minimap2 [minimap2
]
mmsecnum=INT
: Max. number of secondary alignments to keep (minimap2 -N) [0
]
mmpcut=NUM
: Minimap2 Minimal secondary-to-primary score ratio to output secondary mappings (minimap2 -p) [0
]
mapopt=CDICT
: Dictionary of additional minimap2 options to apply (caution: over-rides conflicting settings) []
purebusco=T/F
: Whether to keep BUSCO genes separate rather than generating synteny blocks [False
]
Processing options
forks=X
: Number of parallel sequences to process at once [0
]
killforks=X
: Number of seconds of no activity before killing all remaining forks. [36000
]
forksleep=X
: Sleep time (seconds) between cycles of forking out more process [0
]
History Module Version History
# 0.0.0 - Initial Compilation.
# 0.1.0 - Initial working version with basic documentation. Added scaffold=T/F as an option.
# 0.2.0 - Added sorted=X : Criterion for $ASSEMBLY scaffold sorting (QryLen/Coverage/RefStart/None) [QryLen]
# 0.2.1 - Add documentation and fixed setting of Minimap2 N and p.
# 0.3.0 - Added pagsat=T/F : Whether to output sequence names in special PAGSAT-compatible format [False]
# 0.4.0 - Added purity criteria for more stringent assignment to chromosomes.
# 0.4.1 - Fixed some issues with ambiguous scaffold output.
# 0.4.2 - Unplaced scaffold output bug fix for GitHub issue#2.
# 0.4.3 - Fixed the descriptions for Unplaced scaffolds in the summary table.
# 0.5.0 - Added ctgprefix=X : Unplaced contig prefix. Replaces unplaced=X when 0 gaps. [None]
# 0.6.0 - Added busco=TSV and refbusco=TSV as alternative to minimap2 linkages
# 0.6.1 - Upgraded PAFScaff BUSCO mode to use Synteny blocks and not simply BUSCO genes.