SLiMSuite REST Server


Links
REST Home
EdwardsLab Homepage
EdwardsLab Blog
SLiMSuite Blog
SLiMSuite
Webservers
Genomes
REST Pages
REST Status
REST Help
REST Tools
REST Alias Data
REST API
REST News
REST Sitemap

PAFScaff V0.6.0

Pairwise mApping Format reference-based scaffold anchoring and super-scaffolding.

Module: PAFScaff
Description: Pairwise mApping Format reference-based scaffold anchoring and super-scaffolding.
Version: 0.6.0
Last Edit: 02/02/22
Citation: Field et al. (2020), GigaScience 9(4):giaa027.
GitHub: https://github.com/slimsuite/pafscaff

Copyright © 2019 Richard J. Edwards - See source code for GNU License Notice


Imported modules: rje rje_db rje_obj rje_paf rje_seqlist rje_sequence rje_rmd


See SLiMSuite Blog for further documentation. See rje for general commands.

Function

PAFScaff is designed for mapping genome assembly scaffolds to a closely-related chromosome-level reference genome assembly. It uses (or runs) [Minimap2](https://github.com/lh3/minimap2) to perform an efficient (if rough) all- against-all mapping, then parses the output to assign assembly scaffolds to reference chromosomes.

Mapping is based on minimap2-aligned assembly scaffold ("Query") coverage against the reference chromosomes. Scaffolds are "placed" on the reference scaffold with most coverage. Any scaffolds failing to map onto any chromosome are rated as "Unplaced". For each reference chromosome, PAFScaff then "anchors" placed assembly scaffolds starting with the longest assembly scaffold. Each placed scaffold is then assessed in order of decreasing scaffold length. Any scaffolds that do not overlap with already anchored scaffolds in terms of the Reference chromosome positions they map onto are also considered "Anchored". if newprefix=X is set, scaffolds are renamed with the Reference chromosome they match onto. The original scaffold name and mapping details are included in the description. Unplaced scaffolds are not renamed.

Finally, Anchored scaffolds are super-scaffolded by inserting gaps of NnNnNnNnNn sequence between anchored scaffolds. The lengths of these gaps are determined by the space between the reference positions, modified by overhanging query scaffold regions (min. length 10). The alternating case of these gaps makes them easy to identify later.

## Output

PAFScaff outputs renamed, sorted and reoriented scaffolds in fasta format, along with mapping details:

  • *.anchored.fasta, *.placed.fasta and *.unplaced.fasta contain the relevant subsets of assembly scaffolds,
  • renamed and/or reverse-complemented if appropriate.
    • *.scaffolds.fasta contains the super-scaffolded anchored scaffolds.
    • *.scaffolds.tdt contains the details of the PAFScaff mapping of scaffolds to chromosomes.
    • *.log contains run details, including any warnings or errors encountered.

    NOTE: The precise ordering, orientation and naming of the output scaffolds depends on the settings for: refprefix=X newprefix=X sorted=T/F revcomp=T/F.

    For full documentation of the PAFScaff workflow, run with dochtml=T and read the *.docs.html file generated.

Commandline

Input/Output options

pafin=PAFFILE : PAF generated from $REFERENCE $ASSEMBLY mapping; or run minimap2, or use busco [minimap2]
basefile=STR : Base for file outputs [PAFIN basefile]
seqin=FASFILE : Input genome assembly to map/scaffold onto $REFERENCE (minimap2 $ASSEMBLY) []
reference=FILE : Fasta (with accession numbers matching Locus IDs) ($REFERENCE) []
assembly=FASFILE: As seqin=FASFILE
busco=TSVFILE : BUSCO v5 full table (pafin=busco) [full_table_$BASEFILE.busco.tsv]
refbusco=TSVFILE: Reference BUSCO v5 full table [full_table_$REFBASE.busco.tsv]
refprefix=X : Reference chromosome prefix. If None, will use all $REFERENCE scaffolds [None]
newprefix=X : Assembly chromosome prefix. If None, will not rename $ASSEMBLY scaffolds [None]
unplaced=X : Unplaced scaffold prefix. If None, will not rename unplaced $ASSEMBLY scaffolds [None]
ctgprefix=X : Unplaced contig prefix. Replaces unplaced=X when 0 gaps. [None]
sorted=X : Criterion for $ASSEMBLY scaffold sorting (QryLen/Coverage/RefStart/None) [QryLen]
minmap=PERC : Minimum percentage mapping to a chromosome for assignment [0.0]
minpurity=PERC : Minimum percentage "purity" for assignment to Ref chromosome [50.0]
revcomp=T/F : Whether to reverse complement relevant scaffolds to maximise concordance [True]
scaffold=T/F : Whether to "anchor" non-overlapping scaffolds by Coverage and then scaffold [True]
dochtml=T/F : Generate HTML PAFScaff documentation (*.info.html) instead of main run [False]
pagsat=T/F : Whether to output sequence names in special PAGSAT-compatible format [False]
newchr=X : Prefix for short PAGSAT sequence identifiers [ctg]
spcode=X : Species code for renaming assembly sequences in PAGSAT mode [PAFSCAFF]

Mapping/Classification options

minimap2=PROG : Full path to run minimap2 [minimap2]
mmsecnum=INT : Max. number of secondary alignments to keep (minimap2 -N) [0]
mmpcut=NUM : Minimap2 Minimal secondary-to-primary score ratio to output secondary mappings (minimap2 -p) [0]
mapopt=CDICT : Dictionary of additional minimap2 options to apply (caution: over-rides conflicting settings) []
purebusco=T/F : Whether to keep BUSCO genes separate rather than generating synteny blocks [False]

Processing options

forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]


History Module Version History

    # 0.0.0 - Initial Compilation.
    # 0.1.0 - Initial working version with basic documentation. Added scaffold=T/F as an option.
    # 0.2.0 - Added sorted=X : Criterion for $ASSEMBLY scaffold sorting (QryLen/Coverage/RefStart/None) [QryLen]
    # 0.2.1 - Add documentation and fixed setting of Minimap2 N and p.
    # 0.3.0 - Added pagsat=T/F : Whether to output sequence names in special PAGSAT-compatible format [False]
    # 0.4.0 - Added purity criteria for more stringent assignment to chromosomes.
    # 0.4.1 - Fixed some issues with ambiguous scaffold output.
    # 0.4.2 - Unplaced scaffold output bug fix for GitHub issue#2.
    # 0.4.3 - Fixed the descriptions for Unplaced scaffolds in the summary table.
    # 0.5.0 - Added ctgprefix=X : Unplaced contig prefix. Replaces unplaced=X when 0 gaps. [None]
    # 0.6.0 - Added busco=TSV and refbusco=TSV as alternative to minimap2 linkages
    # 0.6.1 - Upgraded PAFScaff BUSCO mode to use Synteny blocks and not simply BUSCO genes.

© 2015 RJ Edwards. Contact: richard.edwards@unsw.edu.au.