Function
This module is designed to use Minimap alignments in PAF format (with the --cs flag) to replace blastn for all-by-all
genome assembly comparisons for PAGSAT etc.
Input is a PAF file with --cs flag, produced by minimap:
minimap2 --cs -N 100 -p 0.0001 -x asm20 $REFERENCE $ASSEMBLY > $PAFILE
Setting pafin=minimap
will run minimap2 and generate $BASEFILE.paf output, which will then be processed.
Output is a series of tables that emulate GABLAM output:
hitsum.tdt
= summary of hits per query sequence.
gablam.tdt
= summary stats for each query-hit pair.
local.tdt
= local hits table. Will include alignments if localaln=T
qryunique.tdt
= local hits table reduced to unique coverage of query ($ASSEMBLY) sequences (qryunique=T
).
hitunique.tdt
= local hits table reduced to unique coverage of hit ($REFERENCE) sequences (hitunique=T
).
NOTE: When Snapper runs rje_paf it inverts the reference and assembly so that the query is the reference.
The default functionality is to emulate GABLAM (and thereby replace blast+ in GABLAM with the mapper=minimap
flag).
However, excess fields can be reduced with mockblast=T
. Additional fields are output either way.
Commandline
Input/Output options
pafin=PAFFILE
: PAF generated from $REFERENCE $ASSEMBLY mapping []
basefile=STR
: Base for file outputs [PAFIN basefile
]
seqin=FASFILE
: Input genome to identify variants in ($ASSEMBLY) []
reference=FILE
: Fasta (with accession numbers matching Locus IDs) ($REFERENCE) []
assembly=FASFILE
: As seqin=FASFILE
uniqueout=T/F
: Whether to output *.qryunique.tdt and *.hitunique.tdt tables of unique coverage [True
]
uniquehit=T/F
: Option to use *.hitunique.tdt table of unique coverage for GABLAM coverage stats [False
]
alnseq=T/F
: Whether to use alnseq-based processing (True) or CS-Gstring processing [False
]
localaln=T/F
: Whether to output local alignments in Local Table [False
]
mockblast=T/F
: Whether to output mock BLAST headers even when not appropriate [True
]
Minimap2 run/mapping options
minlocid=PERC
: Minimum percentage identity for aligned chunk to be kept (local %identity) [0
]
minloclen=INT
: Minimum length for aligned chunk to be kept (local hit length in bp) [0
]
endextend=X
: Extend minimap2 hits to end of sequence if query region with X bp of end [0
]
minimap2=PROG
: Full path to run minimap2 [minimap2
]
mapopt=CDICT
: Dictionary of minimap2 options [N:100,p:0.0001,x:asm20
]
mapsplice=T/F
: Switch default minimap2 options to -x splice -uf -C5
[False
]
reads=FILELIST
: List of fasta/fastq read files for minimap2 mapping to BAM. Wildcard allowed. Can be gzipped. []
readtype=LIST
: List of ont/pb/hifi file types matching reads for minimap2 mapping [ont
]
Coverage checking mode (dev only)
checkpos=TDTFILE
: File of Locus, Start, End positions for read coverage checking [None
]
checkfields=LIST
: Fields in checkpos file to give Locus, Start and End for checking [Locus,Start,End
]
checkflanks=LIST
: List of lengths flanking check regions that must also be spanned by reads [0,100,1000,5000
]
spanid=X
: Generate sets of read IDs that span checkpos regions, grouped by values of field X []
Variant mode (dev only)
snptableout=T/F
: Generated output of filtered variants to SNP Table [False
]
qcut=X
: Min. quality score for a mapped read to be included [0
]
minqn=X
: Min. number of non-N reads meeting qcut for output (after indel filtering) [10
]
rid=T/F
: Whether to include Read ID (number) lists for each allele [True
]
readnames=T/F
: Output the read names to the RID file [False
]
indels=T/F
: Whether to include indels in "SNP" parsing [True
]
skiploci=LIST
: List of loci to exclude from pileup parsing (e.g. mitochondria) []
mincut=X
: Minimum read count for minor allele (proportion if <1) [0.05
]
absmincut=X
: Absolute minimum read count for minor allele (used if mincut<1) [2
]
biallelic=T/F
: Whether to restrict SNPs to pure biallelic SNPs (two alleles meeting mincut) [False
]
ignoren=T/F
: Whether to exclude "N" calls from alleles [True
]
ignoreref=T/F
: If False will always keep Reference allele and call fixed change as SNP [True
]
Processing options
forks=X
: Number of parallel sequences to process at once [0
]
killforks=X
: Number of seconds of no activity before killing all remaining forks. [36000
]
forksleep=X
: Sleep time (seconds) between cycles of forking out more process [0
]
History Module Version History
# 0.0.0 - Initial Compilation.
# 0.1.0 - Initial working version. Compatible with GABLAM v2.30.0 and Snapper v1.7.0.
# 0.2.0 - Added endextend=X : Extend minimap2 hits to end of sequence if with X bp [10]
# 0.3.0 - Added mapsplice mode for dealing with transcript mapping.
# 0.3.1 - Correct PAF splicing bug.
# 0.4.0 - Added TmpDir and forking for GABLAM conversion.
# 0.5.0 - Added uniquehit=T/F : Option to use *.hitunique.tdt table of unique coverage for GABLAM coverage stats [False]
# 0.6.0 - Added CS alignment manipulation methods.
# 0.6.1 - Added additional error-handling for CS parsing errors.
# 0.7.0 - Added alnseq=T/F : Whether to use alnseq-based processing (True) or CS-Gstring processing (dev only) [False]
# 0.7.1 - Disabled endextend due to bug.
# 0.7.2 - Fixed alnlen bug - minimap2 ignore Ns for alignment length calculation. Fixed endextend bug.
# 0.7.3 - Added minlocid and minloclen filtering to PAF parsing to speed up processing. Set default alnseq=F.
# 0.8.0 - Added developmental variant calling options.
# 0.9.0 - Added long-read mapping to BAM option.
# 0.10.0 - Added longreadMinimapPAF() and checkpos=TDTFILE options.
# 0.10.1 - Added spanid=X: Generate sets of read IDs that span checkpos regions, based on values of field X []
# 0.10.2 - Fixed formatting for Python 2.6 back compatibility for servers.
# 0.10.3 - Fixing issues of PAF files not being generated.
# 0.11.0 - Added HiFi read type.
# 0.12.0 - Added readnames=T/F : Output the read names to the RID file [False]
# 0.13.0 - Updated default -x to asm20 (5% divergence). Use minimap2 map-hifi setting for hifi read mapping.
rje_paf REST Output formats
Run with
&rest=docs
for program documentation and options. A plain text version is accessed with
&rest=help
.
&rest=OUTFMT
can be used to retrieve individual parts of the output, matching the tabs in the default
(
&rest=format
) output. Individual
OUTFMT
elements can also be parsed from the full (
&rest=full
) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...
Available REST Outputs
There is currently no specific help available on REST output for this program.