Function
This module is for taking one or two sequence datasets, peforming an intensive All by All BLAST and then tabulating
the results as a series of pairwise comparisons (*.gablam.*
):
Qry
= Query Short Name (or AccNum)
Hit
= Hit Short Name
Rank
= Rank of that Hit vs Query (based on Score)
Score
= BLAST Score (one-line)
E-Value
= BLAST E-value
QryLen
= Length of Query Sequence
HitLen
= Length of Hit Sequence
Qry_AlnLen
= Total length of local BLAST alignment fragments in Query (Unordered)
Qry_AlnID
= Number of Identical residues of Query aligned against Hit in local BLAST alignments (Unordered)
Qry_AlnSim
= Number of Similar residues of Query aligned against Hit in local BLAST alignments (Unordered)
Qry_OrderedAlnLen
= Total length of local BLAST alignment fragments in Query (Ordered)
Qry_OrderedAlnID
= Number of Identical residues of Query aligned against Hit in local BLAST alignments (Ordered)
Qry_OrderedAlnSim
= Number of Similar residues of Query aligned against Hit in local BLAST alignments (Ordered)
Hit_AlnLen
= Total length of local BLAST alignment fragments in Hit (Unordered)
Hit_AlnID
= Number of Identical residues of Hit aligned against Query in local BLAST alignments (Unordered)
Hit_AlnSim
= Number of Similar residues of Hit aligned against Query in local BLAST alignments (Unordered)
Hit_OrderedAlnLen
= Total length of local BLAST alignment fragments in Hit (Ordered)
Hit_OrderedAlnID
= Number of Identical residues of Hit aligned against Query in local BLAST alignments (Ordered)
Hit_OrderedAlnSim
= Number of Similar residues of Hit aligned against Query in local BLAST alignments (Unordered)
ALIGN_ID
= Number of Identical residues as determined by pairwise ALIGN
ALIGN_Len
= Length of pairwise ALIGN
When searchdb=FILE
is DNA (blastprog=blastn/tblastn/tblastx
), there will be additional Dirn
fields that indicate
whether the hits are on the forward strand (Fwd
), reverse strand (Bwd
) or a combination (Both
).
By default, all BLAST hits will return alignments. (blastv=N
blastb=N
, where N
is the size of searchdb
.) This can
be over-ridden by the blastv=X
and blastb=X
options to limit results to the top X
hits.
GABLAM will also produce a single table of summary statistics for all non-self hits (*.hitsum.*
) (self hits included
if selfhit=T
selfsum=T
):
Qry
= Query Short Name (or AccNum)
Hits
= Number of Hits
MaxScore
= Max non-self BLAST Score (one-line)
E-Value
= BLAST E-value for max score
If keepblast=T
then the blast results files themselves will not be deleted and will also be available as output -
or for re-running GABLAM faster on the same input. This can be performed even when the output basefile has changed
by giving the BLAST results file with blastres=FILE
.
Versions/Updates
V2.8: All-by-all search output
Version 2.8 onwards features explicit extra functionality for all-by-all searches, where the QueryDB (seqin=FILE
) and
SearchDB (searchdb=FILE
) are the same. (Failing to give a searchdb will run in this mode.)
V2.16: FullBLAST mode
Version 2.16.x introduces a new "fullblast" mode, which performs a full BLAST search (using forks=X
to set the number
of processors for the BLAST search) followed by the blastres=FILE
multiGABLAM processing. This should be faster for
large datasets but precludes any appending of results files. This is incompatible with the missing=LIST
advanced
update option. (missing=LIST
should only be required for aborted fullblast=F
runs.) By default, the blast file will
be named the same as the other output but the blastres=FILE
command can be used to read in results from a file with
a different name. If this file was not created with sequences in QueryDB and SearchDB, it will create errors.
V2.20: SNP Table & LocalUnique output
Version 2.20.x added a snptable=T/F
output to generate a SNP table (similar to MUMmer NUCmer output) with the
following fields: Locus
, Pos
, REF
, ALT
, AltLocus
, AltPos
. The localunique=T/F
controls whether hit
regions can be covered multiple times (False
) or (default:True
) reduced to unique "best" hits. Local hits are
sorted according to localsort=X
(default:Identity
). NOTE: Output is restricted to regions of overlap between
query and hit sequences. Regions of each that are not covered will be output in *.nocoverage.tdt
. See *.local.tdt
or *.unique.tdt
output for the regions covered. Normally, the reference genome will be seqin=FILE
and the genome
to compare against the reference will be searchdb=FILE
. NOTE: localmin
and localidmin
are applied to unique
regions as well as the original local hits table.
V2.26: Fragment fasta output
Version 2.26.x tidied up the fragfas=T
and combinedfas=T
output, which are now only available in combination when
fullblast=T
. The gablamfrag=X
parameter controls fragmentation within a given Query-Hit pair, i.e. local hits
within gablamfrag=X
positions of each other in the Hit will be merged. fragmerge=X
applies to different hit regions
when they are combined during combinedfas=T
output, and will merge regions with fragmerge=X
positions
irrespective of whether they are hits from the same query or different ones. Fragments are strand-specific and hits
to the reverse strand will be reverse-complemented by default. If fragrevcomp=F
then all local hits will be mapped
onto the forward strand, regardless of their original direction.
Default values are now gablamfrag=1
fragmerge=0
. This means that hits from the same query will merge if directly
adjacent and on the same strand, whereas hits from different queries will only merge if overlapping and on the same
strand. To restrict merging to overlaps of a specific minimum length, set the relevant parameter to a negative
number, e.g. fragmerge=-10
will require a 10 bp/aa overlap before merging. gablamfrag
can only be set to values
below 1 when used in combination with fullblast=T
.
The defaults for when GABLAM is run from the commandline have been set to fullblast=T
keepblast=T
. When GABLAM is
called from within another program, the fullblast=T/F
and keepblast=T/F
default status is set by that program.
Version 2.29: SAM and GFF3 output
Version 2.29.x tided SAM output and added GFF3 output options for the *.local.tdt
and *.unique.tdt
tables. A
special cdsgff=T/F
option was added for the latter, which will treat the input sequences as CDS sequences for the
unique table GFF3 output. This will activate the qryunique=T
option, which reduced local BLAST hits to non-
overlapping query regions rather than non-overlapping hit regions. This mode is only possible for blastn
searches.
Version 2.30: Minimap2 PAF parsing
Version 2.30.x introduces a developmental option to replace BLAST with Minimap2 for generating the main GABLAM
tables: HitSum, Local and GABLAM. Other GABLAM outputs are currently unavailable with this option.
Commandline
Input/Search Options
seqin=FILE
: Query dataset file [infile.fas
]
searchdb=FILE
: Database to search. [By default, same as seqin
]
blastres=FILE
: BLAST results file for input (over-rides seqin and searchdb) [None
]
fullblast=T/F
: Whether to perform full BLAST followed by blastres analysis [True
]
blastp=X
: Type of BLAST search to perform (blastx for DNA vs prot; tblastn for Prot vs DNA) [blastp
]
gablamcut=X
: Min. percentage value for a GABLAM stat to report hit (GABLAM from FASTA only) [0.0
]
cutstat=X
: Stat for gablamcut (eg. AlnLen or OrderedAlnSim. See Function docs for full list) [OrderedAlnID
]
cutfocus=X
: Focus for gablamcut. Can be Query/Hit/Either/Both. [Either
]
localcut=X
: Cut-off length for local alignments contributing to global GABLAM stats [0
]
localidcut=PERC
: Cut-off local %identity for local alignments contributing to global GABLAM stats [0.0
]
mapper=X
: Program to use for mapping files against each other (blast/minimap) [blast
]
mapopt=CDICT
: Dictionary of updated minimap2 options [N:250,p:0.0001,x:asm20
]
General Output Options
append=T/F
: Whether to append to output file or not. (Not available for blastres=FILE
or fullblast=F
) [False
]
fullres=T/F
: Whether to output full GABLAM results table [True
]
hitsum=T/F
: Whether to output the BLAST Hit Summary table [True
]
local=T/F
: Whether to output local alignment summary stats table [True
]
localaln=T/F
: Whether to output local alignments in Local Table [False
]
localsam=T/F
: Save local (and unique) hits data as SAM files in addition to TDT [False
]
localgff=T/F
: Save local (and unique) hits data as GFF3 files in addition to TDT [False
]
cdsgff=T/F
: Whether the query should be treated as the (in-frame) CDS of a gene (also sets qryunique=T
) [False
]
reftype=X
: Whether to map SAM/GFF3 hits onto the Qry, Hit, Both or Combined [Hit
]
qassemble=T/F
: Whether to fully assemble query stats from all hits in HitSum [False
]
localmin=X
: Minimum length of local alignment to output to local stats table [0
]
localidmin=PERC
: Minimum local %identity of local alignment to output to local stats table [0.0
]
localunique=T/F
: Reduce local hits to unique non-overlapping regions (*.unique.tdt) [snptable=T/F
]
qryunique=T/F
: Whether to reduce local hits to the query rather than the hit [False
]
localsort=X
: Local hit field used to sort local alignments for localunique reduction [Identity
]
snptable=T/F
: Generate a SNP table (similar to MUMmer NUCmer output) for query/hit overlap (fullblast=T
) [False
]
selfhit=T/F
: Whether to include self hits in the fullres output [True] * See also selfsum=T/F
*
selfsum=T/F
: Whether to also include self hits in hitsum output [False] * selfhit must also be T *
qryacc=T/F
: Whether to use the Accession Number rather than the short name for the Query [True
]
keepblast=T/F
: Whether to keep the blast results files rather than delete them [True
]
blastdir=PATH
: Path for blast results file (best used with keepblast=T
) [./
]
percres=T/F
: Whether output is a percentage figures (2d.p.) or absolute numbers [True
]
- Note that enough data is output to convert one into the other in other packages (for short sequences)
reduced=LIST
: List of terms that must be included in reduced output headers (e.g. Hit or Qry_Ordered) []
All-by-all Output Options
maxall=X
: Maximum number of sequences for all-by-all outputs [100
]
dismat=T/F
: Whether to output compiled distance matrix [True
]
diskey=X
: GABLAM Output Key to be used for distance matrix ['Qry_AlnID'
]
distrees=T/F
: Whether to generate UPGMA tree summaries of all-by-all distances [True
]
treeformats=LIST
: List of output formats for generated trees (see rje_tree.py) [nwk,text,png
]
disgraph=T/F
: Whether to output a graph representation of the distance matrix (edges = homology) [False
]
graphtypes=LIST
: Formats for graph outputs (svg/xgmml/png/html) [xgmml,png
]
clusters=T/F
: Whether to output a list of clusters based on shared BLAST homology [True
]
bycluster=X
: Generate separate trees and distance matrix for clusters of X+ sequences [0
]
clustersplit=X
: Threshold at which clusters will be split (e.g. must be < distance to cluster) [1.0
]
singletons=T/F
: Whether to include singleton in main tree and distance matrix [False
]
saveupc=T/F
: Whether to output a UPC file for SLiMSuite compatibility [False
]
Sequence output options
localalnfas=T/F
: Whether to output local alignments to *.local.fas fasta file (if local=T
) [False
]
fasout=T/F
: Output a fasta file per input sequence "ACCNUM.DBASE.fas" [False] (GABLAM from FASTA only)
fasdir=PATH
: Directory in which to save fasta files [BLASTFAS/
]
fragfas=T/F
: Whether to output fragmented Hits based on local alignments [False
]
fragrevcomp=T/F
: Whether to reverse-complement DNA fragments that are on reverse strand to query [True
]
gablamfrag=X
: Length of gaps between mapped residues for fragmenting local hits [0
]
fragmerge=X
: Max Length of gaps between fragmented local hits to merge [0
]
addflanks=X
: Add flanking regions of length X to fragmented hits [0
]
combinedfas=T/F
: Whether to generate a combined fasta file [False
]
dotplots=T/F
: Whether to use gablam.r to output dotplots. (Needs R installed and setup) [False
]
dotlocalmin=X
: Minimum length of local alignment to output to local hit dot plots [1000
]
Advanced/Obsolete Search/Output Options
mysql=T/F
: Whether to output column headers for mysql table build [False
]
missing=LIST
: This will go through and add missing results for AccNums in FILE (or list of AccNums X,Y,..) [None
]
startfrom=X
: Accession number to start from [None
]
alnstats=T/F
: Whether to output GABLAM stats or limit to one-line stats (blastb=0
) [True
]
posinfo=T/F
: Output the Start/End limits of the BLAST Hits [True
]
outstats=X
: Whether to output just GABLAM, GABLAMO or All [All
]
GABLAM Non-redundancy options. NOTE: These are different to rje_seq NR options.
nrseq=T/F
: Make sequences Non-Redundant following all-by-all. [False
]
nrcut=X
: Cut-off for non-redundancy filter, uses nrstat=X
for either query or hit [100.0
]
nrstat=X
: Stat for nrcut (eg. AlnLen or OrderedAlnSim. See above for full list) [OrderedAlnID
]
nrchoice=LIST
: Order of decisions for choosing NR sequence to keep. Otherwise keeps first sequence. (swiss/nonx/length/spec/name/acc/manual) [swiss,nonx,length
]
nrsamespec=T/F
: Non-Redundancy within same species only. [False
]
nrspec=LIST
: List of species codes in order of preference (good to bad) []
BLAST Options
blastpath+=PATH
: path for blast+ files [c:/bioware/blast+/] *Use fwd slashes
blastpath=PATH
: path for blast files [c:/bioware/blast/] *Use fwd slashes
blaste=X
: E-Value cut-off for BLAST searches (BLAST -e X) [1e-4
]
blastv=X
: Number of one-line hits per query (BLAST -v X) [500
]
blastb=X
: Number of hit alignments per query (BLAST -b X) [500
]
blastf=T/F
: Complexity Filter (BLAST -F X) [False
]
checktype=T/F
: Whether to check sequence types and BLAST program selection [True
]
Additional ALIGN Global Identity
globid=T/F
: Whether to output Global %ID using ALIGN [False
]
rankaln=X
: Perform ALIGN pairwise global alignment for top X hits [0
]
evalaln=X
: Perform ALIGN pairwise global alignment for all hits with e <= X [1000
]
alncut=X
: Perform ALIGN pairwise global alignment until < X %ID reached [0
]
Forking Options
noforks=T/F
: Whether to avoid forks [False
]
forks=X
: Number of parallel sequences to process at once [0
]
killforks=X
: Number of seconds of no activity before killing all remaining forks. [36000
]
History Module Version History
# 0.0 - Initial Compilation.
# 1.0 - First working version based on BAM 1.4
# 1.1 - Added blastres=FILE option
# 1.2 - Added percres=T/F option
# 1.3 - Added option to use a GABLAM stat as a cut-off
# 1.4 - Added more output options for webserver
# 2.0 - Major tidy up of module.
# 2.1 - Added DNA GABLAM
# 2.2 - Added reduced=LIST : List of terms that must be included in reduced output headers (e.g. Hit or Qry_Ordered)
# 2.3 - Added local alignment stat output.
# 2.4 - Added distance matrix output and visualisation.
# 2.5 - Miscellaneous cleanup and bug fixes. Updated output file names to use basefile.
# 2.6 - Added full name output for NSF tree.
# 2.7 - Added cluster output.
# 2.8 - Added graph output and replaced dispng and disnsf with distrees. Replaced dismatrix PNG with just tree.
# 2.9 - Added LocalMin and LocalCut for controlling how local alignments are output and/or contribute to GABLAM.
# 2.10- Added dot plot PNG output using gablam.r. Added sequence checking with rje_seqlist -> correct BLAST type.
# 2.11- Altered to use BLAST+ and rje_blast_V2.
# 2.12- Consolidated use of BLAST V2.
# 2.13- Fixed Protein vs DNA GABLAM. Modified sequence extraction to handle larger sequences. Add blastdir=PATH/.
# 2.14- Added checktype=T/F option to check sequence/BLAST type.
# 2.15.0 - Added seqnr function. Add run() method.
# 2.16.0 - Added fullblast=T/F : Whether to perform full BLAST followed by blastres analysis [False]
# 2.16.1 - Fixed a bug where the fullblast option was failing to return scores and evalues.
# 2.17.0 - Added localalnfas=T/F : Whether to output local alignments to *.local.fas fasta file (if local=T) [False]
# 2.17.1 - Fixed bug where query and hit lengths were not being output for fullblast.
# 2.18.0 - Added blaste filtering to be applied to existing BLAST results.
# 2.19.0 - Added maxall=X limits to all-by-all analyses. Added qassemble=T.
# 2.19.1 - Fixed handling of basefile and results generation for blastres=FILE.
# 2.19.2 - Modified output to be in rank order.
# 2.20.0 - Added SNP Table output.
# 2.21.0 - Added nocoverage Table output of regions missing from pairwise SNP Table.
# 2.21.1 - Added fragrevcomp=T/F : Whether to reverse-complement DNA fragments that are on reverse strand to query [True]
# 2.22.0 - Added description to HitSum table.
# 2.22.1 - Added localaln=T/F to keep local alignment sequences in the BLAST local Table.
# 2.22.2 - Fixed local output error. (Query/Qry issue - need to fix this and make consistent!)
# 2.22.3 - Fixed blastv and blastb error: limit also applies to individual pairwise hits!
# 2.23.0 - Divided GablamFrag and FragMerge.
# 2.23.1 - Added tuplekeys=T to cmd_list as default. (Can still be over-ridden if it breaks things!)
# 2.24.0 - Added localidmin and and localidcut as %identity versions of localmin and localcut. (Use for PAGSAT.)
# 2.25.0 - Added localsAM=T/F : Save local (and unique) hits data as SAM files in addition to TDT [False]
# 2.26.0 - Fixed fragfas output and clarified fullblast=T/F, localmin=X and localcut=X. Set fullblast=T keepblast=T.
# 2.26.1 - Fixed keepblast error.
# 2.26.2 - Fixed gablamcut fragfas filtering bug.
# 2.26.3 - Fixed nrseq=T to use Query OR Hit stat for NR filtering.
# 2.26.4 - Minor bug fix to nrchoice command parsing.
# 2.27.0 - Fragmerge no longer removes flanks and can be negative for enforced overlap!
# 2.28.0 - Added localidmin=PERC to localUnique (and thus Snapper).
# 2.28.1 - Fixed missing combinedfas when using existing blastres.
# 2.28.2 - Minor bug fix for NRSeq manual choice when i=-1.
# 2.28.3 - Fixed NRSeq query sorting bug.
# 2.29.0 - Added localGFF=T/F output.
# 2.30.0 - Added mapper=X : Program to use for mapping files against each other (blast/minimap) [blast]
# 2.30.1 - Fixed BLAST LocalIDCut error for GABLAM and QAssemble stat filtering.
# 2.30.2 - Stopped BLAST program checks and FormatDB when mapper isn't BLAST.
# 2.30.3 - Minor tweak to NRSeq removal to work through GABLAM hits in reverse query size order.
# 2.30.4 - Removed duplication error messages when combining fasout results.
# 2.30.5 - Fixed missing BLAST database for sequence extraction.
GABLAM REST Output formats
Run with
&rest=docs
for program documentation and options. A plain text version is accessed with
&rest=help
.
&rest=OUTFMT
can be used to retrieve individual parts of the output, matching the tabs in the default
(
&rest=format
) output. Individual
OUTFMT
elements can also be parsed from the full (
&rest=full
) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...
Available REST Outputs
seqin
=
searchdb
=
blastres
=
gablam
=
hitsim
=
local
=
unique
=
upc
=
fragfas
=
hitfas
=
nrseq
= Non-redundant sequence output [fas]
gff
=
sam
=