Function
HAQESAC is a tool designed for processing a dataset of potential homologues into a trustworthy gobal alignment and well
bootstrap-supported phylogeny. By default, an intial dataset consisting of homologues (detected by BLAST, for example)
is processed into a dataset consisting of the query protein and its orthologues in other species plus paralogous
subfamilies in the same gene family.
Individual sequences are therefore screened fulfil two criteria:
1. They must be homologous to (and alignable with) the query sequence of interest.
2. They must be a member of a subfamily within the gene family to which the query sequence belongs.
HAQ. The first stage of data cleanup is therefore to remove rogue sequences that either do not fit in the gene family at
all or are too distantly related to the query protein for a decent alignment that can be used for useful further
analysis. This is achieved firstly by a simple identity cut-off, determined by pairwise alignments of sequences, and
then by a more complex procedure of removing sequences for whom the overall alignability is poor. During this procedure,
sequences that have too many gaps are also removed as too many gapped residues can cause problems for downstream
evolutionary analyses. Further screening is achieved based on phylogenetic information.
ES. Once the dataset has been 'cleaned up' (and, indeed, during processing), HAQESAC can be used to assign sequences to
subgroups or subfamilies, if such information is needed for downstream analyses.
AC. The final step that HAQESAC is able to perform is ancestral sequence prediction using the GASP (Gapped Ancestral
Sequence Prediction) algorithm (Edwards & Shields 2005)
By default, HAQESAC will perform all these operations. However, it is possible to turn one or more off and only, for
example, reject individually badly aligned sequences, if desired. Details can be found in the Manual:
https://github.com/slimsuite/SLiMSuite/blob/master/docs/manuals/HAQESAC%20Manual.pdf.
HAQESAC outputs can be controlled by d=X
:
0 - *.fas, *.grp, *.nwk, *.tree.txt, *.png, *.anc.*
1 - *.bak (seqin), *.clean.fas, *.haq.fas, *.degap.fas
2 - *.saq.*.fas, *.paq.*.fas, *.scanseq.fas
3 - *.saqx.*.fas
Commandline Options
# General Dataset Input/Output Options #
seqin=FILE
: Loads sequences from FILE (fasta, phylip, aln, uniprot format, or list of fastacmd names for fasdb option)
query=X
: Selects query sequence by name (or part of name, e.g. Accession Number)
acclist=X
: Extract only AccNums in list. X can be FILE or list of AccNums X,Y,.. [None
]
fasdb=FILE
: Fasta format database to extract sequences from [None
]
basefile=X
: Basic 'root' for all files X.* [By default will use 'root' of seqin=FILE
if given or haq_AccNum if qblast]
v=X
: Sets verbosity (-1 for silent) [0
]
i=X
: Sets interactivity (-1 for full auto) [0
]
d=X
: Data output level (0-3, see docstring) [1
]
resdir=PATH
: Output directory for d>0 outputs [./HAQESAC/
]
log=FILE
: Redirect log to FILE [Default = calling_program.log or basefile.log
]
newlog=T/F
: Create new log file. [Default = False: append log file
]
multihaq=T/F
: If pickle present, will load and continue, else will part run then pickle and stop [False
]
# Pre-HAQESAC Data selection #
qseqfile=FILE
: Sequence file from which to extract query (query=X
) and peform BLAST [None
]
qblast=FILE
: BLAST database against which to BLAST query (see rje_blast.py options) [None] ['blaste=1e-7','blastv=1000,blastb=0
']
qfastacmd=X
: Extract query X from qblast database using fastacmd (may also need query=X
) [None
]
# Pre-HAQESAC Sequence Filtering #
gnspacc=T/F
: Convert sequences into gene_SPECIES__AccNum format wherever possible. [True]
backup=T/F
: Whether to backup initial fasta file. Will overwrite existing *.fas.bak. [True
]
accnr=T/F
: Check for redundant Accession Numbers/Names on loading sequences. [False
]
#!# filterseq=FILE
: Filters out sequences in given file (Log, Fasta or list of names)
#!# filterspec=FILE
: Filters out sequences according to species (codes) listed in given file
unkspec=T/F
: Whether sequences of unknown species are allowed [True
]
9spec=T/F
: Whether to treat 9XXXX species codes as actual species (generally higher taxa) [False
]
dblist=X,Y,..,Z
: List of databases in order of preference (good to bad)
dbonly=T/F
: Whether to only allow sequences from listed databases [False
]
#!# keepdesc=FILE
: Only keeps sequences with 1+ of text listed in given file in sequence description
minlen=X
: Minimum length of sequences [0
]
maxlen=X
: Maximum length of sequences (<=0 = No maximum) [0
]
goodX=LIST
: Filters where only sequences meeting the requirement of LIST are kept.
- LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
badX=LIST
: As goodX but excludes rather than retains filtered sequences
# Pre-HAQ Data Cleanup #
cleanup=T/F
: Initial data cleanup [True
]
seqnr=T/F
: Make sequence Non-Redundant [True
]
specnr=T/F
: Non-Redundancy within same species only [True
]
nrid=X
: %Identity cut-off for Non-Redundancy [99.0
]
blastcut=X
: Maximum number of sequences to have in dataset (BLAST query against NR dataset.) [0
]
blaste=X
: E-Value cut-off for BLAST searches (BLAST -e X) [1e-4
]
blastv=X
: Number of one-line hits per query (BLAST -v X) [500
]
blastf=T/F
: Complexity Filter (BLAST -F X) [True
]
qcover=X
: Min. % (ordered) BLAST coverage of Query vs Hit or Hit vs Query [60.0
]
gablam=T/F
: Whether to use GABLAMO Global Alignment from BLAST Local Alignment Matrix (Ordered) rather than ALIGN [True
]
gabsim=T/F
: Whether to use %Similarity for GABLAMO comparisons, rather than %Identity [True
]
pairid=X
: Fasta Alignment ID cut-off for any pair of sequences [0.0
]
qryid=X
: Fasta Alignment ID with Query cut-off [40.0
]
maxgap=X
: Maximum proportion of sequence that may be gaps compared to nearest neighbour (<=0 = No maximum) [0.5
]
# General HAQ Options #
qregion=X,Y
: Concentrate on the region of the query from (and including) residue X to residue Y [1,-1
]
haq=T/F
: Homologue Alignment Quality [True
]
noquery=T/F
: No Query for SAQ, Random Query for PAQ (else query=X
or first sequence) [False
]
keep=T/F
: Keep all sequences (saqkl=0
, paqkl=0
) [False
]
usealn=T/F
: Use current alignment (do not realign, degap=F
) [False
]
cwcut=X
: Total number of residues above which to use ClustalW for alignment in place of alnprog=X
[0
]
haqmatrix=FILE
: File of AA vs AA scores used in SAQ and PAQ. [None
]
# Single Sequence Alignment Quality (SAQ) Options #
saq=T/F
: Single Sequence AQ [True
]
saqc=X
: Min score for a residue in SAQ (Default matrix: no. seqs to share residue). [2
]
saqb=X
: SAQ Block length. [10
]
saqm=X
: No. residues to match in SAQ Block. [7
]
saqks=X
: Relative Weighting of keeping Sequences in SAQ. [3
]
saqkl=X
: Relative Weighting of keeping Length in SAQ. [1
]
mansaq=T/F
: Manual over-ride of sequence rejection decisions in SAQ [False
]
# Pairwise Alignment Quality (PAQ) Options #
prepaq=T/F
: PrePAQ tree grouping [True
]
paq=T/F
: Pairwise AQ [True
]
paqb=X
: PAQ Block length. [7
]
paqm=X
: Min score in PAQ Block (Default matrix: No. residues to match). [3
]
paqks=X
: Relative Weighting of keeping Sequences in PAQ. [3
]
paqkl=X
: Relative Weighting of keeping Length in PAQ. [1
]
manpaq=T/F
: Manual over-ride of sequence rejection decisions in PAQ [False
]
# Establishment of Subfamilies (and PrePAQ) Options #
es=T/F
: Establishment of Subfamilies [True
]
root=X
: Rooting of tree (rje_tree.py), where X is:
- mid = midpoint root tree. [Default]
- ran = random branch.
- ranwt = random branch, weighted by branch lengths.
- man = always ask for rooting options (unless i<0).
- FILE = with seqs in FILE as outgroup. (Any option other than above)
bootcut=X
: cut-off percentage of tree bootstraps for grouping.
mfs=X
: minimum family size [3
]
fam=X
: minimum number of families (If 0, no subfam grouping) [0
]
orphan=T/F
: Whether orphans sequences (not in subfam) allowed. [True
]
allowvar=T/F
: Allow variants of same species within a group. [False
]
qryvar=T/F
: Keep variants of query species within a group (over-rides allowvar=F
). [False
]
keepvar=LIST
: Automatically keep variants for listed species (over-rides allowvar=F
). []
groupspec=X
: Species for duplication grouping [None
]
qspec=T/F
: Whether to highlight query species in PNG tree files [True
]
specdup=X
: Minimum number of different species in clade to be identified as a duplication [2
]
group=X
: Grouping of tree
- man = manual grouping (unless i<0).
- dup = duplication (all species unless groupspec specified).
- qry = duplication with species of Query sequence (or Sequence 1) of treeseq
- one = all sequences in one group
- None = no group (case sensitive)
- FILE = load groups from file
phyoptions=FILE
: File containing extra Phylip tree-making options ('batch running') to use [None
]
protdist=FILE
: File containing extra Phylip PROTDIST ('batch running') to use [None
]
maketree=X
: Program for making tree [None
]
- None = Do not make tree from sequences
- clustalw = ClustalW NJ method
- neighbor = PHYLIP NJ method (NB. Bootstraps not yet supported)
- upgma = PHYLIP UPGMA (neighbor) method (NB. Bootstraps not yet supported)
- fitch = PHYLIP Fitch method (NB. Bootstraps not yet supported)
- kitsch = PHYLIP Kitsch (clock) method (NB. Bootstraps not yet supported)
- protpars = PHYLIP MP method (NB. Bootstraps not yet supported)
- proml = PHYLIP ML method (NB. Bootstraps not yet supported)
- PATH = Alternatively, a path to a different tree program/script can be given. This should accept ClustalW parameters.
# Ancestor Construction (GASP) Options
ac=T/F
: Ancestor Construction (GASP) [True
]
pamfile=FILE
: Sets PAM1 input file [jones.pam
]
pammax=X
: Initial maximum PAM matrix to generate [100
]
pamcut=X
: Absolute maximum PAM matrix [1000
]
fixpam=X
: PAM distance fixed to X [0].
rarecut=X
: Rare aa cut-off [0.05].
fixup=T/F
: Fix AAs on way up (keep probabilities) [True].
fixdown=T/F
: Fix AAs on initial pass down tree [False].
ordered=T/F
: Order ancestral sequence output by node number [False].
pamtree=T/F
: Calculate and output ancestral tree with PAM distances [True].
desconly=T/F
: Limits ancestral AAs to those found in descendants [True].
xpass=X
: How many extra passes to make down & up tree after initial GASP [1].
# System Info Options #
See rje_seq and general commandline options for external program system path settings.
History Module Version History
# 0.0 - Initial Compilation.
# 0.1 - Added clustalw for early stages
# 1.0 - Fully working version. (Improvements still to be made!) Added BLAST to front-end.
# 1.1 - Added backup.
# 1.2 - Added basename and interactive menu to start if nothing given
# 1.3 - Added specification of query subportion. Changed scoring to use sub matrix. Added GASP.
# 1.4 - Added option to use BLAST coverage rather than ALIGN
# 1.5 - Added PHYLIP for tree drawing (rje_tree.py 1.6)
# 1.6 - Added multiHAQ for more rapid treatment of multiple consecutive datasets (and "pickup")
# 1.7 - Added qryvar=T/F : Allow variants of query species within a group (over-rides allowvar=F). [False]
# |---- Fixed minor typo.
# 1.8 - Added qspec tree option
# 1.9 - Added rje_blast_V2 implementation and BLAST+. Use oldblast=T for old BLAST.
# 1.10- Added exceptions for BLAST failure.
# 1.10.1 - Tweaked QryVar interactivity.
# 1.10.2 - Corrected typos and disabled buggy post-HAQESAC data reduction.
# 1.10.3 - Added catching of bad query when i=-1.
# 1.11.0 - Added resdir=PATH [./HAQESAC/] for d>0 outputs.
# 1.12.0 - 9spec=T/F : Whether to treat 9XXXX species codes as actual species (generally higher taxa) [False]
# 1.13.0 - Modified qregion=X,Y to be 1-L numbering.
# 1.14.0 - Added keepvar=LIST to enable auto-retention of vairants for multiple species
HAQESAC REST Output formats
Run with
&rest=docs
for program documentation and options. A plain text version is accessed with
&rest=help
.
&rest=OUTFMT
can be used to retrieve individual parts of the output, matching the tabs in the default
(
&rest=format
) output. Individual
OUTFMT
elements can also be parsed from the full (
&rest=full
) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...
Available REST Outputs
There is currently no specific help available on REST output for this program.