Function
This script is designed to take in two sequences files and generate datasets of orthologous sequence alignments.
The first [seqin] sequence set is the 'queries' around which orthologous datasets are to be assembled. This is now
optimised for a dataset consisting of one protein per protein-coding gene, although splice variants should be dealt
with OK and treated as paralogues. This will only cause problems if the postdup=T
option is used, which restricts
orthologues returned to be within the last post-duplication clade for the sequence.
The second [orthdb] is the list of proteins from which the orthologues will be extracted. The seqin sequences are
then BLASTed against the orthdb and processed (see below) to retain putative orthologues using an estimation of the
phylogenetic relationships based on pairwise sequences similarities.
NB. As of version 2.0, gopher=FILE
has been replaced with seqin=FILE
for greater rje python consistency. The allqry
option has been removed. Please cleanup the input data into a desired non-redundant dataset before running GOPHER.
(In many ways, GOPHER's strength is it's capacity to be run for a single sequence of interest rather than a whole
genome, and it is this functionality that has been concentrated on for use with PRESTO and SLiM Pickings etc.) The
output of statistics for each GOPHER run has also been discontinued for now but may be reintroduced with future
versions. The phosalign command (to produce a table of potential phosphorylation sites (e.g. S,T,Y) across
orthologues for special conservation of phosphorylation prediction analyses) has also been discontinued for now.
Version 2.1 has tightened up on the use of rje_seq parameters that were causing trouble otherwise. It is now the
responsibility of the user to make sure that the orthologue database meets the desired criteria. Duplicate accession
numbers will not be tolerated by GOPHER and (arbitrary) duplicates will be deleted if the sequences are the same, or
renamed otherwise. Renaming may cause problems later. It is highly desirable not to have two proteins with the same
accession number but different amino acid sequences. The following commands are added to the rje_seq object when input
is read: accnr=T
unkspec=F
specnr=F
gnspacc=T
. Note that unknown species are also not permitted.
The process for dataset assembly is as follows for each protein :
1. BLAST against orthdb [orthblast]
> BLASTs saved in BLAST/AccNum.blast
2. Work through BLAST hits, indentifying paralogues (query species duplicates) and the closest homologue from each
other species. This involves a second BLAST of the query versus original BLAST hits ([e-value=10]{cmd:e-value}
, no complexity
filter). The best sequence from each species is kept, i.e. the one with the best similarity to the query and not part
of a clade with any paralogue that excludes the query. (If postdup=T
, the hit must be in the query's post duplication
clade.) In addition hits: [orthfas]
* Must have minimum identity level with Query
* Must be one of the 'good species' [goodspec=LIST
]
> Save reduced sequences as ORTH/AccNum.orth.fas
> Save paralogues identified (and meeting minsim settings) in PARA/AccNum.para.fas
3. Align sequences with MUSCLE [orthalign]
> ALN/AccNum.orthaln.fas
4. Generate an unrooted tree with (ClustalW or PHYLIP) [orthtree]
> TREE/AccNum.orth.nsf
Optional paralogue/subfamily output: (These are best not used with Force=T
or FullForce=T
)
2a. Alignment of query protein and any paralogues >minsim threshold (paralign=T/F
). The parasplice=T/F
controls
whether splice variants are in these paralogue alignments (where identified using AccNum-X notation).
> PARALN/AccNum.paraln.fas
2b. Pairwise combinations of paralogues and their orthologues aligned, with "common" orthologues removed from the
dataset, with a rooted tree and group data for BADASP analysis etc. (parafam=T
)
> PARAFAM/AccNum+ParaAccNum.parafam.fas
> PARAFAM/AccNum+ParaAccNum.parafam.nsf
> PARAFAM/AccNum+ParaAccNum.parafam.grp
2c. Combined protein families consisting of a protein, all the paralogues > minsim and all orthologues for each in a
single dataset. Unaligned. (gopherfam=T
)
> SUBFAM/AccNum.subfam.fas
*NB.* The subfamily outputs involve Gopher calling itself to ensure the paralogues have gone through the Gopher
process themselves. This could potentially cause conflict if forking is used.
Commandline
### Basic Input/Output ###
seqin=FILE
: Fasta file of 'query' sequences for orthology discovery []
orthdb=FILE
: Fasta file with pool of sequences for orthology discovery []. Should contain query sequences.
startfrom=X
: Accession Number / ID to start from. (Enables restart after crash.) [None
]
dna=T/F
: Whether to analyse DNA sequences (not optimised) [False
]
### GOPHER run control parameters ###
orthblast : Run to blasting versus orthdb (Stage 1).
orthfas : Run to output of orthologues (Stage 2).
orthalign : Run to alignment of orthologues (Stage 3).
orthtree : Run to tree-generation (Stage 4). [default!]
### GOPHER Orthologue identifcation Parameters ###
postdup=T/F
: Whether to align only post-duplication sequences [False
]
reciprocal=T/F
: Use Reciprocal Best Hit method instead of standard GOPHER approach [False
]
minsim=X
: Minimum %similarity of Query for each "orthologue" [40.0
]
simfocus=X
: Style of similarity comparison used for MinSim and "Best" sequence identification [query
]
- query = %query must > minsim (Best if query is ultimate focus and maximises closeness of returned orthologues)
- hit = %hit must > minsim (Best if lots of sequence fragments are in searchdb and should be retained)
- either = %query > minsim OR %hit > minsim (Best if both above conditions are true)
- both = %query > minsim AND %hit > minsim (Most stringent setting)
gablamo=X
: GABLAMO measure to use for similarity measures [Sim
]
- ID = %Identity (from BLAST)
- Sim = %Similarity (from BLAST)
- Len = %Coverage (from BLAST)
- Score = BLAST BitScore (Reciprocal Match only)
goodX=LIST
: Filters where only sequences meeting the requirement of LIST are kept.
LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
badX=LIST
: As goodX but excludes rather than retains filtered sequences
### Additional run control options ###
repair=T/F
: Repair mode - replace previous files if date mismatches or files missing.
(Skip missing files if False) [True]
force=T/F
: Whether to force execution at current level even if results are new enough [False
]
fullforce=T/F
: Whether to force current and previous execution even if results are new enough [False
]
dropout=T/F
: Whether to "drop out" at earlier phases, or continue with single sequence [False
]
ignoredate=T/F
: Ignores the age of files and only replaces if missing [False
]
savespace=T/F
: Save space by deleting intermediate blast files during orthfas [True
]
maxpara=X
: Maximum number of paralogues to consider (large gene families can cause problems) [50
]
### Additional Output Options ###
runpath=PATH
: Specify parent directory in which to output files [./
]
paralign=T/F
: Whether to produce paralogue alignments (>minsim) in PARALN/ (assuming run to orthfas+) [False
]
parasplice=T/F
: Whether splice variants (where identified) are counted as paralogues [False
]
parafam=T/F
: Whether to paralogue paired subfamily alignments (>minsim) (assuming run to orthfas+) [False
]
gopherfam=T/F
: Whether to combined paralogous gopher orthologues into protein families (>minsim) (assuming run to orthfas+) [False
]
sticky=T/F
: Switch on "Sticky Orthologous Group generation" [False] * Experimental *
stiggid=X
: Base for Stigg ID numbers [STIGG
]
History Module Version History
# 0.0 - Initial Working Compilation.
# 1.0 - Preliminary working version up to alignment stage
# 2.0 - Complete reworking of main _orthFas() method. See archived GOPHER 1.9 for history and obselete options.
# 2.1 - Updated the parafam etc. methods to work with the new ensloci data
# 2.2 - Added dna=T option.
# 2.3 - Added soaplab=T settings.
# 2.4 - Tidied up a bit.
# 2.5 - Added soaplab file cleanup.
# 2.6 - Replaced Query sequence with input to avoid shared ID problems.
# 2.7 - Added sticky Orthologue method and maxpara=X.
# 2.8 - Made dropout an option.
# 2.9 - Deleted oldStigg() method. Added simple Reciprocal Best Hit orthology prediction.
GOPHER REST Output formats
Run with
&rest=help
for general options. Run with
&rest=full
to get full server output as text or
&rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using
&rest=OUTFMT
for:
OrthBLAST outputs
qry
= Query file
blast
= BLAST results file
blastid
= BLAST hit ID file
Orthologue Identification
orthfas
= Unaligned fasta file of GOPHER orthologues
orthid
= ID list of GOPHER orthologues
Orthologue alignment
alnfas
= Aligned fasta file of GOPHER orthologues
Orthologue tree
nwk
= Tree of GOPHER orthologues in Newick format
tree
= Tree of GOPHER orthologues in plain text format
png
= Tree of GOPHER orthologues in PNG format
Paralogues
parafas
= Unaligned fasta file of GOPHER paralogues
paraid
= ID list of GOPHER paralogues
paralnfas
= Aligned fasta file of GOPHER paralogues
Special GOPHER REST call
Note that individual outputs can be requested for single proteins using the special REST call:
http://rest.slimsuite.unsw.edu.au/gopher&acc=X&spcode=X&rest=X
If this protein has already been processed by GOPHER, the relevant output will be returned directly as plain
text. If not, a jobid will be returned, which will have the desired output once run.