This script is designed to take in two sequences files and generate datasets of orthologous sequence alignments.
seqin sequence set is the 'queries' around which orthologous datasets are to be assembled. This is now
optimised for a dataset consisting of one protein per protein-coding gene, although splice variants should be dealt
with OK and treated as paralogues. This will only cause problems if the
postdup=T option is used, which restricts
orthologues returned to be within the last post-duplication clade for the sequence.
orthdb is the list of proteins from which the orthologues will be extracted. The
seqin sequences are
then BLASTed against the orthdb and processed (see below) to retain putative orthologues using an estimation of the
phylogenetic relationships based on pairwise sequences similarities.
NB. As of version 2.0,
gopher=FILE has been replaced with
seqin=FILE for greater rje python consistency. The
option has been removed. Please cleanup the input data into a desired non-redundant dataset before running GOPHER.
(In many ways, GOPHER's strength is it's capacity to be run for a single sequence of interest rather than a whole
genome, and it is this functionality that has been concentrated on for use with PRESTO and SLiM Pickings etc.) The
output of statistics for each GOPHER run has also been discontinued for now but may be reintroduced with future
phosalign command (to produce a table of potential phosphorylation sites (e.g. S,T,Y) across
orthologues for special conservation of phosphorylation prediction analyses) has also been discontinued for now.
Version 2.1 has tightened up on the use of rje_seq parameters that were causing trouble otherwise. It is now the
responsibility of the user to make sure that the orthologue database meets the desired criteria. Duplicate accession
numbers will not be tolerated by GOPHER and (arbitrary) duplicates will be deleted if the sequences are the same, or
renamed otherwise. Renaming may cause problems later. It is highly desirable not to have two proteins with the same
accession number but different amino acid sequences. The following commands are added to the rje_seq object when input
gnspacc=T. Note that unknown species are also not permitted.
Version 3.0 has improved directory organisation for multi-species and multi-orthdb GOPHER runs on the same system, in
line with the bioware.ucd.ie webserver. Additional data cleanup has been added too. (NB. Experimental Sticky GOPHER
runs will not work with
Organise=T.) The output directory is now set at the highest level by
organise=T then a subdirectory will be created within this directory named after the
orthdb (path and extension
stripped) and each species will have its own subdirectory within this in turn. By default, these are named using the
UniProt species codes, which are read from the input sequences. To use TaxIDs, the
unispec=FILE option must be used.
This should point to a file that has one line per UniProt species code, starting with the species code and ending
with :TaxID: where TaxID is a number. The file should be sorted by species code.
The process for dataset assembly is as follows for each protein :
1. BLAST against orthdb [
- * BLASTs saved in BLAST/AccNum.blast
2. Work through BLAST hits, indentifying paralogues (query species duplicates) and the closest homologue from each
other species. This involves a second BLAST of the query versus original BLAST hits (
blaste=10, no complexity
filter). The best sequence from each species is kept, i.e. the one with the best similarity to the query and not part
of a clade with any paralogue that excludes the query. (If
postdup=T, the hit must be in the query's post duplication
clade.) In addition hits: [
- * Must have minimum identity level with Query
- * Must be one of the 'good species' [
- * Save reduced sequences as ORTH/AccNum.orth.fas
- * Save paralogues identified (and meeting minsim settings) in PARA/AccNum.para.fas
3. Align sequences with MUSCLE [
4. Generate an unrooted tree with (ClustalW or PHYLIP) [
Optional paralogue/subfamily output: (These are best not used with
2a. Alignment of query protein and any paralogues >minsim threshold (
whether splice variants are in these paralogue alignments (where identified using AccNum-X notation).
- * PARALN/AccNum.paraln.fas
2b. Pairwise combinations of paralogues and their orthologues aligned, with "common" orthologues removed from the
dataset, with a rooted tree and group data for BADASP analysis etc. (
- * PARAFAM/AccNum+ParaAccNum.parafam.fas
- * PARAFAM/AccNum+ParaAccNum.parafam.nsf
- * PARAFAM/AccNum+ParaAccNum.parafam.grp
2c. Combined protein families consisting of a protein, all the paralogues > minsim and all orthologues for each in a
single dataset. Unaligned. (
- * SUBFAM/AccNum.subfam.fas
NB. The subfamily outputs involve Gopher calling itself to ensure the paralogues have gone through the Gopher
process themselves. This could potentially cause conflict if forking is used.
### Basic Input/Output ###
seqin=FILE : Fasta file of 'query' sequences for orthology discovery (over-rides
uniprotid=LIST : Extract IDs/AccNums in list from Uniprot into BASEFILE.fas and use as
orthdb=FILE : Fasta file with pool of sequences for orthology discovery . Should contain query sequences.
startfrom=X : Accession Number / ID to start from. (Enables restart after crash.) [
dna=T/F : Whether to analyse DNA sequences (not optimised) [
### GOPHER run control parameters ###
orthblast : Run to blasting versus orthdb (Stage 1).
orthfas : Run to output of orthologues (Stage 2).
orthalign : Run to alignment of orthologues (Stage 3).
orthtree : Run to tree-generation (Stage 4). [default!]
### GOPHER Orthologue identifcation Parameters ###
postdup=T/F : Whether to align only post-duplication sequences [
reciprocal=T/F : Use Reciprocal Best Hit method instead of standard GOPHER approach [
fullrbh=T/F : Whether RBH method should run BLAST searches for all potential RBH orthologues [
compfilter=T/F : Whether to use complexity filter and composition statistics for *initial* BLAST [
minsim=X : Minimum %similarity of Query for each "orthologue" [
simfocus=X : Style of similarity comparison used for MinSim and "Best" sequence identification [
- query = %query must > minsim (Best if query is ultimate focus and maximises closeness of returned orthologues)
- hit = %hit must > minsim (Best if lots of sequence fragments are in searchdb and should be retained)
- either = %query > minsim OR %hit > minsim (Best if both above conditions are true)
- both = %query > minsim AND %hit > minsim (Most stringent setting)
gablamo=X : GABLAMO measure to use for similarity measures [
- ID = %Identity (from BLAST)
- Sim = %Similarity (from BLAST)
- Len = %Coverage (from BLAST)
- Score = BLAST BitScore (Reciprocal Match only)
goodX=LIST : Filters where only sequences meeting the requirement of LIST are kept.
LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
badX=LIST : As goodX but excludes rather than retains filtered sequences
### Additional run control options ###
repair=T/F : Repair mode - replace previous files if date mismatches or files missing.
(Skip missing files if False) [True]
force=T/F : Whether to force execution at current level even if results are new enough [
fullforce=T/F : Whether to force current and previous execution even if results are new enough [
dropout=T/F : Whether to "drop out" at earlier phases, or continue with single sequence [
ignoredate=T/F : Ignores the age of files and only replaces if missing [
savespace=T/F : Save space by deleting intermediate blast files during orthfas [
maxpara=X : Maximum number of paralogues to consider (large gene families can cause problems) [
oldblast=T/F : Run with old BLAST rather than BLAST+ [
### Additional Output Options ###
runpath=PATH : Directory from which to run GOPHER. (NB. Will look for input here unless full paths given) [
gopherdir=PATH : Parent directory for output of files [
organise=T/F : Output files according to orthdb an species (code or TaxaID - need conversion) [
unispec=FILE : UniProt species file for TaxaID conversion. Will use TaxaID instead of Species Code if given [
alnprog=X : Choice of alignment program to use (clustalw/clustalo/muscle/mafft/fsa) [
orthid=X : File identifier (Lower case) for orthology files [
paralign=T/F : Whether to produce paralogue alignments (>minsim) in PARALN/ (assuming run to orthfas+) [
parasplice=T/F : Whether splice variants (where identified) are counted as paralogues [
parafam=T/F : Whether to paralogue paired subfamily alignments (>minsim) (assuming run to orthfas+) [
gopherfam=T/F : Whether to combined paralogous gopher orthologues into protein families (>minsim) (assuming run to orthfas+) [
sticky=T/F : Switch on "Sticky Orthologous Group generation" [False] *** Experimental ***
stiggid=X : Base for Stigg ID numbers [
History Module Version History
# 3.0 - See archived GOPHER 1.9 and gopher_V2 2.9 for history and obselete options.
# 3.0 - Added organise=T/F and gopherdir=PATH for improved file organisation. Tightened savespace.
# 3.0 - Added compfilter=T/F for improved complexity filter and composition statistics control for *initial* BLAST.
# 3.0 - Changed default tree extension to *.nwk for compatibility with MEGA. Deleted _phosAlign() method.
# 3.0 - Added orthology ID option and alignment program to customise output further.
# 3.1 - Added full reciprocal best hit method. (fullrbh=T/F)
# 3.2 - Minor tweak to prevent unwanted directory generation for programs using existing GOPHER alignments.
# 3.3 - Added rje_blast_V2 to use BLAST+. Run with oldblast=T to stick with old NCBI BLAST.
# 3.4 - Fixed FullRBH paralogue duplication issue.
# 3.4.1 - Fixed stripXGap issue. (Why was this being implemented anyway?). Added REST output.
# 3.4.2 - Removed GOPHER System Exit on IOError to prevent breaking of REST server.
# 3.4.3 - Added checking and warning if no bootstraps for orthtree.
GOPHER REST Output formats
for general options. Run with
to get full server output as text or
for more user-friendly formatted output. Individual outputs can be identified/parsed using
## ~ OrthBLAST outputs ~ ##
= Query file
= BLAST results file
= BLAST hit ID file
## ~ Orthologue Identification ~ ##
= Unaligned fasta file of GOPHER orthologues
= ID list of GOPHER orthologues
## ~ Orthologue alignment ~ ##
= Aligned fasta file of GOPHER orthologues
## ~ Orthologue tree ~ ##
= Tree of GOPHER orthologues in Newick format
= Tree of GOPHER orthologues in plain text format
= Tree of GOPHER orthologues in PNG format
## ~ Paralogues ~ ##
= Unaligned fasta file of GOPHER paralogues
= ID list of GOPHER paralogues
= Aligned fasta file of GOPHER paralogues
Note that individual outputs can be requested for single proteins using the special REST call:
If this protein has already been processed by GOPHER, the relevant output will be returned directly as plain
text. If not, a jobid will be returned, which will have the desired output once run.