Program:	GOPHER
Description:	Generation of Orthologous Proteins from Homology-based Estimation of Relationships
Version:	3.5.4
Last Edit:	21/12/20
Citation:	Davey, Edwards & Shields (2007), Nucleic Acids Res. 35(Web Server issue):W455-9.
Manual:	http://bit.ly/GOPHERManual

Imported modules: rje rje_seq rje_sequence rje_tree rje_uniprot rje_blast_V2 rje_dismatrix_V2

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This script is designed to take in two sequences files and generate datasets of orthologous sequence alignments. The first seqin sequence set is the 'queries' around which orthologous datasets are to be assembled. This is now optimised for a dataset consisting of one protein per protein-coding gene, although splice variants should be dealt with OK and treated as paralogues. This will only cause problems if the postdup=T option is used, which restricts orthologues returned to be within the last post-duplication clade for the sequence.

The second orthdb is the list of proteins from which the orthologues will be extracted. The seqin sequences are then BLASTed against the orthdb and processed (see below) to retain putative orthologues using an estimation of the phylogenetic relationships based on pairwise sequences similarities.

NB. As of version 2.0, gopher=FILE has been replaced with seqin=FILE for greater rje python consistency. The allqry option has been removed. Please cleanup the input data into a desired non-redundant dataset before running GOPHER. (In many ways, GOPHER's strength is it's capacity to be run for a single sequence of interest rather than a whole genome, and it is this functionality that has been concentrated on for use with PRESTO and SLiM Pickings etc.) The output of statistics for each GOPHER run has also been discontinued for now but may be reintroduced with future versions. The phosalign command (to produce a table of potential phosphorylation sites (e.g. S,T,Y) across orthologues for special conservation of phosphorylation prediction analyses) has also been discontinued for now.

Version 2.1 has tightened up on the use of rje_seq parameters that were causing trouble otherwise. It is now the responsibility of the user to make sure that the orthologue database meets the desired criteria. Duplicate accession numbers will not be tolerated by GOPHER and (arbitrary) duplicates will be deleted if the sequences are the same, or renamed otherwise. Renaming may cause problems later. It is highly desirable not to have two proteins with the same accession number but different amino acid sequences. The following commands are added to the rje_seq object when input is read: accnr=T unkspec=F specnr=F gnspacc=T. Note that unknown species are also not permitted.

Version 3.0 has improved directory organisation for multi-species and multi-orthdb GOPHER runs on the same system, in line with the bioware.ucd.ie webserver. Additional data cleanup has been added too. (NB. Experimental Sticky GOPHER runs will not work with Organise=T.) The output directory is now set at the highest level by gopherdir=PATH. If organise=T then a subdirectory will be created within this directory named after the orthdb (path and extension stripped) and each species will have its own subdirectory within this in turn. By default, these are named using the UniProt species codes, which are read from the input sequences. To use TaxIDs, the unispec=FILE option must be used. This should point to a file that has one line per UniProt species code, starting with the species code and ending with :TaxID: where TaxID is a number. The file should be sorted by species code.

Processing

The process for dataset assembly is as follows for each protein :

1. BLAST against orthdb [orthblast]

BLASTs saved in BLAST/AccNum.blast

2. Work through BLAST hits, indentifying paralogues (query species duplicates) and the closest homologue from each
other species. This involves a second BLAST of the query versus original BLAST hits (blaste=10, no complexity
filter). The best sequence from each species is kept, i.e. the one with the best similarity to the query and not part
of a clade with any paralogue that excludes the query. (If postdup=T, the hit must be in the query's post duplication
clade.) In addition hits: [orthfas]

Must have minimum identity level with Query
Must be one of the 'good species' [goodspec=LIST]
Save reduced sequences as ORTH/AccNum.orth.fas
Save paralogues identified (and meeting minsim settings) in PARA/AccNum.para.fas

3. Align sequences with MUSCLE [orthalign]

ALN/AccNum.orthaln.fas

4. Generate an unrooted tree with (ClustalW or PHYLIP) [orthtree]

TREE/AccNum.orth.nsf

Optional paralogue/subfamily output: (These are best not used with Force=T or FullForce=T)
2a. Alignment of query protein and any paralogues >minsim threshold (paralign=T/F). The parasplice=T/F controls
whether splice variants are in these paralogue alignments (where identified using AccNum-X notation).

PARALN/AccNum.paraln.fas

2b. Pairwise combinations of paralogues and their orthologues aligned, with "common" orthologues removed from the
dataset, with a rooted tree and group data for BADASP analysis etc. (parafam=T)

PARAFAM/AccNum+ParaAccNum.parafam.fas
PARAFAM/AccNum+ParaAccNum.parafam.nsf
PARAFAM/AccNum+ParaAccNum.parafam.grp

2c. Combined protein families consisting of a protein, all the paralogues > minsim and all orthologues for each in a
single dataset. Unaligned. (gopherfam=T)

SUBFAM/AccNum.subfam.fas

NB. The subfamily outputs involve Gopher calling itself to ensure the paralogues have gone through the Gopher
process themselves. This could potentially cause conflict if forking is used.

Commandline

Basic Input/Output

seqin=FILE : Fasta file of 'query' sequences for orthology discovery (over-rides uniprotid=LIST) []
uniprotid=LIST : Extract IDs/AccNums in list from Uniprot into BASEFILE.fas and use as seqin=FILE. []
acc=LIST : Synonym for uniprotid=LIST []
orthdb=FILE : Fasta file with pool of sequences for orthology discovery []. Should contain query sequences.
startfrom=X : Accession Number / ID to start from. (Enables restart after crash.) [None]
dna=T/F : Whether to analyse DNA sequences (not optimised) [False]

GOPHER run control parameters

orthblast : Run to blasting versus orthdb (Stage 1).
orthfas : Run to output of orthologues (Stage 2).
orthalign : Run to alignment of orthologues (Stage 3).
orthtree : Run to tree-generation (Stage 4). [default!]

GOPHER Orthologue identifcation Parameters

postdup=T/F : Whether to align only post-duplication sequences [False]
reciprocal=T/F : Use Reciprocal Best Hit method instead of standard GOPHER approach [False]
fullrbh=T/F : Whether RBH method should run BLAST searches for all potential RBH orthologues [False]
compfilter=T/F : Whether to use complexity filter and composition statistics for *initial* BLAST [True]
minsim=X : Minimum %similarity of Query for each "orthologue" [40.0]
simfocus=X : Style of similarity comparison used for MinSim and "Best" sequence identification [query]
- query = %query must > minsim (Best if query is ultimate focus and maximises closeness of returned orthologues)
- hit = %hit must > minsim (Best if lots of sequence fragments are in searchdb and should be retained)
- either = %query > minsim OR %hit > minsim (Best if both above conditions are true)
- both = %query > minsim AND %hit > minsim (Most stringent setting)
gablamo=X : GABLAMO measure to use for similarity measures [Sim]
- ID = %Identity (from BLAST)
- Sim = %Similarity (from BLAST)
- Len = %Coverage (from BLAST)
- Score = BLAST BitScore (Reciprocal Match only)
goodX=LIST : Filters where only sequences meeting the requirement of LIST are kept.
LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
badX=LIST : As goodX but excludes rather than retains filtered sequences

Additional run control options

repair=T/F : Repair mode - replace previous files if date mismatches or files missing.
(Skip missing files if False) [True]
force=T/F : Whether to force execution at current level even if results are new enough [False]
fullforce=T/F : Whether to force current and previous execution even if results are new enough [False]
dropout=T/F : Whether to "drop out" at earlier phases, or continue with single sequence [False]
ignoredate=T/F : Ignores the age of files and only replaces if missing [False]
savespace=T/F : Save space by deleting intermediate blast files during orthfas [True]
maxpara=X : Maximum number of paralogues to consider (large gene families can cause problems) [50]
oldblast=T/F : Run with old BLAST rather than BLAST+ [False]

Additional Output Options

runpath=PATH : Directory from which to run GOPHER. (NB. Will look for input here unless full paths given) [./]
gopherdir=PATH : Parent directory for output of files [./]
organise=T/F : Output files according to orthdb an species (code or TaxaID - need conversion) [True]
unispec=FILE : UniProt species file for TaxaID conversion. Will use TaxaID instead of Species Code if given [None]
alnprog=X : Choice of alignment program to use (clustalw/clustalo/muscle/mafft/fsa) [clustalo]
orthid=X : File identifier (Lower case) for orthology files [orth]
paralign=T/F : Whether to produce paralogue alignments (>minsim) in PARALN/ (assuming run to orthfas+) [False]
parasplice=T/F : Whether splice variants (where identified) are counted as paralogues [False]
parafam=T/F : Whether to paralogue paired subfamily alignments (>minsim) (assuming run to orthfas+) [False]
gopherfam=T/F : Whether to combined paralogous gopher orthologues into protein families (>minsim) (assuming run to orthfas+) [False]
sticky=T/F : Switch on "Sticky Orthologous Group generation" [False] * Experimental *
stiggid=X : Base for Stigg ID numbers [STIGG]

History Module Version History

# 3.0 - See archived GOPHER 1.9 and gopher_V2 2.9 for history and obselete options.
# 3.0 - Added organise=T/F and gopherdir=PATH for improved file organisation. Tightened savespace.
# 3.0 - Added compfilter=T/F for improved complexity filter and composition statistics control for *initial* BLAST.
# 3.0 - Changed default tree extension to *.nwk for compatibility with MEGA. Deleted _phosAlign() method.
# 3.0 - Added orthology ID option and alignment program to customise output further.
# 3.1 - Added full reciprocal best hit method. (fullrbh=T/F)
# 3.2 - Minor tweak to prevent unwanted directory generation for programs using existing GOPHER alignments.
# 3.3 - Added rje_blast_V2 to use BLAST+. Run with oldblast=T to stick with old NCBI BLAST.
# 3.4 - Fixed FullRBH paralogue duplication issue.
# 3.4.1 - Fixed stripXGap issue. (Why was this being implemented anyway?). Added REST output.
# 3.4.2 - Removed GOPHER System Exit on IOError to prevent breaking of REST server.
# 3.4.3 - Added checking and warning if no bootstraps for orthtree.
# 3.5.0 - Added separate outputs for trees with different alignment programs.
# 3.5.1 - Added capacity to run DNA GOPHER with tblastx. (Not tested!)
# 3.5.2 - Added acc=LIST as alias for uniprotid=LIST and updated docstring for REST to make it clear that rest=X needed.
# 3.5.3 - Removed bootstrap warning in wrong place.
# 3.5.4 - Added a try/except to catch some errors Norman was getting.

GOPHER REST Output formats

Run with &rest=help for general options. Run with &rest=full to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT for:

OrthBLAST outputs

qry = Query file
blast = BLAST results file
blastid = BLAST hit ID file

Orthologue Identification

orthfas = Unaligned fasta file of GOPHER orthologues
orthid = ID list of GOPHER orthologues

Orthologue alignment

alnfas = Aligned fasta file of GOPHER orthologues

Orthologue tree

nwk = Tree of GOPHER orthologues in Newick format
tree = Tree of GOPHER orthologues in plain text format
png = Tree of GOPHER orthologues in PNG format

Paralogues

parafas = Unaligned fasta file of GOPHER paralogues
paraid = ID list of GOPHER paralogues
paralnfas = Aligned fasta file of GOPHER paralogues

Special GOPHER REST call

Note that individual outputs can be requested for single proteins using the special REST call:

http://rest.slimsuite.unsw.edu.au/gopher&acc=X&spcode=X&rest=X

If this protein has already been processed by GOPHER, the relevant output will be returned directly as plain
text. If not, a jobid will be returned, which will have the desired output once run.

SLiMSuite REST Server

GOPHER V3.5.4

Generation of Orthologous Proteins from Homology-based Estimation of Relationships