Program:	GOPHER
Description:	Generation of Orthologous Proteins from High-Throughput Estimation of Relationships
Version:	2.9
Last Edit:	28/11/12
Citation:	Davey, Edwards & Shields (2007), Nucleic Acids Res. 35(Web Server issue):W455-9.

Imported modules: rje rje_seq rje_tree rje_blast_V1 rje_dismatrix_V2

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This script is designed to take in two sequences files and generate datasets of orthologous sequence alignments. The first [seqin] sequence set is the 'queries' around which orthologous datasets are to be assembled. This is now optimised for a dataset consisting of one protein per protein-coding gene, although splice variants should be dealt with OK and treated as paralogues. This will only cause problems if the postdup=T option is used, which restricts orthologues returned to be within the last post-duplication clade for the sequence.

The second [orthdb] is the list of proteins from which the orthologues will be extracted. The seqin sequences are then BLASTed against the orthdb and processed (see below) to retain putative orthologues using an estimation of the phylogenetic relationships based on pairwise sequences similarities.

NB. As of version 2.0, gopher=FILE has been replaced with seqin=FILE for greater rje python consistency. The allqry option has been removed. Please cleanup the input data into a desired non-redundant dataset before running GOPHER. (In many ways, GOPHER's strength is it's capacity to be run for a single sequence of interest rather than a whole genome, and it is this functionality that has been concentrated on for use with PRESTO and SLiM Pickings etc.) The output of statistics for each GOPHER run has also been discontinued for now but may be reintroduced with future versions. The phosalign command (to produce a table of potential phosphorylation sites (e.g. S,T,Y) across orthologues for special conservation of phosphorylation prediction analyses) has also been discontinued for now.

Version 2.1 has tightened up on the use of rje_seq parameters that were causing trouble otherwise. It is now the responsibility of the user to make sure that the orthologue database meets the desired criteria. Duplicate accession numbers will not be tolerated by GOPHER and (arbitrary) duplicates will be deleted if the sequences are the same, or renamed otherwise. Renaming may cause problems later. It is highly desirable not to have two proteins with the same accession number but different amino acid sequences. The following commands are added to the rje_seq object when input is read: accnr=T unkspec=F specnr=F gnspacc=T. Note that unknown species are also not permitted.

The process for dataset assembly is as follows for each protein :

1. BLAST against orthdb [orthblast]
> BLASTs saved in BLAST/AccNum.blast 2. Work through BLAST hits, indentifying paralogues (query species duplicates) and the closest homologue from each other species. This involves a second BLAST of the query versus original BLAST hits ([e-value=10]{cmd:e-value}, no complexity filter). The best sequence from each species is kept, i.e. the one with the best similarity to the query and not part of a clade with any paralogue that excludes the query. (If postdup=T, the hit must be in the query's post duplication clade.) In addition hits: [orthfas]
* Must have minimum identity level with Query * Must be one of the 'good species' [goodspec=LIST]
> Save reduced sequences as ORTH/AccNum.orth.fas > Save paralogues identified (and meeting minsim settings) in PARA/AccNum.para.fas 3. Align sequences with MUSCLE [orthalign]
> ALN/AccNum.orthaln.fas 4. Generate an unrooted tree with (ClustalW or PHYLIP) [orthtree]
> TREE/AccNum.orth.nsf

Optional paralogue/subfamily output: (These are best not used with Force=T or FullForce=T) 2a. Alignment of query protein and any paralogues >minsim threshold (paralign=T/F). The parasplice=T/F controls whether splice variants are in these paralogue alignments (where identified using AccNum-X notation). > PARALN/AccNum.paraln.fas 2b. Pairwise combinations of paralogues and their orthologues aligned, with "common" orthologues removed from the dataset, with a rooted tree and group data for BADASP analysis etc. (parafam=T) > PARAFAM/AccNum+ParaAccNum.parafam.fas > PARAFAM/AccNum+ParaAccNum.parafam.nsf > PARAFAM/AccNum+ParaAccNum.parafam.grp 2c. Combined protein families consisting of a protein, all the paralogues > minsim and all orthologues for each in a single dataset. Unaligned. (gopherfam=T) > SUBFAM/AccNum.subfam.fas *NB.* The subfamily outputs involve Gopher calling itself to ensure the paralogues have gone through the Gopher process themselves. This could potentially cause conflict if forking is used.

Commandline

### Basic Input/Output ###
seqin=FILE : Fasta file of 'query' sequences for orthology discovery []
orthdb=FILE : Fasta file with pool of sequences for orthology discovery []. Should contain query sequences.
startfrom=X : Accession Number / ID to start from. (Enables restart after crash.) [None]
dna=T/F : Whether to analyse DNA sequences (not optimised) [False]
### GOPHER run control parameters ###
orthblast : Run to blasting versus orthdb (Stage 1).
orthfas : Run to output of orthologues (Stage 2).
orthalign : Run to alignment of orthologues (Stage 3).
orthtree : Run to tree-generation (Stage 4). [default!]
### GOPHER Orthologue identifcation Parameters ###
postdup=T/F : Whether to align only post-duplication sequences [False]
reciprocal=T/F : Use Reciprocal Best Hit method instead of standard GOPHER approach [False]
minsim=X : Minimum %similarity of Query for each "orthologue" [40.0]
simfocus=X : Style of similarity comparison used for MinSim and "Best" sequence identification [query]
- query = %query must > minsim (Best if query is ultimate focus and maximises closeness of returned orthologues)
- hit = %hit must > minsim (Best if lots of sequence fragments are in searchdb and should be retained)
- either = %query > minsim OR %hit > minsim (Best if both above conditions are true)
- both = %query > minsim AND %hit > minsim (Most stringent setting)
gablamo=X : GABLAMO measure to use for similarity measures [Sim]
- ID = %Identity (from BLAST)
- Sim = %Similarity (from BLAST)
- Len = %Coverage (from BLAST)
- Score = BLAST BitScore (Reciprocal Match only)
goodX=LIST : Filters where only sequences meeting the requirement of LIST are kept.
LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
badX=LIST : As goodX but excludes rather than retains filtered sequences

### Additional run control options ###
repair=T/F : Repair mode - replace previous files if date mismatches or files missing.
(Skip missing files if False) [True]
force=T/F : Whether to force execution at current level even if results are new enough [False]
fullforce=T/F : Whether to force current and previous execution even if results are new enough [False]
dropout=T/F : Whether to "drop out" at earlier phases, or continue with single sequence [False]
ignoredate=T/F : Ignores the age of files and only replaces if missing [False]
savespace=T/F : Save space by deleting intermediate blast files during orthfas [True]
maxpara=X : Maximum number of paralogues to consider (large gene families can cause problems) [50]
### Additional Output Options ###
runpath=PATH : Specify parent directory in which to output files [./]
paralign=T/F : Whether to produce paralogue alignments (>minsim) in PARALN/ (assuming run to orthfas+) [False]
parasplice=T/F : Whether splice variants (where identified) are counted as paralogues [False]
parafam=T/F : Whether to paralogue paired subfamily alignments (>minsim) (assuming run to orthfas+) [False]
gopherfam=T/F : Whether to combined paralogous gopher orthologues into protein families (>minsim) (assuming run to orthfas+) [False]
sticky=T/F : Switch on "Sticky Orthologous Group generation" [False] * Experimental *
stiggid=X : Base for Stigg ID numbers [STIGG]

History Module Version History

    # 0.0 - Initial Working Compilation.
    # 1.0 - Preliminary working version up to alignment stage
    # 2.0 - Complete reworking of main _orthFas() method. See archived GOPHER 1.9 for history and obselete options.
    # 2.1 - Updated the parafam etc. methods to work with the new ensloci data
    # 2.2 - Added dna=T option.
    # 2.3 - Added soaplab=T settings.
    # 2.4 - Tidied up a bit.
    # 2.5 - Added soaplab file cleanup.
    # 2.6 - Replaced Query sequence with input to avoid shared ID problems.
    # 2.7 - Added sticky Orthologue method and maxpara=X.
    # 2.8 - Made dropout an option.
    # 2.9 - Deleted oldStigg() method. Added simple Reciprocal Best Hit orthology prediction.

GOPHER REST Output formats

Run with &rest=help for general options. Run with &rest=full to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT for:

OrthBLAST outputs

qry = Query file
blast = BLAST results file
blastid = BLAST hit ID file

Orthologue Identification

orthfas = Unaligned fasta file of GOPHER orthologues
orthid = ID list of GOPHER orthologues

Orthologue alignment

alnfas = Aligned fasta file of GOPHER orthologues

Orthologue tree

nwk = Tree of GOPHER orthologues in Newick format
tree = Tree of GOPHER orthologues in plain text format
png = Tree of GOPHER orthologues in PNG format

Paralogues

parafas = Unaligned fasta file of GOPHER paralogues
paraid = ID list of GOPHER paralogues
paralnfas = Aligned fasta file of GOPHER paralogues

Special GOPHER REST call

Note that individual outputs can be requested for single proteins using the special REST call:

http://rest.slimsuite.unsw.edu.au/gopher&acc=X&spcode=X&rest=X

If this protein has already been processed by GOPHER, the relevant output will be returned directly as plain
text. If not, a jobid will be returned, which will have the desired output once run.

SLiMSuite REST Server

GOPHER V2.9

Generation of Orthologous Proteins from High-Throughput Estimation of Relationships