Program:	SLiMFinder
Description:	Short Linear Motif Finder
Version:	5.4.0
Last Edit:	24/05/19
Citation:	Edwards RJ, Davey NE & Shields DC (2007), PLoS ONE 2(10): e967.
ConsMask Citation:	Davey NE, Shields DC & Edwards RJ (2009), Bioinformatics 25(4): 443-50.
SigV/SigPrime Citation:	Davey NE, Edwards RJ & Shields DC (2010), BMC Bioinformatics 11: 14.
SLiMScape/REST Citation:	Olorin E, O'Brien KT, Palopoli N, Perez-Bercoff A & Shields DC, Edwards RJ (2015), F1000Research 4:477.
SLiMMaker Citation:	Palopoli N, Lythgow KT & Edwards RJ (2015), Bioinformatics 31(14): 2284-2293.
Webserver:	http://www.slimsuite.unsw.edu.au/servers/slimfinder.php
Manual:	http://bit.ly/SFManual

Imported modules: rje rje_seq rje_sequence rje_scoring rje_xgmml rje_slim rje_slimcalc rje_slimcore rje_slimlist slimmaker rje_motif_V3 comparimotif_V3 ned_rankbydistribution

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

SLiMFinder is an integrated SLiM discovery program building on the principles of the SLiMDisc software for accounting for evolutionary relationships [Davey NE, Shields DC & Edwards RJ (2006): Nucleic Acids Res. 34(12):3546-54]. SLiMFinder is comprised of two algorithms:

1. SLiMBuild identifies convergently evolved, short motifs in a dataset. Motifs with fixed amino acid positions are identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. Unlike programs such as TEIRESIAS, which return all shared patterns, SLiMBuild accelerates the process and reduces returned motifs by explicitly screening out motifs that do not occur in enough unrelated proteins. For this, SLiMBuild uses the "Unrelated Proteins" (UP) algorithm of SLiMDisc in which BLAST is used to identify pairwise relationships. Proteins are then clustered according to these relationships into "Unrelated Protein Clusters" (UPC), which are defined such that no protein in a UPC has a BLAST-detectable relationship with a protein in another UPC. If desired, SLiMBuild can be used as a replacement for TEIRESIAS in other software (teiresias=T slimchance=F).

2. SLiMChance estimates the probability of these motifs arising by chance, correcting for the size and composition of the dataset, and assigns a significance value to each motif. Motif occurrence probabilities are calculated independently for each UPC, adjusted for the size of a UPC using the Minimum Spanning Tree algorithm from SLiMDisc. These individual occurrence probabilities are then converted into the total probability of the seeing the observed motifs the observed number of (unrelated) times. These probabilities assume that the motif is known before the search. In reality, only over-represented motifs from the dataset are looked at, so these probabilities are adjusted for the size of motif-space searched to give a significance value. The returned corrected probability is an estimate of the probability of seeing ANY motif with that significance (or greater) from the dataset (i.e. an estimate of the probability of seeing that motif, *or another one like it*). These values are calculated separately for each length of motif.

SLiMFinder version 4.0 introduced a more precise (but more computationally intensive) statistical model, which can be switched on using sigprime=T. Likewise, the more precise (but more computationally intensive) correction to the mean UPC probability heuristic can be switched on using sigv=T. (Note that the other SLiMChance options may not work with either of these options.) The allsig=T option will output all four scores. In this case, SigPrimeV will be used for ranking etc. unless probscore=X is used.

Clouds and Statistics

Where significant motifs are returned, SLiMFinder will group them into Motif "Clouds", which consist of physically overlapping motifs (2+ non-wildcard positions are the same in the same sequence). This provides an easy indication of which motifs may actually be variants of a larger SLiM and should therefore be considered together. From version V4.7, *.cloud.txt output includes a SLiMMaker summary Regex for the whole cloud. NOTE: This may not necessarily match all occurrences in the cloud.

Additional Motif Occurrence Statistics, such as motif conservation, are handled by the rje_slimlist module and rje_slimcalc modules. Please see the documentation for these module for a full list of commandline options. These options have not been fully tested in SLiMFinder, so please report issues and/or request desired functions. Note that occfilter=LIST *does* affect the motifs returned by SLiMBuild and thus the TEIRESIAS output (as does min. IC and min. Support) but the overall Motif slimfilter=LIST *only* affects SLiMFinder output following SLiMChance calculations.

Secondary Functions

The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.

The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.

Where pre-known motifs are also of interest, these can be given with the slimcheck=MOTIFS option and will be added to the output. In general, it is better to use SLiMProb to look for enrichment (or depletion) of pre-defined motifs.

Input/Output

SLiMFinder Input

The main input for SLiMFinder is the seqin=SEQFILE file of protein sequences, which can be Uniprot plain text
(DATFILE) or fasta (FASFILE) format. A batch of files (incorporating wildcards) can be given using
batch=FILELIST. Alternative primary input is uniprotid=LIST. This requires an active internet connection to retrieve
the corresponding Uniprot entries.

SLiMFinder Output

Please see Manual for details.

Commandline

Basic Input/Output Options

seqin=SEQFILE : Sequence file to search. Over-rules batch=FILE and uniprotid=LIST [None]
batch=FILELIST : List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
uniprotid=LIST : Extract IDs/AccNums in list from Uniprot into BASEFILE.dat and use as seqin=FILE. []
maxseq=X : Maximum number of sequences to process [500]
maxupc=X : Maximum UPC size of dataset to process [0]
sizesort=X : Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [0]
walltime=X : Time in hours before program will abort search and exit [1.0]
resfile=FILE : Main SLiMFinder results table [slimfinder.csv]
resdir=PATH : Redirect individual output files to specified directory (and look for intermediates) [SLiMFinder/]
buildpath=PATH : Alternative path to look for existing intermediate files [SLiMFinder/]
force=T/F : Force re-running of BLAST, UPC generation and SLiMBuild [False]
pickup=T/F : Pick-up from aborted batch run by identifying datasets in resfile [False]
pickid=T/F : Whether to use RunID to identify run datasets when using pickup [True]
pickall=T/F : Whether to skip aborted runs (True) or only those datasets that ran to completion (False) [True]
dna=T/F : Whether the sequences files are DNA rather than protein [False]
alphabet=LIST : List of characters to include in search (e.g. AAs or NTs) [default AA or NT codes]
megaslim=FILE : Make/use precomputed results for a proteome (FILE) in fasta format [None]
megablam=T/F : Whether to create and use all-by-all GABLAM results for (gablamdis) UPC generation [False]
ptmlist=LIST : List of PTM letters to add to alphabet for analysis and restrict PTM data []
ptmdata=DSVFILE : File containing PTM data, including AccNum, ModType, ModPos, ModAA, ModCode

SLiMBuild

See also rje_slimcalc options for occurrence-based calculations and filtering : Whether to use evolutionary filter [True]
: Use BLAST Complexity filter when determining relationships [True]
: BLAST e-value threshold for determining relationships [1e=4]
: Alternative all by all distance matrix for relationships [None]
: Alternative GABLAM results file [None] (!!!Experimental feature!!!)
: Max number of homologues to allow (to reduce large multi-domain families) [0]
: Look for alternative UPC file and calculate Significance using new clusters [None]
: Master control switch to turn off all masking if False [True]
: Whether to mask ordered regions (see rje_disorder for options) [False]
: Whether to use relative conservation masking [False]
: UniProt features to mask out (True=EM,DOMAIN,TRANSMEM) []
: UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
: Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
: Mask Upper or Lower case [None]
: List (or file) of motifs to mask from input sequences []
: Masks the N-terminal M (can be useful if termini=T) [True]
: Masks list of position-specific aas, where list = pos1:aas,pos2:aas [2:A]
: Masks list of AAs from all sequences (reduces alphabet) []
: Mask all but the region of the query from (and including) residue X to residue Y [1,-1]
: Whether to add termini characters (^ & $) to search sequences [True]
: Minimum number of consecutive wildcard positions to allow [0]
: Maximum number of consecutive wildcard positions to allow [2]
: Maximum length of SLiMs to return (no. non-wildcard positions) [5]
: Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if < 1) [0.05]
: Used if minocc<1 to define absolute min. UP occ [3]
: Special i, i+3/4, i+7 motif discovery [False]
: If true, will use maxwild and slimlen to define a fixed total motif length [False]
: Special DNA mode that will search for palindromic sequences only [False]
: (preamb=T/F) Whether to search for ambiguous motifs during motif discovery [True]
: Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or > minocc) [0.05]
: Used if ambocc<1 to define absolute min. UP occ [2]
: List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST]
: Whether to allow variable length wildcards [True]
: Whether to search for combined amino acid degeneracy and variable wildcards [False]
: Look for alternative UPC file and filter based on minocc [None]
: Returned motifs must contain one or more of the AAs in LIST (reduces search space) []
: Return only SLiMs that occur in 1+ Query sequences (Name/AccNum) []
: FILE containing focal groups for SLiM return (see Manual for details) [None]
: Motif must appear in X+ focus groups (0 = all) [0]
*

SLiMChance

cloudfix=T/F : Restrict output to clouds with 1+ fixed motif (recommended) [False]
slimchance=T/F : Execute main SLiMFinder probability method and outputs [True]
sigprime=T/F : Calculate more precise (but more computationally intensive) statistical model [False]
sigv=T/F : Use the more precise (but more computationally intensive) fix to mean UPC probability [False]
dimfreq=T/F : Whether to use dimer masking pattern to adjust number of possible sites for motif [True]
probcut=X : Probability cut-off for returned motifs (sigcut=X also recognised) [0.1]
maskfreq=T/F : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [True]
aafreq=AAFILE : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) (!!!Experimental!!!) [None]
negatives=SEQFILE : Multiply raw probabilities by under-representation in FILE (!!!Experimental!!!) [None]
smearfreq=T/F : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
seqocc=T/F : Whether to upweight for multiple occurrences in same sequence (heuristic) [False]
probscore=X : Score to be used for probability cut-off and ranking (Prob/Sig/S/R) [Sig]

Advanced

Advanced Masking Options I (Conservation Masking)

usegopher=T/F : Use GOPHER to generate orthologue alignments missing from alndir - see gopher.py options [False]
fullforce=T/F : Whether to force regeneration of alignments using GOPHER [False]
orthdb=FILE : File to use as source of orthologues for GOPHER []

See also rje_slimcalc options for more conservation calculation options *

Advanced Output Options I (Output data)

clouds=X : Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (0=minocc / -1=off) [2]
runid=X : Run ID for resfile (allows multiple runs on same data) [DATE]
logmask=T/F : Whether to log the masking of individual sequences [True]
slimcheck=MOTIFS : Motif file/list to add to resfile output []

Advanced Output Options II (Output formats)

teiresias=T/F : Replace TEIRESIAS, making *.out and *.mask.fasta files [False]
slimdisc=T/F : Emulate SLiMDisc output format (*.rank & *.dat.rank + TEIRESIAS *.out & *.fasta) [False]
extras=X : Whether to generate additional output files (alignments etc.) [1]
--1 = No output beyond main results file
- 0 = Generate occurrence file
- 1 = Generate occurrence file, alignments and cloud file
- 2 = Generate all additional SLiMFinder outputs
- 3 = Generate SLiMDisc emulation too (equiv extras=2 slimdisc=T)
targz=T/F : Whether to tar and zip dataset result files (UNIX only) [False]
savespace=0 : Delete "unneccessary" files following run (best used with targz): [0]
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle
- 2 = Delete all bar *.upc (pickle added to tar)
- 3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)

Advanced Output Options III (Additional Motif Filtering)

topranks=X : Will only output top X motifs meeting probcut [1000]
oldscores=T/F : Whether to also output old SLiMDisc score (S) and SLiMPickings score (R) [False]
allsig=T/F : Whether to also output all SLiMChance combinations (Sig/SigV/SigPrime/SigPrimeV) [False]
minic=X : Minimum information content for returned motifs [2.1]

See also rje_slimcalc options for occurrence-based calculations and filtering *

Additional Functions I (MotifSeq)

motifseq=LIST : Outputs fasta files for a list of X:Y, where X is the pattern and Y is the output file []
slimbuild=T/F : Whether to build motifs with SLiMBuild. (For combination with motifseq only.) [True]

Additional Functions II (Randomised datasets)

randomise=T/F : Randomise UPC within batch files and output new datasets [False]
randir=PATH : Output path for creation of randomised datasets [Random/]
randbase=X : Base for random dataset name [rand]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

History Module Version History

# 0.0 - Initial Compilation.
# 1.0 - Preliminary working version with Poisson probabilities
# 1.1 - Binomial probabilities, bonferroni corrections and complexity masking
# 1.2 - Added musthave=LIST option and denferroni correction.
# 1.3 - Added resfile=FILE output
# 1.4 - Added option for termini
# 1.5 - Reworked slim mechanics to be ai-x-aj strings for future ambiguity (split on '-' to make list)
# 1.6 - Added basic ambiguity and flexible wildcards plus MST weighting for UP clusters
# 1.7 - Added counting of generic dimer frequencies for improved Bonferroni and probability calculation (No blockmask.)
# - Added topranks=X and query=X
# 1.8 - Added *.upc rather than *.self.blast. Added basic randomiser function.
# 1.9 - Added MotifList object to handle extra calculations and occurrence filtering.
# 2.0 - Tidied up and standardised output. Implemented extra filtering and scoring options.
# 2.1 - Changed defaults. Removed poisson as option and other obseleted functions.
# 2.2 - Tidied and reorganised code using SLiMBuild/SLiMChance subdivision of labour. Removed rerun=T/F (just Force.)
# 2.3 - Added AAFreq "smear" and "better" p1+ calculation. Added extra cloud summary output.
# 2.4 - Minor bug fixes and tidying. Removed power output. (Rubbish anyway!) Can read UPC from distance matrix.
# 3.0 - Dumped useless stats and calculations. Simplified output. Improved ambiguity & clouds.
# 3.1 - Added minwild and alphahelix options. (Partial aadimerfreq & negatives)
# 3.2 - Tidied up with SLiMCore, replaced old Motif objects with SLiM objects and SLiMCalc.
# 3.3 - Added XGMML output. Added webserver option with additional output.
# 3.4 - Added consmask relative conservation masking.
# 3.5 - Standardised masking options. Add motifmask and motifcull.
# 3.6 - Added aamasking and alphabet.
# 3.7 - Added option to switch off dimfreq and better handling of given aafreq
# 3.8 - Added SLiMDisc & SLiMPickings scores and options to rank on them.
# 3.9 - Added clouding consensus information. [Aborted due to technical challenges.]
# 3.10- Added differentiation of methods for pickling and tarring.
# 4.0 - Added SigPrime and SigV calculation from Norman. Added graded extras output.
# 4.1 - Added SizeSort, AltUPC and NewUPC options. Added #END output for webserver.
# 4.2 - Added fixlen option and improved Alphahelix option
# 4.3 - Updated the output for Max/Min filtering and the pickup options. Removed TempMaxSetting.
# 4.4 - Modified to work with GOPHER V3.0.
# 4.5 - Minor modifications to fix sigV and sigPrime bugs. Modified extras setting. Added palindrome setting for DNA motifs.
# 4.6 - Minor modification to seqocc=T function. !Experimental! Added main occurrence output and modified savespace.
# 4.7 - Added SLiMMaker generation to motif clouds. Added Q and Occ to Chance column.
# 4.8 - Modified cloud generation to avoid issues with flexible-length wildcards.
# 4.9 - Preparation for SLiMFinder V5.0 & SLiMCore V2.0 using newer RJE_Object.
# 5.0 - Converted to use rje_obj.RJE_Object as base. Version 4.9 moved to legacy/.
# 5.1 - Modified SLiMChance slightly to catch missing aafreq.
# 5.1.1 - Modified alphabet handling and fixed musthave bug.
# 5.2.0 - Added PTMList and PTMData modes (dev only).
# 5.2.1 - Fixed ambocc<1 and minocc<1 issue. (Using integers rather than floats.) Fixed OccRes Sig output format.
# 5.2.2 - Added warnings for ambocc and minocc that exceed the absolute minima. Updated docstring.
# 5.2.3 - Switched feature masking OFF by default to give consistent Uniprot versus FASTA behaviour. Fixed FTMask=T/F bug.
# 5.3.0 - Added map and failed outputs for uniprotid=LIST input.
# 5.3.1 - Modified placement of disorder masking warning.
# 5.3.2 - Tweaked REST output format presentation.
# 5.3.3 - Updated resfile to be set by basefile if no resfile=X setting given
# 5.3.4 - Fixed terminal (^/$) musthave bug.
# 5.3.5 - Fixed slimcheck and advanced stats models bug.
# 5.4.0 - Modified qregion=X,Y to be 1-L numbering.

SLiMFinder REST Output formats

SLiMs and SLiMFinder

Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological
systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few
as two sites may be important for activity. SLiMFinder is a SLiM discovery program building on the principles of
the SLiMDisc software for accounting for evolutionary relationships between input proteins. This stops results
being dominated by motifs shared for reasons of history, rather than function. SLiMFinder runs in two phases:
(1) SLiMBuild constructs the motif search space based on number of defined positions, maximum length of "wildcard
spacers" and allowed amino acid ambiguities; (2) SLiMChance assesses the over-representation of all motifs,
correcting for the size of the SLiMBuild search space. This gives SLiMFinder high specificity.

Protein sequences can be masked prior to SLiMBuild. Disorder masking (using IUPred predictions) is highly
recommended. Other masking options are described in the manual and/or literature.

Running SLiMFinder

The standared REST server call for SLiMFinder is in the form:
slimfinder&uniprotid=LIST&dismask=T/F&consmask=T/F

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

More options are available through the SLiMFinder server: http://www.slimsuite.unsw.edu.au/servers/slimfinder.php

After running, click on the main tab to see overall SLiM predictions. If any SLiMS have been predicted, the
occ tab will have details of which proteins (and where) they occur.

If no SLiMs are returned: [1] Try altering the masking settings. (Disorder masking is recommended. Conservation
masking can sometimes help but it depend on the dataset.) [2] Try relaxing the probability cutoff. Set
probcut=1.0 to see the best motifs, regardless of significance. (You may also want to reduce the topranks=X
setting.)

Available REST Outputs

main = Main results table of predicted SLiM patterns (if any) [extras=-1]
occ = Occurrence table showing individual SLiM occurrences in input proteins [extras=0]
upc = List of Unrelated Protein Clusters (UPC) used for evolutionary corrections [extras=0]
cloud = Predicted SLiM "cloud" output, which groups overlapping motifs [extras=1]
seqin = Input sequence data [extras=-1]
slimdb = Parsed input sequences in fasta format, used for UPC generation etc. [extras=0]
masked = Masked input sequences (masked residues marked with X) [extras=1]
mapping = Fasta format with positions of SLiM occurrences aligned [extras=1]
motifaln = Fasta format of individual SLiM alignments (unmasked sequences) [extras=1]
maskaln = Fasta format of individual SLiM alignments (masked sequences) [extras=1]

Additional REST Outputs [extras>1]

To get additional REST outputs, set &extras=2 or &extras=3. This may increase run times noticeably,
depending on the number of SLiMs returned.

motifs = SLiM predictions reformatted in plain motif format for CompariMotif [extras=2]
compare = Results of all-by-all CompariMotif search of predicted SLiMs [extras=2]
xgmml = SLiMs, occurrences and motif relationships in a Cytoscape-compatible network [extras=2]
dismatrix = Input sequence distance matrix [extras=3]
rank = Main table in SLiMDisc output format [extras=3]
dat.rank = Occurrence table in SLiMDisc output format [extras=3]
teiresias = Motif prediction output in TEIRESIAS format [extras=3 teiresias=T]
teiresias.fasta = TEIRESIAS masked fasta output [extras=3 teiresias=T]

SLiMSuite REST Server

SLiMFinder V5.4.0

Short Linear Motif Finder