Program:	SLiMFinder
Description:	Short Linear Motif Finder
Version:	4.9
Last Edit:	11/07/14
Citation:	Edwards, Davey & Shields (2007), PLoS ONE 2(10): e967.
ConsMask Citation:	Davey NE, Shields DC & Edwards RJ (2009), Bioinformatics 25(4): 443-50.
SigV/SigPrime Citation:	Davey NE, Edwards RJ & Shields DC (2010), BMC Bioinformatics 11: 14.

Imported modules: rje rje_seq rje_sequence rje_scoring rje_xgmml rje_slim rje_slimcalc rje_slimlist slimmaker rje_motif_V3 comparimotif_V3 ned_rankbydistribution

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few as two sites may be important for activity, making identification of novel SLiMs extremely difficult. In particular, it can be very difficult to distinguish a randomly recurring "motif" from a truly over-represented one. Incorporating ambiguous amino acid positions and/or variable-length wildcard spacers between defined residues further complicates the matter.

SLiMFinder is an integrated SLiM discovery program building on the principles of the SLiMDisc software for accounting for evolutionary relationships [Davey NE, Shields DC & Edwards RJ (2006): Nucleic Acids Res. 34(12):3546-54]. SLiMFinder is comprised of two algorithms:

SLiMBuild identifies convergently evolved, short motifs in a dataset. Motifs with fixed amino acid positions are identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. Unlike programs such as TEIRESIAS, which return all shared patterns, SLiMBuild accelerates the process and reduces returned motifs by explicitly screening out motifs that do not occur in enough unrelated proteins. For this, SLiMBuild uses the "Unrelated Proteins" (UP) algorithm of SLiMDisc in which BLAST is used to identify pairwise relationships. Proteins are then clustered according to these relationships into "Unrelated Protein Clusters" (UPCs), which are defined such that no protein in a UPC has a BLAST-detectable relationship with a protein in another UPC. If desired, SLiMBuild can be used as a replacement for TEIRESIAS in other software (teiresias=T slimchance=F).

SLiMChance estimates the probability of these motifs arising by chance, correcting for the size and composition of the dataset, and assigns a significance value to each motif. Motif occurrence probabilites are calculated independently for each UPC, adjusted the size of a UPC using the Minimum Spanning Tree algorithm from SLiMDisc. These individual occurrence probabilities are then converted into the total probability of the seeing the observed motifs the observed number of (unrelated) times. These probabilities assume that the motif is known before the search. In reality, only over-represented motifs from the dataset are looked at, so these probabilities are adjusted for the size of motif-space searched to give a significance value. This is an estimate of the probability of seeing that motif, or another one like it. These values are calculated separately for each length of motif. Where pre-known motifs are also of interest, these can be given with the slimcheck=MOTIFS option and will be added to the output. SLiMFinder version 4.0 introduced a more precise (but more computationally intensive) statistical model, which can be switched on using sigprime=T. Likewise, the more precise (but more computationally intensive) correction to the mean UPC probability heuristic can be switched on using sigv=T. (Note that the other SLiMChance options may not work with either of these options.) The allsig=T option will output all four scores. In this case, SigPrimeV will be used for ranking etc. unless probscore=X is used.

Where significant motifs are returned, SLiMFinder will group them into Motif "Clouds", which consist of physically overlapping motifs (2+ non-wildcard positions are the same in the same sequence). This provides an easy indication of which motifs may actually be variants of a larger SLiM and should therefore be considered together.

Additional Motif Occurrence Statistics, such as motif conservation, are handled by the rje_slimlist module. Please see the documentation for this module for a full list of commandline options. These options are currently under development for SLiMFinder and are not fully supported. See the SLiMFinder Manual for further details. Note that the OccFilter *does* affect the motifs returned by SLiMBuild and thus the TEIRESIAS output (as does min. IC and min. Support) but the overall Motif StatFilter *only* affects SLiMFinder output following SLiMChance calculations.

Secondary Functions

The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.

The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets
by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final
datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.

Basic Input/Output Options

seqin=FILE : Sequence file to search [None]
batch=LIST : List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
maxseq=X : Maximum number of sequences to process [500]
maxupc=X : Maximum UPC size of dataset to process [0]
sizesort=X : Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [0]
walltime=X : Time in hours before program will abort search and exit [1.0]
resfile=FILE : Main SLiMFinder results table [slimfinder.csv]
resdir=PATH : Redirect individual output files to specified directory (and look for intermediates) [SLiMFinder/]
buildpath=PATH : Alternative path to look for existing intermediate files [SLiMFinder/]
force=T/F : Force re-running of BLAST, UPC generation and SLiMBuild [False]
pickup=T/F : Pick-up from aborted batch run by identifying datasets in resfile [False]
pickid=T/F : Whether to use RunID to identify run datasets when using pickup [True]
pickall=T/F : Whether to skip aborted runs (True) or only those datasets that ran to completion (False) [True]
dna=T/F : Whether the sequences files are DNA rather than protein [False]

SLiMBuild Options

See also rje_slimcalc options for occurrence-based calculations and filtering : Whether to use evolutionary filter [True]
: Use BLAST Complexity filter when determining relationships [True]
: BLAST e-value threshold for determining relationships [1e=4]
: Alternative all by all distance matrix for relationships [None]
: Alternative GABLAM results file [None] (!!!Experimental feature!!!)
: Max number of homologues to allow (to reduce large multi-domain families) [0]
: Look for alternative UPC file and calculate Significance using new clusters [None]
: Master control switch to turn off all masking if False [True]
: Whether to mask ordered regions (see rje_disorder for options) [False]
: Whether to use relative conservation masking [False]
: UniProt features to mask out [EM]
: UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
: Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
: Mask Upper or Lower case [None]
: List (or file) of motifs to mask from input sequences []
: Masks the N-terminal M (can be useful if termini=T) [True]
: Masks list of position-specific aas, where list = pos1:aas,pos2:aas [2:A]
: Masks list of AAs from all sequences (reduces alphabet) []
: Mask all but the region of the query from (and including) residue X to residue Y [0,-1]
: Whether to add termini characters (^ & $) to search sequences [True]
: Minimum number of consecutive wildcard positions to allow [0]
: Maximum number of consecutive wildcard positions to allow [2]
: Maximum length of SLiMs to return (no. non-wildcard positions) [5]
: Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if < 1) [0.05]
: Used if minocc<1 to define absolute min. UP occ [3]
: Special i, i+3/4, i+7 motif discovery [False]
: If true, will use maxwild and slimlen to define a fixed total motif length [False]
: Special DNA mode that will search for palindromic sequences only [False]
: (preamb=T/F) Whether to search for ambiguous motifs during motif discovery [True]
: Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or > minocc) [0.05]
: Used if ambocc<1 to define absolute min. UP occ [2]
: List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST]
: Whether to allow variable length wildcards [True]
: Whether to search for combined amino acid degeneracy and variable wildcards [False]
: Look for alternative UPC file and filter based on minocc [None]
: Returned motifs must contain one or more of the AAs in LIST (reduces search space) []
: Return only SLiMs that occur in 1+ Query sequences (Name/AccNum) []
: FILE containing focal groups for SLiM return (see Manual for details) [None]
: Motif must appear in X+ focus groups (0 = all) [0]
*

SLiMChance Options

cloudfix=T/F : Restrict output to clouds with 1+ fixed motif (recommended) [False]
slimchance=T/F : Execute main SLiMFinder probability method and outputs [True]
sigprime=T/F : Calculate more precise (but more computationally intensive) statistical model [False]
sigv=T/F : Use the more precise (but more computationally intensive) fix to mean UPC probability [False]
dimfreq=T/F : Whether to use dimer masking pattern to adjust number of possible sites for motif [True]
probcut=X : Probability cut-off for returned motifs (sigcut=X also recognised) [0.1]
maskfreq=T/F : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [True]
aafreq=FILE : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) (!!!Experimental!!!) [None]
negatives=FILE : Multiply raw probabilities by under-representation in FILE (!!!Experimental!!!) [None]
smearfreq=T/F : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
seqocc=T/F : Whether to upweight for multiple occurrences in same sequence (heuristic) [False]
probscore=X : Score to be used for probability cut-off and ranking (Prob/Sig/S/R) [Sig]

Advanced Output Options

Advanced Output Options I

clouds=X : Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (0=minocc / -1=off) [2]
runid=X : Run ID for resfile (allows multiple runs on same data) [DATE]
logmask=T/F : Whether to log the masking of individual sequences [True]
slimcheck=FILE : Motif file/list to add to resfile output []

Advanced Output Options II

teiresias=T/F : Replace TEIRESIAS, making *.out and *.mask.fasta files [False]
slimdisc=T/F : Emulate SLiMDisc output format (*.rank & *.dat.rank + TEIRESIAS *.out & *.fasta) [False]
extras=X : Whether to generate additional output files (alignments etc.) [1]
--1 = No output beyond main results file
- 0 = Generate occurrence file
- 1 = Generate occurrence file, alignments and cloud file
- 2 = Generate all additional SLiMFinder outputs
- 3 = Generate SLiMDisc emulation too (equiv extras=2 slimdisc=T)
targz=T/F : Whether to tar and zip dataset result files (UNIX only) [False]
savespace=0 : Delete "unneccessary" files following run (best used with targz): [0]
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle
- 2 = Delete all bar *.upc (pickle added to tar)
- 3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)

Advanced Output Options III

topranks=X : Will only output top X motifs meeting probcut [1000]
oldscores=T/F : Whether to also output old SLiMDisc score (S) and SLiMPickings score (R) [False]
allsig=T/F : Whether to also output all SLiMChance combinations (Sig/SigV/SigPrime/SigPrimeV) [False]
minic=X : Minimum information content for returned motifs [2.1]

See also rje_slimcalc options for occurrence-based calculations and filtering *

Additional Functions

Additional Functions I

motifseq=LIST : Outputs fasta files for a list of X:Y, where X is the pattern and Y is the output file []
slimbuild=T/F : Whether to build motifs with SLiMBuild. (For combination with motifseq only.) [True]

Additional Functions II

randomise=T/F : Randomise UPC within batch files and output new datasets [False]
randir=PATH : Output path for creation of randomised datasets [Random/]
randbase=X : Base for random dataset name [rand]

History Module Version History

# 0.0 - Initial Compilation.
# 1.0 - Preliminary working version with Poisson probabilities
# 1.1 - Binomial probabilities, bonferroni corrections and complexity masking
# 1.2 - Added musthave=LIST option and denferroni correction.
# 1.3 - Added resfile=FILE output
# 1.4 - Added option for termini
# 1.5 - Reworked slim mechanics to be ai-x-aj strings for future ambiguity (split on '-' to make list)
# 1.6 - Added basic ambiguity and flexible wildcards plus MST weighting for UP clusters
# 1.7 - Added counting of generic dimer frequencies for improved Bonferroni and probability calculation (No blockmask.)
# - Added topranks=X and query=X
# 1.8 - Added *.upc rather than *.self.blast. Added basic randomiser function.
# 1.9 - Added MotifList object to handle extra calculations and occurrence filtering.
# 2.0 - Tidied up and standardised output. Implemented extra filtering and scoring options.
# 2.1 - Changed defaults. Removed poisson as option and other obseleted functions.
# 2.2 - Tidied and reorganised code using SLiMBuild/SLiMChance subdivision of labour. Removed rerun=T/F (just Force.)
# 2.3 - Added AAFreq "smear" and "better" p1+ calculation. Added extra cloud summary output.
# 2.4 - Minor bug fixes and tidying. Removed power output. (Rubbish anyway!) Can read UPC from distance matrix.
# 3.0 - Dumped useless stats and calculations. Simplified output. Improved ambiguity & clouds.
# 3.1 - Added minwild and alphahelix options. (Partial aadimerfreq & negatives)
# 3.2 - Tidied up with SLiMCore, replaced old Motif objects with SLiM objects and SLiMCalc.
# 3.3 - Added XGMML output. Added webserver option with additional output.
# 3.4 - Added consmask relative conservation masking.
# 3.5 - Standardised masking options. Add motifmask and motifcull.
# 3.6 - Added aamasking and alphabet.
# 3.7 - Added option to switch off dimfreq and better handling of given aafreq
# 3.8 - Added SLiMDisc & SLiMPickings scores and options to rank on them.
# 3.9 - Added clouding consensus information. [Aborted due to technical challenges.]
# 3.10- Added differentiation of methods for pickling and tarring.
# 4.0 - Added SigPrime and SigV calculation from Norman. Added graded extras output.
# 4.1 - Added SizeSort, AltUPC and NewUPC options. Added #END output for webserver.
# 4.2 - Added fixlen option and improved Alphahelix option
# 4.3 - Updated the output for Max/Min filtering and the pickup options. Removed TempMaxSetting.
# 4.4 - Modified to work with GOPHER V3.0.
# 4.5 - Minor modifications to fix sigV and sigPrime bugs. Modified extras setting. Added palindrome setting for DNA motifs.
# 4.6 - Minor modification to seqocc=T function. !Experimental! Added main occurrence output and modified savespace.
# 4.7 - Added SLiMMaker generation to motif clouds. Added Q and Occ to Chance column.
# 4.8 - Modified cloud generation to avoid issues with flexible-length wildcards.
# 4.9 - Preparation for SLiMFinder V5.0 & SLiMCore V2.0 using newer RJE_Object.

SLiMSuite REST Server

SLiMFinder V4.9

Short Linear Motif Finder