|Description:|| Short Linear Motif Finder|
|Last Edit:|| 07/12/15|
|Citation:|| Edwards RJ, Davey NE & Shields DC (2007), PLoS ONE 2(10): e967.|
|ConsMask Citation:|| Davey NE, Shields DC & Edwards RJ (2009), Bioinformatics 25(4): 443-50.|
|SigV/SigPrime Citation:|| Davey NE, Edwards RJ & Shields DC (2010), BMC Bioinformatics 11: 14.|
|SLiMScape/REST Citation:|| Olorin E, O'Brien KT, Palopoli N, Perez-Bercoff A & Shields DC, Edwards RJ (2015), F1000Research 4:477.|
|SLiMMaker Citation:|| Palopoli N, Lythgow KT & Edwards RJ (2015), Bioinformatics 31(14): 2284-2293.|
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice
See SLiMSuite Blog for further documentation.
Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological
systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few
as two sites may be important for activity, making identification of novel SLiMs extremely difficult. In particular,
it can be very difficult to distinguish a randomly recurring "motif" from a truly over-represented one. Incorporating
ambiguous amino acid positions and/or variable-length wildcard spacers between defined residues further complicates
SLiMFinder is an integrated SLiM discovery program building on the principles of the SLiMDisc software for accounting
for evolutionary relationships [Davey NE, Shields DC & Edwards RJ (2006): Nucleic Acids Res. 34(12):3546-54].
SLiMFinder is comprised of two algorithms:
SLiMBuild identifies convergently evolved, short motifs in a dataset. Motifs with fixed amino acid positions are
identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. Unlike
programs such as TEIRESIAS, which return all shared patterns, SLiMBuild accelerates the process and reduces returned
motifs by explicitly screening out motifs that do not occur in enough unrelated proteins. For this, SLiMBuild uses
the "Unrelated Proteins" (UP) algorithm of SLiMDisc in which BLAST is used to identify pairwise relationships.
Proteins are then clustered according to these relationships into "Unrelated Protein Clusters" (UPC), which are
defined such that no protein in a UPC has a BLAST-detectable relationship with a protein in another UPC. If desired,
SLiMBuild can be used as a replacement for TEIRESIAS in other software (
SLiMChance estimates the probability of these motifs arising by chance, correcting for the size and composition
of the dataset, and assigns a significance value to each motif. Motif occurrence probabilities are calculated
independently for each UPC, adjusted for the size of a UPC using the Minimum Spanning Tree algorithm from SLiMDisc.
These individual occurrence probabilities are then converted into the total probability of the seeing the observed
motifs the observed number of (unrelated) times. These probabilities assume that the motif is known before the
search. In reality, only over-represented motifs from the dataset are looked at, so these probabilities are adjusted
for the size of motif-space searched to give a significance value. The returned corrected probability is an estimate
of the probability of seeing ANY motif with that significance (or greater) from the dataset (i.e. an estimate of the
probability of seeing that motif, *or another one like it*). These values are calculated separately for each length
SLiMFinder version 4.0 introduced a more precise (but more computationally intensive) statistical model, which can
be switched on using
sigprime=T. Likewise, the more precise (but more computationally intensive) correction to the
mean UPC probability heuristic can be switched on using
sigv=T. (Note that the other
SLiMChance options may not
work with either of these options.) The
allsig=T option will output all four scores. In this case, SigPrimeV will be
used for ranking etc. unless
probscore=X is used.
Clouds and Statistics
Where significant motifs are returned, SLiMFinder will group them into Motif "Clouds", which consist of physically
overlapping motifs (2+ non-wildcard positions are the same in the same sequence). This provides an easy indication
of which motifs may actually be variants of a larger SLiM and should therefore be considered together. From version
*.cloud.txt output includes a
SLiMMaker summary Regex for the whole cloud. NOTE: This may not necessarily
match all occurrences in the cloud.
Additional Motif Occurrence Statistics, such as motif conservation, are handled by the
rje_slimlist module and
rje_slimcalc modules. Please see the documentation for these module for a full list of commandline options. These
options have not been fully tested in SLiMFinder, so please report issues and/or request desired functions. Note that
occfilter=LIST *does* affect the motifs returned by SLiMBuild and thus the TEIRESIAS output (as does min. IC and min.
Support) but the overall Motif
slimfilter=LIST *only* affects SLiMFinder output following SLiMChance calculations.
The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.
The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets
by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final
datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.
Where pre-known motifs are also of interest, these can be given with the
slimcheck=MOTIFS option and will be added to
the output. In general, it is better to use
SLiMProb to look for enrichment (or depletion) of pre-defined motifs.
The main input for SLiMFinder is the
seqin=SEQFILE file of protein sequences, which can be Uniprot plain text
DATFILE) or fasta (
FASFILE) format. A batch of files (incorporating wildcards) can be given using
batch=FILELIST. Alternative primary input is
uniprotid=LIST. This requires an active internet connection to retrieve
the corresponding Uniprot entries.
Please see Manual for details.
Basic Input/Output Options
seqin=SEQFILE : Sequence file to search. Over-rules
batch=FILELIST : List of files to search, wildcards allowed. (Over-ruled by
uniprotid=LIST : Extract IDs/AccNums in list from Uniprot into BASEFILE.dat and use as
maxseq=X : Maximum number of sequences to process [
maxupc=X : Maximum UPC size of dataset to process [
sizesort=X : Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [
walltime=X : Time in hours before program will abort search and exit [
resfile=FILE : Main SLiMFinder results table [
resdir=PATH : Redirect individual output files to specified directory (and look for intermediates) [
buildpath=PATH : Alternative path to look for existing intermediate files [
force=T/F : Force re-running of BLAST, UPC generation and SLiMBuild [
pickup=T/F : Pick-up from aborted batch run by identifying datasets in resfile [
pickid=T/F : Whether to use RunID to identify run datasets when using pickup [
pickall=T/F : Whether to skip aborted runs (True) or only those datasets that ran to completion (False) [
dna=T/F : Whether the sequences files are DNA rather than protein [
alphabet=LIST : List of characters to include in search (e.g. AAs or NTs) [
default AA or NT codes]
megaslim=FILE : Make/use precomputed results for a proteome (FILE) in fasta format [
megablam=T/F : Whether to create and use all-by-all GABLAM results for (gablamdis) UPC generation [
ptmlist=LIST : List of PTM letters to add to alphabet for analysis and restrict PTM data 
ptmdata=DSVFILE : File containing PTM data, including AccNum, ModType, ModPos, ModAA, ModCode
SLiMBuild Options I (Evolutionary Filtering)
efilter=T/F : Whether to use evolutionary filter [
blastf=T/F : Use BLAST Complexity filter when determining relationships [
blaste=X : BLAST e-value threshold for determining relationships [
altdis=DSVFILE : Alternative all by all distance matrix for relationships [
gablamdis=FILE : Alternative GABLAM results file [None] (!!!Experimental feature!!!)
homcut=X : Max number of homologues to allow (to reduce large multi-domain families) [
newupc=PATH : Look for alternative UPC file and calculate Significance using new clusters [
SLiMBuild Options II (Input Masking)
masking=T/F : Master control switch to turn off all masking if False [
dismask=T/F : Whether to mask ordered regions (see rje_disorder for options) [
consmask=T/F : Whether to use relative conservation masking [
ftmask=LIST : UniProt features to mask out (
imask=LIST : UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) 
compmask=X,Y : Mask low complexity regions (same AA in X+ of Y consecutive aas) [
casemask=X : Mask Upper or Lower case [
motifmask=X : List (or file) of motifs to mask from input sequences 
metmask=T/F : Masks the N-terminal M (can be useful if
posmask=LIST : Masks list of position-specific aas, where list = pos1:aas,pos2:aas [
aamask=LIST : Masks list of AAs from all sequences (reduces alphabet) 
qregion=X,Y : Mask all but the region of the query from (and including) residue X to residue Y [
SLiMBuild Options III (Basic Motif Construction)
termini=T/F : Whether to add termini characters (^ & $) to search sequences [
minwild=X : Minimum number of consecutive wildcard positions to allow [
maxwild=X : Maximum number of consecutive wildcard positions to allow [
slimlen=X : Maximum length of SLiMs to return (no. non-wildcard positions) [
minocc=X : Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if < 1) [
absmin=X : Used if minocc<1 to define absolute min. UP occ [
alphahelix=T/F : Special i, i+3/4, i+7 motif discovery [
fixlen=T/F : If true, will use maxwild and slimlen to define a fixed total motif length [
palindrome=T/F : Special DNA mode that will search for palindromic sequences only [
SLiMBuild Options IV (Ambiguity)
ambiguity=T/F : (
preamb=T/F) Whether to search for ambiguous motifs during motif discovery [
ambocc=X : Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or > minocc) [
absminamb=X : Used if ambocc<1 to define absolute min. UP occ [
equiv=LIST : List (or file) of TEIRESIAS-style ambiguities to use [
wildvar=T/F : Whether to allow variable length wildcards [
combamb=T/F : Whether to search for combined amino acid degeneracy and variable wildcards [
SLiMBuild Options V (Advanced Motif Filtering)
altupc=PATH : Look for alternative UPC file and filter based on minocc [
musthave=LIST : Returned motifs must contain one or more of the AAs in LIST (reduces search space) 
query=LIST : Return only SLiMs that occur in 1+ Query sequences (Name/AccNum) 
focus=FILE : FILE containing focal groups for SLiM return (see Manual for details) [
focusocc=X : Motif must appear in X+ focus groups (0 = all) [
- * See also rje_slimcalc options for occurrence-based calculations and filtering *
cloudfix=T/F : Restrict output to clouds with 1+ fixed motif (recommended) [
slimchance=T/F : Execute main SLiMFinder probability method and outputs [
sigprime=T/F : Calculate more precise (but more computationally intensive) statistical model [
sigv=T/F : Use the more precise (but more computationally intensive) fix to mean UPC probability [
dimfreq=T/F : Whether to use dimer masking pattern to adjust number of possible sites for motif [
probcut=X : Probability cut-off for returned motifs (
sigcut=X also recognised) [
maskfreq=T/F : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [
aafreq=AAFILE : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [
aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) (!!!Experimental!!!) [
negatives=SEQFILE : Multiply raw probabilities by under-representation in FILE (!!!Experimental!!!) [
smearfreq=T/F : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [
seqocc=T/F : Whether to upweight for multiple occurrences in same sequence (heuristic) [
probscore=X : Score to be used for probability cut-off and ranking (Prob/Sig/S/R) [
Advanced Output Options I (Output data)
clouds=X : Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (
0=minocc / -
runid=X : Run ID for resfile (allows multiple runs on same data) [
logmask=T/F : Whether to log the masking of individual sequences [
slimcheck=MOTIFS : Motif file/list to add to resfile output 
Advanced Output Options II (Output formats)
teiresias=T/F : Replace TEIRESIAS, making *.out and *.mask.fasta files [
slimdisc=T/F : Emulate SLiMDisc output format (*.rank & *.dat.rank + TEIRESIAS *.out & *.fasta) [
extras=X : Whether to generate additional output files (alignments etc.) [
--1 = No output beyond main results file
- 0 = Generate occurrence file
- 1 = Generate occurrence file, alignments and cloud file
- 2 = Generate all additional SLiMFinder outputs
- 3 = Generate SLiMDisc emulation too (equiv
targz=T/F : Whether to tar and zip dataset result files (UNIX only) [
savespace=0 : Delete "unneccessary" files following run (best used with targz): [
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle
- 2 = Delete all bar *.upc (pickle added to tar)
- 3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
Advanced Output Options III (Additional Motif Filtering)
topranks=X : Will only output top X motifs meeting probcut [
oldscores=T/F : Whether to also output old SLiMDisc score (S) and SLiMPickings score (R) [
allsig=T/F : Whether to also output all SLiMChance combinations (Sig/SigV/SigPrime/SigPrimeV) [
minic=X : Minimum information content for returned motifs [
- * See also rje_slimcalc options for occurrence-based calculations and filtering *
Additional Functions I (MotifSeq)
motifseq=LIST : Outputs fasta files for a list of X:Y, where X is the pattern and Y is the output file 
slimbuild=T/F : Whether to build motifs with SLiMBuild. (For combination with motifseq only.) [
Additional Functions II (Randomised datasets)
randomise=T/F : Randomise UPC within batch files and output new datasets [
randir=PATH : Output path for creation of randomised datasets [
randbase=X : Base for random dataset name [
History Module Version History
# 0.0 - Initial Compilation.
# 1.0 - Preliminary working version with Poisson probabilities
# 1.1 - Binomial probabilities, bonferroni corrections and complexity masking
# 1.2 - Added musthave=LIST option and denferroni correction.
# 1.3 - Added resfile=FILE output
# 1.4 - Added option for termini
# 1.5 - Reworked slim mechanics to be ai-x-aj strings for future ambiguity (split on '-' to make list)
# 1.6 - Added basic ambiguity and flexible wildcards plus MST weighting for UP clusters
# 1.7 - Added counting of generic dimer frequencies for improved Bonferroni and probability calculation (No blockmask.)
# - Added topranks=X and query=X
# 1.8 - Added *.upc rather than *.self.blast. Added basic randomiser function.
# 1.9 - Added MotifList object to handle extra calculations and occurrence filtering.
# 2.0 - Tidied up and standardised output. Implemented extra filtering and scoring options.
# 2.1 - Changed defaults. Removed poisson as option and other obseleted functions.
# 2.2 - Tidied and reorganised code using SLiMBuild/SLiMChance subdivision of labour. Removed rerun=T/F (just Force.)
# 2.3 - Added AAFreq "smear" and "better" p1+ calculation. Added extra cloud summary output.
# 2.4 - Minor bug fixes and tidying. Removed power output. (Rubbish anyway!) Can read UPC from distance matrix.
# 3.0 - Dumped useless stats and calculations. Simplified output. Improved ambiguity & clouds.
# 3.1 - Added minwild and alphahelix options. (Partial aadimerfreq & negatives)
# 3.2 - Tidied up with SLiMCore, replaced old Motif objects with SLiM objects and SLiMCalc.
# 3.3 - Added XGMML output. Added webserver option with additional output.
# 3.4 - Added consmask relative conservation masking.
# 3.5 - Standardised masking options. Add motifmask and motifcull.
# 3.6 - Added aamasking and alphabet.
# 3.7 - Added option to switch off dimfreq and better handling of given aafreq
# 3.8 - Added SLiMDisc & SLiMPickings scores and options to rank on them.
# 3.9 - Added clouding consensus information. [Aborted due to technical challenges.]
# 3.10- Added differentiation of methods for pickling and tarring.
# 4.0 - Added SigPrime and SigV calculation from Norman. Added graded extras output.
# 4.1 - Added SizeSort, AltUPC and NewUPC options. Added #END output for webserver.
# 4.2 - Added fixlen option and improved Alphahelix option
# 4.3 - Updated the output for Max/Min filtering and the pickup options. Removed TempMaxSetting.
# 4.4 - Modified to work with GOPHER V3.0.
# 4.5 - Minor modifications to fix sigV and sigPrime bugs. Modified extras setting. Added palindrome setting for DNA motifs.
# 4.6 - Minor modification to seqocc=T function. !Experimental! Added main occurrence output and modified savespace.
# 4.7 - Added SLiMMaker generation to motif clouds. Added Q and Occ to Chance column.
# 4.8 - Modified cloud generation to avoid issues with flexible-length wildcards.
# 4.9 - Preparation for SLiMFinder V5.0 & SLiMCore V2.0 using newer RJE_Object.
# 5.0 - Converted to use rje_obj.RJE_Object as base. Version 4.9 moved to legacy/.
# 5.1 - Modified SLiMChance slightly to catch missing aafreq.
# 5.1.1 - Modified alphabet handling and fixed musthave bug.
# 5.2.0 - Added PTMList and PTMData modes (dev only).
# 5.2.1 - Fixed ambocc<1 and minocc<1 issue. (Using integers rather than floats.) Fixed OccRes Sig output format.
# 5.2.2 - Added warnings for ambocc and minocc that exceed the absolute minima. Updated docstring.
# 5.2.3 - Switched feature masking OFF by default to give consistent Uniprot versus FASTA behaviour. Fixed FTMask=T/F bug.
SLiMFinder REST Output formats
The standared REST server call for SLiMFinder is in the form:
Different sources of input can also be given with:
for general options. Run with
to get full server output as plain text. Otherwise,
individual outputs are parsed and presented in different tabs:
### Outputs available:
= main results file (
= Input file (
= occurrence file (
= UPC file (
= Fasta file used for UPC generation etc. (
= cloud.txt (
= masked.fas (
= mapping.fas file (
= motif alignments (
= masked motif alignments (
= motifs file for CompariMotif (
= CM compare.tdt file (
= XGMML file (
= *.dis.tdt file (
= optional SLiMDisc output (
= optional SLiMDisc output (
can then be used to retrieve individual parts of the output in future.