|
|
Program: | SLiMFinder |
Description: | Short Linear Motif Finder |
Version: | 5.4.0 |
Last Edit: | 24/05/19 |
Citation: | Edwards RJ, Davey NE & Shields DC (2007), PLoS ONE 2(10): e967. |
ConsMask Citation: | Davey NE, Shields DC & Edwards RJ (2009), Bioinformatics 25(4): 443-50. |
SigV/SigPrime Citation: | Davey NE, Edwards RJ & Shields DC (2010), BMC Bioinformatics 11: 14. |
SLiMScape/REST Citation: | Olorin E, O'Brien KT, Palopoli N, Perez-Bercoff A & Shields DC, Edwards RJ (2015), F1000Research 4:477. |
SLiMMaker Citation: | Palopoli N, Lythgow KT & Edwards RJ (2015), Bioinformatics 31(14): 2284-2293. |
Webserver: | http://www.slimsuite.unsw.edu.au/servers/slimfinder.php |
Manual: | http://bit.ly/SFManual |
|
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice
Imported modules:
rje
rje_seq
rje_sequence
rje_scoring
rje_xgmml
rje_slim
rje_slimcalc
rje_slimcore
rje_slimlist
slimmaker
rje_motif_V3
comparimotif_V3
ned_rankbydistribution
See SLiMSuite Blog for further documentation. See rje
for general commands.
Function
Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological
systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few
as two sites may be important for activity, making identification of novel SLiMs extremely difficult. In particular,
it can be very difficult to distinguish a randomly recurring "motif" from a truly over-represented one. Incorporating
ambiguous amino acid positions and/or variable-length wildcard spacers between defined residues further complicates
the matter.
SLiMFinder is an integrated SLiM discovery program building on the principles of the SLiMDisc software for accounting
for evolutionary relationships [Davey NE, Shields DC & Edwards RJ (2006): Nucleic Acids Res. 34(12):3546-54].
SLiMFinder is comprised of two algorithms:
1. SLiMBuild
identifies convergently evolved, short motifs in a dataset. Motifs with fixed amino acid positions are
identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. Unlike
programs such as TEIRESIAS, which return all shared patterns, SLiMBuild accelerates the process and reduces returned
motifs by explicitly screening out motifs that do not occur in enough unrelated proteins. For this, SLiMBuild uses
the "Unrelated Proteins" (UP) algorithm of SLiMDisc in which BLAST is used to identify pairwise relationships.
Proteins are then clustered according to these relationships into "Unrelated Protein Clusters" (UPC), which are
defined such that no protein in a UPC has a BLAST-detectable relationship with a protein in another UPC. If desired,
SLiMBuild
can be used as a replacement for TEIRESIAS in other software (teiresias=T
slimchance=F
).
2. SLiMChance
estimates the probability of these motifs arising by chance, correcting for the size and composition
of the dataset, and assigns a significance value to each motif. Motif occurrence probabilities are calculated
independently for each UPC, adjusted for the size of a UPC using the Minimum Spanning Tree algorithm from SLiMDisc.
These individual occurrence probabilities are then converted into the total probability of the seeing the observed
motifs the observed number of (unrelated) times. These probabilities assume that the motif is known before the
search. In reality, only over-represented motifs from the dataset are looked at, so these probabilities are adjusted
for the size of motif-space searched to give a significance value. The returned corrected probability is an estimate
of the probability of seeing ANY motif with that significance (or greater) from the dataset (i.e. an estimate of the
probability of seeing that motif, *or another one like it*). These values are calculated separately for each length
of motif.
SLiMFinder version 4.0 introduced a more precise (but more computationally intensive) statistical model, which can
be switched on using sigprime=T
. Likewise, the more precise (but more computationally intensive) correction to the
mean UPC probability heuristic can be switched on using sigv=T
. (Note that the other SLiMChance
options may not
work with either of these options.) The allsig=T
option will output all four scores. In this case, SigPrimeV will be
used for ranking etc. unless probscore=X
is used.
Clouds and Statistics
Where significant motifs are returned, SLiMFinder will group them into Motif "Clouds", which consist of physically
overlapping motifs (2+ non-wildcard positions are the same in the same sequence). This provides an easy indication
of which motifs may actually be variants of a larger SLiM and should therefore be considered together. From version
V4.7, *.cloud.txt
output includes a SLiMMaker
summary Regex for the whole cloud. NOTE: This may not necessarily
match all occurrences in the cloud.
Additional Motif Occurrence Statistics, such as motif conservation, are handled by the rje_slimlist
module and
rje_slimcalc
modules. Please see the documentation for these module for a full list of commandline options. These
options have not been fully tested in SLiMFinder, so please report issues and/or request desired functions. Note that
occfilter=LIST
*does* affect the motifs returned by SLiMBuild and thus the TEIRESIAS output (as does min. IC and min.
Support) but the overall Motif slimfilter=LIST
*only* affects SLiMFinder output following SLiMChance calculations.
Secondary Functions
The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.
The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets
by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final
datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.
Where pre-known motifs are also of interest, these can be given with the slimcheck=MOTIFS
option and will be added to
the output. In general, it is better to use SLiMProb
to look for enrichment (or depletion) of pre-defined motifs.
Commandline
Basic Input/Output Options
seqin=SEQFILE
: Sequence file to search. Over-rules batch=FILE
and uniprotid=LIST
[None
]
batch=FILELIST
: List of files to search, wildcards allowed. (Over-ruled by seqin=FILE
.) [*.dat,*.fas
]
uniprotid=LIST
: Extract IDs/AccNums in list from Uniprot into BASEFILE.dat and use as seqin=FILE
. []
maxseq=X
: Maximum number of sequences to process [500
]
maxupc=X
: Maximum UPC size of dataset to process [0
]
sizesort=X
: Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [0
]
walltime=X
: Time in hours before program will abort search and exit [1.0
]
resfile=FILE
: Main SLiMFinder results table [slimfinder.csv
]
resdir=PATH
: Redirect individual output files to specified directory (and look for intermediates) [SLiMFinder/
]
buildpath=PATH
: Alternative path to look for existing intermediate files [SLiMFinder/
]
force=T/F
: Force re-running of BLAST, UPC generation and SLiMBuild [False
]
pickup=T/F
: Pick-up from aborted batch run by identifying datasets in resfile [False
]
pickid=T/F
: Whether to use RunID to identify run datasets when using pickup [True
]
pickall=T/F
: Whether to skip aborted runs (True) or only those datasets that ran to completion (False) [True
]
dna=T/F
: Whether the sequences files are DNA rather than protein [False
]
alphabet=LIST
: List of characters to include in search (e.g. AAs or NTs) [default AA or NT codes
]
megaslim=FILE
: Make/use precomputed results for a proteome (FILE) in fasta format [None
]
megablam=T/F
: Whether to create and use all-by-all GABLAM results for (gablamdis) UPC generation [False
]
ptmlist=LIST
: List of PTM letters to add to alphabet for analysis and restrict PTM data []
ptmdata=DSVFILE
: File containing PTM data, including AccNum, ModType, ModPos, ModAA, ModCode
SLiMBuild
SLiMBuild Options I (Evolutionary Filtering)
efilter=T/F
: Whether to use evolutionary filter [True
]
blastf=T/F
: Use BLAST Complexity filter when determining relationships [True
]
blaste=X
: BLAST e-value threshold for determining relationships [1e=4
]
altdis=DSVFILE
: Alternative all by all distance matrix for relationships [None
]
gablamdis=FILE
: Alternative GABLAM results file [None] (!!!Experimental feature!!!)
homcut=X
: Max number of homologues to allow (to reduce large multi-domain families) [0
]
newupc=PATH
: Look for alternative UPC file and calculate Significance using new clusters [None
]
SLiMBuild Options II (Input Masking)
masking=T/F
: Master control switch to turn off all masking if False [True
]
dismask=T/F
: Whether to mask ordered regions (see rje_disorder for options) [False
]
consmask=T/F
: Whether to use relative conservation masking [False
]
ftmask=LIST
: UniProt features to mask out (True=EM,DOMAIN,TRANSMEM
) []
imask=LIST
: UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
compmask=X,Y
: Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8
]
casemask=X
: Mask Upper or Lower case [None
]
motifmask=X
: List (or file) of motifs to mask from input sequences []
metmask=T/F
: Masks the N-terminal M (can be useful if termini=T
) [True
]
posmask=LIST
: Masks list of position-specific aas, where list = pos1:aas,pos2:aas [2:A
]
aamask=LIST
: Masks list of AAs from all sequences (reduces alphabet) []
qregion=X,Y
: Mask all but the region of the query from (and including) residue X to residue Y [1,-1
]
SLiMBuild Options III (Basic Motif Construction)
termini=T/F
: Whether to add termini characters (^ & $) to search sequences [True
]
minwild=X
: Minimum number of consecutive wildcard positions to allow [0
]
maxwild=X
: Maximum number of consecutive wildcard positions to allow [2
]
slimlen=X
: Maximum length of SLiMs to return (no. non-wildcard positions) [5
]
minocc=X
: Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if < 1) [0.05
]
absmin=X
: Used if minocc<1 to define absolute min. UP occ [3
]
alphahelix=T/F
: Special i, i+3/4, i+7 motif discovery [False
]
fixlen=T/F
: If true, will use maxwild and slimlen to define a fixed total motif length [False
]
palindrome=T/F
: Special DNA mode that will search for palindromic sequences only [False
]
SLiMBuild Options IV (Ambiguity)
ambiguity=T/F
: (preamb=T/F
) Whether to search for ambiguous motifs during motif discovery [True
]
ambocc=X
: Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or > minocc) [0.05
]
absminamb=X
: Used if ambocc<1 to define absolute min. UP occ [2
]
equiv=LIST
: List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST
]
wildvar=T/F
: Whether to allow variable length wildcards [True
]
combamb=T/F
: Whether to search for combined amino acid degeneracy and variable wildcards [False
]
SLiMBuild Options V (Advanced Motif Filtering)
altupc=PATH
: Look for alternative UPC file and filter based on minocc [None
]
musthave=LIST
: Returned motifs must contain one or more of the AAs in LIST (reduces search space) []
query=LIST
: Return only SLiMs that occur in 1+ Query sequences (Name/AccNum) []
focus=FILE
: FILE containing focal groups for SLiM return (see Manual for details) [None
]
focusocc=X
: Motif must appear in X+ focus groups (0 = all) [0
]
- See also rje_slimcalc options for occurrence-based calculations and filtering *
SLiMChance
cloudfix=T/F
: Restrict output to clouds with 1+ fixed motif (recommended) [False
]
slimchance=T/F
: Execute main SLiMFinder probability method and outputs [True
]
sigprime=T/F
: Calculate more precise (but more computationally intensive) statistical model [False
]
sigv=T/F
: Use the more precise (but more computationally intensive) fix to mean UPC probability [False
]
dimfreq=T/F
: Whether to use dimer masking pattern to adjust number of possible sites for motif [True
]
probcut=X
: Probability cut-off for returned motifs (sigcut=X
also recognised) [0.1
]
maskfreq=T/F
: Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [True
]
aafreq=AAFILE
: Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None
]
aadimerfreq=FILE
: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) (!!!Experimental!!!) [None
]
negatives=SEQFILE
: Multiply raw probabilities by under-representation in FILE (!!!Experimental!!!) [None
]
smearfreq=T/F
: Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False
]
seqocc=T/F
: Whether to upweight for multiple occurrences in same sequence (heuristic) [False
]
probscore=X
: Score to be used for probability cut-off and ranking (Prob/Sig/S/R) [Sig
]
Advanced
Advanced Masking Options I (Conservation Masking)
usegopher=T/F
: Use GOPHER to generate orthologue alignments missing from alndir - see gopher.py options [False
]
fullforce=T/F
: Whether to force regeneration of alignments using GOPHER [False
]
orthdb=FILE
: File to use as source of orthologues for GOPHER []
- See also rje_slimcalc options for more conservation calculation options *
Advanced Output Options I (Output data)
clouds=X
: Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (0=minocc
/ -1=off
) [2
]
runid=X
: Run ID for resfile (allows multiple runs on same data) [DATE
]
logmask=T/F
: Whether to log the masking of individual sequences [True
]
slimcheck=MOTIFS
: Motif file/list to add to resfile output []
Advanced Output Options II (Output formats)
teiresias=T/F
: Replace TEIRESIAS, making *.out and *.mask.fasta files [False
]
slimdisc=T/F
: Emulate SLiMDisc output format (*.rank & *.dat.rank + TEIRESIAS *.out & *.fasta) [False
]
extras=X
: Whether to generate additional output files (alignments etc.) [1
]
--1 = No output beyond main results file
- 0 = Generate occurrence file
- 1 = Generate occurrence file, alignments and cloud file
- 2 = Generate all additional SLiMFinder outputs
- 3 = Generate SLiMDisc emulation too (equiv extras=2
slimdisc=T
)
targz=T/F
: Whether to tar and zip dataset result files (UNIX only) [False
]
savespace=0
: Delete "unneccessary" files following run (best used with targz): [0
]
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle
- 2 = Delete all bar *.upc (pickle added to tar)
- 3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
Advanced Output Options III (Additional Motif Filtering)
topranks=X
: Will only output top X motifs meeting probcut [1000
]
oldscores=T/F
: Whether to also output old SLiMDisc score (S) and SLiMPickings score (R) [False
]
allsig=T/F
: Whether to also output all SLiMChance combinations (Sig/SigV/SigPrime/SigPrimeV) [False
]
minic=X
: Minimum information content for returned motifs [2.1
]
- See also rje_slimcalc options for occurrence-based calculations and filtering *
Additional Functions I (MotifSeq)
motifseq=LIST
: Outputs fasta files for a list of X:Y, where X is the pattern and Y is the output file []
slimbuild=T/F
: Whether to build motifs with SLiMBuild. (For combination with motifseq only.) [True
]
Additional Functions II (Randomised datasets)
randomise=T/F
: Randomise UPC within batch files and output new datasets [False
]
randir=PATH
: Output path for creation of randomised datasets [Random/
]
randbase=X
: Base for random dataset name [rand
]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
History Module Version History
# 0.0 - Initial Compilation.
# 1.0 - Preliminary working version with Poisson probabilities
# 1.1 - Binomial probabilities, bonferroni corrections and complexity masking
# 1.2 - Added musthave=LIST option and denferroni correction.
# 1.3 - Added resfile=FILE output
# 1.4 - Added option for termini
# 1.5 - Reworked slim mechanics to be ai-x-aj strings for future ambiguity (split on '-' to make list)
# 1.6 - Added basic ambiguity and flexible wildcards plus MST weighting for UP clusters
# 1.7 - Added counting of generic dimer frequencies for improved Bonferroni and probability calculation (No blockmask.)
# - Added topranks=X and query=X
# 1.8 - Added *.upc rather than *.self.blast. Added basic randomiser function.
# 1.9 - Added MotifList object to handle extra calculations and occurrence filtering.
# 2.0 - Tidied up and standardised output. Implemented extra filtering and scoring options.
# 2.1 - Changed defaults. Removed poisson as option and other obseleted functions.
# 2.2 - Tidied and reorganised code using SLiMBuild/SLiMChance subdivision of labour. Removed rerun=T/F (just Force.)
# 2.3 - Added AAFreq "smear" and "better" p1+ calculation. Added extra cloud summary output.
# 2.4 - Minor bug fixes and tidying. Removed power output. (Rubbish anyway!) Can read UPC from distance matrix.
# 3.0 - Dumped useless stats and calculations. Simplified output. Improved ambiguity & clouds.
# 3.1 - Added minwild and alphahelix options. (Partial aadimerfreq & negatives)
# 3.2 - Tidied up with SLiMCore, replaced old Motif objects with SLiM objects and SLiMCalc.
# 3.3 - Added XGMML output. Added webserver option with additional output.
# 3.4 - Added consmask relative conservation masking.
# 3.5 - Standardised masking options. Add motifmask and motifcull.
# 3.6 - Added aamasking and alphabet.
# 3.7 - Added option to switch off dimfreq and better handling of given aafreq
# 3.8 - Added SLiMDisc & SLiMPickings scores and options to rank on them.
# 3.9 - Added clouding consensus information. [Aborted due to technical challenges.]
# 3.10- Added differentiation of methods for pickling and tarring.
# 4.0 - Added SigPrime and SigV calculation from Norman. Added graded extras output.
# 4.1 - Added SizeSort, AltUPC and NewUPC options. Added #END output for webserver.
# 4.2 - Added fixlen option and improved Alphahelix option
# 4.3 - Updated the output for Max/Min filtering and the pickup options. Removed TempMaxSetting.
# 4.4 - Modified to work with GOPHER V3.0.
# 4.5 - Minor modifications to fix sigV and sigPrime bugs. Modified extras setting. Added palindrome setting for DNA motifs.
# 4.6 - Minor modification to seqocc=T function. !Experimental! Added main occurrence output and modified savespace.
# 4.7 - Added SLiMMaker generation to motif clouds. Added Q and Occ to Chance column.
# 4.8 - Modified cloud generation to avoid issues with flexible-length wildcards.
# 4.9 - Preparation for SLiMFinder V5.0 & SLiMCore V2.0 using newer RJE_Object.
# 5.0 - Converted to use rje_obj.RJE_Object as base. Version 4.9 moved to legacy/.
# 5.1 - Modified SLiMChance slightly to catch missing aafreq.
# 5.1.1 - Modified alphabet handling and fixed musthave bug.
# 5.2.0 - Added PTMList and PTMData modes (dev only).
# 5.2.1 - Fixed ambocc<1 and minocc<1 issue. (Using integers rather than floats.) Fixed OccRes Sig output format.
# 5.2.2 - Added warnings for ambocc and minocc that exceed the absolute minima. Updated docstring.
# 5.2.3 - Switched feature masking OFF by default to give consistent Uniprot versus FASTA behaviour. Fixed FTMask=T/F bug.
# 5.3.0 - Added map and failed outputs for uniprotid=LIST input.
# 5.3.1 - Modified placement of disorder masking warning.
# 5.3.2 - Tweaked REST output format presentation.
# 5.3.3 - Updated resfile to be set by basefile if no resfile=X setting given
# 5.3.4 - Fixed terminal (^/$) musthave bug.
# 5.3.5 - Fixed slimcheck and advanced stats models bug.
# 5.4.0 - Modified qregion=X,Y to be 1-L numbering.
SLiMFinder REST Output formats
SLiMs and SLiMFinder
Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological
systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few
as two sites may be important for activity. SLiMFinder is a SLiM discovery program building on the principles of
the SLiMDisc software for accounting for evolutionary relationships between input proteins. This stops results
being dominated by motifs shared for reasons of history, rather than function. SLiMFinder runs in two phases:
(1) SLiMBuild constructs the motif search space based on number of defined positions, maximum length of "wildcard
spacers" and allowed amino acid ambiguities; (2) SLiMChance assesses the over-representation of all motifs,
correcting for the size of the SLiMBuild search space. This gives SLiMFinder high specificity.
Protein sequences can be masked prior to SLiMBuild. Disorder masking (using IUPred predictions) is highly
recommended. Other masking options are described in the manual and/or literature.
Running SLiMFinder
The standared REST server call for SLiMFinder is in the form:
slimfinder&uniprotid=LIST&dismask=T/F&consmask=T/F
Run with
&rest=docs
for program documentation and options. A plain text version is accessed with
&rest=help
.
&rest=OUTFMT
can be used to retrieve individual parts of the output, matching the tabs in the default
(
&rest=format
) output. Individual
OUTFMT
elements can also be parsed from the full (
&rest=full
) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...
More options are available through the SLiMFinder server:
http://www.slimsuite.unsw.edu.au/servers/slimfinder.php
After running, click on the
main
tab to see overall SLiM predictions. If any SLiMS have been predicted, the
occ
tab will have details of which proteins (and where) they occur.
If no SLiMs are returned: [1] Try altering the masking settings. (Disorder masking is recommended. Conservation
masking can sometimes help but it depend on the dataset.) [2] Try relaxing the probability cutoff. Set
probcut=1.0
to see the best motifs, regardless of significance. (You may also want to reduce the
topranks=X
setting.)
Available REST Outputs
main
= Main results table of predicted SLiM patterns (if any) [
extras=-1
]
occ
= Occurrence table showing individual SLiM occurrences in input proteins [
extras=0
]
upc
= List of Unrelated Protein Clusters (UPC) used for evolutionary corrections [
extras=0
]
cloud
= Predicted SLiM "cloud" output, which groups overlapping motifs [
extras=1
]
seqin
= Input sequence data [
extras=-1
]
slimdb
= Parsed input sequences in fasta format, used for UPC generation etc. [
extras=0
]
masked
= Masked input sequences (masked residues marked with
X
) [
extras=1
]
mapping
= Fasta format with positions of SLiM occurrences aligned [
extras=1
]
motifaln
= Fasta format of individual SLiM alignments (unmasked sequences) [
extras=1
]
maskaln
= Fasta format of individual SLiM alignments (masked sequences) [
extras=1
]
Additional REST Outputs [extras>1]
To get additional REST outputs, set
&extras=2
or
&extras=3
. This may increase run times noticeably,
depending on the number of SLiMs returned.
motifs
= SLiM predictions reformatted in plain motif format for CompariMotif [
extras=2
]
compare
= Results of all-by-all CompariMotif search of predicted SLiMs [
extras=2
]
xgmml
= SLiMs, occurrences and motif relationships in a Cytoscape-compatible network [
extras=2
]
dismatrix
= Input sequence distance matrix [
extras=3
]
rank
= Main table in SLiMDisc output format [
extras=3
]
dat.rank
= Occurrence table in SLiMDisc output format [
extras=3
]
teiresias
= Motif prediction output in TEIRESIAS format [
extras=3
teiresias=T
]
teiresias.fasta
= TEIRESIAS masked fasta output [
extras=3
teiresias=T
]