Program:	QSLiMFinder
Description:	Query Short Linear Motif Finder
Version:	2.3.0
Last Edit:	24/05/19
Citation:	Palopoli N, Lythgow KT & Edwards RJ. Bioinformatics 2015; doi: 10.1093/bioinformatics/btv155
SLiMFinder:	Edwards, Davey & Shields (2007), PLoS ONE 2(10): e967. [PMID: 17912346]
Webserver:	http://www.slimsuite.unsw.edu.au/servers/qslimfinder.php
Manual:	http://bit.ly/SFManual

Imported modules: rje rje_seq slimfinder rje_slim rje_slimcalc rje_slimcore rje_slimlist

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

QSLiMFinder is a modification of the basic SLiMFinder tool to specifically look for SLiMs shared by a query sequence and one or more additional sequences. To do this, SLiMBuild first identifies all motifs that are present in the query sequences before removing it (and its UPC) from the dataset. The rest of the search and stats takes place using the remainder of the dataset but only using motifs found in the query. The final correction for multiple testing is made using a motif space defined by the original query sequence, rather than the full potential motif space used by the original SLiMFinder. This is offset against the increased probability of the observed motif support values due to the reduction of support that results from removing the query sequence but could potentially still identify SLiMs will increased significance.

Note that minocc and ambocc values *include* the query sequence, e.g. minocc=2 specifies the query and ONE other UPC.

Commandline

Basic Input/Output Options

seqin=FILE batch=LIST query=LIST addquery=FILE maxseq=X maxupc=X sizesort=X walltime=X resfile=FILE resdir=PATH buildpath=PATH force=T/F pickup=T/F dna=T/F alphabet=LIST megaslim=FILE megablam=T/F ptmlist=LIST ptmdata=DSVFILE #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

SLiMBuild Options I - Evolutionary Filtering

efilter=T/F blastf=T/F blaste=X altdis=FILE gablamdis=FILE homcut=X

SLiMBuild Options II - Input Masking

masking=T/F dismask=T/F consmask=T/F ftmask=LIST imask=LIST compmask=X,Y casemask=X motifmask=X metmask=T/F posmask=LIST aamask=LIST qregion=X,Y : Sequence file to search [None]
: List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
: Return only SLiMs that occur in 1+ Query sequences (Name/AccNum/Seq Number) [1]
: Adds query sequence(s) to batch jobs from FILE [None]
: Maximum number of sequences to process [500]
: Maximum UPC size of dataset to process [0]
: Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [0]
: Time in hours before program will abort search and exit [1.0]
: Main QSLiMFinder results table [qslimfinder.csv]
: Redirect individual output files to specified directory (and look for intermediates) [QSLiMFinder/]
: Alternative path to look for existing intermediate files [SLiMFinder/]
: Force re-running of BLAST, UPC generation and SLiMBuild [False]
: Pick-up from aborted batch run by identifying datasets in resfile using RunID [False]
: Whether the sequences files are DNA rather than protein [False]
: List of characters to include in search (e.g. AAs or NTs) [default AA or NT codes]
: Make/use precomputed results for a proteome (FILE) in fasta format [None]
: Whether to create and use all-by-all GABLAM results for (gablamdis) UPC generation [False]
: List of PTM letters to add to alphabet for analysis and restrict PTM data []
: File containing PTM data, including AccNum, ModType, ModPos, ModAA, ModCode
~~~~~~~~~~~~~~~~~~~~~~~#
: Whether to use evolutionary filter [True]
: Use BLAST Complexity filter when determining relationships [True]
: BLAST e-value threshold for determining relationships [1e=4]
: Alternative all by all distance matrix for relationships [None]
: Alternative GABLAM results file [None] (!!!Experimental feature!!!)
: Max number of homologues to allow (to reduce large multi-domain families) [0]
: Master control switch to turn off all masking if False [True]
: Whether to mask ordered regions (see rje_disorder for options) [False]
: Whether to use relative conservation masking [False]
: UniProt features to mask out (True=EM,DOMAIN,TRANSMEM) []
: UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
: Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
: Mask Upper or Lower case [None]
: List (or file) of motifs to mask from input sequences []
: Masks the N-terminal M (can be useful if termini=T) [True]
: Masks list of position-specific aas, where list = pos1:aas,pos2:aas [2:A]
: Masks list of AAs from all sequences (reduces alphabet) []
: Mask all but the region of the query from (and including) residue X to residue Y (0<L numbering) [1,-1]

SLiMBuild Options III - Basic Motif Construction

termini=T/F : Whether to add termini characters (^ & $) to search sequences [True]
minwild=X : Minimum number of consecutive wildcard positions to allow [0]
maxwild=X : Maximum number of consecutive wildcard positions to allow [2]
slimlen=X : Maximum length of SLiMs to return (no. non-wildcard positions) [5]
minocc=X : Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if < 1) [0.05]
absmin=X : Used if minocc<1 to define absolute min. UP occ [3]
alphahelix=T/F : Special i, i+3/4, i+7 motif discovery [False]

SLiMBuild Options IV - Ambiguity

ambiguity=T/F : (preamb=T/F) Whether to search for ambiguous motifs during motif discovery [True]
ambocc=X : Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or > minocc) [0.05]
absminamb=X : Used if ambocc<1 to define absolute min. UP occ [2]
equiv=LIST : List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST]
wildvar=T/F : Whether to allow variable length wildcards [True]
combamb=T/F : Whether to search for combined amino acid degeneracy and variable wildcards [False]

SLiMBuild Options V - Advanced Motif Filtering

musthave=LIST : Returned motifs must contain one or more of the AAs in LIST (reduces search space) []
focus=FILE : FILE containing focal groups for SLiM return (see Manual for details) [None]
focusocc=X : Motif must appear in X+ focus groups (0 = all) [0]

See also rje_slimcalc options for occurrence-based calculations and filtering *

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

SLiMChance Options

cloudfix=T/F : Restrict output to clouds with 1+ fixed motif (recommended) [False]
slimchance=T/F : Execute main QSLiMFinder probability method and outputs [True]
sigprime=T/F : Calculate more precise (but more computationally intensive) statistical model [False]
sigv=T/F : Use the more precise (but more computationally intensive) fix to mean UPC probability [False]
qexact=T/F : Calculate exact Query motif space (True) or over-estimate from dimers (False) (quicker) [True]
probcut=X : Probability cut-off for returned motifs [0.1]
maskfreq=T/F : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [False]
aafreq=FILE : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) (!!!Experimental!!!) [None]
negatives=FILE : Multiply raw probabilities by under-representation in FILE (!!!Experimental!!!) [None]
smearfreq=T/F : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
seqocc=T/F : Whether to upweight for multiple occurrences in same sequence (heuristic) [False]
probscore=X : Score to be used for probability cut-off and ranking (Prob/Sig) [Sig]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

Advanced Output Options I - Output data

clouds=X : Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (0=minocc / -1=off) [2]
runid=X : Run ID for resfile (allows multiple runs on same data) [DATE:TIME]
logmask=T/F : Whether to log the masking of individual sequences [True]
slimcheck=FILE : Motif file/list to add to resfile output []

Advanced Output Options II - Output formats

teiresias=T/F : Replace TEIRESIAS, making *.out and *.mask.fasta files [False]
slimdisc=T/F : Emulate SLiMDisc output format (*.rank & *.dat.rank + TEIRESIAS *.out & *.fasta) [False]
extras=X : Whether to generate additional output files (alignments etc.) [1]
--1 = No output beyond main results file
- 0 = Generate occurrence file and cloud file
- 1 = Generate occurrence file, alignments and cloud file
- 2 = Generate all additional QSLiMFinder outputs
- 3 = Generate SLiMDisc emulation too (equiv extras=2 slimdisc=T)
targz=T/F : Whether to tar and zip dataset result files (UNIX only) [False]
savespace=0 : Delete "unneccessary" files following run (best used with targz): [0]
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle
- 2 = Delete all bar *.upc (pickle added to tar)
- 3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)

Advanced Output Options III - Additional Motif Filtering

topranks=X : Will only output top X motifs meeting probcut [1000]
minic=X : Minimum information content for returned motifs [2.1]
allsig=T/F : Whether to also output all SLiMChance combinations (Sig/SigV/SigPrime/SigPrimeV) [False]

See also rje_slimcalc options for occurrence-based calculations and filtering *

History Module Version History

    # 0.0 - Initial Compilation based on SLiMFinder 3.5.
    # 1.0 - Test & Modified to include AA masking.
    # 1.1 - Added sizesort.
    # 1.2 - Added the addquery function.
    # 1.3 - Updated the output for Max/Min filtering and the pickup options.
    # 1.4 - Added additional dictionary and list to store Query dimers and SLiMs for motif space calculations.
    # 1.4 - Added qexact=T/F option for calculating Exact Query motif space (True) or estimating from dimers (False).
    # 1.5 - Implemented SigV calculation. Modified extras setting.
    # 1.6 - Removed excess module imports.
    # 1.7 - Fixed "MustHave=LIST" correction of motif space.
    # 1.8 - Added cloudfix=T/F Restrict output to clouds with 1+ fixed motif (recommended) [False]. Consolidating output.
    # 1.9 - Preparation for QSLiMFinder V2.0 & SLiMCore V2.0 using newer RJE_Object.
    # 2.0 - Converted to use rje_obj.RJE_Object as base. Version 1.9 moved to legacy/.
    # 2.1.0 - Added PTMData and PTMList options.
    # 2.1.1 - Switched feature masking OFF by default to give consistent Uniprot versus FASTA behaviour.
    # 2.2.0 - Added map and failed outputs for uniprotid=LIST input.
    # 2.3.0 - Modified qregion=X,Y to be 1-L numbering.

QSLiMFinder REST Output formats

SLiMs and SLiMFinder

Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological
systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few
as two sites may be important for activity. SLiMFinder is a SLiM discovery program building on the principles of
the SLiMDisc software for accounting for evolutionary relationships between input proteins. This stops results
being dominated by motifs shared for reasons of history, rather than function. SLiMFinder runs in two phases:
(1) SLiMBuild constructs the motif search space based on number of defined positions, maximum length of "wildcard
spacers" and allowed amino acid ambiguities; (2) SLiMChance assesses the over-representation of all motifs,
correcting for the size of the SLiMBuild search space. This gives SLiMFinder high specificity.

Protein sequences can be masked prior to SLiMBuild. Disorder masking (using IUPred predictions) is highly
recommended. Other masking options are described in the manual and/or literature.

Running SLiMFinder

The standared REST server call for SLiMFinder is in the form:
slimfinder&uniprotid=LIST&dismask=T/F&consmask=T/F

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

More options are available through the SLiMFinder server: http://www.slimsuite.unsw.edu.au/servers/slimfinder.php

After running, click on the main tab to see overall SLiM predictions. If any SLiMS have been predicted, the
occ tab will have details of which proteins (and where) they occur.

If no SLiMs are returned: [1] Try altering the masking settings. (Disorder masking is recommended. Conservation
masking can sometimes help but it depend on the dataset.) [2] Try relaxing the probability cutoff. Set
probcut=1.0 to see the best motifs, regardless of significance. (You may also want to reduce the topranks=X
setting.)

Available REST Outputs

main = Main results table of predicted SLiM patterns (if any) [extras=-1]
occ = Occurrence table showing individual SLiM occurrences in input proteins [extras=0]
upc = List of Unrelated Protein Clusters (UPC) used for evolutionary corrections [extras=0]
cloud = Predicted SLiM "cloud" output, which groups overlapping motifs [extras=1]
seqin = Input sequence data [extras=-1]
slimdb = Parsed input sequences in fasta format, used for UPC generation etc. [extras=0]
masked = Masked input sequences (masked residues marked with X) [extras=1]
mapping = Fasta format with positions of SLiM occurrences aligned [extras=1]
motifaln = Fasta format of individual SLiM alignments (unmasked sequences) [extras=1]
maskaln = Fasta format of individual SLiM alignments (masked sequences) [extras=1]

Additional REST Outputs [extras>1]

To get additional REST outputs, set &extras=2 or &extras=3. This may increase run times noticeably,
depending on the number of SLiMs returned.

motifs = SLiM predictions reformatted in plain motif format for CompariMotif [extras=2]
compare = Results of all-by-all CompariMotif search of predicted SLiMs [extras=2]
xgmml = SLiMs, occurrences and motif relationships in a Cytoscape-compatible network [extras=2]
dismatrix = Input sequence distance matrix [extras=3]
rank = Main table in SLiMDisc output format [extras=3]
dat.rank = Occurrence table in SLiMDisc output format [extras=3]
teiresias = Motif prediction output in TEIRESIAS format [extras=3 teiresias=T]
teiresias.fasta = TEIRESIAS masked fasta output [extras=3 teiresias=T]

SLiMSuite REST Server

QSLiMFinder V2.3.0

Query Short Linear Motif Finder