Function
SLiMSearch is a tool for finding pre-defined SLiMs (Short Linear Motifs) in a protein sequence database. SLiMSearch
can make use of corrections for evolutionary relationships and a variation of the SLiMChance alogrithm from
SLiMFinder to assess motifs for statistical over- and under-representation. SLiMSearch is a replacement for PRESTO
and uses many of the same underlying modules.
Benefits of SLiMSearch that make it more useful than a lot of existing tools include:
- searching with mismatches rather than restricting hits to perfect matches.
- optional equivalency files for searching with specific allowed mismatched (e.g. charge conservation)
- generation or reading of alignment files from which to calculate conservation statistics for motif occurrences.
- additional statistics, including protein disorder, surface accessibility and hydrophobicity predictions
- recognition of "n of m" motif elements in the form <X:n:m>, where X is one or more amino acids that must occur n+
times across which m positions. E.g. <IL:3:5> must have 3+ Is and/or Ls in a 5aa stretch.
Main output for SLiMSearch is a delimited file of motif/peptide occurrences but the motifaln=T
and proteinaln=T
also
allow output of alignments of motifs and their occurrences. The primary outputs are named *.csv for the occurrence
data and *.summary.csv for the summary data for each motif/dataset pair.
NOTE: SLiMSearch has now been largely superseded by SLiMProb for motif statistics.
Commandline
### Basic Input/Output Options ###
motifs=FILE
: File of input motifs/peptides [None
]
Single line per motif format = 'Name Sequence #Comments' (Comments are optional and ignored)
Alternative formats include fasta, SLiMDisc output and raw motif lists.
seqin=FILE
: Sequence file to search [None
]
batch=LIST
: List of sequence files for batch input (wildcard * permitted) []
maxseq=X
: Maximum number of sequences to process [0
]
maxsize=X
: Maximum dataset size to process in AA (or NT) [100,000
]
maxocc=X
: Filter out Motifs with more than maximum number of occurrences [0
]
walltime=X
: Time in hours before program will abort search and exit [1.0
]
resfile=FILE
: Main SLiMSearch results table [slimsearch.csv
]
resdir=PATH
: Redirect individual output files to specified directory (and look for intermediates) [SLiMSearch/
]
buildpath=PATH
: Alternative path to look for existing intermediate files [SLiMSearch/
]
force=T/F
: Force re-running of BLAST, UPC generation and search [False
]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
SearchDB Options I
masking=T/F
: Master control switch to turn off all masking if False [False
]
dismask=T/F
: Whether to mask ordered regions (see rje_disorder for options) [False
]
consmask=T/F
: Whether to use relative conservation masking [False
]
ftmask=LIST
: UniProt features to mask out [EM,DOMAIN,TRANSMEM
]
imask=LIST
: UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
compmask=X,Y
: Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8
]
casemask=X
: Mask Upper or Lower case [None
]
motifmask=X
: List (or file) of motifs to mask from input sequences []
metmask=T/F
: Masks the N-terminal M [False
]
posmask=LIST
: Masks list of position-specific aas, where list = pos1:aas,pos2:aas [2:A
]
aamask=LIST
: Masks list of AAs from all sequences (reduces alphabet) []
SearchDB Options II
efilter=T/F
: Whether to use evolutionary filter [False
]
blastf=T/F
: Use BLAST Complexity filter when determining relationships [True
]
blaste=X
: BLAST e-value threshold for determining relationships [1e=4
]
altdis=FILE
: Alternative all by all distance matrix for relationships [None
]
gablamdis=FILE
: Alternative GABLAM results file [None] (!!!Experimental feature!!!)
occupc=T/F
: Whether to output the UPC ID number in the occurrence output file [False
]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SLiMChance Options ###
maskfreq=T/F
: Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [True
]
aafreq=FILE
: Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None
]
aadimerfreq=FILE
: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) [None
]
negatives=FILE
: Multiply raw probabilities by under-representation in FILE [None
]
background=FILE
: Use observed support in background file for over-representation calculations [None
]
smearfreq=T/F
: Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False
]
seqocc=X
: Restrict to sequences with X+ occurrences (adjust for high frequency SLiMs) [1
]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Output Options ###
extras=X
: Whether to generate additional output files (alignments etc.) [1
]
- 0 = No output beyond main results file
- 1 = Generate additional outputs (alignments etc.)
pickle=T/F
: Whether to save/use pickles [True
]
targz=T/F
: Whether to tar and zip dataset result files (UNIX only) [False
]
savespace=0
: Delete "unneccessary" files following run (best used with targz): [0
]
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle files
- 2 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
- See also rje_slimcalc options for occurrence-based calculations and filtering *
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
History Module Version History
# 0.0 - Initial Compilation.
# 1.0 - Standardised masking options. Still not fully tested.
# 1.1 - Added background=FILE option for determing mean(p1+) for SLiMs based on background file.
# 1.2 - Added maxsize option.
# 1.3 - Add aamask option (and alphabet)
# 1.4 - Fixed zero-size UPC bug.
# 1.5 - Add MaxOcc setting.
# 1.6 - Minor tweaks to Log output. Add option for UPC number in occ output.
# 1.7 - Modified to work with GOPHER V3.0.
# 1.7.1 - Minor modification to docstring. Preparation for update to SLiMSearch 2.0 optimised for proteome searches.