This program has a fairly simple function of reading in a set of sequences and generating a regular expression motif
from them. It is designed with protein sequences in mind but should work for DNA sequences too. Input sequences can
be in fasta format or just plain text (with no sequence headers) and should be aligned already. If
gapped positions will be ignored (treated as Xs) and variable length wildcards are not returned. If
gapped positions will be assessed based on the ungapped peptides at that position and a variable length inserted.
This variable-length position may be a wildcard or it may be a defined position if there is sufficient signal in the
peptides with amino acids at that position.
SLiMMaker considers each column of the input in turn and compresses it into a regular expression element according to
some simple rules, screening out rare amino acids and converting particularly degenerate positions into wildcards.
Each amino acid in the column that occurs at least X times (as defined by
minseq=X) is considered for the regular
expression definition for that position. The full set of amino acids meeting this criterion is then assessed for
whether to keep it as a defined position, or convert into a wildcard. First, if the number of different amino acids
meeting this criterion is zero or above a second threshold (
maxaa=X), the position is defined as a wildcard. Second,
the proportion of input sequences matching the amino acid set is compared to a minimum frequency criterion
minfreq=X). Failing to meet this minimum frequency will again result in a wildcard. Otherwise, the amino acid set is
added to the SLiM definition as either a fixed position (if only one amino acid met the
minseq criterion) or as a
degenerate position. Finally, leading and trailing wildcards are removed.
By default, each defined position in a motif will contain amino acids that (a) occur in at least three sequences
each, (b) have a combined frequency of >=75%, and (c) have 5 or fewer different amino acids (that occur in 3+
sequences). The same
minseq=X threshold is also used to determine whether flexible length *defined* positions are
varlength=T), i.e. to have a flexible-length non-wildcard position, at least minseq sequences must
have a gap at that position. This does not apply to flexible-length wildcards.
Note. Unless the "iterate" function is used, the final motif only contains defined positions that match a given
frequency of the input (75% by default). Because positions are considered independently, however, the final motif
might occur in fewer than 75% of the input sequences. SLiMSearch can be used to check the occurrence stats.
Version 1.5.0 incorporates a new peptide alignment mode to deal with unaligned peptides. This is controlled by the
peptalign=T/F/X option, which is set to True by default. If given a regular expression, this will be used to guide
the alignment. Otherwise, the longest peptides will be used as a guide and the minimum number of gaps added to
shorter peptides. PeptCluster peptide distance measures are used to assess different variants, starting with simple
sequence identity, then amino acid properties (if ties) and finally PAM distances. One of the latter can be set as
the priority using
peptdis=X. Peptide alignment assumes that peptides have termini (^ & $) or flanking wildcards
added. If not, set
Version 1.6.0 added the option to incorporate amino acid equivalencies to extend motif sites beyond the top X% of
amino acids. This works by identifying a degenerate set of amino acids as normal using
minseq=X and then checking
whether these form a subset of an equivalence group prior to the
minfreq=X filter. If so, it will try extending the
degenerate position to incorporate additional members of the equivalence group. For example,
IL could incorporate
MVF amino acids of an
FILMV group. Only amino acids represented in the peptides will be added. Single
amino acids will also be extended, e.g.
S could be extended to
ST. This mode is switched on with
equiv=LIST option sets the equivalence groups.
If two or more equivalence groups could be extended, the one with the most members will be chosen. If tied, the one
with fewest possible amino acids (from
equiv=LIST) will be chosen. If still tied, the first group in the list will
peptides=LIST : These can be entered as a list or a file. If a file, lines following '#' or '>' are ignored
peptalign=T/F/X : Align peptides. Will use as guide regular expression, else T/True for regex-free alignment. [
minseq=X : Min. no. of sequences for an aa to be in [
minfreq=X : Min. combined freq of accepted aa to avoid wildcard [
maxaa=X : Max. no. different amino acids for one position [
ignore=X : Amino acid(s) to ignore. (If nucleotide, would be N-) [
dna=T/F : Whether "peptides" are actually DNA fragments [
iterate=T/F : Whether to perform iterative SLiMMaker, re-running on matched peptides with each iteration [
varlength=T/F : Whether to identifies gaps in aligned peptides and generate variable length motif [
extendaa=T/F : Whether to extend ambiguous aa using equivalence list [
equiv=LIST : List (or file) of TEIRESIAS-style ambiguities to use [
See also rje.py generic commandline options.
History Module Version History
# 0.0 - Initial Compilation.
# 1.0 - Initial Working Version. Some minor modifications for SLiMBench including iterative SLiMMaker.
# 1.1 - Modified to work with end of line characters.
# 1.2.0 - Modified to work with REST servers.
# 1.3.0 - Added varlength option to identify gaps in aligned peptides and generate variable length motif.
# 1.3.1 - Fixed varlength option to work with end of peptide gaps. (Gaps ignored completely - should not be there!)
# 1.4.0 - Add iteration REST output.
# 1.4.1 - Add unmatched peptides REST output.
# 1.4.2 - Fixed bug with variable length wildcards at start of sequence.
# 1.5.0 - Added peptalign=X functionality, using PeptCluster peptide alignment.
# 1.6.0 - Added equiv=LIST : List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST]
# 1.6.1 - Fixed peptide case bug.
Enter sequences and click "Make SLiM". Sequences can be raw sequences or fasta format.
(Example sequences are LIG_PCNA_PIPBox_1 ELM occurrences.)
In place of peptides, an ELM Class can also be entered into the box. See the REST aliases page for more details.