SLiMSuite REST Server


Links
REST Home
EdwardsLab Homepage
EdwardsLab Blog
SLiMSuite Blog
SLiMSuite
Webservers
REST Pages
REST Status
REST Help
REST Tools
REST Alias Data
REST API
REST News
REST Sitemap

PRESTO V5.0

Protein Regular Expression Search Tool

Program: PRESTO
Description: Protein Regular Expression Search Tool
Version: 5.0
Last Edit: 22/01/07
Note: This program has been superceded in most functions by SLiMSearch.

Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice


Imported modules: rje rje_motif_V3 rje_motif_stats rje_scoring rje_seq rje_sequence rje_motiflist


See SLiMSuite Blog for further documentation. See rje for general commands.

Function

PRESTO is what the acronym suggests: a search tool for searching proteins with peptide sequences or motifs using an algorithm based on Regular Expressions. The simple input and output formats and ease of use on local databases make PRESTO a useful alternative to web resources for high throughput studies.

The additional benefits of PRESTO that make it more useful than a lot of existing tools include:

  • * PRESTO can be given alignment files from which to calculate conservation statistics for motif occurrences.
  • * searching with mismatches rather than restricting hits to perfect matches.
  • * additional statistics, inlcuding protein disorder, surface accessibility and hydrophobicity predictions
  • * production of separate fasta files containing the proteins hit by each motif.
  • * production of both UniProt format results and delimited text results for easy incorporation into other applications.
  • * inbuilt tandem Mass Spec ambiguities.

PRESTO recognises "n of m" motif elements in the form <X:n:m>, where X is one or more amino acids that must occur n+ times across which m positions. E.g. <IL:3:5> must have 3+ Is and/or Ls in a 5aa stretch.

Main output for PRESTO is a delimited file of motif/peptide occurrences but the motifaln=T and proteinaln=T also allow output of alignments of motifs and their occurrences. PRESTO has an additional motinfo=FILE output, which produces a summary table of the input motifs, inlcuding Expected values if searchdb given and information content if motifIC=T. Hit proteins can also be output in fasta format (fasout=T) or UniProt format with occurrences as features (uniprot=T).

Release Notes

Expectation scores have now been modified since PRESTO Version 1.x. In addition to the expectation score for the no.
of occurrences of a given motif (given the number of mismatches) in the entire dataset ("EXPECT"), there is now an
estimation of the probability of the observed number of occurrences, derived from a Poisson distribution, which is
output in the log file ("#PROB"). Further more, these values are now also calculated per sequence individually
("SEQ_EXP" and "SEQ_PROB").

Note on MS-MS mode: The old Perl version of Presto had a handy MS-MS mode for searching peptides sequenced from tandem
mass-spec data. (In this mode [msms=T], amino acids of equal mass (Leu-Ile [LI], Gln-Lys [QK], MetO-Phe [MF]) are
automatically placed as possible variants and additional output columns give information of predicted tryptic fragment
masses etc.) Implementation of MS-MS mode has been started in this version but discontinued due to lack of demand. As a
result, extra tryptic fragment data is not produced. If you would like to use it, contact me at richard.edwards@ucd.ie
and I will finish implementing it.

Note for compare=T mode: This is still fully functional but main documentation has been moved to comparimotif.py.

!!!NEW!!! for version 3.7, PRESTO has an additional domfilter=FILE option. This is quite crude and will read in domains
to be filtered from the FILE given. This file MUST be tab-delimited and must have at least three columns, with headers
'Name','Start' and 'Stop', where Name matches the short name of the Hit and 'Start' and 'End' are the positions of the
domain 1-N. This will output two additional columns, plus a further two if iupred=T:

  • DOM_MASK = Gives the motif a score of the length of the domain if it would be masked out by masking domains or 0 if not

  • DOM_PROP = Gives the proportion of motif positions in a domain

  • DOM_DIS = Gives the motif the mean disorder score for the *domain* if in the domain, else 1.0 if not

  • DOM_COMB = Gives positions in the domain the mean disorder score for the domain, else they keep their own scores

!!!NEW!!! for version 4.0, PRESTO has a Peptide design mode (peptides=T), using winsize=X to set size of peptides around
occurrences. This will output peptide sequences into a fasta file and additional columns to the main PRESTO output file:

  • PEP_SEQ = Sequence of peptide

  • PEP_DESIGN = Peptide design comments. "OK" if all looking good, else warnings bad AA combos (DP, DC, DG, NG, NS or PP)

Development Notes

Main output is now determined by outfile=X and/or basefile=X, which will set the self.info['Basefile'] attribute,
using standard rje module commands. If it is not set (i.e. is '' or 'None'), it will be generated using the motif and
searchdb files as with the old PRESTO. Main search output will use this file leader and add the appropriate extension based
on the output type and delimiter:

  • resfile = Main PRESTO search = *.presto.tdt

  • motifaln = Produce fasta files of local motif alignments *.motifaln.fas

  • uniprot = Output of hits as a uniprot format file = *.uniprot.presto

  • motinfo = Motif summary table = *.motinfo.tdt

  • ftout = Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [*.features.tdt]

  • peptides = Peptides designed around motifs = *.peptides.fas

Other special output will generate their names using protein and/or motif names using the root PATH of basefile
(e.g. the PATH will be stripped and ProteinAln/ or HitFas/ directories made for output):

  • * proteinaln=T/F : Search for alignments of proteins containing motifs and produce new file containing motifs in [False]

  • * fasout=T/F : Whether to output hit sequences as a fasta format file motif.fas [False]

Reformatting and ouputting motifs require a file name to be given:

  • * motifout=FILE : Filename for output of reformatted (and filtered?) motifs in PRESTO format [None]

PRESTO Commands

## Basic Input Parameters ##
motifs=FILE : File of input motifs/peptides [None]
Single line per motif format = 'Name Sequence #Comments' (Comments are optional and ignored)
Alternative formats include fasta, SLiMDisc output and raw motif lists.
minpep=X : Min length of motif/peptide X aa [2]
minfix=X : Min number of fixed positions for a motif to contain [0]
minic=X : Min information content for a motif (1 fixed position = 1.0) [2.0]
trimx=T/F : Trims Xs from the ends of a motif [False]
nrmotif=T/F : Whether to remove redundancy in input motifs [False]
searchdb=FILE : Protein Fasta file to search (or second motif file to compare) [None]
xpad=X : Adds X additional Xs to the flanks of the motif (after trimx if trimx=T) [0]
xpaddb=X : Adds X additional Xs to the flanks of the search database sequences (will mess up alignments) [0]
minimotif=T/F : Input file is in minimotif format and will be reformatted (PRESTO File format only) [False]
goodmotif=LIST : List of text to match in Motif names to keep (can have wildcards) []
## Basic Output Parameters ##
outfile=X : Base name of results files, e.g. X.presto.tdt. [motifsFILE-searchdbFILE.presto.tdt]
expect=T/F : Whether to give crude expect values based on AA frequencies [True]
nohits=T/F : Save list of sequence IDs without motif hits to *.nohits.txt. [False]
useres=T/F : Whether to append existing results to *.presto.txt and *.nohits.txt (continuing afer last sequence)
and/or use existing results in to search for conservation in alignments if usealn=T. [False]
mysql=T/F : Output results in mySQL format - lower case headers and no spaces [False]
hitname=X : Format for Hit Name: full/short/accnum [short]
fasout=T/F : Whether to output hit sequences as a fasta format file motif.fas [False]
datout=T/F : Whether to output hits as a uniprot format file *.uniprot.presto [False]
motinfo=T/F : Whether to output motif summary table *.motinfo.tdt [None]
motifout=FILE : Filename for output of reformatted (and filtered?) motifs in PRESTO format [None]
## Advanced Output Options ##
winsa=X : Number of aa to extend Surface Accessibility calculation either side of motif [0]
winhyd=X : Number of aa to extend Eisenberg Hydrophobicity calculation either side of motif [0]
windis=X : Extend disorder statistic X aa either side of motif (use flanks *only* if negative) [0]
winchg=X : Extend charge calculations (if any) to X aa either side of motif [0]
winsize=X : Sets all of the above window sizes (use flanks *only* if negative) [0]
slimchg=T/F : Calculate Asolute, Net and Balance charge statistics (above) for occurrences [False]
iupred=T/F : Run IUPred disorder prediction [False]
foldindex=T/F : Run FoldIndex disorder prediction [False]
iucut=X : Cut-off for IUPred results (0.0 will report mean IUPred score) [0.0]
iumethod=X : IUPred method to use (long/short) [short]
iupath=PATH : The full path to the IUPred exectuable [c:/bioware/iupred/iupred.exe]
domfilter=FILE : Use the DomFilter options, reading domains from FILE [None]
runid=X : Adds an additional Run_ID column identifying the run (for multiple appended runs [None]
restrict=LIST : List of files containing instances (hit,start,end) to output (only) []
exclude=LIST : List of files containing instances (hit,start,end) to exclude []
peptides=T/F : Peptide design mode, using winsize=X to set size of peptides around motif [False]
newscore=LIST : Lists of X:Y, create a new statistic X, where Y is the formula of the score. []
## Basic Search Parameters ##
mismatch=X,Y : Peptide must be >= Y aa for X mismatches
ambcut=X : Cut-off for max number of choices in ambiguous position to be shown as variant [10]
expcut=X : The maximum number of expected occurrences allowed to still search with motif [0] (if -ve, per seq)
alphabet=X,Y,.. : List of letters in alphabet of interest [AAs]
reverse=T/F : Reverse the motifs - good for generating a test comparison data set [False]
*** No longer outputs *.rev.txt - use motifout=X instead! ***

msms=T/F : Whether searching Tandem Mass Spec peptides [False]
ranking=T/F : Whether to rank hits by their rating in MSMS mode [False]
memsaver=T/F : Whether to store all results in Objects (False) or clear as search proceeds (True) [True]
startfrom=X : Accession Number / ID to start from. (Enables restart after crash.) [None]
## Conservation Parameters ##
usealn=T/F : Whether to search for and use alignemnts where present. [False]
gopher=T/F : Use GOPHER to generate missing orthologue alignments in alndir - see gopher.py options [False]
fullforce=T/F : Force GOPHER to re-run even if alignment exists [False]
alndir=PATH : Path to alignments of proteins containing motifs [./] * Use forward slashes (/)
alnext=X : File extension of alignment files, accnum.X [aln.fas]
alngap=T/F : Whether to count proteins in alignments that have 100% gaps over motif (True) or (False) ignore
as putative sequence fragments [False] (NB. All X regions are ignored as sequence errors.)
conspec=LIST : List of species codes for conservation analysis. Can be name of file containing list. [None]
conscore=X : Type of conservation score used: [pos]
- abs = absolute conservation of motif using RegExp over matched region
- pos = positional conservation: each position treated independently
- prop = conservation of amino acid properties
- all = all three methods for comparison purposes
consamb=T/F : Whether to calculate conservation allowing for degeneracy of motif (True) or of fixed variant (False) [True]
consinfo=T/F : Weight positions by information content (does nothing for conscore=abs) [True]
consweight=X : Weight given to global percentage identity for conservation, given more weight to closer sequences [0]
- 0 gives equal weighting to all. Negative values will upweight distant sequences.
posmatrix=FILE : Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix) [None]
aaprop=FILE : Amino Acid property matrix file. [aaprop.txt]
consout=T/F : Outputs an additional result field containing information on the conservation score used [False]
## Additional Output for Extracted Motifs ##
motific=T/F : Output Information Content for motifs [False]
motifaln=T/F : Produce fasta files of local motif alignments [False]
proteinaln=T/F : Search for alignments of proteins containing motifs and produce new file containing motifs [False]
protalndir=PATH : Directory name for output of protein aligments [ProteinAln/]
flanksize=X : Size of sequence flanks for motifs [30]
xdivide=X : Size of dividing Xs between motifs [10]
ftout=T/F : Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [*.features.tdt]
unipaths=LIST : List of additional paths containing uniprot.index files from which to look for and extract features ['']
statfilter=LIST : List of stats to filter (*discard* occurrences) on, consisting of X*Y where:
- X is an output stat (the column header),
- * is an operator in the list >, >=, !=, =, >= ,< !!! Remember to enclose in "quotes" for <> !!!
- Y is a value that X must have, assessed using *.
This filtering is crude and may behave strangely if X is not a numerical stat!

## Motif Comparison Parameters ##
compare=T/F : Compare the motifs from the motifs FILE with the searchdb FILE (or self if None) [False]
minshare=X : Min. number of non-wildcard positions for motifs to share [2]
matchfix=X : If >0 must exactly match *all* fixed positions in the motifs from: [0]
- 1: input (motifs=FILE) motifs
- 2: searchdb motifs
- 3: *both* input and searchdb motifs
matchic=T/F : Use (and output) information content of matched regions to asses motif matches [True]
motdesc=X : Sets which motifs have description outputs (0-3 as matchfix option) [3]
outstyle=X : Sets the output style for the resfile [normal]
- normal = all standard stats are output
- multi = designed for multiple appended runs. File names are also output
- single = designed for searches of a single motif vs a database. Only motif2 stats are output
- normalsplit/multisplit = as normal/multi but stats are grouped by motif rather than by type


© 2015 RJ Edwards. Contact: richard.edwards@unsw.edu.au.