Module:	slimpickings
Description:	SLiMDisc results compilation and extraction
Version:	3.0
Last Edit:	18/01/07

Imported modules: gopher rje rje_disorder rje_scoring rje_seq rje_sequence rje_uniprot

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This is a basic results compiler for multiple SLiMDisc motif discovery datasets. There are currently the following functional elements to the module:

1. Basic compilation of results from multiple datasets into a single file. This will search through the current directory and any subdirectories (unless subdir=F) and pull out results into a single comma-separated file (slimdisc_results.csv or outfile=FILE). With the basic run, the following statistics are output: ['Dataset','SeqNum','TotalAA','Rank','Score','Pattern','Occ','IC','Norm','Sim']
This file can then be imported into other applications for analysis. (E.g. rje_mysql.py can be run on the file to construct a BUILD statement for MySQL, or StatTranfer can convert the file for STATA analysis etc.) !!! NB: If multiple datasets (e.g. in subdirectories) have the same name, slim_pickings will become confused and may generate erroneous data later. Please ensure that all datasets are uniquely named. !!!

2. Additional optional stats based on the motifs sequences themselves to help rank and filter interesting results. These are: - AbsChg = Number of charged positions [KRDE]
- NetChg = Net charge of motif [KR] - [DE]
- BalChg = Balance of charge = Net charge in first half motif - Net charge in second half - AILMV = Whether all positions in the motif are A,I,L,M or V. - Aromatic = Count of F+Y+W - Phos = Potential phosphorylated residues X (none) or [S][T][Y], whichever are present

3. Calculation of additional statistics from the input sequences, using PRESTO. These are: - Mean IUPred/FoldIndex Protein Disorder around the motif occurrences (including extended window either side) - Mean Surface Accessibility around the motif occurrences (including an extended window either side) - Mean Eisenberg Hydrophobicity around the motif occurrences (including an extended window either side) - SLiM conservation across orthologous proteins. (This calculation needs improving.) The mean for all occurrences of a motif will be output. In addition, percentile steps can be used to assess motifs according to selected threshold criteria (in another package). This will return the threshold at a given percentile, e.g. SA_pc75=2.0 would mean that 75% of occurrences have a mean Surface Accessibility value of 2.0 or greater. (For hydrophobicity, Hyd_pc50=0.3 would be 50% of occurrences have mean Hydrophobicity of 0.3 or *less*. This is because low hydrophobicity is good for a (non-structural) functional motif.)

4. Collation and extraction of key data for specific results. These may be by any combination of protein, motif or dataset. If a list of datasets is not given, then all datasets will be considered. (Likewise proteins and motifs.) To be very specific, all three lists may be specified (slimlist=LIST protlist=LIST datalist=LIST). Information is pulled out in a two-step process: (1) The slimdisc.*.index files are consulted for the appropriate list of datasets. If missing, these will be regenerated. (slimdisc.motif.index and slimdisc.protein.index both point to dataset names. slimdisc.dataset.index points these names to the full path of the results.) Only datasets returned by all appropriate lists will be analysed for data extraction. (2) The appropriate data on the motifs will be extracted into a directory as determined by outdir=PATH. Depending on the options selected, the following (by default all) data is returned: - *.motifaln.fas = customised fasta file with motifs aligned in different sequences, ready for dotplots and manual inspection for homology not detected by BLAST. - *.dat = UniProt DAT file for as many parents as possible. These files will be saved in the directory set by outdir=PATH.

5. Re-ranking of results. rerank=X will now re-rank the results for each dataset according to the statistic set by rankstat=X, and output the top X results only. By default, this is the "R-score" = ic * norm * occ / exp. The output "Rank" will be replaced with the new rank and a new column "OldRank" added to the ouput. zscore=T/F turns on and off a simple Z-score calculation based on the slimranks read in. Version 2.5 added a new option for a crude length correction of the RScore, dividing by 20 to the power of the motif IC (as calculated by SLiM Pickings on a scale of 1.0 per fixed position). This is controlled by the lencorrect=T/F option. By default this is False (for backwards compatibility) but with future versions this may become the default as it is assumed (by me) that it will improve performance. However, there is currently no justification for this, so use with caution!

6. Filtering of results using the statfilter=LIST option, allowing results to be filtered according to a set of rules: LIST should be (a file containing) a comma-separated list of stats to filter on, consisting of X*Y where X is an output stat (the column header); * is an operator in the list >, >=, =, =< ,< ; and Y is a value that X must have, assessed using *. This filtering is crude and may behave strangely if X is not a numerical stat (although Python does seem to assess these alphabetically, so it may be OK)! This filtering is performed before the reranking of the motifs if rerank=X is used. This can make run times quite long as many more motifs need stats calculations. (If rerank=X is used without statfilter, re-ranking is done earlier to save time.) See the manual for details.

7. !!!NEW!!! with version 3.0, customised scores can be created using the newscore=LIST option, where LIST is in the form X:Y,X:Y, where in turn X is the name for the new score (a column with this name will be produced) and Y is the formula of the score. This formula may contain any output column names, numbers and the operators +-*/^ (^ is "to the power of"), using brackets to set the order of calculation. Without brackets, a strict left to right hierarchy is observed. e.g. newscore=Eg:3+2*6 will generate a column called "Eg" containing the value 30.0. Custom scores can feature previously defined custom scores in the command options, so a second newscore call could be newscore=Eg:3+2*6,Eg2:Eg^2 (= Eg squared = 900.0). This can be used in conjunction with statfilter, e.g. newscore=UDif:UHS-UP statfilter=UDif>1.

Commandline

## Basic compilation options ##
outfile=FILE dirlist=LIST compile=T/F append=T/F slimranks=X rerank=X rankstat=X motific=T/F lencorrect=T/F delimit=X ## Advanced compilation options ##
subdir=T/F webid=LIST slimversion=X ## Additonal statistics ##
abschg=T/F netchg=T/F balchg=T/F ailmv=T/F aromatic=T/F phos=T/F expect=T/F zscore=T/F newscore=LIST custom=LIST

Additional calculations to make

slimsa=T/F winsa=X slimhyd=T/F winhyd=X slimcons=T/F - See PRESTO conservation options. (NB. consamb does nothing.)
slimchg=T/F slimfold=T/F slimiup=T/F windis=X iucut=X iumethod=X iupath=PATH percentile=X ## Collation and Extraction of specific results ##
index=T/F bigindex=T/F fullpath=T/F slimpath=PATH slimlist=LIST protlist=LIST datalist=LIST strict=T/F (False = extract details for all proteins in datalist datasets containing outdir=PATH picksid=X inputext=LIST [dat,fas,fasta,faa]
indexre=LIST - ipi : '[ipi_HUMAN__(\S+)-*\d*=(\S]{cmd:ipi_HUMAN__(\S+)-*\d*}.+)', - ipi_sv : '^[ipi_HUMAN__([A-Za-z0-9]+)-*\d*=(\S]{cmd:ipi_HUMAN__([ - ft : '^(\[S+)_HUMAN=(\S]{cmd:S+)_HUMAN}+)', - ft_sv : '^([[A-Za-z0-9]+)-*\d*_HUMAN=(\S]{cmd:A-Za-z0-9]+)-*\d*_ : Name of output file. [slimdisc_results.csv] : List of directories from which to extract files (wildcards OK) [./] : Compile motifs from SliMDisc rank files into output file. (False=index only) [True] : Append file rather than over-writing [False] : Maximum number of SlimDisc ranks to exract from any given dataset [5000] : Re-ranks according to RScore (if expect=T) and only outputs top X new ranks (if > 0) [5000] : Stat to use to re-rank data [RScore] : Recalulate IC using PRESTO. Used for re-ranking. OldIC also output. [False] : Implements crude length correction in RScore [False] : Change delmiter to X [,] : Whether to search subdirectories for rank files [True] : List of SLiMDisc webserver IDs to compile. (Works only on bioware!) [] : SLiMDisc results version for compiled output [1.4] : Whether to output number of charged positions (KRDE) [True] : Whether to output net charge of motif (KR) - (DE) [True] : Whether to output the *balance* of charge (netNT - netCT) [True] : Whether to output if all positions in the motif are A,I,L,M or V. [True] : Whether to output count of F+Y+W [True] : Whether to output potential phosphorylated residues X (none) or [S][T][Y], if present [True] : Calculate min. expected occurrence of motif in search dataset [True] : Calculate z-scores for each motif using the entire dataset (<=slimranks) [True] : Lists of X:Y, create a new statistic X, where Y is the formula of the score. [] : Calulate Custom score as a produce of stats in LIST [] : Calculate SA information for SLiMDisc Results [True] : Number of aa to extend Surface Accessibility calculation either side of motif [0] : Calculate Eisenbeg Hydophobicity for SLiMDisc Results [True] : Number of aa to extend Eisenberg Hydrophobicity calculation either side of motif [0] : Calculate Conservation stats for SLiMDisc results [False] : Calculate selected charge statistics (above) for occurrences in addition to pattern [False] : Calculate disorder using FoldIndex over the internet [False] : Calculate disorder using local IUPred [True] : Number of aa to extend disorder prediction each side of occurrence [0] : Cut-off for IUPred results [0.2] : IUPred method to use (long/short) [short] : The full path to the IUPred exectuable [c:/bioware/iupred/iupred.exe] : Percentile steps to return in addition to mean [0] : Whether to create index files (slimpicks.*.index) for proteins, motifs and datasets [True] : Whether to use the special makeBigIndexFiles() method [False] : Whether to use full path (else relative) for dataset index [True] : Path to place (or find) index files. *Cannot be used for extraction if fullpath=F* [./] : List (A,B,C) or FILE containing list of SLiMs (motifs) to extract [] : List (A,B,C) or FILE containing list of proteins for which to extract results [] : List (A,B,C) or FILE containing list of datasets for which to extract results [] : Only extract protein/occurrence details for those proteins in protlist [False] slimlist motifs) : Directory into which extracted data will be placed. [./] : Outputs an extra 'PicksID' column containg the identifier X [] : List of file extensions for original input files. (Should be in same dir as *.rank, or one dir above) : List of alternative regular expression patterns to try for index retrieval [] # IPI Human sequence A-Za-z0-9]+)-*\d*}.+)', # IPI Human UniProt splice variant # SLiMDisc FullText (UniProt format) retrieval HUMAN}+)' # SLiMDisc FullText (UniProt format) splice variant

## Additional Output for Extracted Motifs ##
occres=FILE : Output individual occurrence data in FILE [None]
extract=T/F : Extract additional data for motifs [True if datasets/SLiMs/accnums given, else False]
motifaln=T/F : Produce fasta files of local motif alignments [True]
flanksize=X : Size of sequence flanks for motifs [30]
xdivide=X : Size of dividing Xs between motifs [10]
datout=FILE : Extract UniProt entries from parent proteins where possible into FILE [uniprot_extract.dat]
unitab=T/F : Make tables of UniProt data using rje_uniprot.py [True]
ftout=FILE : Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [None]
unipaths=LIST : List of additional paths containing uniprot.index files from which to look for and extract features ['']
peptides=T/F : Peptide design around discovered motifs [False]
## Additional Output for Proteins ##
proteinaln=T/F : Search for alignments of proteins containing motifs and produce new file containing motifs [True]
gopher=T/F : Use GOPHER to generate missing orthologue alignments in alndir - see gopher.py options [False]
alndir=PATH : Path to alignments of proteins containing motifs [./] * Use forward slashes (/) [Gopher/ALN/]
alnext=X : File extension of alignment files, accnum.X [orthaln.fas]
## Advanced Filtering Options ##
statfilter=LIST : List of stats to filter (remove matching motifs) on, consisting of X*Y where:
- X is an output stat (the column header),
- * is an operator in the list >, >=, !=, =, >= ,< !!! Remember to enclose in "quotes" for <> !!!
- Y is a value that X must have, assessed using *.
This filtering is crude and may behave strangely if X is not a numerical stat!
zfilter=T/F : Calculate the Z-score on the filtered dataset (True) or the whole dataset (False) [False]
rankfilter=T/F : Re-ranks the filtered dataset (True) rather than the whole (pre-filtered) dataset (False) [True]
- NB. If zfilter=T then rankfilter=T.

## Old/obselete options ##
advprob=T/F : Calculate advanced probability based on actual sequences containing motifs [False] #!# Not right yet!! #!#
advmax=X : Max number of sequences to use computationally intensive advanced probability [35]
* See RJE_UNIPROT options for UniProt settings *

SLiMSuite REST Server

slimpickings V3.0

SLiMDisc results compilation and extraction

Function

Commandline

Additional calculations to make