This module is primarily to contain core dataset processing methods for both SLiMFinder and SLiMProv to inherit and
use. This primarily consists of the options and methods for masking datasets and generating UPC. This module can
therefore be run in standalone mode to generate UPC files for SLiMFinder or SLiMProb. It can also be used to generate
"MegaSLiM" files of precomputed scores that can be used for subsequent subdata.
In addition, the secondary MotifSeq and Randomise functions are handled here.
The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.
The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets
by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final
datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.
The "EquivMaker" function will read in a BLOSUM matrix and, for a particular score cut-off, generate all equivalence
groups for which all members have pairwise BLOSUM scores that equal or exceed the cut-off. This equivalence file can
then be used as input for TEIRESIAS or SLiMFinder.
### Basic Input/Output Options ###
seqin=FILE : Sequence file to search. Over-rules batch mode (and
batch=LIST : List of files to search, wildcards allowed. [
uniprotid=LIST : Extract IDs/AccNums in list from Uniprot into BASEFILE.dat and use as
seqfilter=T/F : Whether to apply sequence filtering options (goodX, badX etc.) to input [
maxseq=X : Maximum number of sequences to process [
maxupc=X : Maximum UPC size of dataset to process [
sizesort=X : Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [
walltime=X : Time in hours before program will abort search and exit [
resdir=PATH : Redirect individual output files to specified directory (and look for intermediates) [
buildpath=PATH : Alternative path to look for existing intermediate files [
force=T/F : Force re-running of BLAST, UPC generation and SLiMBuild [
dna=T/F : Whether the sequences files are DNA rather than protein [
alphabet=LIST : List of characters to include in search (e.g. AAs or NTs) [
default AA or NT codes]
megaslim=FILE : Make/use precomputed results for a proteome (FILE) in fasta format [
megablam=T/F : Whether to create and use all-by-all GABLAM results for (gablamdis) UPC generation [
megaslimfix=T/F : Whether to run megaslim in "fix" mode to tidy/repair existing files [
ptmlist=LIST : List of PTM letters to add to alphabet for analysis and restrict PTM data 
ptmdata=DSVFILE : File containing PTM data, including AccNum, ModType, ModPos, ModAA, ModCode
### Evolutionary Filtering Options ###
efilter=T/F : Whether to use evolutionary filter [
blastf=T/F : Use BLAST Complexity filter when determining relationships [
blaste=X : BLAST e-value threshold for determining relationships [
altdis=FILE : Alternative all by all distance matrix for relationships [
gablamdis=FILE : Alternative GABLAM results file [None] (!!!Experimental feature!!!)
fupc=T/F : Whether to use experimental "Fragment UPC" approach for UPC of large proteomes [
domtable=FILE : Domain table containing domain ("Type") and sequence ("Name") pairings for additional UPC [
homcut=X : Max number of homologues to allow (to reduce large multi-domain families) [
extras=T/F : Whether to generate additional output files (distance matrices etc.) [
### Input Masking and AA Frequency Options ###
masking=T/F : Master control switch to turn off all masking if False [
dismask=T/F : Whether to mask ordered regions (see rje_disorder for options) [
consmask=T/F : Whether to use relative conservation masking [
ftmask=LIST : UniProt features to mask out (
imask=LIST : UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) 
fakemask=T/F : Whether to invoke UniFake to generate additional features for masking [
compmask=X,Y : Mask low complexity regions (same AA in X+ of Y consecutive aas) [
casemask=X : Mask Upper or Lower case [
metmask=T/F : Masks the N-terminal M (can be useful if SLiMFinder
posmask=LIST : Masks list of position-specific aas, where list = pos1:aas,pos2:aas *Also
aamask=LIST : Masks list of AAs from all sequences (reduces alphabet) 
motifmask=X : List (or file) of motifs to mask from input sequences 
logmask=T/F : Whether to output the log messages for masking of individual sequences to screen [
masktext=X : Text ID to over-ride automated masking text and identify specific masking settings [
maskpickle=T/F : Whether to save/load pickle of masked input data, independent of main pickling [
maskfreq=T/F : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [
aafreq=FILE : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
smearfreq=T/F : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [
qregion=X,Y : Mask all but the region of the query from (and including) residue X to residue Y [
megaslimdp=X : Accuracy (d.p.) for MegaSLiM masking tool raw scores [
### Advanced Output Options ###
targz=T/F : Whether to tar and zip dataset result files (UNIX only) [
pickle=T/F : Whether to save/use pickles [
savespace=0 : Delete "unneccessary" files following run (best used with targz): [
- 0 = Delete no files
- 1 = Delete all bar *.upc, *.pickle (Pickle excluded from tar.gz with this setting)
- 2 = Delete all bar *.upc files (Pickle included in tar.gz with this setting)
- 3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
iuscoredir=PATH : Path in which to save protein acc.iupred.txt score files for megaslim analysis
protscores=T/F : Whether to save individual protein rlc.txt files in alignment directory [
Additional Functions I
motifseq=LIST : Outputs fasta files for a list of X:Y, where X is the pattern and Y is the output file 
slimbuild=T/F : Whether to build motifs with SLiMBuild. (For combination with motifseq only.) [
Additional Functions II
randomise=T/F : Randomise UPC within batch files and output new datasets [
randir=PATH : Output path for creation of randomised datasets [
randbase=X : Base for random dataset name [
randsource=FILE : Source for new sequences for random datasets (replaces UPCs) [
Additional Functions III
blosumfile=FILE : BLOSUM file from which to make equivalence file
equivout=FILE : File for equivalence list output [
equivcut=X : BLOSUM score cut-off for equivalence groups [
History Module Version History
# 0.0 - Initial Compilation based on SLiMFinder 3.1.
# 0.1 - Tidied with respect to SLiMFinder and SLiMSearch.
# 0.2 - Added DNA mode.
# 0.3 - Added relative conservation masking.
# 0.4 - Altered TarGZ to *not* include *.pickle.gz
# 1.0 - Standardised masking options. Added motifmask and motifcull.
# 1.1 - Checked/updated Randomise option.
# 1.2 - Added generation of UniFake file to maskInput() method. (fakemask=T)
# 1.3 - Added aamask effort
# 1.4 - Added DomTable option
# 1.5 - Added masktext and maskpickle options to accelerate runs with large masked datasets.
# 1.6 - Fixed occurrence table bugs.
# 1.7 - Added SizeSort and NewUPC. Add server #END statements.
# 1.8 - Added BLOSUM2equiv method for making equivalent lists from a BLOSUM matrix.
# 1.9 - Minor modifications to Log output. Updated motifSeq() function to output unmasked sequences.
# 1.10- Bypass UPC generation for single sequences.
# 1.11- Tidied some of the module imports.
# 1.12- Upgraded BLAST to BLAST+. Can use old BLAST with oldblast=T.
# 1.13- Modified the savespace settings to reduce numbers of files. targz file now uses RunID not Build Info.
# 1.14- Started adding code for Fragmented UPC (FUPC) clustering.
# 1.15- Added pre-running GOPHER if no alndir and usegopher=T. Updated dataset() to use Input not Basefile.
# 1.16- Preparation for SLiMCore V2.0 using newer RJE_Object.
# 2.0 - Converted to use rje_obj.RJE_Object as base. Version 1.16 moved to legacy/.
# 2.1 - Added megaslim=FILE option to make/use precomputed results for a proteome. Upgraded MotifSeq method.
# 2.2 - Modified aa frequency calculations to use alphabet to generate 0.0 frequencies (rather than missing aa).
# 2.3 - Docstring edits. Minor tweak to walltime() to close open files.
# 2.4 - Added megaslimfix=T/F : Whether to run megaslim in "fix" mode to tidy/repair existing files [False]
# 2.5 - Added (hidden) slimmutant=T/F : Whether to ignore '.p.\D\d+\D' at end of accnum. Made default append=True.
# 2.6.0 - Added uniprotid=LIST : Extract IDs/AccNums in list from Uniprot into BASEFILE.dat and use as seqin=FILE. 
# 2.6.1 - Removed the maxseq default setting.
# 2.7.0 - Updating MegaSLiM function to work with REST server. Allow megaslim=seqin. Added iuscoredir=PATH and protscores=T/F.
# 2.7.1 - Modified iuscoredir=PATH and protscores=T/F to work without megaslim. Fixed UPC/SLiMdb issue for GOPHER.
# 2.7.2 - Fixed iuscoredir=PATH to stop raising errors when file not previously made.
# 2.7.3 - Fixed serverend message error.
# 2.7.4 - Fixed walltime server bug.
# 2.7.5 - Fixed feature masking.
# 2.7.6 - Added feature masking log info or warning.
# 2.7.7 - Switched feature masking OFF by default to give consistent Uniprot versus FASTA behaviour.
# 2.7.8 - Fixed batch=FILE error for single input files.
# 2.8.0 - Added map and failed output to REST servers and standalone uniprotid=LIST input runs.
# 2.8.1 - Updated resfile to be set by basefile if no resfile=X setting given
# 2.9.0 - Added separate IUPred long suffix for reusing predictions
# 2.10.0 - Added seqfilter=T/F : Whether to apply sequence filtering options (goodX, badX etc.) to input [False]
SLiMCore REST Output formats
for general options. Run with
to get full server output as text or
for more user-friendly formatted output. Individual outputs can be identified/parsed using
= List of predicted disorder scores for proteins (if
or using special disorder rest call)
= List of RLC scores for proteins (if
or using special rlc rest call)
= Groupings of unrelated proteins (if
Note that SLiMCore can either be run as http://rest.slimsuite.unsw.edu.au/slimcore
or special runs can be used to
try and directly access RLC or Disorder scores for an individual protein:
If these data already exist, they will be returned directly as plain text. If not, a jobid will be returned,
which will have the desired output once run.