Module:	rje_seqgen
Description:	Random Sequence Generator Module
Version:	1.7
Last Edit:	17/01/13

Imported modules: rje rje_markov rje_seq rje_sequence rje_zen rje_blast_V1

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This module is designed to generate a number a random sequences based on input AA or Xmer frequencies and the desired order of markov chain from which to draw the amino acid (or nucleotide) probabilites.

If poolgen=T, then the amino acid frequencies will be used to make a finite pool of amino acids from which sequences will be built. This will ensure that the total dataset has the correct amino acid frequencies. Because of the potential of this to get 'stuck' in impossible sequence space - especially if screenx > 0 - an additional parameter poolcyc=X determines how many times to retry the generation of sequences. If poolgen=F, then generation of sequences will be faster but the resulting dataset may have Xmer frequencies that differ greatly from the input frequencies, depending on how many (and which) redundant and/or screened Xmer-containing sequences are removed. (If seqin contains only one peptide then each random peptide will be a scramble of that peptide.)

!!!NEW!!! Verson 1.3 has new scramble function, which takes in a list of peptides and tries to construct scrambled versions of them. In this case, screenx=X sets the length of common Xmers between the scrambled peptide and the original peptide at which a scrambled peptide will be rejected. This should set > 1, else all peptides will be rejected. (If left at the default of zero, no peptides will be rejected.) In this mode, outfile=FILE will set the name of a delimited output file containing two columns: peptide & scramble. (Default filename = scramble.tdt)

!!!NEW!!! Version 1.5 has a new BLAST-centred method for making a random dataset from an input dataset, retaining the approximate evolutionary relationships as defined by BLAST homology, which should result in similar GABLAM statistics for the randomised dataset. For this, a random sequence is created first. Any BLAST hits between this and other sequences are then mapped, keeping the required percentage identity (and using different amino acids drawn from the frequency pool for the rest). The next sequence is taken, completed and then the same process followed, until all sequences have been made. Improvements to make: (a) incorporate similarity too; (b) adjust aa frequencies after BLAST mapping. This method is activated by the blastgen=T option and has limited options as yet. NB. The input dataset will *not* be subject to rje_seq filtering.

!!!NEW!!! Version 1.6 has an EST randomiser. This will go through each sequence in turn and generate a new sequence of the same length using the NT frequencies (or markov chain frequencies) of just that sequence. Updated in V1.7 to make this work for proteins too.

Commandline

## Generation options ##
seqnum=X : Number of random sequences to generate [24]
seqlen=X,Y : Range of lengths for random sequences [10]
markovx=X : Order of markov chain to use for sequence construction [1]
aafreq=FILE : File from which to read AA Freqs [None]
xmerfile=FILE : File from which to read Xmer frequencies for sequence generation [None]
xmerseq=FILE : Sequence file from which to calculate Xmer frequencies [None]

xmerseq is overwridden by xmerfile and aafreq. aafreq only works if markovx=1 and is over-ridden by xmerfile *

nrgen=T/F

True

poolgen=T/F

False

poolcyc=X

1

maxhyd=X

10

outfile=FILE

randseq.fas

randname=X

randseq

randdesc=T/F

True

idmin=X

1

idmax=X

0

append=T/F

False

screenfile=FILE

None

xmerocc=T/F

True

screenx=X

0

screenrev=T/F

False

max}` and `[randname=X]{cmd`

scramble=T/F : Run peptide scrambler [False] fullscramble=T/F: Generate all possible scrambles for each peptide in TDT [False] scramblecyc=X : Number of attempts to try each scramble before giving up [10000] seqin=FILE : Sequence file containing peptides to scramble [None] peptides=LIST : Alternative peptide sequence input for scrambling [] outfile=FILE : Output delimited file of scrambled peptides or peptide and scrambled sequence. [scramble.tdt] teiresias=X : Length of patterns to be screened by additional TEIRESIAS search on scrambled vs original [0] teiresiaspath=PATH : Path to TEIRESIAS ['c:/bioware/Teiresias/teiresias_char.exe'] * Use forward slashes (/)


  
  ### BLAST-based dataset randomiser (uses some of the Output options listed) ###

  blastgen=T/F    : Activate the BLASTGen method [False]

  seqin=FILE      : Input sequence file to randomise [None]

  keepnames=T/F   : Whether to keep same input names in outfile [False]

  ### EST Randomiser ###

  estgen=T/F      : Whether to run EST randomiser method [False]


  




History Module Version History
    # 0.0 - Initial Compilation.
    # 1.0 - Initial Working version
    # 1.1 - Added max hydrophobicity
    # 1.3 - Added peptide scrambler
    # 1.4 - Separated Xmer screen and Teiresias pattern screen
    # 1.5 - Added BLASTGen Method
    # 1.6 - Checked function with DNA. Added EST randomiser function.
    # 1.7 - Modified/fixed ESTgen function to work for protein sequences.

SLiMSuite REST Server

rje_seqgen V1.7

Random Sequence Generator Module

Function

Commandline

Uses `seqnum=X`, randdesc, `[idmin/max=X]{cmd:idmin/max}` and `[randname=X]{cmd`

History Module Version History

SLiMSuite REST Server

rje_seqgen V1.7

Random Sequence Generator Module

Function

Commandline

Uses seqnum=X, randdesc, [idmin/max=X]{cmd:idmin/max} and [randname=X]{cmd

History Module Version History

Uses `seqnum=X`, randdesc, `[idmin/max=X]{cmd:idmin/max}` and `[randname=X]{cmd`