Function
This module is designed to generate a number a random sequences based on input AA or Xmer frequencies and the desired
order of markov chain from which to draw the amino acid (or nucleotide) probabilites.
If poolgen=T
, then the amino acid frequencies will be used to make a finite pool of amino acids from which sequences
will be built. This will ensure that the total dataset has the correct amino acid frequencies. Because of the
potential of this to get 'stuck' in impossible sequence space - especially if screenx > 0 - an additional parameter
poolcyc=X
determines how many times to retry the generation of sequences. If poolgen=F
, then generation of sequences
will be faster but the resulting dataset may have Xmer frequencies that differ greatly from the input frequencies,
depending on how many (and which) redundant and/or screened Xmer-containing sequences are removed. (If seqin contains
only one peptide then each random peptide will be a scramble of that peptide.)
!!!NEW!!! Verson 1.3 has new scramble function, which takes in a list of peptides and tries to construct scrambled
versions of them. In this case, screenx=X
sets the length of common Xmers between the scrambled peptide and the
original peptide at which a scrambled peptide will be rejected. This should set > 1, else all peptides will be
rejected. (If left at the default of zero, no peptides will be rejected.) In this mode, outfile=FILE
will set the
name of a delimited output file containing two columns: peptide & scramble. (Default filename = scramble.tdt)
!!!NEW!!! Version 1.5 has a new BLAST-centred method for making a random dataset from an input dataset, retaining the
approximate evolutionary relationships as defined by BLAST homology, which should result in similar GABLAM statistics
for the randomised dataset. For this, a random sequence is created first. Any BLAST hits between this and other
sequences are then mapped, keeping the required percentage identity (and using different amino acids drawn from the
frequency pool for the rest). The next sequence is taken, completed and then the same process followed, until all
sequences have been made. Improvements to make: (a) incorporate similarity too; (b) adjust aa frequencies after BLAST
mapping. This method is activated by the blastgen=T
option and has limited options as yet.
NB. The input dataset will *not* be subject to rje_seq filtering.
!!!NEW!!! Version 1.6 has an EST randomiser. This will go through each sequence in turn and generate a new sequence of
the same length using the NT frequencies (or markov chain frequencies) of just that sequence. Updated in V1.7 to make
this work for proteins too.
Commandline
## Generation options ##
seqnum=X
: Number of random sequences to generate [24
]
seqlen=X,Y
: Range of lengths for random sequences [10
]
markovx=X
: Order of markov chain to use for sequence construction [1
]
aafreq=FILE
: File from which to read AA Freqs [None
]
xmerfile=FILE
: File from which to read Xmer frequencies for sequence generation [None
]
xmerseq=FILE
: Sequence file from which to calculate Xmer frequencies [None
]
- xmerseq is overwridden by xmerfile and aafreq. aafreq only works if
markovx=1
and is over-ridden by xmerfile *
nrgen=T/F
: Whether to generate a non-redundant sequence list (whole-sequence redundancies only) [True
]
poolgen=T/F
: Whether to build sequences using a fixed AA pool (exact freqs) or probabilities only [False
]
poolcyc=X
: Number of times to retry making sequences if rules are broken [1
]
maxhyd=X
: Maximum mean hydrophobicity score [10
]
## Output & Naming ##
outfile=FILE
: Output file name [randseq.fas
]
randname=X
: Name 'leader' for output fasta file [randseq
]
randdesc=T/F
: Whether to include construction details in description line of output file [True
]
idmin=X
: Starting numerical ID for randseq (allows appending) [1
]
idmax=X
: Max number for randseq ID. If < seqnum, will use seqnum. If <0, no zero-prefixing of IDs. [0
]
append=T/F
: Whether to append to outfile [False
]
## Other Xmers of Interest ##
screenfile=FILE
: File of Xmers to screen in generated sequences [None
]
xmerocc=T/F
: Whether to output occurrences of screened Xmers [True
]
screenx=X
: Reject generated sequences containing screened Xmers >= X [0
]
screenrev=T/F
: Whether to screen reverse Xmers too [False
]
## Peptide Scrambling Parameters ##
Uses seqnum=X
, randdesc, [idmin/max=X]{cmd:idmin/max}
and [randname=X]{cmd
scramble=T/F
: Run peptide scrambler [False
]
fullscramble=T/F
: Generate all possible scrambles for each peptide in TDT [False
]
scramblecyc=X
: Number of attempts to try each scramble before giving up [10000
]
seqin=FILE
: Sequence file containing peptides to scramble [None
]
peptides=LIST
: Alternative peptide sequence input for scrambling []
outfile=FILE
: Output delimited file of scrambled peptides or peptide and scrambled sequence. [scramble.tdt
]
teiresias=X
: Length of patterns to be screened by additional TEIRESIAS search on scrambled vs original [0
]
teiresiaspath=PATH
: Path to TEIRESIAS ['c:/bioware/Teiresias/teiresias_char.exe'] * Use forward slashes (/)
### BLAST-based dataset randomiser (uses some of the Output options listed) ###
blastgen=T/F
: Activate the BLASTGen method [False
]
seqin=FILE
: Input sequence file to randomise [None
]
keepnames=T/F
: Whether to keep same input names in outfile [False
]
### EST Randomiser ###
estgen=T/F
: Whether to run EST randomiser method [False
]
History Module Version History
# 0.0 - Initial Compilation.
# 1.0 - Initial Working version
# 1.1 - Added max hydrophobicity
# 1.3 - Added peptide scrambler
# 1.4 - Separated Xmer screen and Teiresias pattern screen
# 1.5 - Added BLASTGen Method
# 1.6 - Checked function with DNA. Added EST randomiser function.
# 1.7 - Modified/fixed ESTgen function to work for protein sequences.