Contains Classes and methods for sets of DNA and protein sequences.
Sequence Input/Output Options
seqin=FILE : Loads sequences from FILE (fasta,phylip,aln,uniprot or fastacmd names from fasdb) [
query=X : Selects query sequence by name [
acclist=LIST : Extract only AccNums in list. LIST can be FILE or list of AccNums X,Y,.. [
fasdb=FILE : Fasta format database to extract sequences from [
mapseq=FILE : Maps sequences from FILE to sequences of same name [
mapdna=FILE : Map DNA sequences from FILE onto sequences of same name in protein alignment [
seqout=FILE : Saves 'tidied' sequences to FILE after loading and manipulations [
reformat=X : Outputs sequence in a particular format, where X is:
- fasta/fas/phylip/scanseq/acclist/speclist/acc/idlist/fastacmd/teiresias/mysql/nexus/3rf/6rf/est6rf [None]
- if no
seqout=FILE given, will use input file name as base and add appropriate exension.
reformat=X may not be fully implemented. Report erroneous behaviour! #!#
logrem=T/F : Whether to log removed sequences [True] - suggest False with filtering of large files!
Sequence Loading/Formatting Options
gnspacc=T/F : Convert sequence names into gene_SPECIES__AccNum format wherever possible. [
alphabet=LIST : Alphabet allowed in sequences [
standard 1 letter AA codes]
replacechar=T/F : Whether to remove numbers and replace characters not found in the given alphabet with 'X' [
autofilter=T/F : Whether to automatically apply sequence filters etc. upon loading sequence [
autoload=T/F : Whether to automatically load sequences upon initiating object [
memsaver=T/F : Minimise memory usage. Input sequences must be fasta. [
degap=T/F : Degaps each sequence [
tidygap=T/F : Removes any columns from alignments that are 100% gap [
ntrim=X : Trims of regions >= X proportion N bases (X residues for protein) [
seqtype=X : Force program to read as DNA, RNA, Protein or Mixed (case insensitive;
read=Will work it out) [
dna=T/F : Alternative identification of sequences as DNA [
mixed=T/F : Whether to allow auto-identification of mixed sequences types (else uses first seq only) [
align=T/F : Whether the sequences should be aligned. Will align if unaligned. [
rna2dna=T/F : Converts RNA to DNA [
trunc=X : Truncates each sequence to the first X aa. (Last X aa if -ve) (Useful for webservers like SingalP.) [
usecase=T/F : Whether to output sequences in mixed case rather than converting all to upper case [
case=LIST : List of positions to switch case, starting with first lower case (e.g
case=20,-20 will have ends UC) 
countspec=T/F : Generate counts of different species and output in log [
Sequence Filtering Options
filterout=FILE : Saves filtered sequences (as fasta) into FILE. *NOTE: File is appended if
minlen=X : Minimum length of sequences [
maxlen=X : Maximum length of sequences (<=0 = No maximum) [
maxgap=X : Maximum proportion of sequence that may be gaps (<=0 = No maximum) [
maxx=X : Maximum proportion of sequence that may be Xs (<=0 = No maximum; >=1 = Absolute no.) [
maxglob=X : Maximum proportion of sequence predicted to be ordered (<=0 = None; >=1 = Absolute) [
minorf=X : Minimum ORF length for a DNA/EST translation (reformatting only) [
minpoly=X : Minimum length of poly-A tail for 3rf / 6rf EST translation (reformatting only) [
gapfilter=T/F : Whether to filter gappy sequences upon loading [
nosplice=T/F : If
nosplice=T, UniProt splice variants will be filtered out [
dblist=LIST : List of databases in order of preference (good to bad)
dbonly=T/F : Whether to only allow sequences from listed databases [
unkspec=T/F : Whether sequences of unknown species are allowed [
accnr=T/F : Check for redundant Accession Numbers/Names on loading sequences. [
seqnr=T/F : Make sequences Non-Redundant [
nrid=X : %Identity cut-off for Non-Redundancy (GABLAMO) [
nrsim=X : %Similarity cut-off for Non-Redundancy (GABLAMO) [None]
nralign=T/F : Use ALIGN for non-redundancy calculations rather than GABLAMO [
specnr=T/F : Non-Redundancy within same species only [
querynr=T/F : Perform Non-Redundancy on Query species (True) or limit to non-Query species (False) [
nrkeepann=T/F : Append annotation of redundant sequences onto NR sequences [
goodX=LIST : Filters where only sequences meeting the requirement of LIST are kept.
LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
badX=LIST : As goodX but excludes rather than retains filtered sequences
System Info Options
- * Use forward slashes for paths (/)
blastpath=PATH : Path to BLAST programs ['']
blast+path=PATH : Path to BLAST+ programs ['']
fastapath=PATH : Path to FASTA programs ['']
clustalw=PATH : Path to CLUSTALW program [
clustalo=PATH : Path to CLUSTAL Omega alignment program [
mafft=PATH : Path to MAFFT alignment program [
muscle=PATH : Path to MUSCLE alignment program [
fsa=PATH : Path to FSA alignment program ['fsa']
pagan=PATH : Path to PAGAN alignment program ['pagan']
win32=T/F : Run in Win32 Mode [
alnprog=X : Choice of alignment program to use (clustalw/clustalo/muscle/mafft/fsa/pagan) [
Sequence Manipulation/Function Options
pamdis : Makes an all by all PAM distance matrix
split=X : Splits file into numbered files of X sequences. (Useful for webservers like TMHMM.)
relcons=FILE: Returns a file containing Pos AbsCons RelCons [
relconwin=X : Window size for relative conservation scoring [
makepng=T/F : Whether to make RelCons PNG files [
seqname=X : Output sequence names for PNG files etc. (short/Name/Number/AccNum/ID) [
outmatrix=X : Type for output matrix - text / mysql / phylip
blast2fas=FILE1,FILE2,...,FILEn : Will blast sequences against list of databases and compile a fasta file of results per query
- use options from rje_blast.py for each individual blast (
blastd=FILE will be over-ridden)
- saves results in AccNum.blast.fas and will append existing files!
keepblast=T/F : Whether to keep BLAST results files for blast2fas searches [
haqbat=FILE : Generate a batch file (FILE) to run HAQESAC on generated BLAST files, with seqin as queries [
- Sequence List Class. Holds a list of Sequence Objects and has methods for manipulation etc.
- Individual Sequence Class.
- Sequence Distance Matrix Class.
History Module Version History
# 0.0 - Initial Compilation.
# 0.1 - Renamed major attributes
# 0.2 - New implementation on more generic OO approach. Non-Redundancy Output
# 0.3 - No Out Object in Objects
# 1.0 - Better Documentation to go with GASP V:1.2
# 1.1 - Better DNA stuff
# 1.2 - Added ClustalW align
# 1.3 - Separated Sequence object into rje_sequence.py
# 1.4 - Add rudimentary gnspacc=T/F
# 1.5 - Changed pwAln to use popen()
# 1.6 - Fixed nrdic problem in Redundancy check and added user-definition of database list
# 1.8 - Added UniProt input and acclist reading
# 1.9 - Added 'reformat=scanseq' option but not properly implemented. Added align=T/F.
# 2.0 - Major reworking of commandline options and introduction of self.list dictionary (rje v3.0)
# 2.1 - Added reformat of UniProt with memsaver=T.
# 2.2 - Added GABLAM non-redundancy
# 2.3 - Added NR in memsaver mode
# 2.4 - Changed some of the log output (REM and redundancy) to look better.
# 2.5 - Added nr_qry to makeNR()
# 2.6 - Added mysql reformat output: fastacmd, protein_id, acc_num, spec_code, description (delimited)
# 2.7 - Added SeqCount() method. Incorporated reading of sequence case.
# 2.8 - Added NEXUS output for MrBayes compatibility
# 2.9 - Added setupSubDict(masking=True) for use in probabilistic conservation scores
# 3.0 - Start of improvements for DNA sequences with dna=T.
# 3.1 - Added relative conservation calculations for a whole alignment.
# 3.2 - Added output of sequences for making alignments in R.
# 3.3 - Added 6RF reformatting for DNA sequences.
# 3.4 - Added HAQBAT option
# 3.5 - Added extra alignment program, MAFFT
# 3.6 - Added stripGap() method. Replaced self.seq with self.seqs() for reading. (Replace with list at some point.)
# 3.7 - Added raw option for single sequence load.
# 3.8 - Added maxGlob setting for screening out globular proteins.
# 3.9 - Added reading of mafft format when not producing standard fasta.
# 3.10- Added ntrim=X : Trims of regions >= X proportion N bases (X residues for protein) [0.5]
# 3.11- Added mapdna=FILE option to map DNA sequences onto protein alignment
# 3.12- Added countspec=T/F : Generate counts of different species and output in log [False]
# 3.13- Updated sequence type checking for use with GABLAM 2.10.
# 3.14- Added CLUSTAL Omega alignment program ['clustalo']
# 3.15- Added PAGAN alignment program ['pagan'] and (hopefully) fixed minor Windows fastacmd bug.
# 3.16- Added BLAST+ path and seqFromBlastDBCmd()
# 3.17- Updated to use BLAST+ and rje_blast_V2
# 3.18- Minor BLAST+ bug fixes. Added exceptions to readBLAST failure.
# 3.19- Fixed BLAST+ sequence extraction name truncation error.
# 3.20- Added run() method for SeqSuite.
# 3.21.0 - Added extraction of uniprot IDs for seqin.
# 3.22.0 - Added loading sequences from provided sequence files contents directly, bypassing file reading.
# 3.22.1 - Fixed problem if seqin is blank, triggering odd Uniprot download.
# 3.23.0 - Add speclist to reformat options.
# 3.24.0 - Added REST seqout output.
RJE_SEQ REST Output formats
There is currently no specific help available on REST output for this program. Run with
options. Run with
to get full server output. Individual outputs can be identified/parsed:
can then be used to retrieve individual parts of the output in future.