Contains Classes and methods for sets of DNA and protein sequences.
Sequence Input/Output Options
seqin=FILE : Loads sequences from FILE (fasta,phylip,aln,uniprot or fastacmd names from fasdb) [
query=X : Selects query sequence by name [
acclist=LIST : Extract only AccNums in list. LIST can be FILE or list of AccNums X,Y,.. [
fasdb=FILE : Fasta format database to extract sequences from [
mapseq=FILE : Maps sequences from FILE to sequences of same name [
mapdna=FILE : Map DNA sequences from FILE onto sequences of same name in protein alignment [
seqout=FILE : Saves 'tidied' sequences to FILE after loading and manipulations [
reformat=X : Outputs sequence in a particular format, where X is:
- fasta/fas/phylip/scanseq/acclist/speclist/acc/idlist/fastacmd/teiresias/mysql/nexus/3rf/6rf/est6rf [None]
- if no
seqout=FILE given, will use input file name as base and add appropriate exension.
reformat=X may not be fully implemented. Report erroneous behaviour! #!#
logrem=T/F : Whether to log removed sequences [True] - suggest False with filtering of large files!
Sequence Loading/Formatting Options
gnspacc=T/F : Convert sequence names into gene_SPECIES__AccNum format wherever possible. [
alphabet=LIST : Alphabet allowed in sequences [
standard 1 letter AA codes]
replacechar=T/F : Whether to remove numbers and replace characters not found in the given alphabet with 'X' [
autofilter=T/F : Whether to automatically apply sequence filters etc. upon loading sequence [
autoload=T/F : Whether to automatically load sequences upon initiating object [
memsaver=T/F : Minimise memory usage. Input sequences must be fasta. [
degap=T/F : Degaps each sequence [
tidygap=T/F : Removes any columns from alignments that are 100% gap [
ntrim=X : Trims of regions >= X proportion N bases (X residues for protein) [
seqtype=X : Force program to read as DNA, RNA, Protein or Mixed (case insensitive;
read=Will work it out) [
dna=T/F : Alternative identification of sequences as DNA [
mixed=T/F : Whether to allow auto-identification of mixed sequences types (else uses first seq only) [
align=T/F : Whether the sequences should be aligned. Will align if unaligned. [
rna2dna=T/F : Converts RNA to DNA [
trunc=X : Truncates each sequence to the first X aa. (Last X aa if -ve) (Useful for webservers like SingalP.) [
usecase=T/F : Whether to output sequences in mixed case rather than converting all to upper case [
case=LIST : List of positions to switch case, starting with first lower case (e.g
case=20,-20 will have ends UC) 
countspec=T/F : Generate counts of different species and output in log [
Sequence Filtering Options
filterout=FILE : Saves filtered sequences (as fasta) into FILE. *NOTE: File is appended if
minlen=X : Minimum length of sequences [
maxlen=X : Maximum length of sequences (<=0 = No maximum) [
maxgap=X : Maximum proportion of sequence that may be gaps (<=0 = No maximum) [
maxx=X : Maximum proportion of sequence that may be Xs (<=0 = No maximum; >=1 = Absolute no.) [
maxglob=X : Maximum proportion of sequence predicted to be ordered (<=0 = None; >=1 = Absolute) [
minorf=X : Minimum ORF length for a DNA/EST translation (reformatting only) [
minpoly=X : Minimum length of poly-A tail for 3rf / 6rf EST translation (reformatting only) [
gapfilter=T/F : Whether to filter gappy sequences upon loading [
nosplice=T/F : If
nosplice=T, UniProt splice variants will be filtered out [
dblist=LIST : List of databases in order of preference (good to bad)
dbonly=T/F : Whether to only allow sequences from listed databases [
unkspec=T/F : Whether sequences of unknown species are allowed [
9spec=T/F : Whether to treat 9XXXX species codes as actual species (generally higher taxa) [
accnr=T/F : Check for redundant Accession Numbers/Names on loading sequences. [
seqnr=T/F : Make sequences Non-Redundant [
nrid=X : %Identity cut-off for Non-Redundancy (GABLAMO) [
nrsim=X : %Similarity cut-off for Non-Redundancy (GABLAMO) [None]
nralign=T/F : Use ALIGN for non-redundancy calculations rather than GABLAMO [
specnr=T/F : Non-Redundancy within same species only [
querynr=T/F : Perform Non-Redundancy on Query species (True) or limit to non-Query species (False) [
nrkeepann=T/F : Append annotation of redundant sequences onto NR sequences [
goodX=LIST : Filters where only sequences meeting the requirement of LIST are kept.
LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
badX=LIST : As goodX but excludes rather than retains filtered sequences
System Info Options
- Use forward slashes for paths (/)
blastpath=PATH : Path to BLAST programs ['']
blast+path=PATH : Path to BLAST+ programs ['']
fastapath=PATH : Path to FASTA programs ['']
clustalw=PATH : Path to CLUSTALW program [
clustalo=PATH : Path to CLUSTAL Omega alignment program [
mafft=PATH : Path to MAFFT alignment program [
muscle=PATH : Path to MUSCLE alignment program [
fsa=PATH : Path to FSA alignment program ['fsa']
pagan=PATH : Path to PAGAN alignment program ['pagan']
win32=T/F : Run in Win32 Mode [
alnprog=X : Choice of alignment program to use (clustalw/clustalo/muscle/mafft/fsa/pagan) [
Sequence Manipulation/Function Options
pamdis : Makes an all by all PAM distance matrix
split=X : Splits file into numbered files of X sequences. (Useful for webservers like TMHMM.)
relcons=FILE: Returns a file containing Pos AbsCons RelCons [
relconwin=X : Window size for relative conservation scoring [
makepng=T/F : Whether to make RelCons PNG files [
seqname=X : Output sequence names for PNG files etc. (short/Name/Number/AccNum/ID) [
outmatrix=X : Type for output matrix - text / mysql / phylip
blast2fas=FILE1,FILE2,...,FILEn : Will blast sequences against list of databases and compile a fasta file of results per query
- use options from rje_blast.py for each individual blast (
blastd=FILE will be over-ridden)
- saves results in AccNum.blast.fas and will append existing files!
keepblast=T/F : Whether to keep BLAST results files for blast2fas searches [
haqbat=FILE : Generate a batch file (FILE) to run HAQESAC on generated BLAST files, with seqin as queries [
- Sequence List Class. Holds a list of Sequence Objects and has methods for manipulation etc.
- Individual Sequence Class.
- Sequence Distance Matrix Class.
History Module Version History
# 0.0 - Initial Compilation.
# 0.1 - Renamed major attributes
# 0.2 - New implementation on more generic OO approach. Non-Redundancy Output
# 0.3 - No Out Object in Objects
# 1.0 - Better Documentation to go with GASP V:1.2
# 1.1 - Better DNA stuff
# 1.2 - Added ClustalW align
# 1.3 - Separated Sequence object into rje_sequence.py
# 1.4 - Add rudimentary gnspacc=T/F
# 1.5 - Changed pwAln to use popen()
# 1.6 - Fixed nrdic problem in Redundancy check and added user-definition of database list
# 1.8 - Added UniProt input and acclist reading
# 1.9 - Added 'reformat=scanseq' option but not properly implemented. Added align=T/F.
# 2.0 - Major reworking of commandline options and introduction of self.list dictionary (rje v3.0)
# 2.1 - Added reformat of UniProt with memsaver=T.
# 2.2 - Added GABLAM non-redundancy
# 2.3 - Added NR in memsaver mode
# 2.4 - Changed some of the log output (REM and redundancy) to look better.
# 2.5 - Added nr_qry to makeNR()
# 2.6 - Added mysql reformat output: fastacmd, protein_id, acc_num, spec_code, description (delimited)
# 2.7 - Added SeqCount() method. Incorporated reading of sequence case.
# 2.8 - Added NEXUS output for MrBayes compatibility
# 2.9 - Added setupSubDict(masking=True) for use in probabilistic conservation scores
# 3.0 - Start of improvements for DNA sequences with dna=T.
# 3.1 - Added relative conservation calculations for a whole alignment.
# 3.2 - Added output of sequences for making alignments in R.
# 3.3 - Added 6RF reformatting for DNA sequences.
# 3.4 - Added HAQBAT option
# 3.5 - Added extra alignment program, MAFFT
# 3.6 - Added stripGap() method. Replaced self.seq with self.seqs() for reading. (Replace with list at some point.)
# 3.7 - Added raw option for single sequence load.
# 3.8 - Added maxGlob setting for screening out globular proteins.
# 3.9 - Added reading of mafft format when not producing standard fasta.
# 3.10- Added ntrim=X : Trims of regions >= X proportion N bases (X residues for protein) [0.5]
# 3.11- Added mapdna=FILE option to map DNA sequences onto protein alignment
# 3.12- Added countspec=T/F : Generate counts of different species and output in log [False]
# 3.13- Updated sequence type checking for use with GABLAM 2.10.
# 3.14- Added CLUSTAL Omega alignment program ['clustalo']
# 3.15- Added PAGAN alignment program ['pagan'] and (hopefully) fixed minor Windows fastacmd bug.
# 3.16- Added BLAST+ path and seqFromBlastDBCmd()
# 3.17- Updated to use BLAST+ and rje_blast_V2
# 3.18- Minor BLAST+ bug fixes. Added exceptions to readBLAST failure.
# 3.19- Fixed BLAST+ sequence extraction name truncation error.
# 3.20- Added run() method for SeqSuite.
# 3.21.0 - Added extraction of uniprot IDs for seqin.
# 3.22.0 - Added loading sequences from provided sequence files contents directly, bypassing file reading.
# 3.22.1 - Fixed problem if seqin is blank, triggering odd Uniprot download.
# 3.23.0 - Add speclist to reformat options.
# 3.24.0 - Added REST seqout output.
# 3.25.0 - 9spec=T/F : Whether to treat 9XXXX species codes as actual species (generally higher taxa) [False]
# 3.25.1 - Fixed -long_seqids retrieval bug.
# 3.25.2 - Fixed 9spec filtering bug.
RJE_SEQ REST Output formats
for program documentation and options. A plain text version is accessed with
can be used to retrieve individual parts of the output, matching the tabs in the default
) output. Individual
elements can also be parsed from the full (
) server output,
which is formatted as follows:
... contents for OUTFMT section ...
Available REST Outputs
There is currently no specific help available on REST output for this program.