Program:	RJE_SEQ
Description:	DNA/Protein sequence list module
Version:	3.25.3
Last Edit:	21/12/20

Imported modules: rje rje_blast_V1 rje_blast_V2 rje_dismatrix_V2 rje_sequence rje_uniprot

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

Contains Classes and methods for sets of DNA and protein sequences.

Sequence Input/Output Options

seqin=FILE : Loads sequences from FILE (fasta,phylip,aln,uniprot or fastacmd names from fasdb) [None]
query=X : Selects query sequence by name [None]
acclist=LIST : Extract only AccNums in list. LIST can be FILE or list of AccNums X,Y,.. [None]
fasdb=FILE : Fasta format database to extract sequences from [None]
mapseq=FILE : Maps sequences from FILE to sequences of same name [None]
mapdna=FILE : Map DNA sequences from FILE onto sequences of same name in protein alignment [None]
seqout=FILE : Saves 'tidied' sequences to FILE after loading and manipulations [None]
reformat=X : Outputs sequence in a particular format, where X is:
- fasta/fas/phylip/scanseq/acclist/speclist/acc/idlist/fastacmd/teiresias/mysql/nexus/3rf/6rf/est6rf [None]
- if no seqout=FILE given, will use input file name as base and add appropriate exension.
#!# reformat=X may not be fully implemented. Report erroneous behaviour! #!#
logrem=T/F : Whether to log removed sequences [True] - suggest False with filtering of large files!

Sequence Loading/Formatting Options

gnspacc=T/F : Convert sequence names into gene_SPECIES__AccNum format wherever possible. [False]
alphabet=LIST : Alphabet allowed in sequences [standard 1 letter AA codes]
replacechar=T/F : Whether to remove numbers and replace characters not found in the given alphabet with 'X' [True]
autofilter=T/F : Whether to automatically apply sequence filters etc. upon loading sequence [True]
autoload=T/F : Whether to automatically load sequences upon initiating object [True]
memsaver=T/F : Minimise memory usage. Input sequences must be fasta. [False]
degap=T/F : Degaps each sequence [False]
tidygap=T/F : Removes any columns from alignments that are 100% gap [True]
ntrim=X : Trims of regions >= X proportion N bases (X residues for protein) [0.0]
seqtype=X : Force program to read as DNA, RNA, Protein or Mixed (case insensitive; read=Will work it out) [None]
dna=T/F : Alternative identification of sequences as DNA [False]
mixed=T/F : Whether to allow auto-identification of mixed sequences types (else uses first seq only) [False]
align=T/F : Whether the sequences should be aligned. Will align if unaligned. [False]
rna2dna=T/F : Converts RNA to DNA [False]
trunc=X : Truncates each sequence to the first X aa. (Last X aa if -ve) (Useful for webservers like SingalP.) [0]
usecase=T/F : Whether to output sequences in mixed case rather than converting all to upper case [False]
case=LIST : List of positions to switch case, starting with first lower case (e.g case=20,-20 will have ends UC) []
countspec=T/F : Generate counts of different species and output in log [False]

Sequence Filtering Options

filterout=FILE : Saves filtered sequences (as fasta) into FILE. *NOTE: File is appended if append=T* [None]
minlen=X : Minimum length of sequences [0]
maxlen=X : Maximum length of sequences (<=0 = No maximum) [0]
maxgap=X : Maximum proportion of sequence that may be gaps (<=0 = No maximum) [0]
maxx=X : Maximum proportion of sequence that may be Xs (<=0 = No maximum; >=1 = Absolute no.) [0]
maxglob=X : Maximum proportion of sequence predicted to be ordered (<=0 = None; >=1 = Absolute) [0]
minorf=X : Minimum ORF length for a DNA/EST translation (reformatting only) [0]
minpoly=X : Minimum length of poly-A tail for 3rf / 6rf EST translation (reformatting only) [20]
gapfilter=T/F : Whether to filter gappy sequences upon loading [True]
nosplice=T/F : If nosplice=T, UniProt splice variants will be filtered out [False]
dblist=LIST : List of databases in order of preference (good to bad)
[sprot,ipi,uniprot,trembl,ens_known,ens_novel,ens_scan]
dbonly=T/F : Whether to only allow sequences from listed databases [False]
unkspec=T/F : Whether sequences of unknown species are allowed [True]
9spec=T/F : Whether to treat 9XXXX species codes as actual species (generally higher taxa) [False]
accnr=T/F : Check for redundant Accession Numbers/Names on loading sequences. [True]
seqnr=T/F : Make sequences Non-Redundant [False]
nrid=X : %Identity cut-off for Non-Redundancy (GABLAMO) [100.0]
nrsim=X : %Similarity cut-off for Non-Redundancy (GABLAMO) [None]
nralign=T/F : Use ALIGN for non-redundancy calculations rather than GABLAMO [False]
specnr=T/F : Non-Redundancy within same species only [False]
querynr=T/F : Perform Non-Redundancy on Query species (True) or limit to non-Query species (False) [True]
nrkeepann=T/F : Append annotation of redundant sequences onto NR sequences [False]
goodX=LIST : Filters where only sequences meeting the requirement of LIST are kept.
LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
badX=LIST : As goodX but excludes rather than retains filtered sequences

System Info Options

Use forward slashes for paths (/)

blastpath=PATH

blast+path=PATH

fastapath=PATH

clustalw=PATH

'clustalw'

clustalo=PATH

'clustalo'

mafft=PATH

'mafft'

muscle=PATH

'muscle'

fsa=PATH

pagan=PATH

win32=T/F

False

alnprog=X

clustalo

Sequence Manipulation/Function Options

pamdis : Makes an all by all PAM distance matrix
split=X : Splits file into numbered files of X sequences. (Useful for webservers like TMHMM.)
relcons=FILE: Returns a file containing Pos AbsCons RelCons [None]
relconwin=X : Window size for relative conservation scoring [30]
makepng=T/F : Whether to make RelCons PNG files [False]
seqname=X : Output sequence names for PNG files etc. (short/Name/Number/AccNum/ID) [short]

DisMatrix Options

outmatrix=X : Type for output matrix - text / mysql / phylip

Special Options

blast2fas=FILE1,FILE2,...,FILEn : Will blast sequences against list of databases and compile a fasta file of results per query
- use options from rje_blast.py for each individual blast (blastd=FILE will be over-ridden)
- saves results in AccNum.blast.fas and will append existing files!
keepblast=T/F : Whether to keep BLAST results files for blast2fas searches [True]
haqbat=FILE : Generate a batch file (FILE) to run HAQESAC on generated BLAST files, with seqin as queries [None]

Classes

SeqList(rje.RJE_Object):
- Sequence List Class. Holds a list of Sequence Objects and has methods for manipulation etc.
Sequence(rje_sequence.Sequence):
- Individual Sequence Class.
DisMatrix(rje_dismatrix.DisMatrix):
- Sequence Distance Matrix Class.

History Module Version History

    # 0.0 - Initial Compilation.
    # 0.1 - Renamed major attributes
    # 0.2 - New implementation on more generic OO approach. Non-Redundancy Output
    # 0.3 - No Out Object in Objects
    # 1.0 - Better Documentation to go with GASP V:1.2
    # 1.1 - Better DNA stuff
    # 1.2 - Added ClustalW align
    # 1.3 - Separated Sequence object into rje_sequence.py
    # 1.4 - Add rudimentary gnspacc=T/F
    # 1.5 - Changed pwAln to use popen()
    # 1.6 - Fixed nrdic problem in Redundancy check and added user-definition of database list
    # 1.8 - Added UniProt input and acclist reading
    # 1.9 - Added 'reformat=scanseq' option but not properly implemented. Added align=T/F.
    # 2.0 - Major reworking of commandline options and introduction of self.list dictionary (rje v3.0)
    # 2.1 - Added reformat of UniProt with memsaver=T.
    # 2.2 - Added GABLAM non-redundancy
    # 2.3 - Added NR in memsaver mode
    # 2.4 - Changed some of the log output (REM and redundancy) to look better.
    # 2.5 - Added nr_qry to makeNR()
    # 2.6 - Added mysql reformat output: fastacmd, protein_id, acc_num, spec_code, description (delimited)
    # 2.7 - Added SeqCount() method. Incorporated reading of sequence case.
    # 2.8 - Added NEXUS output for MrBayes compatibility
    # 2.9 - Added setupSubDict(masking=True) for use in probabilistic conservation scores
    # 3.0 - Start of improvements for DNA sequences with dna=T.
    # 3.1 - Added relative conservation calculations for a whole alignment.
    # 3.2 - Added output of sequences for making alignments in R.
    # 3.3 - Added 6RF reformatting for DNA sequences.
    # 3.4 - Added HAQBAT option
    # 3.5 - Added extra alignment program, MAFFT
    # 3.6 - Added stripGap() method. Replaced self.seq with self.seqs() for reading. (Replace with list at some point.)
    # 3.7 - Added raw option for single sequence load.
    # 3.8 - Added maxGlob setting for screening out globular proteins.
    # 3.9 - Added reading of mafft format when not producing standard fasta.
    # 3.10- Added ntrim=X : Trims of regions >= X proportion N bases (X residues for protein) [0.5]
    # 3.11- Added mapdna=FILE option to map DNA sequences onto protein alignment
    # 3.12- Added countspec=T/F   : Generate counts of different species and output in log [False]
    # 3.13- Updated sequence type checking for use with GABLAM 2.10.
    # 3.14- Added CLUSTAL Omega alignment program ['clustalo']
    # 3.15- Added PAGAN alignment program ['pagan'] and (hopefully) fixed minor Windows fastacmd bug.
    # 3.16- Added BLAST+ path and seqFromBlastDBCmd()
    # 3.17- Updated to use BLAST+ and rje_blast_V2
    # 3.18- Minor BLAST+ bug fixes. Added exceptions to readBLAST failure.
    # 3.19- Fixed BLAST+ sequence extraction name truncation error.
    # 3.20- Added run() method for SeqSuite.
    # 3.21.0 - Added extraction of uniprot IDs for seqin.
    # 3.22.0 - Added loading sequences from provided sequence files contents directly, bypassing file reading.
    # 3.22.1 - Fixed problem if seqin is blank, triggering odd Uniprot download.
    # 3.23.0 - Add speclist to reformat options.
    # 3.24.0 - Added REST seqout output.
    # 3.25.0 - 9spec=T/F   : Whether to treat 9XXXX species codes as actual species (generally higher taxa) [False]
    # 3.25.1 - Fixed -long_seqids retrieval bug.
    # 3.25.2 - Fixed 9spec filtering bug.
    # 3.25.3 - Added some bug fixes from Norman that were giving him errors.

RJE_SEQ REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

There is currently no specific help available on REST output for this program.

SLiMSuite REST Server

RJE_SEQ V3.25.3

DNA/Protein sequence list module