SLiMSuite REST Server


Links
REST Home
EdwardsLab Homepage
EdwardsLab Blog
SLiMSuite Blog
SLiMSuite
Webservers
Genomes
REST Pages
REST Status
REST Help
REST Tools
REST Alias Data
REST API
REST News
REST Sitemap

RJE_SEQ V3.25.3

DNA/Protein sequence list module

Program: RJE_SEQ
Description: DNA/Protein sequence list module
Version: 3.25.3
Last Edit: 21/12/20

Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice


Imported modules: rje rje_blast_V1 rje_blast_V2 rje_dismatrix_V2 rje_sequence rje_uniprot


See SLiMSuite Blog for further documentation. See rje for general commands.

Function

Contains Classes and methods for sets of DNA and protein sequences.

Sequence Input/Output Options

seqin=FILE : Loads sequences from FILE (fasta,phylip,aln,uniprot or fastacmd names from fasdb) [None]
query=X : Selects query sequence by name [None]
acclist=LIST : Extract only AccNums in list. LIST can be FILE or list of AccNums X,Y,.. [None]
fasdb=FILE : Fasta format database to extract sequences from [None]
mapseq=FILE : Maps sequences from FILE to sequences of same name [None]
mapdna=FILE : Map DNA sequences from FILE onto sequences of same name in protein alignment [None]
seqout=FILE : Saves 'tidied' sequences to FILE after loading and manipulations [None]
reformat=X : Outputs sequence in a particular format, where X is:
- fasta/fas/phylip/scanseq/acclist/speclist/acc/idlist/fastacmd/teiresias/mysql/nexus/3rf/6rf/est6rf [None]
- if no seqout=FILE given, will use input file name as base and add appropriate exension.
#!# reformat=X may not be fully implemented. Report erroneous behaviour! #!#
logrem=T/F : Whether to log removed sequences [True] - suggest False with filtering of large files!

Sequence Loading/Formatting Options

gnspacc=T/F : Convert sequence names into gene_SPECIES__AccNum format wherever possible. [False]
alphabet=LIST : Alphabet allowed in sequences [standard 1 letter AA codes]
replacechar=T/F : Whether to remove numbers and replace characters not found in the given alphabet with 'X' [True]
autofilter=T/F : Whether to automatically apply sequence filters etc. upon loading sequence [True]
autoload=T/F : Whether to automatically load sequences upon initiating object [True]
memsaver=T/F : Minimise memory usage. Input sequences must be fasta. [False]
degap=T/F : Degaps each sequence [False]
tidygap=T/F : Removes any columns from alignments that are 100% gap [True]
ntrim=X : Trims of regions >= X proportion N bases (X residues for protein) [0.0]
seqtype=X : Force program to read as DNA, RNA, Protein or Mixed (case insensitive; read=Will work it out) [None]
dna=T/F : Alternative identification of sequences as DNA [False]
mixed=T/F : Whether to allow auto-identification of mixed sequences types (else uses first seq only) [False]
align=T/F : Whether the sequences should be aligned. Will align if unaligned. [False]
rna2dna=T/F : Converts RNA to DNA [False]
trunc=X : Truncates each sequence to the first X aa. (Last X aa if -ve) (Useful for webservers like SingalP.) [0]
usecase=T/F : Whether to output sequences in mixed case rather than converting all to upper case [False]
case=LIST : List of positions to switch case, starting with first lower case (e.g case=20,-20 will have ends UC) []
countspec=T/F : Generate counts of different species and output in log [False]

Sequence Filtering Options

filterout=FILE : Saves filtered sequences (as fasta) into FILE. *NOTE: File is appended if append=T* [None]
minlen=X : Minimum length of sequences [0]
maxlen=X : Maximum length of sequences (<=0 = No maximum) [0]
maxgap=X : Maximum proportion of sequence that may be gaps (<=0 = No maximum) [0]
maxx=X : Maximum proportion of sequence that may be Xs (<=0 = No maximum; >=1 = Absolute no.) [0]
maxglob=X : Maximum proportion of sequence predicted to be ordered (<=0 = None; >=1 = Absolute) [0]
minorf=X : Minimum ORF length for a DNA/EST translation (reformatting only) [0]
minpoly=X : Minimum length of poly-A tail for 3rf / 6rf EST translation (reformatting only) [20]
gapfilter=T/F : Whether to filter gappy sequences upon loading [True]
nosplice=T/F : If nosplice=T, UniProt splice variants will be filtered out [False]
dblist=LIST : List of databases in order of preference (good to bad)
[sprot,ipi,uniprot,trembl,ens_known,ens_novel,ens_scan]
dbonly=T/F : Whether to only allow sequences from listed databases [False]
unkspec=T/F : Whether sequences of unknown species are allowed [True]
9spec=T/F : Whether to treat 9XXXX species codes as actual species (generally higher taxa) [False]
accnr=T/F : Check for redundant Accession Numbers/Names on loading sequences. [True]
seqnr=T/F : Make sequences Non-Redundant [False]
nrid=X : %Identity cut-off for Non-Redundancy (GABLAMO) [100.0]
nrsim=X : %Similarity cut-off for Non-Redundancy (GABLAMO) [None]
nralign=T/F : Use ALIGN for non-redundancy calculations rather than GABLAMO [False]
specnr=T/F : Non-Redundancy within same species only [False]
querynr=T/F : Perform Non-Redundancy on Query species (True) or limit to non-Query species (False) [True]
nrkeepann=T/F : Append annotation of redundant sequences onto NR sequences [False]
goodX=LIST : Filters where only sequences meeting the requirement of LIST are kept.
LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
badX=LIST : As goodX but excludes rather than retains filtered sequences

System Info Options

  • Use forward slashes for paths (/)
  • blastpath=PATH : Path to BLAST programs [''] blast+path=PATH : Path to BLAST+ programs ['']
    fastapath=PATH : Path to FASTA programs ['']
    clustalw=PATH : Path to CLUSTALW program ['clustalw']
    clustalo=PATH : Path to CLUSTAL Omega alignment program ['clustalo']
    mafft=PATH : Path to MAFFT alignment program ['mafft']
    muscle=PATH : Path to MUSCLE alignment program ['muscle']
    fsa=PATH : Path to FSA alignment program ['fsa']
    pagan=PATH : Path to PAGAN alignment program ['pagan']
    win32=T/F : Run in Win32 Mode [False]
    alnprog=X : Choice of alignment program to use (clustalw/clustalo/muscle/mafft/fsa/pagan) [clustalo]

Sequence Manipulation/Function Options

pamdis : Makes an all by all PAM distance matrix
split=X : Splits file into numbered files of X sequences. (Useful for webservers like TMHMM.)
relcons=FILE: Returns a file containing Pos AbsCons RelCons [None]
relconwin=X : Window size for relative conservation scoring [30]
makepng=T/F : Whether to make RelCons PNG files [False]
seqname=X : Output sequence names for PNG files etc. (short/Name/Number/AccNum/ID) [short]

DisMatrix Options

outmatrix=X : Type for output matrix - text / mysql / phylip

Special Options

blast2fas=FILE1,FILE2,...,FILEn : Will blast sequences against list of databases and compile a fasta file of results per query
- use options from rje_blast.py for each individual blast (blastd=FILE will be over-ridden)
- saves results in AccNum.blast.fas and will append existing files!
keepblast=T/F : Whether to keep BLAST results files for blast2fas searches [True]
haqbat=FILE : Generate a batch file (FILE) to run HAQESAC on generated BLAST files, with seqin as queries [None]

Classes

SeqList(rje.RJE_Object):
- Sequence List Class. Holds a list of Sequence Objects and has methods for manipulation etc.
Sequence(rje_sequence.Sequence):
- Individual Sequence Class.
DisMatrix(rje_dismatrix.DisMatrix):
- Sequence Distance Matrix Class.

History Module Version History

    # 0.0 - Initial Compilation.
    # 0.1 - Renamed major attributes
    # 0.2 - New implementation on more generic OO approach. Non-Redundancy Output
    # 0.3 - No Out Object in Objects
    # 1.0 - Better Documentation to go with GASP V:1.2
    # 1.1 - Better DNA stuff
    # 1.2 - Added ClustalW align
    # 1.3 - Separated Sequence object into rje_sequence.py
    # 1.4 - Add rudimentary gnspacc=T/F
    # 1.5 - Changed pwAln to use popen()
    # 1.6 - Fixed nrdic problem in Redundancy check and added user-definition of database list
    # 1.8 - Added UniProt input and acclist reading
    # 1.9 - Added 'reformat=scanseq' option but not properly implemented. Added align=T/F.
    # 2.0 - Major reworking of commandline options and introduction of self.list dictionary (rje v3.0)
    # 2.1 - Added reformat of UniProt with memsaver=T.
    # 2.2 - Added GABLAM non-redundancy
    # 2.3 - Added NR in memsaver mode
    # 2.4 - Changed some of the log output (REM and redundancy) to look better.
    # 2.5 - Added nr_qry to makeNR()
    # 2.6 - Added mysql reformat output: fastacmd, protein_id, acc_num, spec_code, description (delimited)
    # 2.7 - Added SeqCount() method. Incorporated reading of sequence case.
    # 2.8 - Added NEXUS output for MrBayes compatibility
    # 2.9 - Added setupSubDict(masking=True) for use in probabilistic conservation scores
    # 3.0 - Start of improvements for DNA sequences with dna=T.
    # 3.1 - Added relative conservation calculations for a whole alignment.
    # 3.2 - Added output of sequences for making alignments in R.
    # 3.3 - Added 6RF reformatting for DNA sequences.
    # 3.4 - Added HAQBAT option
    # 3.5 - Added extra alignment program, MAFFT
    # 3.6 - Added stripGap() method. Replaced self.seq with self.seqs() for reading. (Replace with list at some point.)
    # 3.7 - Added raw option for single sequence load.
    # 3.8 - Added maxGlob setting for screening out globular proteins.
    # 3.9 - Added reading of mafft format when not producing standard fasta.
    # 3.10- Added ntrim=X : Trims of regions >= X proportion N bases (X residues for protein) [0.5]
    # 3.11- Added mapdna=FILE option to map DNA sequences onto protein alignment
    # 3.12- Added countspec=T/F   : Generate counts of different species and output in log [False]
    # 3.13- Updated sequence type checking for use with GABLAM 2.10.
    # 3.14- Added CLUSTAL Omega alignment program ['clustalo']
    # 3.15- Added PAGAN alignment program ['pagan'] and (hopefully) fixed minor Windows fastacmd bug.
    # 3.16- Added BLAST+ path and seqFromBlastDBCmd()
    # 3.17- Updated to use BLAST+ and rje_blast_V2
    # 3.18- Minor BLAST+ bug fixes. Added exceptions to readBLAST failure.
    # 3.19- Fixed BLAST+ sequence extraction name truncation error.
    # 3.20- Added run() method for SeqSuite.
    # 3.21.0 - Added extraction of uniprot IDs for seqin.
    # 3.22.0 - Added loading sequences from provided sequence files contents directly, bypassing file reading.
    # 3.22.1 - Fixed problem if seqin is blank, triggering odd Uniprot download.
    # 3.23.0 - Add speclist to reformat options.
    # 3.24.0 - Added REST seqout output.
    # 3.25.0 - 9spec=T/F   : Whether to treat 9XXXX species codes as actual species (generally higher taxa) [False]
    # 3.25.1 - Fixed -long_seqids retrieval bug.
    # 3.25.2 - Fixed 9spec filtering bug.
    # 3.25.3 - Added some bug fixes from Norman that were giving him errors.

RJE_SEQ REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

There is currently no specific help available on REST output for this program.

© 2015 RJ Edwards. Contact: richard.edwards@unsw.edu.au.