SLiMSuite REST Server


Links
REST Home
EdwardsLab Homepage
EdwardsLab Blog
SLiMSuite Blog
SLiMSuite
Webservers
REST Pages
REST Status
REST Help
REST Tools
REST Alias Data
REST API
REST News
REST Sitemap

rje_seqlist V1.25.0

RJE Nucleotide and Protein Sequence List Object (Revised)

Module: rje_seqlist
Description: RJE Nucleotide and Protein Sequence List Object (Revised)
Version: 1.25.0
Last Edit: 08/05/17

Copyright © 2011 Richard J. Edwards - See source code for GNU License Notice


Imported modules: rje rje_db rje_menu rje_obj rje_sequence rje_uniprot rje_zen


See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This module is designed to replace rje_seq. The scale of projects has grown substantially, and rje_seq cannot deal well with large datasets. An important feature of rje_seqlist.SeqList objects, therefore, is to offer different sequence modes for different applications. To simplify matters, rje_seqlist will now only cope with single format sequences, which includes a single naming format.

This version of the SeqList object therefore has several distinct modes that determine how the sequences are stored. - full = Full loading into Sequence Objects. - list = Lists of (name,sequence) tuples only. - file = List of file positions. - index = No loading of sequences. Use index file to find sequences on the fly. - db = Store sequence data in database object.

SeqShuffle

Version 1.2 introduced the seqshuffle function for randomising input sequences. This generates a set of biologically
unrealistic sequences by randomly shuffling each input sequence without replacement, such that the output sequences
have the same primary monomer composition as the input but any dimer/trimer biases etc. are removed. This is executed
by the shuffleSeq() method, which can also generate sequences shuffled with replacement, i.e. based on frequencies.

Sampler

Version 1.5 introduced a sequence sampling function for pulling out a random selection of input sequences into one or
more output files. This is controlled by sampler=N(,X) where the X setting is optional. Random selections of N
sequences will be output into a file named according to the seqout=FILE option (or the input file appended with
.nN if none given). X defines the number of replicate datasets to generate and will be set to 1 if not given.
If X>1 then the output filenames will be appended with .rx for each replicate, where x is 1 to X. If 0.0 < N < 1.0
then a proportion of the input sequences (rounding to the nearest integer) will be selected.

SortSeq

In Version 1.8, the sizesort=T/F function is replaced with sortseq=X (or seqsort=X), where X is a choice of:
- size = Sort sequences by size small -> big
- accnum = Alphabetical by accession number
- name = Alphabetical by name
- seq[X] = Alphabetical by sequence with option to use first X aa/nt only (to save memory)
- species = Alphabetical by species code
- desc = Alphabetical by description
- invsize = Sort by size big -> small re-output prior to loading/filtering (old sizesort - still sets sortseq)
- invX / revX (Note adding inv or rev in front of any selection will reverse sort.)

Edit

Version 1.16 introduced an interactive edit mode (edit=T) that gives users the options to rearrange, copy, delete,
split, truncate, rename, join, merge (as consensus) etc. Please contact the author for more details.

From Version 1.20, a delimited text file can also be given
as edit=FILE, which should contain: Locus, Pos, Edit, Details. Edit is the type of change (INS/DEL/SUB) and Details
contains the nature of the change (ins/sub sequence or del length). Edits are made in reverse order per locus to
avoid position conflicts and overlapping edits should be avoided. WARNING: These will not be checked for! An optional
Notes field will be used if present for annotating changes in the log file.

From Version 1.22, a delimited file can be given in place of Start,End for region=X. This file should contain Locus,
Start, End and NewAcc fields. If No NewAcc field is present, the new accession number will be the previous accnum
(extracted from the sequence name) with '.X-Y' appended.

Commandline

INPUT OPTIONS

seqin=FILE : Sequence input file name. [None]
seqmode=X : Sequence mode, determining method of sequence storage (full/list/file/index/db/filedb). [file]
seqdb=FILE : Sequence file from which to extract sequences (fastacmd/index formats) [None]
seqindex=T/F : Whether to save (and load) sequence index file in file mode. [True]
seqformat=X : Expected format of sequence file [None]
seqtype=X : Sequence type (prot(ein)/dna/rna/mix(ed)) [None]
mixed=T/F : Whether to allow auto-identification of mixed sequences types (else uses first seq only) [False]
dna=T/F : Alternative option to indicate dealing with nucleotide sequences [False]
autoload=T/F : Whether to automatically load sequences upon initialisation. [True]
autofilter=T/F : Whether to automatically apply sequence filtering. [True]

SEQUENCE FORMATTING

reformat=X : Output format for sequence files (fasta/short/acc/acclist/speclist/index/dna2prot/peptides/(q)region/revcomp) [fasta]
rename=T/F : Whether to rename sequences [False]
spcode=X : Species code for non-gnspacc format sequences [None]
newacc=X : New base for sequence accession numbers - will rename sequences [None]
newgene=X : New gene for renamed sequences (if blank will use newacc or 'seq' if none read) [None]
concatenate=T : Concenate sequences into single output sequence named after file [False]
split=X : String to be inserted between each concatenated sequence [''].
seqshuffle=T/F : Randomly shuffle each sequence without replacement (maintains monomer composition) [False]
region=X,Y : Alignment/Query region to use for reformat=peptides/(q)region reformatting of fasta alignment (1-L) [1,-1]
edit=T/F/FILE : Enter sequence edit mode upon loading (will switch seqmode=list) (see above) [False]

DNA TRANSLATIONS (reformat=dna2prot)

minorf=X # Min. ORF length for translated sequences output. -1 for single translation inc stop codons [-1]
terminorf=X # Min. length for terminal ORFs, only if no minorf=X ORFs found (good for short sequences) [-1]
orfmet=T/F # Whether ORFs must start with a methionine (before minorf cutoff) [True]
rftran=X # No. reading frames (RF) into which to translate (1,3,6) [1]

FILTERING OPTIONS

seqnr=T/F : Whether to check for redundancy on loading. (Will remove, save and reload if found) [False]
revcompnr=T/F : Whether to check reverse complement for redundancy too [False]
goodX=LIST : Inclusive filtering, only retaining sequences matching list []
badX=LIST : Exclusive filtering, removing sequences matching list []
- where X is 'Acc', Accession number; 'Seq', Sequence name; 'Spec', Species code; 'Desc', part of name;
minlen=X : Minimum sequence length [0]
maxlen=X : Maximum length of sequences (<=0 = No maximum) [0]

OUTPUT OPTIONS

seqout=FILE : Whether to output sequences to new file after loading and filtering [None]
usecase=T/F : Whether to return sequences in the same case as input (True), or convert to Upper (False) [False]
sortseq=X : Whether to sort sequences prior to output (size/invsize/accnum/name/seq/species/desc) [None]
sampler=N(,X) : Generate (X) file(s) sampling a random N sequences from input into seqout.N.X.fas [0]
summarise=T/F : Generate some summary statistics in log file for sequence data after loading [False]
splitseq=X : Split output sequence file according to X (gene/species) [None]

See also rje.py generic commandline options.

History Module Version History

    # 0.0 - Initial Compilation. Based on rje_seq 3.10.
    # 0.1 - Added basic species filtering and sequence output.
    # 0.2 - Added upper case filtering.
    # 0.3 - Added accnum filtering and sequence renaming.
    # 0.4 - Added sequence redundancy filtering.
    # 0.5 - Added newgene=X for sequence renaming (newgene_spcode__newaccXXX). NewAcc no longer fixed Upper Case.
    # 1.0 - Upgraded to "ready" Version 1.0. Added concatenate=T and split=X options for sequence concatenation.
    # 1.0 - Added reading of sequence type from rje_seq.py and mixed=T/F.
    # 1.1 - Added shortName() and modified SeqDict.
    # 1.2 - Added seqshuffle option for randomising sequences.
    # 1.3 - Modified use of index file (appends, not replaces, file extension)
    # 1.4 - Added dna2prot reformat function.
    # 1.5 - Added sampler=N(,X)   : Generate (X) file(s) sampling a random N sequences from input into seqout.N.X.fas [0]
    # 1.6 - Modified currSeq() and nextSeq() slightly to fix index mode breakage. Look out for other programs breaking.
    # 1.6 - Add sequence fragment extraction.
    # 1.7 - Added code to create rje_sequence.Sequence objects.
    # 1.8 - Added sortseq=X : Whether to sort sequences prior to output (size/invsize/accnum/name/seq/species/desc) [None]
    # 1.9.0 - Added extra functions for returning sequence AccNum, ID or Species code.
    # 1.10.0 - Added extraction of uniprot IDs for seqin.
    # 1.11.0 - Added more dna2prot reformatting options.
    # 1.12.0 - Added peptides/qregion reformatting and region=X,Y.
    # 1.13.0 - Added summarise=T option for generating some summary statistics for sequence data. Added minlen & maxlen.
    # 1.14.0 - Added splitseq=X split output sequence file according to X (gene/species) [None]
    # 1.15.0 - Added names() method.
    # 1.15.1 - Fixed bug with storage and return of summary stats.
    # 1.15.2 - Fixed dna2prot reformatting.
    # 1.15.3 - Fixed summarise bug (n=1).
    # 1.15.4 - Fixed REST server output bug.
    # 1.15.5 - Fixed reformat=fasta default issue introduced from fixing REST output bug.
    # 1.16.0 - Added edit=T sequence edit mode upon loading (will switch seqmode=list).
    # 1.17.0 - Added additional summarise=T output for seqmode=db.
    # 1.18.0 - Added revcomp to reformat options.
    # 1.19.0 - Added option log description for deleting sequence during edit.
    # 1.20.0 - Added option to give a file of changes for edit mode.
    # 1.20.1 - Fixed edit=FILE deletion bug.
    # 1.21.0 - Added capacity to add/update database object from self.summarise() even if not seqmode=db. Added filedb mode.
    # 1.22.0 - Added geneDic() method.
    # 1.23.0 - Added seqSequence() method.
    # 1.24.0 - Add NNN gaps option and "delete rest of sequences" to edit().
    # 1.24.1 - Minor edit bug fix and DNA toggle option.
    # 1.25.0 - Added loading of FASTQ files in seqmode=file mode.

rje_seqlist REST Output formats

The seqlist server is primarily for simple reformatting and sequence manipulation tasks:
- fasta = standard gene_SPECIES__AccNum Description fasta format
- short = fasta format without any description
- acc = fasta format with accession numbers (only) as sequence names
- acclist = plain text list of accession numbers of sequences
- speclist = plain text list of Uniprot species codes for sequences
- dna2protrna2prot/translate/nt2prot = translation of DNA (or RNA) sequence into protein
- peptides = plain list (without names) of protein sequences
- qregion = fasta alignment restricted to the columns incorporating the given sequence region of the query (sequence 1)
- region = fasta alignment restricted to the given columns of the alignment

If &rest=X, where X is in the above list, the relevant reformatting will be triggered and the resulting text
output returned. Otherwise, output is &rest=seqout.

Additional sequence filtering (degapping etc.) can be performed with the related seq server, which has
&rest=seqout output only. (See: http://rest.slimsuite.unsw.edu.au/seq for commandline options.)

Run with &rest=help for general options. Run with &rest=full to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT.

© 2015 RJ Edwards. Contact: richard.edwards@unsw.edu.au.