|Description:|| RJE Nucleotide and Protein Sequence List Object (Revised)|
|Last Edit:|| 08/07/20|
Copyright © 2011 Richard J. Edwards - See source code for GNU License Notice
See SLiMSuite Blog for further documentation. See
rje for general commands.
This module is designed to replace rje_seq. The scale of projects has grown substantially, and rje_seq cannot deal
well with large datasets. An important feature of rje_seqlist.SeqList objects, therefore, is to offer different
sequence modes for different applications. To simplify matters, rje_seqlist will now only cope with single format
sequences, which includes a single naming format.
This version of the SeqList object therefore has several distinct modes that determine how the sequences are stored.
- full = Full loading into Sequence Objects.
- list = Lists of (name,sequence) tuples only.
- file = List of file positions.
- index = No loading of sequences. Use index file to find sequences on the fly.
- db = Store sequence data in database object.
Version 1.2 introduced the seqshuffle function for randomising input sequences. This generates a set of biologically
unrealistic sequences by randomly shuffling each input sequence without replacement, such that the output sequences
have the same primary monomer composition as the input but any dimer/trimer biases etc. are removed. This is executed
by the shuffleSeq() method, which can also generate sequences shuffled with replacement, i.e. based on frequencies.
Version 1.5 introduced a sequence sampling function for pulling out a random selection of input sequences into one or
more output files. This is controlled by
where the X setting is optional. Random selections of N
sequences will be output into a file named according to the
seqout=FILE option (or the input file appended with
.nN if none given). X defines the number of replicate datasets to generate and will be set to 1 if not given.
If X>1 then the output filenames will be appended with
.rx for each replicate, where x is 1 to X. If 0.0 < N < 1.0
then a proportion of the input sequences (rounding to the nearest integer) will be selected.
In Version 1.8, the
sizesort=T/F function is replaced with
seqsort=X), where X is a choice of:
- size = Sort sequences by size small -> big
- accnum = Alphabetical by accession number
- name = Alphabetical by name
- seq[X] = Alphabetical by sequence with option to use first X aa/nt only (to save memory)
- species = Alphabetical by species code
- desc = Alphabetical by description
- invsize = Sort by size big -> small re-output prior to loading/filtering (old sizesort - still sets sortseq)
- invX / revX (Note adding
rev in front of any selection will reverse sort.)
Version 1.16 introduced an interactive edit mode (
edit=T) that gives users the options to rearrange, copy, delete,
split, truncate, rename, join, merge (as consensus) etc. Please contact the author for more details.
From Version 1.20, a delimited text file can also be given
edit=FILE, which should contain: Locus, Pos, Edit, Details. Edit is the type of change (INS/DEL/SUB) and Details
contains the nature of the change (ins/sub sequence or del length). Edits are made in reverse order per locus to
avoid position conflicts and overlapping edits should be avoided. WARNING: These will not be checked for! An optional
Notes field will be used if present for annotating changes in the log file.
From Version 1.22, a delimited file can be given in place of Start,End for
region=X. This file should contain Locus,
Start, End and NewAcc fields. If No NewAcc field is present, the new accession number will be the previous accnum
(extracted from the sequence name) with '.X-Y' appended.
seqin=FILE : Sequence input file name. [
seqmode=X : Sequence mode, determining method of sequence storage (full/list/file/index/db/filedb). [
seqdb=FILE : Sequence file from which to extract sequences (fastacmd/index formats) [
seqindex=T/F : Whether to save (and load) sequence index file in file mode. [
seqformat=X : Expected format of sequence file [
seqtype=X : Sequence type (prot(ein)/dna/rna/mix(ed)) [
mixed=T/F : Whether to allow auto-identification of mixed sequences types (else uses first seq only) [
dna=T/F : Alternative option to indicate dealing with nucleotide sequences [
autoload=T/F : Whether to automatically load sequences upon initialisation. [
autofilter=T/F : Whether to automatically apply sequence filtering. [
duperr=T/F : Whether identification of duplicate sequence names should raise an error [
reformat=X : Output format for sequence files (fasta/short/acc/acclist/accdesc/speclist/index/dna2prot/peptides/(q)region/revcomp/descaffold) [
rename=T/F : Whether to rename sequences [
spcode=X : Species code for non-gnspacc format sequences [
newacc=X : New base for sequence accession numbers - will rename sequences [
newgene=X : New gene for renamed sequences (if blank will use newacc or 'seq' if none read) [
genecounter=T/F : Whether new gene have a numbered suffix (will match newacc numbering) [
newdesc=FILE : File of new names for sequences (over-rules other naming). First word should match input [
keepname=T/F : Whether to keep the original name (first word) when mapping with
concatenate=T : Concatenate sequences into single output sequence named after file [
split=X : String to be inserted between each concatenated sequence [''].
seqshuffle=T/F : Randomly shuffle each sequence without replacement (maintains monomer composition) [
region=X,Y : Alignment/Query region to use for
reformat=peptides/(q)region reformatting of fasta alignment (1-L) [
edit=T/F/FILE : Enter sequence edit mode upon loading (will switch
seqmode=list) (see above) [
gnspacc=T/F : Whether to automatically try to enforce SLiMSuite gene_SPCODE__AccNum format [
minorf=X # Min. ORF length for translated sequences output. -1 for single translation inc stop codons [
terminorf=X # Min. length for terminal ORFs, only if no
minorf=X ORFs found (good for short sequences) [
orfmet=T/F # Whether ORFs must start with a methionine (before minorf cutoff) [
rftran=X # No. reading frames (RF) into which to translate (1,3,6) [
seqnr=T/F : Whether to check for redundancy on loading. (Will remove, save and reload if found) [
grepnr=T/F : Whether to use grep based forking NR mode (needs sized-sorted one-line-per-sequence fasta) [
twopass=T/F : Whether to perform second pass looking for redundancy of earlier sequences within later ones [
revcompnr=T/F : Whether to check reverse complement for redundancy too [
goodX=LIST : Inclusive filtering, only retaining sequences matching list 
badX=LIST : Exclusive filtering, removing sequences matching list 
- where X is 'Acc', Accession number; 'Seq', Sequence name; 'Spec', Species code; 'Desc', part of name;
minlen=X : Minimum sequence length [
maxlen=X : Maximum length of sequences (<=0 = No maximum) [
maskseq=TDTFILE : File of Locus, Start, End positions for masking [
grabseq=TDTFILE : File of Locus, Start, End positions for region extraction [
posfields=LIST : Fields in checkpos file to give Locus, Start and End for checking [
addflanks=INT : Length of flanking sequence to also extract/mask [
SEQUENCE TILING OPTIONS
tile=INT : Tile sequences into INT bp chunks (<1 for no tiling) [
mintile=X : Min. length for tile, else appended to previous (<1 for proportion of
tilestep=0 : Gap between end of one tile and start of next. Can be negative [
tilename=STR : Tile naming strategy (pos, start, num, purepos, purestart, purenum) [
seqout=FILE : Whether to output sequences to new file after loading and filtering [
usecase=T/F : Whether to return sequences in the same case as input (True), or convert to Upper (False) [
sortseq=X : Whether to sort sequences prior to output (size/invsize/accnum/name/seq/species/desc) [
sampler=N(,X) : Generate (X) file(s) sampling a random N sequences from input into seqout.N.X.fas [
summarise=T/F : Generate some summary statistics in log file for sequence data after loading [
genomesize=X : Genome size for NG50 and LG50 summary output (if >0) [
gapstats=T/F : Output a summary of assembly gap sizes and positions [
mingap=INT : Minimum length of a stretch of N bases to count as a gap [
gapfix=X:Y(,X:Y): List of gap lengths X to convert to different lengths Y 
maker=T/F : Whether to extract MAKER2 statistics (AED, eAED, QI) from sequence names [
splitseq=X : Split output sequence file according to X (gene/species) [
tmpdir=PATH : Directory used for temporary files [
See also rje.py generic commandline options.
History Module Version History
# 0.0 - Initial Compilation. Based on rje_seq 3.10.
# 0.1 - Added basic species filtering and sequence output.
# 0.2 - Added upper case filtering.
# 0.3 - Added accnum filtering and sequence renaming.
# 0.4 - Added sequence redundancy filtering.
# 0.5 - Added newgene=X for sequence renaming (newgene_spcode__newaccXXX). NewAcc no longer fixed Upper Case.
# 1.0 - Upgraded to "ready" Version 1.0. Added concatenate=T and split=X options for sequence concatenation.
# 1.0 - Added reading of sequence type from rje_seq.py and mixed=T/F.
# 1.1 - Added shortName() and modified SeqDict.
# 1.2 - Added seqshuffle option for randomising sequences.
# 1.3 - Modified use of index file (appends, not replaces, file extension)
# 1.4 - Added dna2prot reformat function.
# 1.5 - Added sampler=N(,X) : Generate (X) file(s) sampling a random N sequences from input into seqout.N.X.fas 
# 1.6 - Modified currSeq() and nextSeq() slightly to fix index mode breakage. Look out for other programs breaking.
# 1.6 - Add sequence fragment extraction.
# 1.7 - Added code to create rje_sequence.Sequence objects.
# 1.8 - Added sortseq=X : Whether to sort sequences prior to output (size/invsize/accnum/name/seq/species/desc) [None]
# 1.9.0 - Added extra functions for returning sequence AccNum, ID or Species code.
# 1.10.0 - Added extraction of uniprot IDs for seqin.
# 1.11.0 - Added more dna2prot reformatting options.
# 1.12.0 - Added peptides/qregion reformatting and region=X,Y.
# 1.13.0 - Added summarise=T option for generating some summary statistics for sequence data. Added minlen & maxlen.
# 1.14.0 - Added splitseq=X split output sequence file according to X (gene/species) [None]
# 1.15.0 - Added names() method.
# 1.15.1 - Fixed bug with storage and return of summary stats.
# 1.15.2 - Fixed dna2prot reformatting.
# 1.15.3 - Fixed summarise bug (n=1).
# 1.15.4 - Fixed REST server output bug.
# 1.15.5 - Fixed reformat=fasta default issue introduced from fixing REST output bug.
# 1.16.0 - Added edit=T sequence edit mode upon loading (will switch seqmode=list).
# 1.17.0 - Added additional summarise=T output for seqmode=db.
# 1.18.0 - Added revcomp to reformat options.
# 1.19.0 - Added option log description for deleting sequence during edit.
# 1.20.0 - Added option to give a file of changes for edit mode.
# 1.20.1 - Fixed edit=FILE deletion bug.
# 1.21.0 - Added capacity to add/update database object from self.summarise() even if not seqmode=db. Added filedb mode.
# 1.22.0 - Added geneDic() method.
# 1.23.0 - Added seqSequence() method.
# 1.24.0 - Add NNN gaps option and "delete rest of sequences" to edit().
# 1.24.1 - Minor edit bug fix and DNA toggle option.
# 1.25.0 - Added loading of FASTQ files in seqmode=file mode.
# 1.26.0 - Updated sequence statistics and fixed N50 underestimation bug.
# 1.26.1 - Fixed median length overestimation bug.
# 1.26.2 - Fixed sizesort bug. (Now big to small as advertised.)
# 1.27.0 - Added grepNR() method (dev only). Switched default to RevCompNR=T.
# 1.28.0 - Fixed second pass NR naming bug and added option to switch off altogether.
# 1.29.0 - Added maker=T/F : Whether to extract MAKER2 statistics (AED, eAED, QI) from sequence names [False]
# 1.30.0 - Updated and improved DNA2Protein.
# 1.31.0 - Added genecounter to rename option for use with other programs, e.g. PAGSAT.
# 1.31.1 - Fixed edit bug when not in DNA mode.
# 1.32.0 - Added genomesize and NG50/LG50 to DNA summarise.
# 1.32.1 - Fixed LG50/L50 bug.
# 1.32.2 - Added reformat=accdesc to generate output without gene and species code.
# 1.32.3 - Added checkNames() to check for duplicate sequence names and/or lack of gnspacc format.
# 1.32.3 - Added duperr=T/F : Whether identification of duplicate sequence names should raise an error [True]
# 1.33.0 - Added newdesc=FILE : File of new names for sequences (over-rules other naming). First word should match input [None]
# 1.33.1 - Fixed bug with appending sequences with gap insertion.
# 1.34.0 - Added genecounter=T/F : Whether new gene have a numbered suffix (will match newacc numbering) [False]
# 1.35.0 - Added initial extraction of sequences from BLASTDB from rje_seq.
# 1.36.0 - Added bpFromStr(seqlen)
# 1.36.1 - Changed default duplicate suffix to X2.
# 1.37.0 - Added masking and extraction from loaded table of positions.
# 1.38.0 - Added assembly gap summary and manipulation fundtions.
# 1.39.0 - Added descaffolding, tiling output and gnspacc=T/F to control edit renaming.
# 1.40.0 - Added keepname=T/F : Whether to keep the original name (first word) when mapping with newdesc=FILE [True]
rje_seqlist REST Output formats
server is primarily for simple reformatting and sequence manipulation tasks:
= fasta format without any description
= fasta format with accession numbers (only) as sequence names
= plain text list of accession numbers of sequences
= plain text list of Uniprot species codes for sequences
= translation of DNA (or RNA) sequence into protein
= plain list (without names) of protein sequences
= fasta alignment restricted to the columns incorporating the given sequence region of the query (sequence 1)
= fasta alignment restricted to the given columns of the alignment
is in the above list, the relevant reformatting will be triggered and the resulting text
output returned. Otherwise, output is
Additional sequence filtering (degapping etc.) can be performed with the related
server, which has
output only. (See: http://rest.slimsuite.unsw.edu.au/seq
for commandline options.)
for general options. Run with
to get full server output as text or
for more user-friendly formatted output. Individual outputs can be identified/parsed using