Module:	rje_seqlist
Description:	RJE Nucleotide and Protein Sequence List Object (Revised)
Version:	1.48.2
Last Edit:	26/03/22

Imported modules: rje rje_db rje_menu rje_obj rje_sequence rje_zen rje_uniprot

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This module is designed to replace rje_seq. The scale of projects has grown substantially, and rje_seq cannot deal well with large datasets. An important feature of rje_seqlist.SeqList objects, therefore, is to offer different sequence modes for different applications. To simplify matters, rje_seqlist will now only cope with single format sequences, which includes a single naming format.

This version of the SeqList object therefore has several distinct modes that determine how the sequences are stored. - full = Full loading into Sequence Objects. - list = Lists of (name,sequence) tuples only. - file = List of file positions. - index = No loading of sequences. Use index file to find sequences on the fly. - db = Store sequence data in database object.

SeqShuffle

Version 1.2 introduced the seqshuffle function for randomising input sequences. This generates a set of biologically
unrealistic sequences by randomly shuffling each input sequence without replacement, such that the output sequences
have the same primary monomer composition as the input but any dimer/trimer biases etc. are removed. This is executed
by the shuffleSeq() method, which can also generate sequences shuffled with replacement, i.e. based on frequencies.

Sampler

Version 1.5 introduced a sequence sampling function for pulling out a random selection of input sequences into one or
more output files. This is controlled by sampler=N(,X) where the X setting is optional. Random selections of N
sequences will be output into a file named according to the seqout=FILE option (or the input file appended with
.nN if none given). X defines the number of replicate datasets to generate and will be set to 1 if not given.
If X>1 then the output filenames will be appended with .rx for each replicate, where x is 1 to X. If 0.0 < N < 1.0
then a proportion of the input sequences (rounding to the nearest integer) will be selected.

SortSeq

In Version 1.8, the sizesort=T/F function is replaced with sortseq=X (or seqsort=X), where X is a choice of:
- size = Sort sequences by size small -> big
- accnum = Alphabetical by accession number
- name = Alphabetical by name
- seq[X] = Alphabetical by sequence with option to use first X aa/nt only (to save memory)
- species = Alphabetical by species code
- desc = Alphabetical by description
- invsize = Sort by size big -> small re-output prior to loading/filtering (old sizesort - still sets sortseq)
- invX / revX (Note adding inv or rev in front of any selection will reverse sort.)

Edit

Version 1.16 introduced an interactive edit mode (edit=T) that gives users the options to rearrange, copy, delete,
split, truncate, rename, join, merge (as consensus) etc. Please contact the author for more details.

From Version 1.20, a delimited text file can also be given
as edit=FILE, which should contain: Locus, Pos, Edit, Details. Edit is the type of change (INS/DEL/SUB) and Details
contains the nature of the change (ins/sub sequence or del length). Edits are made in reverse order per locus to
avoid position conflicts and overlapping edits should be avoided. WARNING: These will not be checked for! An optional
Notes field will be used if present for annotating changes in the log file.

From Version 1.22, a delimited file can be given in place of Start,End for region=X. This file should contain Locus,
Start, End and NewAcc fields. If No NewAcc field is present, the new accession number will be the previous accnum
(extracted from the sequence name) with '.X-Y' appended.

Commandline

INPUT OPTIONS

seqin=FILE seqmode=X seqdb=FILE seqindex=T/F seqformat=X seqtype=X mixed=T/F dna=T/F autoload=T/F autofilter=T/F duperr=T/F

SEQUENCE FORMATTING

reformat=X rename=T/F spcode=X newacc=X newgene=X genecounter=T/F newdesc=FILE keepname=T/F concatenate=T split=X seqshuffle=T/F region=X,Y edit=T/F/FILE gnspacc=T/F

DNA TRANSLATIONS (`minorf=X` `terminorf=X` `orfmet=T/F` `rftran=X` `orfgaps=T/F`

FILTERING OPTIONS

seqnr=T/F grepnr=T/F twopass=T/F revcompnr=T/F goodX=LIST badX=LIST - where X is 'Acc', Accession number; 'Seq', Sequence name; 'Spec', Species code; 'Desc', minlen=X maxlen=X

EXTRACT/MASK OPTIONS

maskseq=TDTFILE grabseq=TDTFILE posfields=LIST addflanks=INT

SEQUENCE TILING OPTIONS

tile=INT mintile=X tilestep=0 tilename=STR

OUTPUT OPTIONS

seqout=FILE usecase=T/F sortseq=X sampler=N(,X) summarise=T/F genomesize=X raw=T/F fracstats=T/F fracstep=INT lenstats=LIST gapstats=T/F mingap=INT gapfix=X:Y(,X:Y): maker=T/F splitseq=X tmpdir=PATH : Sequence input file name. [None]
: Sequence mode, determining method of sequence storage (full/list/file/index/db/filedb). [file]
: Sequence file from which to extract sequences (fastacmd/index formats) [None]
: Whether to save (and load) sequence index file in file mode. [True]
: Expected format of sequence file [None]
: Sequence type (prot(ein)/dna/rna/mix(ed)) [None]
: Whether to allow auto-identification of mixed sequences types (else uses first seq only) [False]
: Alternative option to indicate dealing with nucleotide sequences [False]
: Whether to automatically load sequences upon initialisation. [True]
: Whether to automatically apply sequence filtering. [True]
: Whether identification of duplicate sequence names should raise an error [True]
: Output format for sequence files (fasta/short/acc/acclist/accdesc/speclist/index/dna2prot/dna2orfs/peptides/(q)region/revcomp/reverse/descaffold/degap) [fasta]
: Whether to rename sequences [False]
: Species code for non-gnspacc format sequences [None]
: New base for sequence accession numbers - will rename sequences [None]
: New gene for renamed sequences (if blank will use newacc or 'seq' if none read) [None]
: Whether new gene have a numbered suffix (will match newacc numbering) [False]
: File of new names for sequences (over-rules other naming). First word should match input [None]
: Whether to keep the original name (first word) when mapping with newdesc=FILE [True]
: Concatenate sequences into single output sequence named after file [False]
: String to be inserted between each concatenated sequence [''].
: Randomly shuffle each sequence without replacement (maintains monomer composition) [False]
: Alignment/Query region to use for reformat=peptides/(q)region reformatting of fasta alignment (1-L) [1,-1]
: Enter sequence edit mode upon loading (will switch seqmode=list) (see above) [False]
: Whether to automatically try to enforce SLiMSuite gene_SPCODE__AccNum format [True]
t">reformat=dna2prot)

: Min. ORF length for translated sequences output. -1 for single translation inc stop codons [-1]
: Min. length for terminal ORFs, only if no minorf=X ORFs found (good for short sequences) [-1]
: Whether ORFs must start with a methionine (before minorf cutoff) [True]
: No. reading frames (RF) into which to translate (1,3,6) [1]
: Whether to allow assembly gaps (Ns) in ORFs (Xs) or (False) truncate as stop codons [True]
: Whether to check for redundancy on loading. (Will remove, save and reload if found) [False]
: Whether to use grep based forking NR mode (needs sized-sorted one-line-per-sequence fasta) [True]
: Whether to perform second pass looking for redundancy of earlier sequences within later ones [True]
: Whether to check reverse complement for redundancy too [True]
: Inclusive filtering, only retaining sequences matching list []
: Exclusive filtering, removing sequences matching list []
part of name;
: Minimum sequence length [0]
: Maximum length of sequences (<=0 = No maximum) [0]
: File of Locus, Start, End positions for masking [None]
: File of Locus, Start, End positions for region extraction [None]
: Fields in checkpos file to give Locus, Start and End for checking [Locus,Start,End]
: Length of flanking sequence to also extract/mask [0]
: Tile sequences into INT bp chunks (<1 for no tiling) [0]
: Min. length for tile, else appended to previous (<1 for proportion of tile=INT) [0.1]
: Gap between end of one tile and start of next. Can be negative [0]
: Tile naming strategy (pos, start, num, purepos, purestart, purenum) [pos]
: Whether to output sequences to new file after loading and filtering [None]
: Whether to return sequences in the same case as input (True), or convert to Upper (False) [False]
: Whether to sort sequences prior to output (size/invsize/accnum/name/seq/species/desc) [None]
: Generate (X) file(s) sampling a random N sequences from input into seqout.N.X.fas [0]
: Generate some summary statistics in log file for sequence data after loading [False]
: Genome size for NG50 and LG50 summary output (if >0) [0]
: Adjust summary statistics for raw sequencing data [False]
: Output a table of N(G)XX and L(G)XX statistics for a range of XX [False]
: Step size for NXX and LXX fractions (1/2/5/10/25) [5]
: List of min sequence lengths to output stats for (raw=T) []
: Output summary tables of contigs, assembly gap sizes and positions (also contigs=T/F) [False]
: Minimum length of a stretch of N bases to count as a gap (0=None unless gapstats=T) [10]
List of gap lengths X to convert to different lengths Y []
: Whether to extract MAKER2 statistics (AED, eAED, QI) from sequence names [False]
: Split output sequence file according to X (gene/species) [None]
: Directory used for temporary files ['./tmp/']

See also rje.py generic commandline options.

History Module Version History

    # 0.0 - Initial Compilation. Based on rje_seq 3.10.
    # 0.1 - Added basic species filtering and sequence output.
    # 0.2 - Added upper case filtering.
    # 0.3 - Added accnum filtering and sequence renaming.
    # 0.4 - Added sequence redundancy filtering.
    # 0.5 - Added newgene=X for sequence renaming (newgene_spcode__newaccXXX). NewAcc no longer fixed Upper Case.
    # 1.0 - Upgraded to "ready" Version 1.0. Added concatenate=T and split=X options for sequence concatenation.
    # 1.0 - Added reading of sequence type from rje_seq.py and mixed=T/F.
    # 1.1 - Added shortName() and modified SeqDict.
    # 1.2 - Added seqshuffle option for randomising sequences.
    # 1.3 - Modified use of index file (appends, not replaces, file extension)
    # 1.4 - Added dna2prot reformat function.
    # 1.5 - Added sampler=N(,X)   : Generate (X) file(s) sampling a random N sequences from input into seqout.N.X.fas [0]
    # 1.6 - Modified currSeq() and nextSeq() slightly to fix index mode breakage. Look out for other programs breaking.
    # 1.6 - Add sequence fragment extraction.
    # 1.7 - Added code to create rje_sequence.Sequence objects.
    # 1.8 - Added sortseq=X : Whether to sort sequences prior to output (size/invsize/accnum/name/seq/species/desc) [None]
    # 1.9.0 - Added extra functions for returning sequence AccNum, ID or Species code.
    # 1.10.0 - Added extraction of uniprot IDs for seqin.
    # 1.11.0 - Added more dna2prot reformatting options.
    # 1.12.0 - Added peptides/qregion reformatting and region=X,Y.
    # 1.13.0 - Added summarise=T option for generating some summary statistics for sequence data. Added minlen & maxlen.
    # 1.14.0 - Added splitseq=X split output sequence file according to X (gene/species) [None]
    # 1.15.0 - Added names() method.
    # 1.15.1 - Fixed bug with storage and return of summary stats.
    # 1.15.2 - Fixed dna2prot reformatting.
    # 1.15.3 - Fixed summarise bug (n=1).
    # 1.15.4 - Fixed REST server output bug.
    # 1.15.5 - Fixed reformat=fasta default issue introduced from fixing REST output bug.
    # 1.16.0 - Added edit=T sequence edit mode upon loading (will switch seqmode=list).
    # 1.17.0 - Added additional summarise=T output for seqmode=db.
    # 1.18.0 - Added revcomp to reformat options.
    # 1.19.0 - Added option log description for deleting sequence during edit.
    # 1.20.0 - Added option to give a file of changes for edit mode.
    # 1.20.1 - Fixed edit=FILE deletion bug.
    # 1.21.0 - Added capacity to add/update database object from self.summarise() even if not seqmode=db. Added filedb mode.
    # 1.22.0 - Added geneDic() method.
    # 1.23.0 - Added seqSequence() method.
    # 1.24.0 - Add NNN gaps option and "delete rest of sequences" to edit().
    # 1.24.1 - Minor edit bug fix and DNA toggle option.
    # 1.25.0 - Added loading of FASTQ files in seqmode=file mode.
    # 1.26.0 - Updated sequence statistics and fixed N50 underestimation bug.
    # 1.26.1 - Fixed median length overestimation bug.
    # 1.26.2 - Fixed sizesort bug. (Now big to small as advertised.)
    # 1.27.0 - Added grepNR() method (dev only). Switched default to RevCompNR=T.
    # 1.28.0 - Fixed second pass NR naming bug and added option to switch off altogether.
    # 1.29.0 - Added maker=T/F : Whether to extract MAKER2 statistics (AED, eAED, QI) from sequence names [False]
    # 1.30.0 - Updated and improved DNA2Protein.
    # 1.31.0 - Added genecounter to rename option for use with other programs, e.g. PAGSAT.
    # 1.31.1 - Fixed edit bug when not in DNA mode.
    # 1.32.0 - Added genomesize and NG50/LG50 to DNA summarise.
    # 1.32.1 - Fixed LG50/L50 bug.
    # 1.32.2 - Added reformat=accdesc to generate output without gene and species code.
    # 1.32.3 - Added checkNames() to check for duplicate sequence names and/or lack of gnspacc format.
    # 1.32.3 - Added duperr=T/F : Whether identification of duplicate sequence names should raise an error [True]
    # 1.33.0 - Added newdesc=FILE : File of new names for sequences (over-rules other naming). First word should match input [None]
    # 1.33.1 - Fixed bug with appending sequences with gap insertion.
    # 1.34.0 - Added genecounter=T/F : Whether new gene have a numbered suffix (will match newacc numbering) [False]
    # 1.35.0 - Added initial extraction of sequences from BLASTDB from rje_seq.
    # 1.36.0 - Added bpFromStr(seqlen)
    # 1.36.1 - Changed default duplicate suffix to X2.
    # 1.37.0 - Added masking and extraction from loaded table of positions.
    # 1.38.0 - Added assembly gap summary and manipulation fundtions.
    # 1.39.0 - Added descaffolding, tiling output and gnspacc=T/F to control edit renaming.
    # 1.40.0 - Added keepname=T/F : Whether to keep the original name (first word) when mapping with newdesc=FILE [True]
    # 1.41.0 - Added contig N50 and L50 output. Tweaked tiling output to leave off name suffix when full length sequence.
    # 1.41.1 - Fixed contig N50 and L50 output. (Previously not sorted!)
    # 1.42.0 - Added tabular summary output for different L/N(G) values.
    # 1.42.1 - Switched mingap=INT to 0=None unless gapstats=T.
    # 1.43.0 - Added raw=T/F and lenstats=LIST to adjust summary statistics for raw sequencing data
    # 1.43.1 - Added sequence reversal (not complemented) to reformat and edit
    # 1.44.0 - Added some additional parsing of common sequence formats from rje_sequence: need to expand.
    # 1.45.0 - Modified the newDesc() method for updating descriptions.
    # 1.45.1 - Added CtgNum to output stats.
    # 1.45.2 - Slight increase of gap extraction speed.
    # 1.45.3 - Fixed bug for summarising masked assemblies.
    # 1.46.0 - Added dna2orfs reformatting options.
    # 1.46.1 - Tweaked the batchSummarise method.
    # 1.46.2 - Added orfgaps=T/F. Partial implementation of GFF output for dna2orfs reformatting. Need completion.
    # 1.47.0 - Added reformat=degap option for removing alignment gaps from input sequences.
    # 1.48.0 - Output a table of contigs during summarise (sets gapstats=T) [False]. Removed some dependencies.
    # 1.48.1 - Switched contigs=TRUE as the default.
    # 1.48.2 - Made contigs=T/F and gapstats=T/F synonymous.

rje_seqlist REST Output formats

The seqlist server is primarily for simple reformatting and sequence manipulation tasks:
- fasta = standard gene_SPECIES__AccNum Description fasta format
- short = fasta format without any description
- acc = fasta format with accession numbers (only) as sequence names
- acclist = plain text list of accession numbers of sequences
- speclist = plain text list of Uniprot species codes for sequences
- dna2prot/rna2prot/translate/nt2prot = translation of DNA (or RNA) sequence into protein
- peptides = plain list (without names) of protein sequences
- qregion = fasta alignment restricted to the columns incorporating the given sequence region of the query (sequence 1)
- region = fasta alignment restricted to the given columns of the alignment

If &rest=X, where X is in the above list, the relevant reformatting will be triggered and the resulting text
output returned. Otherwise, output is &rest=seqout.

Additional sequence filtering (degapping etc.) can be performed with the related seq server, which has
&rest=seqout output only. (See: http://rest.slimsuite.unsw.edu.au/seq for commandline options.)

Run with &rest=help for general options. Run with &rest=full to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT.

SLiMSuite REST Server

rje_seqlist V1.48.2

RJE Nucleotide and Protein Sequence List Object (Revised)

Function

SeqShuffle

Sampler

SortSeq

Edit

Commandline

INPUT OPTIONS

SEQUENCE FORMATTING

DNA TRANSLATIONS (`minorf=X` `terminorf=X` `orfmet=T/F` `rftran=X` `orfgaps=T/F`

FILTERING OPTIONS

EXTRACT/MASK OPTIONS

SEQUENCE TILING OPTIONS

OUTPUT OPTIONS

History Module Version History

rje_seqlist REST Output formats

SLiMSuite REST Server

rje_seqlist V1.48.2

RJE Nucleotide and Protein Sequence List Object (Revised)

Function

SeqShuffle

Sampler

SortSeq

Edit

Commandline

INPUT OPTIONS

SEQUENCE FORMATTING

DNA TRANSLATIONS (minorf=X terminorf=X orfmet=T/F rftran=X orfgaps=T/F

FILTERING OPTIONS

EXTRACT/MASK OPTIONS

SEQUENCE TILING OPTIONS

OUTPUT OPTIONS

History Module Version History

rje_seqlist REST Output formats

DNA TRANSLATIONS (`minorf=X` `terminorf=X` `orfmet=T/F` `rftran=X` `orfgaps=T/F`