Module:	ExTATIC
Description:	Extensions and Truncations from Alternative Translation Initiation Codons
Version:	0.2.0
Last Edit:	28/01/15

Imported modules: rje rje_db rje_obj rje_seqlist rje_sequence rje_slim

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

ExTATIC predicts alternative initiation sites in cDNA (mRNA) sequences based on a set of rules regarding start codon efficiency. Input is in the form of two sequence files:

1. A file of cDNA sequences (cdna=FASFILE).

2. A file of CDS sequences (cds=FASFILE).

For full functionality, Biomart Ensembl downloads are recommended. Sequence names should be in the format:

>TranscriptID|GeneID[|GeneSymbol][|Description]
The GeneID will be used to match different transcripts from the same gene. GeneSymbol is used purely for output and could be any external database xref of choice. Likewise, Description is purely for interpretation of results and is not used by the program. If fewer than four fields are found by splitting on '|', the third will be assumed to be the description and GeneID will be duplicated for GeneSymbol. If fewer than three, GeneID will also be duplicated for Description. If fewer that two, the first word of each sequence name will be used for both TranscriptID and GeneID (and GeneSymbol) with the remaining name forming the Description. If the Description contains "Gene:X" and no GeneID or GeneSymbol is given, X will be used for these values. Additional naming formats can be added on request.

For stringency, 5' flanking regions will NOT be used to expand UTRs where missing/short: this should be performed by an upstream program/analysis if necessary.

The context=DICT provides IUPAC contexts for AUG start codons and their strengths. Each context must contain XXX that indicates the start codon position. DNA or RNA notation can be used but contexts must not clash, i.e. each codon must uniquely match one context. If no contexts are provided, default values will be used:

XXXG = Strong
RnnXXXH = Mid ([AG]nnXXX[ACU])
CnnXXXU = Mid
UnnXXXC = Mid
CnnXXXM = Weak (CnnXXX[AC])
UnnXXXW = Weak (UnnXXX[AU])
^XXXH = Unknown (XXX[ACU] at start of sequence)
^nXXXH = Unknown (nXXX[ACU] at start of sequence)
^nnXXXH = Unknown (nnXXX[ACU] at start of sequence)

Alternative codons, given by altcodons=LIST, can match any contexts given by altcontext=LIST. By default, only "Strong" contexts are considered, for CTG and GTG codons. Annotated start sites that do not meet any allowed context (i.e. are non-canonical codons not in altcodons=LIST) will be given a context that is just the start codon.

ORFs and ORFTypes

The main output from ExTATIC is a series of tables identifying possible start codons and the corresponding Open Reading Frames (ORFs). A fasta file of predicted ORFs is also output. In this file, protein sequences are in lower case upto the first "Strong" ATG start codon. Upstream predicted start codon positions are also given in upper case.

AIC are classified according to their position and reading frame relative to the annotated start codon (if CDS are given). The orftypes=LIST option sets which set of AIC will be returned. ORFs are classified according to their most 5' AIC.

uORF = Upstream ORF that terminates upstream of the annotated start.
eORF = Extended annotated ORF. (Upstream in-frame start site with no stop codon before the annotated start.)
oORF = Upstream ORF that overlaps the annotated start.
aORF = Annotated ORF start site.
tORF = Truncated annotated ORF. (Downstream in-frame start site before stop codon or "Strong" ATG.
dORF = Downstream ORF that starts before a stop codon or "Strong" ATG in annotated ORF.

Single Sequence Analysis

If singleseq=X is given then ExTATIC will analyse a single sequence only. SeqIn and CDSIn should be sequence strings rather than fasta files. The sequence name will be taken from singleseq=X. Use append=T to add results to previous singleseq analyses and set basefile=X. (Default basefile for singleseq analysis is the first word from singleseq.)

REST Output

For details of available REST output for ExTATIC, see: http://rest.slimsuite.unsw.edu.au/extatic&rest=outfmt

Commandline

INPUT OPTIONS

seqin=FASFILE : cDNA input seqence file (*.cdna.fas). See docs for naming advice. [ExTATIC.cdna.fas]
cdna=FASFILE : Alternative cDNA input sequence file option.
cds=FASFILE : CDS input sequence file. (or cdsin=FASFILE) [*.cds.fas based on cdna or singleseq]
singleseq=X : Analyse a single sequence only (named X). cDNA and CDS should be sequence strings. [None]

PROCESSING OPTIONS

orftypes=LIST : List of alternative ORF types to generate (eORF/tORF/uORF/oORF/dORF) [e,t,u,o]
cdsonly=T/F : Whether to restrict analysis to cDNA with matching CDS [True]
context=DICT : Dictionary file of codon context *XXX* and AUG strength rating (Context:Strength) [see docs]
altcodons=LIST : List of acceptable non-AUG start codons (RNA or DNA) [CTG,GTG]
altcontext=LIST : List of contexts to be considered for altcodons [XXXG]
nrflanks=X : Flanking sequence length (added 5' & 3') for analysing for redundancy within a gene [10]
fullnr=T/F : Perfrom NR Flank analysis across all genes [False]
minorf=X : Minimum ORF lengths to be considered [0]
minutr=X : Minimum 5' UTR lengths to be included in analysis [1]
mincds=X : Minimum CDS lengths to be included in analysis [0]

OUTPUT OPTIONS

basefile=X : Root name for output files. Path will be stripped and resdir used. [* based on cdna or singleseq]
resdir=PATH : Path for results output [./]

History Module Version History

    # 0.1.0 - Initial Compilation based on PATIS V0.3.
    # 0.2.0 - Added tabular and fasta output. Basic REST service output but not tested.

ExTATIC REST Output formats

Run with &rest=help for general options. Run with &rest=full to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT for:

context = Table of codon contexts and strength ratings. [tdt]
altcodons = List of alternative (non-AUG) codons hopefully.
orfs = output table of predicted open reading frames. [tdt]
aic = output table of predicted alternative initiation sites. [tdt]
nr = output of non-redundant AIC, grouped by codons plus nrflanks=X flanking sequence. (By gene unless fullnr=T) [tdt]
transcripts = output table summarising ORFs and AIC for each transcript. [tdt]
fas = ORF fasta sequences with upper case AIC and lower case sequence until the first Strong ATG codon. [fas]
name = Sequence name (&singleseq=X).
cdna = Input cDNA sequence(s). [fas]
cds = [Optional] Input coding sequence(s). [fas]

SLiMSuite REST Server

ExTATIC V0.2.0

Extensions and Truncations from Alternative Translation Initiation Codons