Function
ExTATIC predicts alternative initiation sites in cDNA (mRNA) sequences based on a set of rules regarding start codon
efficiency. Input is in the form of two sequence files:
1. A file of cDNA sequences (cdna=FASFILE
).
2. A file of CDS sequences (cds=FASFILE
).
For full functionality, Biomart Ensembl downloads are recommended. Sequence names should be in the format:
>TranscriptID|GeneID[|GeneSymbol][|Description]
The GeneID will be used to match different transcripts from the same gene. GeneSymbol is used purely for output and
could be any external database xref of choice. Likewise, Description is purely for interpretation of results and is
not used by the program. If fewer than four fields are found by splitting on '|', the third will be assumed to be the
description and GeneID will be duplicated for GeneSymbol. If fewer than three, GeneID will also be duplicated for
Description. If fewer that two, the first word of each sequence name will be used for both TranscriptID and GeneID
(and GeneSymbol) with the remaining name forming the Description. If the Description contains "Gene:X" and no GeneID
or GeneSymbol is given, X will be used for these values. Additional naming formats can be added on request.
For stringency, 5' flanking regions will NOT be used to expand UTRs where missing/short: this should be performed by
an upstream program/analysis if necessary.
The context=DICT
provides IUPAC contexts for AUG start codons and their strengths. Each context must contain XXX that
indicates the start codon position. DNA or RNA notation can be used but contexts must not clash, i.e. each codon must
uniquely match one context. If no contexts are provided, default values will be used:
XXXG
= Strong
RnnXXXH
= Mid ([AG]nnXXX[ACU])
CnnXXXU
= Mid
UnnXXXC
= Mid
CnnXXXM
= Weak (CnnXXX[AC])
UnnXXXW
= Weak (UnnXXX[AU])
^XXXH
= Unknown (XXX[ACU] at start of sequence)
^nXXXH
= Unknown (nXXX[ACU] at start of sequence)
^nnXXXH
= Unknown (nnXXX[ACU] at start of sequence)
Alternative codons, given by altcodons=LIST
, can match any contexts given by altcontext=LIST
. By default, only
"Strong" contexts are considered, for CTG and GTG codons. Annotated start sites that do not meet any allowed context
(i.e. are non-canonical codons not in altcodons=LIST
) will be given a context that is just the start codon.
ORFs and ORFTypes
The main output from ExTATIC is a series of tables identifying possible start codons and the corresponding Open
Reading Frames (ORFs). A fasta file of predicted ORFs is also output. In this file, protein sequences are in lower
case upto the first "Strong" ATG start codon. Upstream predicted start codon positions are also given in upper case.
AIC are classified according to their position and reading frame relative to the annotated start codon (if CDS are
given). The orftypes=LIST
option sets which set of AIC will be returned. ORFs are classified according to their most
5' AIC.
uORF
= Upstream ORF that terminates upstream of the annotated start.
eORF
= Extended annotated ORF. (Upstream in-frame start site with no stop codon before the annotated start.)
oORF
= Upstream ORF that overlaps the annotated start.
aORF
= Annotated ORF start site.
tORF
= Truncated annotated ORF. (Downstream in-frame start site before stop codon or "Strong" ATG.
dORF
= Downstream ORF that starts before a stop codon or "Strong" ATG in annotated ORF.
Single Sequence Analysis
If singleseq=X
is given then ExTATIC will analyse a single sequence only. SeqIn and CDSIn should be sequence strings
rather than fasta files. The sequence name will be taken from singleseq=X
. Use append=T
to add results to previous
singleseq analyses and set basefile=X
. (Default basefile for singleseq analysis is the first word from singleseq.)
REST Output
For details of available REST output for ExTATIC, see: http://rest.slimsuite.unsw.edu.au/extatic&rest=outfmt
Commandline
INPUT OPTIONS
seqin=FASFILE
: cDNA input seqence file (*.cdna.fas). See docs for naming advice. [ExTATIC.cdna.fas
]
cdna=FASFILE
: Alternative cDNA input sequence file option.
cds=FASFILE
: CDS input sequence file. (or cdsin=FASFILE
) [*.cds.fas based on cdna or singleseq
]
singleseq=X
: Analyse a single sequence only (named X). cDNA and CDS should be sequence strings. [None
]
PROCESSING OPTIONS
orftypes=LIST
: List of alternative ORF types to generate (eORF/tORF/uORF/oORF/dORF) [e,t,u,o
]
cdsonly=T/F
: Whether to restrict analysis to cDNA with matching CDS [True
]
context=DICT
: Dictionary file of codon context *XXX* and AUG strength rating (Context:Strength) [see docs
]
altcodons=LIST
: List of acceptable non-AUG start codons (RNA or DNA) [CTG,GTG
]
altcontext=LIST
: List of contexts to be considered for altcodons [XXXG
]
nrflanks=X
: Flanking sequence length (added 5' & 3') for analysing for redundancy within a gene [10
]
fullnr=T/F
: Perfrom NR Flank analysis across all genes [False
]
minorf=X
: Minimum ORF lengths to be considered [0
]
minutr=X
: Minimum 5' UTR lengths to be included in analysis [1
]
mincds=X
: Minimum CDS lengths to be included in analysis [0
]
OUTPUT OPTIONS
basefile=X
: Root name for output files. Path will be stripped and resdir used. [* based on cdna or singleseq
]
resdir=PATH
: Path for results output [./
]
History Module Version History
# 0.1.0 - Initial Compilation based on PATIS V0.3.
# 0.2.0 - Added tabular and fasta output. Basic REST service output but not tested.
ExTATIC REST Output formats
Run with
&rest=help
for general options. Run with
&rest=full
to get full server output as text or
&rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using
&rest=OUTFMT
for:
context
= Table of codon contexts and strength ratings. [tdt]
altcodons
= List of alternative (non-AUG) codons hopefully.
orfs
= output table of predicted open reading frames. [tdt]
aic
= output table of predicted alternative initiation sites. [tdt]
nr
= output of non-redundant AIC, grouped by codons plus
nrflanks=X
flanking sequence. (By gene unless
fullnr=T
) [
tdt
]
transcripts
= output table summarising ORFs and AIC for each transcript. [tdt]
fas
= ORF fasta sequences with upper case AIC and lower case sequence until the first Strong ATG codon. [fas]
name
= Sequence name (
&singleseq=X
).
cdna
= Input cDNA sequence(s). [fas]
cds
= [Optional] Input coding sequence(s). [fas]