ExTATIC predicts alternative initiation sites in cDNA (mRNA) sequences based on a set of rules regarding start codon
efficiency. Input is in the form of two sequence files:
1. A file of cDNA sequences (
2. A file of CDS sequences (
For full functionality, Biomart Ensembl downloads are recommended. Sequence names should be in the format:
The GeneID will be used to match different transcripts from the same gene. GeneSymbol is used purely for output and
could be any external database xref of choice. Likewise, Description is purely for interpretation of results and is
not used by the program. If fewer than four fields are found by splitting on '|', the third will be assumed to be the
description and GeneID will be duplicated for GeneSymbol. If fewer than three, GeneID will also be duplicated for
Description. If fewer that two, the first word of each sequence name will be used for both TranscriptID and GeneID
(and GeneSymbol) with the remaining name forming the Description. If the Description contains "Gene:X" and no GeneID
or GeneSymbol is given, X will be used for these values. Additional naming formats can be added on request.
For stringency, 5' flanking regions will NOT be used to expand UTRs where missing/short: this should be performed by
an upstream program/analysis if necessary.
context=DICT provides IUPAC contexts for AUG start codons and their strengths. Each context must contain XXX that
indicates the start codon position. DNA or RNA notation can be used but contexts must not clash, i.e. each codon must
uniquely match one context. If no contexts are provided, default values will be used:
XXXG = Strong
RnnXXXH = Mid ([AG]nnXXX[ACU])
CnnXXXU = Mid
UnnXXXC = Mid
CnnXXXM = Weak (CnnXXX[AC])
UnnXXXW = Weak (UnnXXX[AU])
^XXXH = Unknown (XXX[ACU] at start of sequence)
^nXXXH = Unknown (nXXX[ACU] at start of sequence)
^nnXXXH = Unknown (nnXXX[ACU] at start of sequence)
Alternative codons, given by
altcodons=LIST, can match any contexts given by
altcontext=LIST. By default, only
"Strong" contexts are considered, for CTG and GTG codons. Annotated start sites that do not meet any allowed context
(i.e. are non-canonical codons not in
altcodons=LIST) will be given a context that is just the start codon.
ORFs and ORFTypes
The main output from ExTATIC is a series of tables identifying possible start codons and the corresponding Open
Reading Frames (ORFs). A fasta file of predicted ORFs is also output. In this file, protein sequences are in lower
case upto the first "Strong" ATG start codon. Upstream predicted start codon positions are also given in upper case.
AIC are classified according to their position and reading frame relative to the annotated start codon (if CDS are
orftypes=LIST option sets which set of AIC will be returned. ORFs are classified according to their most
uORF = Upstream ORF that terminates upstream of the annotated start.
eORF = Extended annotated ORF. (Upstream in-frame start site with no stop codon before the annotated start.)
oORF = Upstream ORF that overlaps the annotated start.
aORF = Annotated ORF start site.
tORF = Truncated annotated ORF. (Downstream in-frame start site before stop codon or "Strong" ATG.
dORF = Downstream ORF that starts before a stop codon or "Strong" ATG in annotated ORF.
Single Sequence Analysis
singleseq=X is given then ExTATIC will analyse a single sequence only. SeqIn and CDSIn should be sequence strings
rather than fasta files. The sequence name will be taken from
append=T to add results to previous
singleseq analyses and set
basefile=X. (Default basefile for singleseq analysis is the first word from singleseq.)
For details of available REST output for ExTATIC, see: http://rest.slimsuite.unsw.edu.au/extatic&rest=outfmt
seqin=FASFILE : cDNA input seqence file (*.cdna.fas). See docs for naming advice. [
cdna=FASFILE : Alternative cDNA input sequence file option.
cds=FASFILE : CDS input sequence file. (or
*.cds.fas based on cdna or singleseq]
singleseq=X : Analyse a single sequence only (named X). cDNA and CDS should be sequence strings. [
orftypes=LIST : List of alternative ORF types to generate (eORF/tORF/uORF/oORF/dORF) [
cdsonly=T/F : Whether to restrict analysis to cDNA with matching CDS [
context=DICT : Dictionary file of codon context *XXX* and AUG strength rating (Context:Strength) [
altcodons=LIST : List of acceptable non-AUG start codons (RNA or DNA) [
altcontext=LIST : List of contexts to be considered for altcodons [
nrflanks=X : Flanking sequence length (added 5' & 3') for analysing for redundancy within a gene [
fullnr=T/F : Perfrom NR Flank analysis across all genes [
minorf=X : Minimum ORF lengths to be considered [
minutr=X : Minimum 5' UTR lengths to be included in analysis [
mincds=X : Minimum CDS lengths to be included in analysis [
basefile=X : Root name for output files. Path will be stripped and resdir used. [
* based on cdna or singleseq]
resdir=PATH : Path for results output [
History Module Version History
# 0.1.0 - Initial Compilation based on PATIS V0.3.
# 0.2.0 - Added tabular and fasta output. Basic REST service output but not tested.
ExTATIC REST Output formats
for general options. Run with
to get full server output as text or
for more user-friendly formatted output. Individual outputs can be identified/parsed using
= Table of codon contexts and strength ratings. [tdt]
= List of alternative (non-AUG) codons hopefully.
= output table of predicted open reading frames. [tdt]
= output table of predicted alternative initiation sites. [tdt]
= output of non-redundant AIC, grouped by codons plus
flanking sequence. (By gene unless
= output table summarising ORFs and AIC for each transcript. [tdt]
= ORF fasta sequences with upper case AIC and lower case sequence until the first Strong ATG codon. [fas]
= Sequence name (
= Input cDNA sequence(s). [fas]
= [Optional] Input coding sequence(s). [fas]