Module:	rje_aic
Description:	Alternative Initiation Codon Module
Version:	0.8
Last Edit:	30/01/14

Imported modules: rje rje_db rje_seq rje_seqlist rje_sequence rje_uniprot rje_zen rje_blast_V1

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This module contains a bunch of miscellaneous methods etc. for exploring AIC. set by mode=X. Appropriate modes are described in more detail below.

Aug11

This mode is initially experimenting with the new rje_seqlist module and whether flanking regions can be extracted
directly from chromosomal DNA data.

Feb10

This mode constitutes a revised Kozak-style analysis of human TIC (Translation Initiation Codon). Data is read in
from four sources:
1. Full length EnsEMBL cDNAs (reference only)
2. 1000nt 5' of TSS. This should not include 5' UTR where annotation is good.
3. 5'UTR of transcripts (where available).
4. Coding sequences of each transcript.

Sequences are compiled and cross-checked with each other. Sequence <3> should match the start of sequence <1> for a
given transcript and be followed by sequence <4>. Where <3> is missing, the full length cDNA and CDS are used and the
TIC is assumed to be the first codon of the CDS. In terms of TIC context, this gives different types of sequence
('etype'):
- AL. 5'UTR annotated and longer than kozwin.
- AS. 5'UTR annotated but shorter than kozwin.
- US. 5'UTR inferred from cDNA but shorter than kozwin.
- These are also combined for additional combos (XL,XS,AX,XX) for calculations
- NL. 5'UTR annotated and longer than kozwin but non-AUG TIC. Not used for calculations (but still rated).
- NS. 5'UTR annotated but shorter than kozwin, plus non-AUG TIC. Not used for calculations (but still rated).

For each transcript, the following sequences are then extracted:
- The "TICwin" window around TIC (as set by kozwin). [*.kozwin.fas]
- A control "Conwin" window around the first AUG encountered in seq <2>. [*.conwin.fas]
Each transcript is then classified according to its core sequence:
- Good = "best" Kozak gccaugg [set by GoodIC list]
- augG = Not "Good" but has G at +4
- Mid (25-70% activity) where the -3/+4 combination is G/A, A/A, G/C, A/C, U/C, G/U, A/U, C/U
- Weak (<25% activity) where -3/+4 combination is C/C, C/A, U/A, U/U

For each gene, transcripts are checked for redundancy and only one representative of each different TICwin sequence
is kept. Regional GC content is estimated from the first 500nt of seq <2>.

The following calculations are then performed for both TICwin and Conwin sequences:
- Nucleotide counts for each position of the window, within each eType
- Frequency and Ranking for "core" sequences (-3+4) within each eType
- Frequency "f" scores for each window using sums of observed frequencies at each position

These calculations are performed for all sequences and then repeated for each CoreType independently. This is output
in the "CoreSplit" field of each output (All/Good/augG/Rxx/Poor).

The following data is recorded for each transcript:
- CoreSplit = All/Good/augG/Rxx/Poor.
- Transcript ID
- Gene ID
- TICwin = Window around TIC (as set by kozwin)
- Core = Core Kozak window -3 to + 4 (XXXAUGX)
- CoreType = Good/augG/Rxx/Poor
- WegCore = Wegrzyn Core sequence XXnnXnnAUGX
- Conwin = Control window around 5' AUG (as set by kozwin)
- Redundancy = No. filtered identical TIC windows for this gene
- AIC = No. of *different* TIC windows for this gene.
- e5utr = UTR length from input seq <3>
- c5utr = UTR length from input seq <1>
- eType = AL/AS/US (see above)
- regG = G count from first 500nt of seq <2>
- regC = G count from first 500nt of seq <2>
- regA = A count from first 500nt of seq <2>
- regT = T count from first 500nt of seq <2>
- tfX = TICwin Frequency scores, where X is AL/AS/US/XL/XS/AX/XX.
- cfX = Conwin Frequency scores, where X is AL/AS/US/XL/XS/AX/XX.
- trX = TICwin core ranking scores, where X is AL/AS/US/XL/XS/AX/XX.
- crX = Conwin core ranking scores, where X is AL/AS/US/XL/XS/AX/XX.
- twX = TICwin WegCore ranking scores, where X is AL/AS/US/XL/XS/AX/XX.
- cwX = Conwin WegCore ranking scores, where X is AL/AS/US/XL/XS/AX/XX.

The following data is recorded for each Core:
- CoreSplit = All/Good/augG/Rxx/Poor.
- Core = Core sequence
- tfX = Mean frequency scores (excluding Core) for TICwin with this core
- tiX = Information content (excluding Core) for TICwin with this core
- trX = Rank for TICwin with this score
- cfX = Mean frequency scores (excluding Core) for Conwin with this core
- ciX = Information content (excluding Core) for Conwin with this core
- crX = Rank for Conwin with this core

The following data is recorded for each position in the window:
- eType = AL/AS/US/XL/XS/AX/XX.
- CoreSplit = All/Good/augG/Rxx/Poor.
- G = Count of G
- A = Count of G
- T = Count of G
- C = Count of G
- Info = Information Content of position.

BIOL3050
This method analyses the genome of choice for good (and bad) candidates for BIOL3050 project genes. Initiation codons
are divided into:
- Weak = -3[UC]NNXXX[UCA]
- Strong = -3[AG]NNXXXG, where X is AUG/CUG/ACG
- Mid = All others
Genes are then classified according to the following criteria for experimentation:
- Good = Weak annotated IC and Strong eORF or tORF, 40aa <= eLen <= 200aa. No RECuts in 5'UTR to IC+300nt.
- Cut = Weak annotated IC and Strong eORF or tORF, 40aa <= eLen <= 200aa but RECuts in 5'UTR to IC+300nt.
- Len = Weak annotated IC and Strong eORF or tORF, 40aa > eLen > 200aa.
- Poor = All others.

PRIDE
This is a very basic parser of PRIDE data that extracts the peptides identified from each XML file.

Commandline

GENERAL

mode=LIST : Run mode (refcheck/kozak/feb10/pride/jan14) [pride]
track=LIST : List of genes or IDs to track through analysis []
enspath=PATH : Path to EnsEMBL files [./EnsEMBL/]

REFCHECK

refseq=FILE : File containing RefSeq download in GenBank format []
biomart=FILE : File containing BioMart download []
shortutr=X : 5' UTR <X bp will be marked as "Short" [10]

KOZAK

kozak=X : Basename for Kozak analysis input ['Homo_sapiens.GRCh37.56']
kozwin=X : No. of nucleotides either side of start codon [21]
noutr=T/F : Whether to include sequences without a 5' UTR [True]
flank=X : Length for 5' flanking sequence [1000]
output=LIST : List of outputs to generate with this run [all]
rep=X : Number of random replicates [1000]
coreal=LIST : List of Sequence eTypes for CoreAL analysis ['AL']
nrgene=X : Which gene type to use for redundancy removal (Gene/EnsG) ['Gene']
nonaug=LIST : List of non-AUG codons to consider in good context [CTG,GTG]

BIOL3050

recut=LIST : List of recognition sequences for restriction enzymes ['CTCGAG','GCTAGC']

PRIDE

pridefile=FILE : Delimited file containing PRIDE IDs and Protein IDs []
pridepath=PATH : Path to XML downloads from PRIDE ['./PRIDE_XML/']

See also rje.py generic commandline options.

History Module Version History

    # 0.0 - Initial Compilation.
    # 0.1 - Added run modes and Kozak analysis.
    # 0.2 - Added Feb '10 Kozak+ analysis.
    # 0.3 - Changed to use new downloads.
    # 0.4 - Added *very* basic PRIDE mode.
    # 0.5 - Added BIOL3050 experimental design mode.
    # 0.6 - Added Aug11 mode for clean analysis.
    # 0.7 - Updated the Weak and Mid definitions in the light of more recent experimental data. Added nonaug=LIST.
    # 0.8 - Added jan14 mode for specific AIC paper analysis. In general need of a remake and tidy!

SLiMSuite REST Server

rje_aic V0.8

Alternative Initiation Codon Module