Program:	BUDAPEST
Description:	Bioinformatics Utility for Data Analysis of Proteomics on ESTs
Version:	2.3
Last Edit:	31/07/13
Citation:	Jones, Edwards et al. (2011), Marine Biotechnology 13(3): 496-504.

Imported modules: fiesta haqesac rje rje_db rje_mascot rje_menu rje_seq rje_seqlist rje_sequence rje_tree rje_zen rje_blast_V2

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

Proteomic analysis of EST data presents a bioinformatics challenge that is absent from standard protein-sequence based identification. EST sequences are translated in all six Reading Frames (RF), most of which will not be biologically relevant. In addition to increasing the search space for the MS search engines, there is also the added challenge of removing redundancy from results (due to the inherent redundancy of the EST database), removing spurious identifications (due to the translation of incorrect reading frames), and identifying the true protein hits through homology to known proteins.

BUDAPEST (Bioinformatics Utility for Data Analysis of Proteomics on ESTs) aims to overcome some of these problems by post-processing results to remove redundancy and assign putative homology-based identifications to translated RFs that have been "hit" during a MASCOT search of MS data against an EST database. Peptides assigned to "incorrect" RFs are eliminated and EST translations combined in consensus sequences using FIESTA (Fasta Input EST Analysis). These consensus hits are optionally filtered on the number of MASCOT peptides they contain before being re-annotated using BLAST searches against a reference database. Finally, HAQESAC can be used for automated or semi-automated phylogenetic analysis for improved sequence annotation.

Input

BUDAPEST takes three main files as input:

A MASCOT results file, specified by mascot=FILENAME.
The EST sequences used (or, at least, hit by) the MASCOT search, in fasta format, specified by seqin=FILENAME.
A protein database for BLAST-based annotation in fasta format, specified by searchdb=FILENAME.

Output

BUDAPEST produces the following main output files, where X is set by basefile=X:

X.budapest.tdt = main output table of results
X.budapest.fas = BLAST-annotated clustered consensus EST translations using FIESTA
X.summary.txt = summary of results from BUDAPEST pipeline
X.details.txt = full details of processing for each original MASCOT hit.

Additional information can also be obtained from the additional sequence files:

X.est.fas = subset of EST sequences from EST database that have 1+ hits in MASCOT results.
X.translations.fas = fasta format of translated RF Hits that are retained after BUDAPEST cleanup.
X.fiesta.fas = BLAST-annotated consensus EST translations using FIESTA (pre min. peptide filtering)
X_HAQESAC/X.* = HAQESAC results files for annotating translated ESTs (haqesac=T only)
X_seqfiles/X.cluster*.fas = fasta files of translations and BLAST hits in NR clusters (clusterfas=T only)

Lastly, reformatted MASCOT files are produced, named after the original input file (Y):

Y.mascot.txt = header information from the MASCOT file.
Y.mascot.csv = the delimited data portion of the MASCOT file.

Commandline

INPUT OPTIONS

mascot=FILE : Name of MASCOT csv file [None]
seqin=FILE : Name of EST fasta file used for search [None]
searchdb=FILE : Fasta file for GABLAM search of EST translations [None]
partial=T/F : Whether partial EST data is acceptable (True) or all MASCOT hits must be found (False) [True]
itraq=T/F : Whether data is from an iTRAQ experiment [False]
empai=T/F : Whether emPAI data is present in MASCOT file [True]
samples=LIST : List of X:Y, where X is an iTRAQ isotag and Y is a sample []

PROCESSING OPTIONS

minpolyat=X : Min length of poly-A/T to be counted. (-1 = ignore all) [10]
fwdonly=T/F : Whether to treat EST/cDNA sequences as coding strands (False = search all 6RF) [False]
minorf=X : Min length of ORFs to be considered [10]
topblast=X : Report the top X BLAST results [10]
minaln=X : Min length of shared region for FIESTA consensus assembly [20]
minid=X : Min identity of shared region for FIESTA consensus assembly [95.0]
minpep=X : Minimum number of different peptides mapped to final translation/consensus [2]

SEQUENCE FORMATTING

newacc=X : New base for sequence accession numbers ['BUD']
gnspacc=T/F : Convert sequences into gene_SPECIES__AccNum format wherever possible. [True]
spcode=X : Species code for EST sequences [None]

OUTPUT OPTIONS

basefile=X : "Base" name for all results files, e.g. X.budapest.tdt [MASCOT file basename]
hitdata=LIST : List of hit data to add to main budapest table [prot_mass,prot_pi]
seqcluster=T/F : Perform additional sequence (BLAST/GABLAM) clustering [True]
clusterfas=T/F : Generate fasta files of translations and BLAST hits in NR clusters [False]
clustertree=LIST: List of formats for cluster tree output (3+ seqs only) [text,nsf,png]
fiestacons=T/F : Use FIESTA to auto-construct consensi from BUDAPEST RF translations [True]
haqesac=T/F : HAQESAC analysis of identified EST translations [True]
blastcut=X : Reduced the number of sequences in HAQESAC runs to X (0 = no reduction) [50]
multihaq=T/F : Whether to run HAQESAC in two-phases with second, manual phase [False]
cleanhaq=T/F : Delete excessive HAQESAC results files [True]
haqdb=FILE : Optional additional search database for MultiHAQ analysis [None]

History Module Version History

    # 0.0 - Initial Compilation.
    # 0.1 - Reworked the pipeline in the light of discoveries made from version 0.0 runs.
    # 1.0 - Working version for basic analysis.
    # 1.1 - Modified to work with new MASCOT column headers.
    # 1.2 - Added tracking of MASCOT data, results tables and division of EST-RFs.
    # 1.3 - Split clustering into two levels: peptide and sequence clustering
    # 1.4 - Added FIESTA auto-construction of consensi from BUDAPEST RF translations [True]
    # 1.5 - Added MinPep filtering.
    # 1.6 - Improved tracking of peptides to final consensus sequences and output details.
    # 1.7 - Added menu and extra control of interactivity. Removed rfhits=F option.
    # 1.8 - Added preliminary iTRAQ handling.
    # 1.9 - Bug fixed for new MASCOT output.
    # 2.0 - Revised version using rje_mascot object for loading.
    # 2.1 - Improved handling of iTRAQ data using rje_mascot V1.2.
    # 2.2 - Removed unrequired rje_dismatrix import.
    # 2.3 - Updated to use rje_blast_V2. Needs further updates for BLAST+. Deleted obsolete OLDreadMascot() method.

SLiMSuite REST Server

BUDAPEST V2.3

Bioinformatics Utility for Data Analysis of Proteomics on ESTs