SLiMSuite REST Server


Links
REST Home
EdwardsLab Homepage
EdwardsLab Blog
SLiMSuite Blog
SLiMSuite
Webservers
REST Pages
REST Status
REST Help
REST Tools
REST Alias Data
REST API
REST News
REST Sitemap

REVERT V0.9.1

Retrovirus and Endogenous Viral Element Reconstruction Tool

Module: REVERT
Description: Retrovirus and Endogenous Viral Element Reconstruction Tool
Version: 0.9.1
Last Edit: 08/09/16

Copyright © 2014 Richard J. Edwards - See source code for GNU License Notice


Imported modules: rje rje_blast_V2 rje_db rje_forker rje_genbank rje_menu rje_obj rje_ppi rje_seqlist rje_taxonomy rje_tree rje_dismatrix_V2 gablam fiesta


See SLiMSuite Blog for further documentation.

Function

REVERT (the Retrovirus and Endogenous Viral Element Reconstruction Tool) is an automated utility for discovery candidate viruses and endogenous viral element (EVE) sequences in host genomes. (Transcriptome data, such as EST libraries, could also be used as input and REVERT should be able to retrieve expressed contemporary viruses too.)

NOTE: Version 0.7.0 has reworked the pipeline slightly. Updated docs coming soon. (REST service output might also be temporarily affected.)

REVERT Analysis Pipeline

1. A set of viral genomes is (downloaded to and) loaded from a genbank file, as determined by virusgb=FILE/LIST.
This is then used to create a *.full.fas file of full-length genomes (for later mapping) and *.prot.fas file
of proteins for initial searching against a fasta-format host genome, set by genome=FILE. The output basefile
defaults to virusgb.genome but can be over-ruled by basefile=FILE. NB. A set of files can be set up for a
batch run using vbatch=FILELIST and gbatch=FILELIST (wildcards allowed).

2. GABLAM tblastn search of viral proteins against host genome DNA. This will produce *.gablam.tdt, *.hitsum.tdt
and *.local.tdt summary files, and a fasta file per protein in a directory set by fasdir=PATH. The fixed
settings for GABLAM are fasout=T fragfas=T combined=T. Other GABLAM settings can be altered on the commandline.
By default, GABLAM will use the BLAST complexity filter and composition-based statistics (blastf=T blastcf=T).
This GABLAM search uses the fullblast=T switch by default. (Use fullblast=F to switch off.) Only Ordered GABLAM
output is returned (outstats=GABLAMO) with a default cut-off of 50% viral protein (gablamcut=0.5
cutstat=OrderedAlnLen cutfocus=Query).

3. The combined hits (if any) from the GABLAM search are mapped back onto the viral proteins and genome using a
second BLAST search without the complexity filter: (1) a blastn search, aligning genome hits against the viral
genome; (2) a blastx search aligning the viral proteins against the genome hits; (3) a tblastn search, aligning
the translated genome hits against the original viral proteins. The latter is converted into a set of fasta
alignments. The assemblies use the BLAST+ flat query-anchored output and can include overlapping hits.
NOTE: The second BLAST may identify additional protein hits between the viral proteins and the genome regions that
were missed by the original BLAST! Such examples might have HitNum of 0 but FragNum > 0.

4. The protein alignments are summarised for each virus-genome comparison and converted into % coverage and %
identity (of the viral proteins), which are summed up across all proteins in the viral proteome. If running in
batch mode, these will be compiled into a single protein table and single virus summary table:
- *.revert.details.tdt = each viral protein against each genome
- *.revert.tdt = each virus against each genome

5. Compiled results for the batch file run will then be generated and used to make additional outputs. For the virus
and genome tables, an extra Alias field will be added to use for visualisation outputs. If aliasfile=FILE is
given then aliases will be read from this file. Where missing, commandline prompts for aliases will be given. If
no alias is given (or aliases=F) then the full virus or genome name will be used. Outputs are:
- *.revert.virus.tdt = a single line for each virus, summarising the total number of genomes hit.
- *.revert.protein.tdt = a single line for each viral protein, summarising the total number of genomes hit.
- *.revert.genome.tdt = a single line for each genome, summarising the total number of viruses hit.

6. TO BE ADDED:
A final compilation mode will GABLAM all (non-redundant) genome fragments against hit proteins, establish the best
hits and apply stricter cut-offs for both protein and virus hits.
mincov is also applied to the initial GABLAM.

Commandline

Input Options

virusgb=FILE/LIST # Either a genbank download (must end *.gb) or a list of GenBank viral nuccore UIDs [virus.gb]
vbatch=FILELIST # Run REVERT on a list of virusgb files (over-rides virusgb=FILE). Wildcards allowed. []
genome=FILE # Fasta file containing contigs or chromosomes of host genome [genome.fas]
gbatch=FILELIST # Run REVERT on a list of genome files (over-rides genome=FILE). Wildcards allowed. []
searchdb=FILE # Fasta file of viral (& host?) proteins for assembly annotation. Use viral proteins if None. []
taxdir=PATH/ # Will look in this directory for taxonomy input files if not found ['SourceData/']
virusdir=PATH/ # Will look in this directory for single viral gb files ['VirusGB/']
sourcedate=DATE # Source file date (YYYY-MM-DD) for taxonomy files to preferentially use [None]

Basic Output Options

basefile=FILE # Root of output files. Defaults to virusgb.genome if blank/None [None]
revertdir=PATH/ # Directory to output run files into. (Prefixes basefile for single runs.) [REVERT/]
fasdir=PATH # Directory in which to save fasta files [RevertDir/GABLAMFAS/genome/]
blastdir=PATH # BLAST directory for GABLAM BLAST files if keepblast=T [RevertDir/BLAST/]
keepblast=T/F # Whether to keep GABLAM BLAST files for reference/reuse [True]
blaste=X # BLAST evalue cut-off [1e-10]

Results Compilation/Filtering Options

revertnr=T/F # Compile batch run results with non-redundancy and quality filtering [True]
gablamfrag=X # Length of gaps between mapped residues for fragmenting local hits [100]
minpcov=X # Min. %coverage for viral proteins during NR reduction [40.0]
minplocid=X # Min. local %identity for viral proteins during NR reduction [30.0]
minvcov=X # Min. %coverage for viral genomes during NR reduction [40.0]
minvlocid=X # Min. local %identity for viral genomes during NR reduction [30.0]

Advanced Output Options

aliases=T/F # Whether to use aliases in additional outputs [True]
aliasfile=FILE # Delimited file of 'Name','Alias' to use in place of batch summary files [*.alias.tdt]
vspcode=T/F # Whether to use viral species codes (even if invented) for VirusID aliases [True]
gspcode=T/F # Whether to use genome species codes for Genome field if able to parse name [False]
vgablam=T/F # Whether to compile viral genomes/proteomes and conduct all-by-all GABLAM for graphs [True]
graphformats=LIST # Formats for virus-genome graph outputs (svg/xgmml/png/html) (dev=T only) [xgmml,png]
treeformats=LIST # List of output formats for generated trees (see rje_tree.py) (dev=T only) [nwk,png]
### ~ Adavanced Run Options~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
farmgablam=X # Whether to run a pre-REVERT farming of batch BLAST searches using X forks each [0]
### ~ Special Run Options~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
vgbparse=T/F # Whether to parse virus IDs from vbatch tables and output BASEFILE.HOST.acc files [False]
vhost=LIST # List of viral hosts to output new files for. (Blank for all) []


See also rje.py generic commandline options.

History Module Version History

    # 0.0.0 - Initial Compilation.
    # 0.1.0 - Improved pickup of existing results.
    # 0.2.0 - Additional output for visualisation of results.
    # 0.3.0 - Added extra mode for generating viral accnum lists from http://www.ncbi.nlm.nih.gov/genome/viruses/.
    # 0.4.0 - Altered *.protein.tdt output to *.details.tdt output and added new *.protein.tdt file.
    # 0.4.1 - Modified to use the new FullBlast GABLAM mode for speed.
    # 0.4.2 - Modified to farm out GABLAM BLAST searches as forks.
    # 0.4.3 - Temp fix for QSub runs with large wall-times being used for PPI Spring Layout walltime.
    # 0.4.4 - Removed limitation of farmgablam being < 1/2 forks.
    # 0.5.0 - Altered defaults and added some extra default GABLAM filtering.
    # 0.6.0 - Addition of revertnr=T/F method to remove redundancy across searches. Made default.
    # 0.7.0 - Reworking and tidying to make use of virus directory the default.
    # 0.7.1 - Minor bug fixing for REVERT REST Server.
    # 0.7.2 - Minor bug fix for updated RJE_DB module.
    # 0.8.0 - Added Local Table output to REST server and fixed virus parse bug.
    # 0.8.1 - Fixed virus genbank from ID file bug. (Need to re-check REST server.)
    # 0.9.0 - Major reworking of Alias generation/use. Deletion of some OLD methods.
    # 0.9.1 - Fixed up a few bugs and outputs. Ready for REST testing.
    # 0.9.2 - Added gablamfrag=100 default to GABLAM object call to hopefully counter GABLAM default switch to 1.

REVERT REST Output formats

REVERT is a high throughput pipeline for finding evidence of Endogenous viral elements (EVEs) within a genome.
Input is one or more genbank IDs corresponding to annotated viral genomes (e.g. NC_001542), and a host genome.
Available genomes can be accessed at http://rest.slimsuite.unsw.edu.au/alias&genome. Please contact us to add
more genomes. (Alternatively, the standalone version of REVERT can be run on multiple bespoke genomes.)

Run with &rest=help for general options. Run with &rest=full to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT:

nr = Main REVERT summary table of non-redundant (NR) hits between viruses and the search genome. [tdt]
nr.details = Breakdown by protein of main REVERT results. [tdt]
consensus = reconstructed consensus sequences from genome fragments for each NR protein [fas]
revert = Full REVERT summary table before NR/QC filtering [tdt]
details = Breakdown of full summary table by protein [tdt]
local = Details of local alignments positions in genome and gfrag sequences [tdt]
gfrag = Extracted genomic regions with putative EVEs [fas]
virusgb = Genbank format download of input viruses [gbk]
virus.locus = Summary table of viruses extracted from genbank [tdt]
virus.feature = Table of annotated viral features. (Protein/CDS needed for REVERT analysis) [tdt]
virus.prot = Fasta file of viral proteins used in analysis [fas]
virus.full = Full genome sequences of viruses used in analysis [fas]
genome = Genome sequence used for analysis. (May be too big to download) [fas]

More details of outputs can be found in the [REVERT Help](http://rest.slimsuite.unsw.edu.au/revert) and/or
[SLiMSuite blog](http://slimsuite.blogspot.com.au/). Please get in touch if anything is not clear.

© 2015 RJ Edwards. Contact: richard.edwards@unsw.edu.au.