Module:	SAAGA
Description:	Summarise, Annotate & Assess Genome Annotations
Version:	0.7.7
Last Edit:	25/11/21
Citation:	Edwards RJ et al. (2021), BMC Genomics 22, 188 https://doi.org/10.1186/s12864-021-07493-6
Assess Citation:	Stuart KC et al. (2021), bioRvix https://doi.org/10.1101/2021.04.07.438753
GitHub:	http://github.com/slimsuite/saaga

Imported modules: rje rje_db rje_gff rje_obj rje_rmd rje_seqlist

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

SAAGA is a tool for summarising, annotating and assessing genome annotations, with a particular focus on annotation generated by GeMoMa. The core of SAAGA is reciprocal MMeqs searches of the annotation and reference proteomes. These are used to identify the best hits for protein product identification and to assess annotations based on query and hit coverage. SAAGA will also generate annotation summary statistics, and extract the longest protein from each gene for a representative non-redundant proteome (e.g. for BUSCO analysis).

Run modes

assess = Assess annotation using reference annotation (e.g. a reference organism proteome)
annotate = Rename annotation using reference annotation (could be Swissprot)
longest = Extract the longest protein per gene
mmseq = Run the mmseq2 steps in preparation for further analysis
summarise = Summarise annotation from GFF file
taxonomy = Summarise taxonomic assignments for contamination assessments (Taxolotl)

Commandline

Input/Output options

seqin=FILE : Protein annotation file to assess [annotation.faa]
gffin=FILE : Protein annotation GFF file [annotation.gff]
cdsin=FILE : Optional transcript annotation file for renaming and/or longest isoform extraction [annotation.fna]
assembly=FILE : Optional genome fasta file (required for some outputs) [None]
refprot=FILE : Reference proteome for mapping data onto [refproteome.fasta]
refdb=FILE : Reference proteome MMseqs2 database (over-rules mmseqdb path) []
mmseqdb=PATH : Directory in which to find/create MMseqs2 databases [./mmseqdb/]
mmsearch=PATH : Directory in which to find/create MMseqs2 databases [./mmsearch/]
basefile=X : Prefix for output files [$SEQBASE.$REFBASE]
gffgene=X : Label for GFF gene feature type ['gene']
gffcds=X : Label for GFF CDS feature type ['CDS']
gffmrna=X : Label for GFF mRNA feature type ['mRNA']
gffdesc=X : GFF output field label for annotated proteins (e.g. note, product) [product]

Run mode options

annotate=T/F : Rename annotation using reference annotation (could be Swissprot) [False]
assess=T/F : Assess annotation using reference annotation [False]
longest=T/F : Extract longest protein per gene into *.longest.faa [False]
mmseqs=T/F : Run the MMseqs2 steps in preparation for further analysis [True]
summarise=T/F : Summarise annotation from GFF file [True]
taxonomy=T/F : Summarise taxonomic assignments for contamination assessments (Taxolotl) [False]
dochtml=T/F : Generate HTML SAAGA documentation (*.docs.html) instead of main run [False]

Search and filter options

tophits=INT : Restrict mmseqs hits to the top X hits [250]
minglobid=PERC : Minimum global query percentage identity for a hit to be kept [40.0]

Precomputed MMSeq2 options

mmqrymap=TSV : Tab-delimited output for query versus reference search (see docs) [$SEQBASE.$REFBASE.mmseq.tsv]
mmhitmap=TSV : Tab-delimited output for reference versus query search (see docs) [$REFBASE.$SEQBASE.mmseq.tsv]

Batch Run options

batchseq=FILELIST : List of seqin=FILE annotation proteomes for comparison
batchref=FILELIST : List of refprot=FILE reference proteomes for comparison

Taxonomy options

taxdb=FILE : MMseqs2 taxonomy database for taxonomy assignment [seqTaxDB]
taxbase=X : Output prefix for taxonomy output [$SEQBASE.$TAXADB]
taxorfs=T/F : Whether to generate ORFs from assembly if no seqin=FILE given [True]
taxbyseq=T/F : Whether to parse and generate taxonomy output for each assembly (GFF) sequence [True]
taxbycontig=T/F : Whether to generate taxonomy output for each contig if the assembly is loaded [True]
taxbyseqfull=T/F: Whether generate full easy taxonomy report outputs for each assembly (GFF) sequence [False]
taxsubsets=FILELIST : Files (fasta/id) with sets of assembly input sequences (matching GFF) to summarise []
taxlevels=LIST : List of taxonomic levels to report (* for superkingdom and below) ['*']
taxwarnrank=X : Taxonomic rank (and above) to warn when deviating for consensus [family]
bestlineage=T/F : Whether to enforce a single lineage for best taxa ratings [True]
mintaxnum=INT : Minimum gene count in main dataset to keep taxon, else merge with higher level [2]

System options

forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]
tmpdir=PATH : Temporary directory path for running mmseqs2 [./tmp/]

History Module Version History

# 0.0.0 - Initial Compilation.
# 0.1.0 - Initial working version. Needs improved documentation.
# 0.2.0 - Added extra annotation/longest output for CDS and GFF.
# 0.2.1 - Renamed to SAAGA and tidied some documentation.
# 0.3.0 - Added some additional hit info to annotation and reworked to allow multiple query-hit pairs.
# 0.3.1 - Fixed assess bug and sped up GFF parsing.
# 0.4.0 - Added tophits=X [250] and minglobid=X [40.0] options, plus gobid and hitnum to output.
# 0.5.0 - Added definitions for gffgene=X, gffcds=X and gffmrna=X. Modified output.
# 0.5.1 - Tidied some of the code and added some identifier checks for GFF and Fasta input.
# 0.5.2 - Fixed issue with swapped transcript and exon feature identifiers following v0.5.1 tidying.
# 0.5.3 - Added pident compatibility with updated mmseq2. Updated documentation. Modified some stats calculations.
# 0.5.4 - Added restricted feature parsing from GFF. Fixed GFF type input bug.
# 0.6.0 - Added more graceful failure if no sequences loaded. Added GFF renaming output field options. Fixed GFF output bug.
# 0.7.0 - Added taxonomy mode for taxonomic summaries and contamination checks.
# 0.7.1 - Added taxorfs setting to generate ORFs in absence of GFF or protein file.
# 0.7.2 - Updated docstring. Added rating to lca_genes. Add batchrun for matching seqin/gffin pairs. Added GFF output.
# 0.7.3 - Fixed lca_genes rating and added taxbycontig=T/F taxonomy output for each contig if the assembly is loaded.
# 0.7.4 - Updated some of the outputs to Taxolotl rather than SAAGA.
# 0.7.5 - Added bestlineage=T/F : Whether to enforce a single lineage for best taxa ratings [True]
# 0.7.6 - Fixed GFF output.
# 0.7.7 - Fixed contig output for Taxolotl.

SAAGA REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

There is currently no specific help available on REST output for this program.

SLiMSuite REST Server

SAAGA V0.7.7

Summarise, Annotate & Assess Genome Annotations