SAAGA is a tool for summarising, annotating and assessing genome annotations, with a particular focus on annotation
generated by GeMoMa. The core of SAAGA is reciprocal MMeqs searches of the annotation and reference proteomes. These
are used to identify the best hits for protein product identification and to assess annotations based on query and
hit coverage. SAAGA will also generate annotation summary statistics, and extract the longest protein from each gene
for a representative non-redundant proteome (e.g. for BUSCO analysis).
assess = Assess annotation using reference annotation (e.g. a reference organism proteome)
annotate = Rename annotation using reference annotation (could be Swissprot)
longest = Extract the longest protein per gene
mmseq = Run the mmseq2 steps in preparation for further analysis
summarise = Summarise annotation from GFF file
taxonomy = Summarise taxonomic assignments for contamination assessments (Taxolotl)
seqin=FILE : Protein annotation file to assess [
gffin=FILE : Protein annotation GFF file [
cdsin=FILE : Optional transcript annotation file for renaming and/or longest isoform extraction [
assembly=FILE : Optional genome fasta file (required for some outputs) [
refprot=FILE : Reference proteome for mapping data onto [
refdb=FILE : Reference proteome MMseqs2 database (over-rules mmseqdb path) 
mmseqdb=PATH : Directory in which to find/create MMseqs2 databases [
mmsearch=PATH : Directory in which to find/create MMseqs2 databases [
basefile=X : Prefix for output files [
gffgene=X : Label for GFF gene feature type [
gffcds=X : Label for GFF CDS feature type [
gffmrna=X : Label for GFF mRNA feature type [
gffdesc=X : GFF output field label for annotated proteins (e.g. note, product) [
Run mode options
annotate=T/F : Rename annotation using reference annotation (could be Swissprot) [
assess=T/F : Assess annotation using reference annotation [
longest=T/F : Extract longest protein per gene into *.longest.faa [
mmseqs=T/F : Run the MMseqs2 steps in preparation for further analysis [
summarise=T/F : Summarise annotation from GFF file [
taxonomy=T/F : Summarise taxonomic assignments for contamination assessments (Taxolotl) [
dochtml=T/F : Generate HTML SAAGA documentation (*.docs.html) instead of main run [
Search and filter options
tophits=INT : Restrict mmseqs hits to the top X hits [
minglobid=PERC : Minimum global query percentage identity for a hit to be kept [
Precomputed MMSeq2 options
mmqrymap=TSV : Tab-delimited output for query versus reference search (see docs) [
mmhitmap=TSV : Tab-delimited output for reference versus query search (see docs) [
Batch Run options
batchseq=FILELIST : List of
seqin=FILE annotation proteomes for comparison
batchref=FILELIST : List of
refprot=FILE reference proteomes for comparison
taxdb=FILE : MMseqs2 taxonomy database for taxonomy assignment [
taxbase=X : Output prefix for taxonomy output [
taxorfs=T/F : Whether to generate ORFs from assembly if no
seqin=FILE given [
taxbyseq=T/F : Whether to parse and generate taxonomy output for each assembly (GFF) sequence [
taxbycontig=T/F : Whether to generate taxonomy output for each contig if the assembly is loaded [
taxbyseqfull=T/F: Whether generate full easy taxonomy report outputs for each assembly (GFF) sequence [
taxsubsets=FILELIST : Files (fasta/id) with sets of assembly input sequences (matching GFF) to summarise 
taxlevels=LIST : List of taxonomic levels to report (* for superkingdom and below) [
taxwarnrank=X : Taxonomic rank (and above) to warn when deviating for consensus [
bestlineage=T/F : Whether to enforce a single lineage for best taxa ratings [
mintaxnum=INT : Minimum gene count in main dataset to keep taxon, else merge with higher level [
forks=X : Number of parallel sequences to process at once [
killforks=X : Number of seconds of no activity before killing all remaining forks. [
forksleep=X : Sleep time (seconds) between cycles of forking out more process [
tmpdir=PATH : Temporary directory path for running mmseqs2 [
History Module Version History
# 0.0.0 - Initial Compilation.
# 0.1.0 - Initial working version. Needs improved documentation.
# 0.2.0 - Added extra annotation/longest output for CDS and GFF.
# 0.2.1 - Renamed to SAAGA and tidied some documentation.
# 0.3.0 - Added some additional hit info to annotation and reworked to allow multiple query-hit pairs.
# 0.3.1 - Fixed assess bug and sped up GFF parsing.
# 0.4.0 - Added tophits=X  and minglobid=X [40.0] options, plus gobid and hitnum to output.
# 0.5.0 - Added definitions for gffgene=X, gffcds=X and gffmrna=X. Modified output.
# 0.5.1 - Tidied some of the code and added some identifier checks for GFF and Fasta input.
# 0.5.2 - Fixed issue with swapped transcript and exon feature identifiers following v0.5.1 tidying.
# 0.5.3 - Added pident compatibility with updated mmseq2. Updated documentation. Modified some stats calculations.
# 0.5.4 - Added restricted feature parsing from GFF. Fixed GFF type input bug.
# 0.6.0 - Added more graceful failure if no sequences loaded. Added GFF renaming output field options. Fixed GFF output bug.
# 0.7.0 - Added taxonomy mode for taxonomic summaries and contamination checks.
# 0.7.1 - Added taxorfs setting to generate ORFs in absence of GFF or protein file.
# 0.7.2 - Updated docstring. Added rating to lca_genes. Add batchrun for matching seqin/gffin pairs. Added GFF output.
# 0.7.3 - Fixed lca_genes rating and added taxbycontig=T/F taxonomy output for each contig if the assembly is loaded.
# 0.7.4 - Updated some of the outputs to Taxolotl rather than SAAGA.
# 0.7.5 - Added bestlineage=T/F : Whether to enforce a single lineage for best taxa ratings [True]
# 0.7.6 - Fixed GFF output.
# 0.7.7 - Fixed contig output for Taxolotl.
SAAGA REST Output formats
for program documentation and options. A plain text version is accessed with
can be used to retrieve individual parts of the output, matching the tabs in the default
) output. Individual
elements can also be parsed from the full (
) server output,
which is formatted as follows:
... contents for OUTFMT section ...
Available REST Outputs
There is currently no specific help available on REST output for this program.