Module:	BUSCOMP
Description:	BUSCO Compilation and Comparison tool
Version:	1.0.1
Last Edit:	10/01/22
Citation:	Stuart KC, Edwards RJ et al. (preprint). bioRxiv 2021.04.07.438753 (doi: 10.1101/2021.04.07.438753)
See Also:	Edwards RJ (2019). F1000Research 8:995 (slides) (doi: 10.7490/f1000research.1116972.1)

Imported modules: rje rje_obj rje_db rje_menu rje_paf rje_rmd rje_seqlist rje_busco

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

BUSCOMP is designed to overcome some of the non-deterministic limitations of BUSCO to:

1. compile a non-redundant maximal set of complete BUSCOs from a set of assemblies, and 2. use this set to provide a "true" comparison of completeness between different assemblies of the same genome with predictable behaviour.

For each BUSCO gene, BUSCOMP will extract the best "Single Complete" sequence from those available, using the full_table_*.tsv results table and single_copy_busco_sequences/ directory of hit sequences. BUSCOMP ranks all the hits across all assemblies by Score and keeps the top-ranking hits. Ties are then resolved by Length, keeping the longest sequence. Ties for Score and Length will keep an arbitrary entry as the winner. Single complete hits are given preference over Duplicate hits, even if they have a lower score, because only Single hits have their sequences saved by BUSCO in the single_copy_busco_sequences/ directory. This set of predicted gene sequences forms the "BUSCOMPSeq" gene set.

NOTE: If BUSCO v5 has been run in MetaEuk mode, the nucleotide sequences will not have been generated. BUSCOMP will use the full table and metaeuk_output/*_results/*.codon.fas files to extract the nucleotide sequences where possible.

BUSCOMP uses minimap2 to map BUSCOSeq predicted CDS sequences onto genome/transcriptome assemblies, including those not included in the original BUSCO compilation. This way, the compiled set of species-specific BUSCO sequences can also be used to generate a quick-and-dirty assessment of completeness for a new genome assembly. Hits are converted into percentage coverage stats, which are then used to reclassify the BUSCO gene on the basis of coverage and identity. BUSCOMP ratings are designed to mimic the original BUSCO ratings but have different definitions. In addition, two extra classes of low quality hit have been added: "Partial" and "Ghost".

Complete: 95%+ Coverage in a single contig/scaffold. (Note: accuracy/identity is not considered.)
Duplicated: 95%+ Coverage in 2+ contigs/scaffolds.
Fragmented: 95%+ combined coverage but not in any single contig/scaffold.
Partial: 40-95% combined coverage.
Ghost: Hits meeting local cutoff but <40% combined coverage.
Missing: No hits meeting local cutoff.

In addition to individual assembly stats, BUSCO and BUSCOMP ratings are compiled across user-defined groups of assemblies with various outputs to give insight into how different assemblies complement each other. Ratings are also combined with traditional genome assembly statistics (NG50 and LG50) based on a given genomesize=X to help identify the "best" assemblies. Details of settings, key results, tables and plots are output to an HTML report using Rmarkdown.

NOTE: For HTML output, R must be installed and a pandoc environment variable must be set, e.g.

export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc

NOTE: BUSCOMPSeq sequences can be provided with buscofas=FILE in place of compilation. This option has not been tested and might give some unexpected behaviours, as some of the quoted figures will still be based on the calculated BUSCOMPSeq data. Please report any unexpected behaviour.

For full documentation of the BUSCOMP workflow, run with dochtml=T and read the *.docs.html file generated, or visit https://slimsuite.github.io/buscomp/.

Commandline

Input/Output options

runs=DIRLIST : List of BUSCO run directories (wildcards allowed) [run_*]
fastadir=DIRLIST: List of directories containing genome fasta files (wildcards allowed) [./]
fastaext=LIST : List of accepted fasta file extensions that will be checked for in fastadir [fasta,fas,fsa,fna,fa]
genomes=FILE : File of Prefix and Genome fields for generate user-friendly output [*.genomes.tdt if found]
restrict=T/F : Restrict analysis to genomes with a loaded alias [False]
runsort=X : Output sorting order for genomes and groups (or "Genome","Prefix","Complete","Group") [Group]
stripnum=T/F : Whether to strip numbers ("XX_*") at start of Genome alias in output [True]
groups=FILE : File of Genome and Group fields to define Groups for compilation [*.groups.tdt]
buscofas=FASFILE: Fasta file of BUSCO DNA sequences. Will combine and make (NR) if not given [None]
metaeukfna=T/F : Perform v5 metaeuk nucleotide busco sequences extraction if missing [True]
buscomp=T/F : Whether to run BUSCO compilation across full results tables [True]
dupbest=T/F : Whether to rate "Duplicated" above "Complete" when compiling "best" BUSCOs across Groups [False]
buscompseq=T/F : Whether to run full BUSCO comparison using buscofas and minimap2 [True]
ratefas=FILELIST: Additional fasta files of assemblies to rate with BUSCOMPSeq (No BUSCO run) (wildcards allowed) []
rmdreport=T/F : Generate Rmarkdown report and knit into HTML [True]
ggplot=T/F : Whether to use ggplot code for plotting [True]
fullreport=T/F : Generate full Rmarkdown report including detailed tables of all ratings [True]
missing=T/F : Generate summary tables for sets of Missing genes for each assembly/group [True]
dochtml=T/F : Generate HTML BUSCOMP documentation (*.docs.html) instead of main run [False]
summarise=T/F : Include summaries of genomes in main *.genomes.tdt output [True]
loadsummary=T/F : Use existing genome summaries including NG50 from *.genomes.tdt, if present [True]

Mapping/Classification options

minimap2=PROG : Full path to run minimap2 [minimap2]
endextend=X : Extend minimap2 hits to end of sequence if query region with X bp of end [0]
minlocid=INT : Minimum percentage identity for aligned chunk to be kept (local %identity) [0]
minloclen=INT : Minimum length for aligned chunk to be kept (local hit length in bp) [20]
uniquehit=T/F : Option to use *.hitunique.tdt table of unique coverage for GABLAM coverage stats [True]
mmsecnum=INT : Max. number of secondary alignments to keep (minimap2 -N) [3]
mmpcut=NUM : Minimap2 Minimal secondary-to-primary score ratio to output secondary mappings (minimap2 -p) [0]
mapopt=CDICT : Dictionary of additional minimap2 options to apply (caution: over-rides conflicting settings) []
alnseq=T/F : Whether to use alnseq-based processing (True) or (False) faster CS-Gstring processing [False]

Processing options

forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]

History Module Version History

# 0.0.0 - Initial Compilation.
# 0.1.0 - Basic working version.
# 0.2.0 - Functional version with basic RMarkdown HTML output.
# 0.3.0 - Added ratefas=FILELIST: Additional fasta files of assemblies to rate with BUSCOMPSeq (No BUSCO run) [].
# 0.4.0 - Implemented forking and tidied up output a little.
# 0.5.0 - Updated genome stats and RMarkdown HTML output. Reorganised assembly loading and proeccessing. Added menus.
# 0.5.1 - Reorganised code for clearer flow and documentation. Unique and missing BUSCO output added.
# 0.5.2 - Dropped paircomp method and added Rmarkdown control methods. Updated Rmarkdown descriptions. Updated log output.
# 0.5.3 - Tweaked log output and fixed a few minor bugs.
# 0.5.4 - Deleted some excess code and tweaked BUSCO percentage plot outputs.
# 0.5.5 - Fixed minlocid bug and cleared up minimap temp directories. Added LnnIDxx to BUSCOMP outputs.
# 0.5.6 - Added uniquehit=T/F : Option to use *.hitunique.tdt table of unique coverage for GABLAM coverage stats [False]
# 0.6.0 - Added more minimap options, changed defaults and dev generation of a table changes in ratings from BUSCO to BUSCOMP.
# 0.6.1 - Fixed bug that was including Duplicated sequences in the buscomp.fasta file. Added option to exclude from BUSCOMPSeq compilation.
# 0.6.2 - Fixed bug introduced that had broken manual group review/editing.
# 0.7.0 - Updated the defaults in the light of test analyses. Tweaked Rmd report.
# 0.7.1 - Fixed unique group count bug when some genomes are not in a group. Fixed running with non-standard options.
# 0.7.2 - Added loadsummary=T/F option to regenerate summaries and fixed bugs running without BUSCO results.
# 0.7.3 - Fixed bugs calculating Complete BUSCO scores in a couple of places. Added text summaries to plots.
# 0.7.4 - Added ggplot option. Added group plots to full reports.
# 0.7.5 - Reinstated BUSCOMP contribution reports when re-running.
# 0.7.6 - Added additional error-handling for CS parsing errors.
# 0.7.7 - Fixed problems with buscompseq=F. Fixed stripnum and Rmd bugs. Added sequence name checking for duplicates.
# 0.7.8 - Fixed a bug where BUSCOMP was not being compiled for assemblies without BUSCO data.
# 0.7.9 - Added listing of numbers to BUSCOMP Missing charts.
# 0.8.0 - Added alnseq=F as default PAF parsing mode for improved efficiency.
# 0.8.1 - Set endextend=0 due to bug.
# 0.8.2 - Fixed full RMD chart labelling bug. Fixed endextend bug and reinstated endextend=10 default.
# 0.8.3 - Fixed Unique rating bug with no groups.
# 0.8.4 - Set endextend=0 due to another bug.
# 0.8.5 - Fixed BUSCO table loading bug introduced by Diploidocus. Added error catching for logbinomial bug.
# 0.8.6 - Tweaked code to handle BUSCO v4 files, but not (yet) file organisation.
# 0.8.7 - Fixing issues with prefix parsing from BUSCO directories and files.
# 0.9.0 - Updated parsing of single_copy_busco_sequences/ to enable multiple directories with "$PREFIX" suffixes.
# 0.9.1 - Updated parsing to enable BUSCO v4 results recognition. (run with -o $GENOME.busco)
# 0.9.2 - Fixed some bugs when files missing.
# 0.9.3 - Minor fixes to output and clearer error messages. Fixed formatting for Python 2.6 back compatibility for servers.
# 0.9.4 - Added contig statistics and fixed group description loading bug.
# 0.9.5 - Fixed Group BUSCOMP plot output bug.
# 0.9.6 - Added CtgNum: Number of contigs (`SeqNum`+`GapCount`).
# 0.9.7 - Fixed some Rmd bugs to fix output after summary table changes.
# 0.10.0- Added Complete BUSCOMP gene table output for Diploidocus BUSCO table alternative.
# 0.10.1- Changed BUSCOMP to be BUSCO Compilation and Comparison Tool.
# 0.11.0- Updated for BUSCO v5.
# 0.12.0- Added parsing for v5 proteome and transcriptome modes.
# 0.12.1- Fixed group deletion bug.
# 0.13.0- Added generation of missing MetaEuk *.fna files using rje_busco module.
# 1.0.0 - Added citation to main documentation and switched to version 1.x for release with publication.
# 1.0.1 - Fixed parsing of MetaEuk sequences that have extra letters to BuscoID in full table.

BUSCOMP REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

There is currently no specific help available on REST output for this program.

SLiMSuite REST Server

BUSCOMP V1.0.1

BUSCO Compilation and Comparison tool