|
BUSCOMP V1.0.1BUSCO Compilation and Comparison tool
Copyright © 2019 Richard J. Edwards - See source code for GNU License Notice Imported modules:
See SLiMSuite Blog for further documentation. See FunctionBUSCOMP is designed to overcome some of the non-deterministic limitations of BUSCO to: 1. compile a non-redundant maximal set of complete BUSCOs from a set of assemblies, and 2. use this set to provide a "true" comparison of completeness between different assemblies of the same genome with predictable behaviour. For each BUSCO gene, BUSCOMP will extract the best "Single Complete" sequence from those available, using the
NOTE: If BUSCO v5 has been run in MetaEuk mode, the nucleotide sequences will not have been generated. BUSCOMP will
use the full table and BUSCOMP uses minimap2 to map BUSCOSeq predicted CDS sequences onto genome/transcriptome assemblies, including those not included in the original BUSCO compilation. This way, the compiled set of species-specific BUSCO sequences can also be used to generate a quick-and-dirty assessment of completeness for a new genome assembly. Hits are converted into percentage coverage stats, which are then used to reclassify the BUSCO gene on the basis of coverage and identity. BUSCOMP ratings are designed to mimic the original BUSCO ratings but have different definitions. In addition, two extra classes of low quality hit have been added: "Partial" and "Ghost".
In addition to individual assembly stats, BUSCO and BUSCOMP ratings are compiled across user-defined groups of
assemblies with various outputs to give insight into how different assemblies complement each other. Ratings are
also combined with traditional genome assembly statistics (NG50 and LG50) based on a given NOTE: For HTML output, R must be installed and a pandoc environment variable must be set, e.g. export NOTE: BUSCOMPSeq sequences can be provided with For full documentation of the BUSCOMP workflow, run with CommandlineInput/Output options
Mapping/Classification options
Processing options
History Module Version History# 0.0.0 - Initial Compilation. # 0.1.0 - Basic working version. # 0.2.0 - Functional version with basic RMarkdown HTML output. # 0.3.0 - Added ratefas=FILELIST: Additional fasta files of assemblies to rate with BUSCOMPSeq (No BUSCO run) []. # 0.4.0 - Implemented forking and tidied up output a little. # 0.5.0 - Updated genome stats and RMarkdown HTML output. Reorganised assembly loading and proeccessing. Added menus. # 0.5.1 - Reorganised code for clearer flow and documentation. Unique and missing BUSCO output added. # 0.5.2 - Dropped paircomp method and added Rmarkdown control methods. Updated Rmarkdown descriptions. Updated log output. # 0.5.3 - Tweaked log output and fixed a few minor bugs. # 0.5.4 - Deleted some excess code and tweaked BUSCO percentage plot outputs. # 0.5.5 - Fixed minlocid bug and cleared up minimap temp directories. Added LnnIDxx to BUSCOMP outputs. # 0.5.6 - Added uniquehit=T/F : Option to use *.hitunique.tdt table of unique coverage for GABLAM coverage stats [False] # 0.6.0 - Added more minimap options, changed defaults and dev generation of a table changes in ratings from BUSCO to BUSCOMP. # 0.6.1 - Fixed bug that was including Duplicated sequences in the buscomp.fasta file. Added option to exclude from BUSCOMPSeq compilation. # 0.6.2 - Fixed bug introduced that had broken manual group review/editing. # 0.7.0 - Updated the defaults in the light of test analyses. Tweaked Rmd report. # 0.7.1 - Fixed unique group count bug when some genomes are not in a group. Fixed running with non-standard options. # 0.7.2 - Added loadsummary=T/F option to regenerate summaries and fixed bugs running without BUSCO results. # 0.7.3 - Fixed bugs calculating Complete BUSCO scores in a couple of places. Added text summaries to plots. # 0.7.4 - Added ggplot option. Added group plots to full reports. # 0.7.5 - Reinstated BUSCOMP contribution reports when re-running. # 0.7.6 - Added additional error-handling for CS parsing errors. # 0.7.7 - Fixed problems with buscompseq=F. Fixed stripnum and Rmd bugs. Added sequence name checking for duplicates. # 0.7.8 - Fixed a bug where BUSCOMP was not being compiled for assemblies without BUSCO data. # 0.7.9 - Added listing of numbers to BUSCOMP Missing charts. # 0.8.0 - Added alnseq=F as default PAF parsing mode for improved efficiency. # 0.8.1 - Set endextend=0 due to bug. # 0.8.2 - Fixed full RMD chart labelling bug. Fixed endextend bug and reinstated endextend=10 default. # 0.8.3 - Fixed Unique rating bug with no groups. # 0.8.4 - Set endextend=0 due to another bug. # 0.8.5 - Fixed BUSCO table loading bug introduced by Diploidocus. Added error catching for logbinomial bug. # 0.8.6 - Tweaked code to handle BUSCO v4 files, but not (yet) file organisation. # 0.8.7 - Fixing issues with prefix parsing from BUSCO directories and files. # 0.9.0 - Updated parsing of single_copy_busco_sequences/ to enable multiple directories with "$PREFIX" suffixes. # 0.9.1 - Updated parsing to enable BUSCO v4 results recognition. (run with -o $GENOME.busco) # 0.9.2 - Fixed some bugs when files missing. # 0.9.3 - Minor fixes to output and clearer error messages. Fixed formatting for Python 2.6 back compatibility for servers. # 0.9.4 - Added contig statistics and fixed group description loading bug. # 0.9.5 - Fixed Group BUSCOMP plot output bug. # 0.9.6 - Added CtgNum: Number of contigs (`SeqNum`+`GapCount`). # 0.9.7 - Fixed some Rmd bugs to fix output after summary table changes. # 0.10.0- Added Complete BUSCOMP gene table output for Diploidocus BUSCO table alternative. # 0.10.1- Changed BUSCOMP to be BUSCO Compilation and Comparison Tool. # 0.11.0- Updated for BUSCO v5. # 0.12.0- Added parsing for v5 proteome and transcriptome modes. # 0.12.1- Fixed group deletion bug. # 0.13.0- Added generation of missing MetaEuk *.fna files using rje_busco module. # 1.0.0 - Added citation to main documentation and switched to version 1.x for release with publication. # 1.0.1 - Fixed parsing of MetaEuk sequences that have extra letters to BuscoID in full table. BUSCOMP REST Output formatsRun with&rest=docs for program documentation and options. A plain text version is accessed with &rest=help .&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default( &rest=format ) output. Individual OUTFMT elements can also be parsed from the full (&rest=full ) server output,which is formatted as follows: ###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~### # OUTFMT: ... contents for OUTFMT section ... Available REST OutputsThere is currently no specific help available on REST output for this program.© 2015 RJ Edwards. Contact: richard.edwards@unsw.edu.au. |