Module:	depthsizer
Description:	Read-depth based genome size prediction
Version:	1.6.3
Last Edit:	18/03/22
Citation:	Chen SH et al. & Edwards RJ (2022): Mol. Ecol. Res. (doi: 10.1111/1755-0998.13574)

Imported modules: rje rje_obj rje_rmd rje_readcore rje_seqlist

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

DepthSizer is a modified wrapper for the genome size estimate methods of Diploidocus. DepthSizer needs a genome assembly (fasta format, seqin=FILE), a set of long read (ONT, PacBio or HiFi) data for the assembly (reads=FILELIST and readtype=LIST) (or readbp=INT), and a BUSCO full table of results (busco=TSVFILE).

DepthSizer works on the principle that Complete BUSCO genes should represent predominantly single copy (diploid read depth) regions along with some poor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc. are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the region is actually diploid coverage, the modal read depth is expected to represent the actual single copy read depth.

DepthSizer uses samtools mpileup (or samtools depth if quickdepth=T) to calculate the per-base read depth. This is converted into an estimated single copy read depth using a density plot of BUSCO single copy genes. Legacy mode will calculate modal read depth for each BUSCO gene along with the overall modal read depth for all gene regions. Genome size is then estimated based on a crude calculation using the total combined sequencing length. This will be caculated from reads=FILELIST unless provided with readbp=INT.

BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE (default full_table_$BASEFILE.busco.tsv). This can be replaced with any table with the fields: ['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status = Complete and the Contig, Start and End fields are used to define the regions that should be predominantly single copy. Output from BUSCOMP is also compatible with DepthSizer.

NOTE: The current unadjusted genome size prediction appears to be an over-estimate. Please see documentation for more details of attempts to correct for contamination, read mapping and/or imbalanced insertion:deletion ratios etc.

---

Dependencies

Unless bam=FILE is given, [minimap2](https://github.com/lh3/minimap2) must be installed and either added to the
environment $PATH or given to DepthSizer with the minimap2=PROG setting, and [samtools](http://www.htslib.org/)
needs to be installed. Unless legacy=T depdensity=F, R will also need be installed.

To generate documentation with dochtml, R will need to be installed and a pandoc environment variable must be set, e.g.

export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc

For DepthSizer documentation, run with dochtml=T and read the *.docs.html file generated.

Commandline

Main DepthSizer run options

seqin=FILE : Input sequence assembly [None]
basefile=FILE : Root of output file names [gapspanner or $SEQIN basefile]
summarise=T/F : Whether to generate and output summary statistics sequence data before and after processing [True]
genomesize=INT : Haploid genome size (bp) [0]
bam=FILE : BAM file of long reads mapped onto assembly [$BASEFILE.bam]
reads=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
dochtml=T/F : Generate HTML DepthSizer documentation (*.docs.html) instead of main run [False]
tmpdir=PATH : Path for temporary output files during forking (not all modes) [./tmpdir/]

Genome size prediction options

busco=TSVFILE : BUSCO full table [full_table_$BASEFILE.busco.tsv]
readbp=INT : Total combined read length for depth calculations (over-rides reads=FILELIST) []
adjustmode=X : Map adjustment method to apply (None/CovBases/IndelRatio/MapBases/MapAdjust/MapRatio/OldAdjust/OldCovBases) [IndelRatio]
quickdepth=T/F : Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False]
covbases=T/F : Whether to calculate predicted minimum genome size based on mapped reads only [True]
mapadjust=T/F : Whether to calculate mapadjust predicted genome size based on read length:mapping ratio [False]
benchmark=T/F : Activate benchmarking mode and also output the assembly size and mean depth estimate [False]
legacy=T/F : Whether to perform Legacy v1.0.0 (Diploidocus) calculations [False]
depdensity=T/F : Whether to use the BUSCO depth density profile in place of modal depth in legacy mode [True]
depadjust=INT : Advanced R density bandwidth adjustment parameter [12]
seqstats=T/F : Whether to output CN and depth data for full sequences as well as BUSCO genes [False]

Forking options

forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
killmain=T/F : Whether to kill main thread rather than individual forks when killforks reached. [False]
logfork=T/F : Whether to log forking in main log [False]

History Module Version History

    # 0.0.0 - Initial Compilation.
    # 1.0.0 - Initial working version of DepthSizer based on Diploidocus v0.16.2.
    # 1.1.0 - Updated to use updated ReadCore and DepthCopy code, plus new Indel ratio calculation (dev=T).
    # 1.2.0 - Added CovBases lower depthsizer estimate output based solely on mapped reads. Fixed unmapped read bug.
    # 1.2.1 - Fixed major flaw in indelratio calculation.
    # 1.3.0 - Added adjust string and benchmark=T option for additional calculations. IndelRatio no longer dev only.
    # 1.3.1 - Tweaked some input checks and log output. Replaced indelratio sort -u with uniq for speed and memory.
    # 1.4.0 - Added seqstats=T/F : Whether to output CN and depth data for full sequences as well as BUSCO genes [False]
    # 1.4.1 - Added citation and fixed minor output typo.
    # 1.4.2 - Fixed bug that causes clashes with v5 full_table.bed files.
    # 1.5.0 - Add additional map adjustment variants:
    #       - MapAdjust2 = allbases, not covbases
    #       - MapBases = Use map bases, not covbases for min read volumne
    #       - MapRatio = Use mapbases adjusted by indelratio
    # 1.6.0 - Disable legacy mode using Diploidocus.
    # 1.6.1 - Bug fixes to underlying R script and related core codebase.
    # 1.6.2 - Updated citation to Mol Ecol Res paper.
    # 1.6.3 - Fixed R code bug. Added bamcsi=T/F to use CSI indexing.

depthsizer REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

There is currently no specific help available on REST output for this program.

SLiMSuite REST Server

depthsizer V1.6.3

Read-depth based genome size prediction