Module:	DepthKopy
Description:	Single-copy read-depth and kmer based copy number analysis
Version:	1.0.3
Last Edit:	06/04/22
Citation:	Chen SH et al. & Edwards RJ (2022): Mol. Ecol. Res. (doi: 10.1111/1755-0998.13574)

Imported modules: rje rje_db rje_kat rje_obj rje_readcore rje_rmd

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

DepthKopy is an updated version of the regcnv methods of [Diploidocus](https://github.com/slimsuite/diploidocus), using the same single-copy read depth estimation from BUSCO Complete genes as [DepthSizer](https://github.com/slimsuite/depthsizer). DepthKopy needs a genome assembly (fasta format, seqin=FILE), a set of long read (ONT, PacBio or HiFi) data for the assembly (reads=FILELIST and readtype=LIST) or mapped BAM file (bam=FILE), and a BUSCO/BUSCOMP full table of results (busco=TSVFILE) or pre-calculated single-copy read depth (scdepth=NUM). For kmer analysis, reads also need to be provided with kmerreads=FILELIST (and 10xtrim=T if barcoded 10x linked reads).

Optionally, it can then take one or more additional sources of assembly regions for which stats will be calculated. Delimited text files or GFF files can be provided using regfile=FILE for a single file, or regfile=CDICT for multiple files. Multiple files are given in the form Name1:File1,Name2:File2,...,NameN:FileN, from which each unique Name will be plotted as a separate violin (see output). GFF files will be parsed to extract any features matching the types given by gfftype=LIST (default = gene). Delimited files will need to have headers that match those provided by checkfields=LIST (default = SeqName,Start,End). By default, 100 kb non-overlapping windows will also be output (winsize=INT winstep=NUM). Unless seqstats=F, statistics will also be calculated per assembly scaffold. Subsets of scaffolds can be given separate window output plots using chromcheck=LIST (scaffold names) or chromcheck=INT (min size).

DepthKopy works on the principle that Complete BUSCO genes should represent predominantly single copy (diploid read depth) regions along with some poor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc. are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the region is actually diploid coverage, the modal read depth is expected to represent the actual single copy read depth.

DepthKopy uses samtools mpileup (or samtools depth if quickdepth=T) to calculate the per-base read depth. This is converted into an estimated single copy read depth using a smoothed density plot of BUSCO single copy genes. BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE (default full_table_$BASEFILE.busco.tsv). This can be replaced with any table matching the BUSCO fields: ['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status = Complete and the Contig, Start and End fields are used to define the regions that should be predominantly single copy. Output from BUSCOMP is also compatible with DepthKopy. DepthKopy has been tested with outputs from BUSCO v3 and v5.

Output is a table of depth statistics and predicted copy number for each input dataset (BUSCO Complete, BUSCO Duplicated, regfile Regions, assembly scaffolds (seqstat=T) and sliding windows). Data visualisations are also output for each region set using [ggstatsplot](https://indrajeetpatil.github.io/ggstatsplot/).

Commandline

Main DepthKopy run options

seqin=FILE : Input sequence assembly. (Must be *.fa or *.fasta for kmerself=T.) [None]
basefile=FILE : Root of output file names [diploidocus or $SEQIN basefile]
scdepth=NUM : Single copy ("diploid") read depth. If zero, will use SC BUSCO mode [0]
bam=FILE : BAM file of long reads mapped onto assembly [$BASEFILE.bam]
reads=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
dochtml=T/F : Generate HTML DepthKopy documentation (*.docs.html) instead of main run [False]

Depth and Copy Number options

busco=TSVFILE : BUSCO full table [full_table_$BASEFILE.busco.tsv]
quickdepth=T/F : Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False]
depfile=FILE : Precomputed depth file (*.fastdep or *.fastmp) to use [None]
homfile=FILE : Precomputed homology depth file (*.fasthom) to use (false=off) [None]
regfile=CDICT : List of Name:Files (or single FILE) of SeqName, Start, End positions (or GFF) for CN checking [None]
checkfields=LIST: Fields in checkpos file to give Locus, Start and End for checking [SeqName,Start,End]
gfftype=LIST : Optional feature types to use if performing regcheck on GFF file (e.g. gene) ['gene']
winsize=INT : Generate additional window-based depth and CN profiles of INT bp (0 to switch off) [100000]
winstep=NUM : Generate window every NUM bp (or fraction of winsize=INT if <=1) [1]
chromcheck=LIST : Output separate window analysis violin plots for listed sequences (or min size) + 'Other' []
seqstats=T/F : Whether to output CN and depth data for full sequences as well as BUSCO genes [True]

KAT kmer options

kmerself=T/F : Whether to perform additional assembly kmer analysis [True]
kmerreads=FILELIST : File of high quality reads for KAT kmer analysis []
10xtrim=T/F : Whether to trim 16bp 10x barcodes from Read 1 of Kmer Reads data for KAT analysis [False]

System options

forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]
tmpdir=PATH : Path for temporary output files during forking [./tmpdir/]

History Module Version History

    # 0.0.0 - Initial Compilation.
    # 0.1.0 - Added KAT kmers and self-homology. Renamed DepthKopy.
    # 0.1.1 - Fixed chromcheck setting bug.
    # 0.1.2 - Implemented kmerself=F toggle.
    # 0.2.0 - Added seqstats=T/F : Whether to output CN and depth data for full sequences as well as BUSCO genes [False]
    # 0.2.1 - Fixed occasional R BUSCO bug and renamed pngplots directory after basefile.
    # 0.2.2 - Fixed ignoredate bug.
    # 0.3.0 - Added support for multiple regfiles. Added maxcn=INT option. Fixed end of sequence window size.
    # 0.3.1 - Fixed reghead bug.
    # 0.4.0 - Updated to use the seqin file to restrict sequences under analysis from regfile/BUSCO etc. Updated docs.
    # 1.0.0 - Added over-ride of BUSCO calculation when scdepth=X is provided. First true release. Added to SeqSuite.
    # 1.0.1 - Added passing on of gfftype=LIST option to Rscript.
    # 1.0.2 - Updated citation to Mol Ecol Res paper.
    # 1.0.3 : Fixed problem with only a single density point.

DepthKopy REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

There is currently no specific help available on REST output for this program.

SLiMSuite REST Server

DepthKopy V1.0.3

Single-copy read-depth and kmer based copy number analysis