|
|
Module: | DepthKopy |
Description: | Single-copy read-depth and kmer based copy number analysis |
Version: | 1.0.3 |
Last Edit: | 06/04/22 |
Citation: | Chen SH et al. & Edwards RJ (2022): Mol. Ecol. Res. (doi: 10.1111/1755-0998.13574) |
|
Copyright © 2021 Richard J. Edwards - See source code for GNU License Notice
Imported modules:
rje
rje_db
rje_kat
rje_obj
rje_readcore
rje_rmd
See SLiMSuite Blog for further documentation. See rje
for general commands.
Function
DepthKopy is an updated version of the regcnv methods of [Diploidocus](https://github.com/slimsuite/diploidocus),
using the same single-copy read depth estimation from BUSCO Complete genes as [DepthSizer](https://github.com/slimsuite/depthsizer).
DepthKopy needs a genome assembly (fasta format, seqin=FILE
), a set of long read (ONT, PacBio or HiFi) data for the assembly
(reads=FILELIST
and readtype=LIST
) or mapped BAM file (bam=FILE
), and a BUSCO/BUSCOMP full table of results (busco=TSVFILE
)
or pre-calculated single-copy read depth (scdepth=NUM
). For kmer analysis, reads also need to be provided with
kmerreads=FILELIST
(and 10xtrim=T
if barcoded 10x linked reads).
Optionally, it can then take one or more additional sources of assembly regions for which stats will be calculated.
Delimited text files or GFF files can be provided using regfile=FILE
for a single file, or regfile=CDICT
for multiple
files. Multiple files are given in the form Name1:File1,Name2:File2,...,NameN:FileN
, from which each unique Name
will
be plotted as a separate violin (see output). GFF files will be parsed to extract any features matching the types given
by gfftype=LIST
(default = gene
). Delimited files will need to have headers that match those provided by
checkfields=LIST
(default = SeqName,Start,End
). By default, 100 kb non-overlapping windows will also be output
(winsize=INT
winstep=NUM
). Unless seqstats=F
, statistics will also be calculated per assembly scaffold. Subsets of
scaffolds can be given separate window output plots using chromcheck=LIST
(scaffold names) or chromcheck=INT
(min
size).
DepthKopy works on the principle that Complete
BUSCO genes should represent predominantly single copy (diploid
read depth) regions along with some poor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc.
are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the
region is actually diploid coverage, the modal read depth is expected to represent the actual single copy
read depth.
DepthKopy uses samtools mpileup
(or samtools depth
if quickdepth=T
) to calculate the per-base read depth.
This is converted into an estimated single copy read depth using a smoothed density plot of BUSCO single copy genes.
BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE
(default
full_table_$BASEFILE.busco.tsv
). This can be replaced with any table matching the BUSCO fields:
['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status
= Complete
and the Contig
, Start
and End
fields are used to define the regions that should be predominantly single copy.
Output from BUSCOMP is also compatible with DepthKopy. DepthKopy has been tested with outputs from BUSCO v3 and v5.
Output is a table of depth statistics and predicted copy number for each input dataset (BUSCO Complete, BUSCO Duplicated,
regfile
Regions, assembly scaffolds (seqstat=T
) and sliding windows). Data visualisations are also output for each
region set using [ggstatsplot](https://indrajeetpatil.github.io/ggstatsplot/).
Commandline
Main DepthKopy run options
seqin=FILE
: Input sequence assembly. (Must be *.fa or *.fasta for kmerself=T
.) [None
]
basefile=FILE
: Root of output file names [diploidocus or $SEQIN basefile
]
scdepth=NUM
: Single copy ("diploid") read depth. If zero, will use SC BUSCO mode [0
]
bam=FILE
: BAM file of long reads mapped onto assembly [$BASEFILE.bam
]
reads=FILELIST
: List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST
: List of ont/pb/hifi file types matching reads for minimap2 mapping [ont
]
dochtml=T/F
: Generate HTML DepthKopy documentation (*.docs.html) instead of main run [False
]
Depth and Copy Number options
busco=TSVFILE
: BUSCO full table [full_table_$BASEFILE.busco.tsv
]
quickdepth=T/F
: Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False
]
depfile=FILE
: Precomputed depth file (*.fastdep or *.fastmp) to use [None
]
homfile=FILE
: Precomputed homology depth file (*.fasthom) to use (false=off
) [None
]
regfile=CDICT
: List of Name:Files (or single FILE) of SeqName, Start, End positions (or GFF) for CN checking [None
]
checkfields=LIST
: Fields in checkpos file to give Locus, Start and End for checking [SeqName,Start,End
]
gfftype=LIST
: Optional feature types to use if performing regcheck on GFF file (e.g. gene) ['gene'
]
winsize=INT
: Generate additional window-based depth and CN profiles of INT bp (0 to switch off) [100000
]
winstep=NUM
: Generate window every NUM bp (or fraction of winsize=INT
if <=1) [1
]
chromcheck=LIST
: Output separate window analysis violin plots for listed sequences (or min size) + 'Other' []
seqstats=T/F
: Whether to output CN and depth data for full sequences as well as BUSCO genes [True
]
KAT kmer options
kmerself=T/F
: Whether to perform additional assembly kmer analysis [True
]
kmerreads=FILELIST
: File of high quality reads for KAT kmer analysis []
10xtrim=T/F
: Whether to trim 16bp 10x barcodes from Read 1 of Kmer Reads data for KAT analysis [False
]
System options
forks=X
: Number of parallel sequences to process at once [0
]
killforks=X
: Number of seconds of no activity before killing all remaining forks. [36000
]
forksleep=X
: Sleep time (seconds) between cycles of forking out more process [0
]
tmpdir=PATH
: Path for temporary output files during forking [./tmpdir/
]
History Module Version History
# 0.0.0 - Initial Compilation.
# 0.1.0 - Added KAT kmers and self-homology. Renamed DepthKopy.
# 0.1.1 - Fixed chromcheck setting bug.
# 0.1.2 - Implemented kmerself=F toggle.
# 0.2.0 - Added seqstats=T/F : Whether to output CN and depth data for full sequences as well as BUSCO genes [False]
# 0.2.1 - Fixed occasional R BUSCO bug and renamed pngplots directory after basefile.
# 0.2.2 - Fixed ignoredate bug.
# 0.3.0 - Added support for multiple regfiles. Added maxcn=INT option. Fixed end of sequence window size.
# 0.3.1 - Fixed reghead bug.
# 0.4.0 - Updated to use the seqin file to restrict sequences under analysis from regfile/BUSCO etc. Updated docs.
# 1.0.0 - Added over-ride of BUSCO calculation when scdepth=X is provided. First true release. Added to SeqSuite.
# 1.0.1 - Added passing on of gfftype=LIST option to Rscript.
# 1.0.2 - Updated citation to Mol Ecol Res paper.
# 1.0.3 : Fixed problem with only a single density point.
DepthKopy REST Output formats
Run with
&rest=docs
for program documentation and options. A plain text version is accessed with
&rest=help
.
&rest=OUTFMT
can be used to retrieve individual parts of the output, matching the tabs in the default
(
&rest=format
) output. Individual
OUTFMT
elements can also be parsed from the full (
&rest=full
) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...
Available REST Outputs
There is currently no specific help available on REST output for this program.