|
|
Module: | depthsizer |
Description: | Read-depth based genome size prediction |
Version: | 1.6.3 |
Last Edit: | 18/03/22 |
Citation: | Chen SH et al. & Edwards RJ (2022): Mol. Ecol. Res. (doi: 10.1111/1755-0998.13574) |
|
Copyright © 2021 Richard J. Edwards - See source code for GNU License Notice
Imported modules:
rje
rje_obj
rje_rmd
rje_readcore
rje_seqlist
See SLiMSuite Blog for further documentation. See rje
for general commands.
Function
DepthSizer is a modified wrapper for the genome size estimate methods of Diploidocus. DepthSizer needs a genome assembly
(fasta format, seqin=FILE
), a set of long read (ONT, PacBio or HiFi) data for the assembly (reads=FILELIST
and
readtype=LIST
) (or readbp=INT
), and a BUSCO full table of results (busco=TSVFILE
).
DepthSizer works on the principle that Complete
BUSCO genes should represent predominantly single copy (diploid
read depth) regions along with some poor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc.
are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the
region is actually diploid coverage, the modal read depth is expected to represent the actual single copy
read depth.
DepthSizer uses samtools mpileup
(or samtools depth
if quickdepth=T
) to calculate the per-base read depth.
This is converted into an estimated single copy read depth using a density plot of BUSCO single copy genes.
Legacy mode will calculate modal read depth for each BUSCO gene along with the overall modal read depth for all gene
regions. Genome size is then estimated based on a crude calculation using the total combined sequencing length.
This will be caculated from reads=FILELIST
unless provided with readbp=INT
.
BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE
(default
full_table_$BASEFILE.busco.tsv
). This can be replaced with any table with the fields:
['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status
= Complete
and the Contig
, Start
and End
fields are used to define the regions that should be predominantly single copy.
Output from BUSCOMP is also compatible with DepthSizer.
NOTE: The current unadjusted genome size prediction appears to be an over-estimate. Please see documentation for
more details of attempts to correct for contamination, read mapping and/or imbalanced insertion:deletion ratios
etc.
---
Commandline
Main DepthSizer run options
seqin=FILE
: Input sequence assembly [None
]
basefile=FILE
: Root of output file names [gapspanner or $SEQIN basefile
]
summarise=T/F
: Whether to generate and output summary statistics sequence data before and after processing [True
]
genomesize=INT
: Haploid genome size (bp) [0
]
bam=FILE
: BAM file of long reads mapped onto assembly [$BASEFILE.bam
]
reads=FILELIST
: List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST
: List of ont/pb/hifi file types matching reads for minimap2 mapping [ont
]
dochtml=T/F
: Generate HTML DepthSizer documentation (*.docs.html) instead of main run [False
]
tmpdir=PATH
: Path for temporary output files during forking (not all modes) [./tmpdir/
]
Genome size prediction options
busco=TSVFILE
: BUSCO full table [full_table_$BASEFILE.busco.tsv
]
readbp=INT
: Total combined read length for depth calculations (over-rides reads=FILELIST
) []
adjustmode=X
: Map adjustment method to apply (None/CovBases/IndelRatio/MapBases/MapAdjust/MapRatio/OldAdjust/OldCovBases) [IndelRatio
]
quickdepth=T/F
: Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False
]
covbases=T/F
: Whether to calculate predicted minimum genome size based on mapped reads only [True
]
mapadjust=T/F
: Whether to calculate mapadjust predicted genome size based on read length:mapping ratio [False
]
benchmark=T/F
: Activate benchmarking mode and also output the assembly size and mean depth estimate [False
]
legacy=T/F
: Whether to perform Legacy v1.0.0 (Diploidocus) calculations [False
]
depdensity=T/F
: Whether to use the BUSCO depth density profile in place of modal depth in legacy mode [True
]
depadjust=INT
: Advanced R density bandwidth adjustment parameter [12
]
seqstats=T/F
: Whether to output CN and depth data for full sequences as well as BUSCO genes [False
]
Forking options
forks=X
: Number of parallel sequences to process at once [0
]
killforks=X
: Number of seconds of no activity before killing all remaining forks. [36000
]
killmain=T/F
: Whether to kill main thread rather than individual forks when killforks reached. [False
]
logfork=T/F
: Whether to log forking in main log [False
]
History Module Version History
# 0.0.0 - Initial Compilation.
# 1.0.0 - Initial working version of DepthSizer based on Diploidocus v0.16.2.
# 1.1.0 - Updated to use updated ReadCore and DepthCopy code, plus new Indel ratio calculation (dev=T).
# 1.2.0 - Added CovBases lower depthsizer estimate output based solely on mapped reads. Fixed unmapped read bug.
# 1.2.1 - Fixed major flaw in indelratio calculation.
# 1.3.0 - Added adjust string and benchmark=T option for additional calculations. IndelRatio no longer dev only.
# 1.3.1 - Tweaked some input checks and log output. Replaced indelratio sort -u with uniq for speed and memory.
# 1.4.0 - Added seqstats=T/F : Whether to output CN and depth data for full sequences as well as BUSCO genes [False]
# 1.4.1 - Added citation and fixed minor output typo.
# 1.4.2 - Fixed bug that causes clashes with v5 full_table.bed files.
# 1.5.0 - Add additional map adjustment variants:
# - MapAdjust2 = allbases, not covbases
# - MapBases = Use map bases, not covbases for min read volumne
# - MapRatio = Use mapbases adjusted by indelratio
# 1.6.0 - Disable legacy mode using Diploidocus.
# 1.6.1 - Bug fixes to underlying R script and related core codebase.
# 1.6.2 - Updated citation to Mol Ecol Res paper.
# 1.6.3 - Fixed R code bug. Added bamcsi=T/F to use CSI indexing.
depthsizer REST Output formats
Run with
&rest=docs
for program documentation and options. A plain text version is accessed with
&rest=help
.
&rest=OUTFMT
can be used to retrieve individual parts of the output, matching the tabs in the default
(
&rest=format
) output. Individual
OUTFMT
elements can also be parsed from the full (
&rest=full
) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...
Available REST Outputs
There is currently no specific help available on REST output for this program.