SLiMSuite REST Server


Links
REST Home
EdwardsLab Homepage
EdwardsLab Blog
SLiMSuite Blog
SLiMSuite
Webservers
Genomes
REST Pages
REST Status
REST Help
REST Tools
REST Alias Data
REST API
REST News
REST Sitemap

Diploidocus V1.2.0

Diploid genome assembly analysis toolkit

Module: Diploidocus
Description: Diploid genome assembly analysis toolkit
Version: 1.2.0
Last Edit: 22/03/22
Nala Citation: Edwards RJ et al. (2021), BMC Genomics
DipNR Citation: Stuart KC, Edwards RJ et al. (preprint), bioRxiv 2021.04.07.438753; [doi: 10.1101/2021.04.07.438753]
Tidy Citation: Chen SH et al. & Edwards RJ (2022): Mol. Ecol. Res. [doi: 10.1111/1755-0998.13574]
GitHub: https://github.com/slimsuite/diploidocus

Copyright © 2020 Richard J. Edwards - See source code for GNU License Notice


Imported modules: rje rje_db rje_forker rje_obj rje_rmd rje_seqlist rje_sequence rje_paf rje_genomics rje_kat rje_readcore depthkopy depthsizer rje_blast_V2 smrtscape slimfarmer


See SLiMSuite Blog for further documentation. See rje for general commands.

Function

Diploidocus is a sequence analysis toolkit for a number of different analyses related to diploid genome assembly. The main suite of analyses combines long read depth profiles, short read kmer analysis, assembly kmer analysis, BUSCO gene prediction and contaminant screening for a number of assembly tasks including contamination identification, haplotig identification/removal and low quality contig/scaffold trimming/filtering.

In addition, Diploidocus will use mapped long reads and BUSCO single copy read depths for genome size prediction (gensize), and coverage (regcheck) or copy number estimation (regcnv) for user-defined regions. Diploidocus also has functions for removing redundancy (sortnr), generating a non-redundant pseudo-diploid assembly with primary and secondary scaffolds from 10x pseudohap output (diphap), and creating an in silico diploid set of long reads from two haploid parents (for testing phasing etc.) (insilico).

Please note that Diploidocus is still in development and documentation is currently a bit sparse. Like its (almost) namesake, Diploidocus has developed into a large, lumbering beast and is in the process of being split up into a number of specialised tools. Core mechanics will still be performed by Diploidocus but some of the documentation will be moved to other programs and sites.

The different run modes are set using runmode=X:

  • diploidocus default run mode will run gensize, telomeres, vecscreen, deptrim and purgehap analysis
  • dipcycle runs iterative cycles of diploidocus mode until convergence (no more filtering) is reached
  • gensize uses BUSCO results, a BAM file and read file(s) to predict the genome size of the organism
  • purgehap filters scaffolds based on post-processing of purge_haplotigs
  • telomeres performs a regex telomere search based on method of https://github.com/JanaSperschneider/FindTelomeres
  • vecscreen searches for contaminants and flags/masks/removes identified scaffolds
  • deptrim trims sequence termini of at least mintrim=INT bp with less than deptrim=INT read depth
  • regcheck checks reads spanning given regions and also calculates mean depth and estimated copy number (if regcnv=T)
  • regcnv calculates mean depth and estimated copy number for regcheck regions from BAM file (no read spanning analysis)
  • gapspan is a specialised regcheck analysis that checks for reads spanning genome assembly gaps
  • gapass extends the gapspan mode to extract the reads spanning a gap and re-assemble them using flye
  • gapfill extends the gapass mode to map re-assembled gaps back onto the assembly and replace gaps spanned by the new contigs
  • sortnr performs an all-by-all mapping with minimap2 and then removes redundancy
  • diphap splits a pseudodiploid assembly into primary and alternative scaffolds
  • diphapnr runs sortnr followed by diphap
  • insilico generates balanced diploid combined reads from two sequenced haploid parents

See <https://slimsuite.github.io/diploidocus/> for details of each mode. General SLiMSuite run documentation can be found at <https://github.com/slimsuite/SLiMSuite>.

Diploidocus is available as part of SLiMSuite, or via a standalone GitHub repo at <https://github.com/slimsuite/diploidocus>.

Run Modes

Main Diploidocus filtering [runmode=diploidocus]

Diploidocus builds on the PurgeHaplotigs classifications to use the depth bins, KAT assembly kmer frequencies and
BUSCO results to reclassify scaffolds and partition them into:

  • *.diploidocus.fasta = the scaffolds kept for the next round of PurgeHap
  • *.core.fasta = the same set of scaffolds, minus repeats
  • *.repeats.fasta = the repeat scaffolds excluded from core
  • *.junk.fasta = low coverage scaffolds, removed as junk
  • *.quarantine.fasta = putative haplotigs, removed from the assembly but retained for reference.

NOTE: PurgeHaplotigs was not using zero coverage bases in its percentages. This is now fixed by Diploidocus.

See main docs for details.

---

Cycled Diploidocus filtering [runmode=dipcycle]

Diploidocus can be automated to cycle through repeated rounds of the main purge_haplotigs/filtering method until
no further scaffolds are removed. Each cycle will run Diploidocus with a numerical suffix, e.g. $BASEFILE.1.*,
using the *.diploidocus.fasta output from the previous cycle as input. Cycling will continue until no further
scaffolds are filtered into either *.quarantine.fasta or *.junk.fasta.

Output for each cycle will be initially generated in the run directory but then moved to a dipcycle_$BASEFILE
directory upon completion.

Final outputs from the final cycle will then be compiled under the original $BASEFILE prefix:

  • $BASEFILE.diploidocus.tdt = Final ratings for the input scaffolds. This is like the single cycle output with an additional Cycle field containing the last cycle this scaffold was processed in.
  • $BASEFILE.ratings.tdt = Reduced final ratings output for the input scaffolds (SeqName,SeqLen,ScreenPerc,Class,Rating,Cycle).
  • $BASEFILE.diploidocus.fasta = the scaffolds kept from the final Diploidocus cycle
  • $BASEFILE.core.fasta = the same set of scaffolds, minus repeats
  • $BASEFILE.repeats.fasta = the repeat scaffolds excluded from core
  • $BASEFILE.quarantine.fasta = concatenated purged scaffolds from all Diploidocus cycles.
  • $BASEFILE.junk.fasta = concatenated low coverage and low quality scaffolds, removed as junk, from all cycles.

NOTE: Contents for these four *.fasta files are summarised in the main log. Individual purge cycles have their own
log files in the dipcycle_$BASEFILE directory.

See runmode=diploidocus documentation for more details.

---

Genome size prediction [runmode=gensize]

This works on the principle that Complete BUSCO genes should represent predominantly single copy (diploid read
depth) regions along with some ooor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc.
are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the
region is actually diploid coverage, the modal read depth is expected to represent the actual single copy
read depth.

Diploidocus uses samtools mpileup (or samtools depth if quickdepth=T) to calculate the per-base read depth
and extracts the modal read depth for each BUSCO gene along with the overall modal read depth for all gene
regions. Genome size is then estimated based on a crude calculation using the total combined sequencing length.
This will be caculated from reads=FILELIST unless provided with readbp=INT.

BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE (default
full_table_$BASEFILE.busco.tsv). This can be replaced with any table with the fields:
['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status = Complete
and the Contig, Start and End fields are used to define the regions that should be predominantly single copy.

NOTE: The current genome size prediction appears to be an over-estimate. There is currently no adjustment for
contamination. The mapadjust option attemtps to correct for read mapping and imbalanced insertion:deletion ratios
etc. but has not been extensively tested.

---

Running Purge_haplotigs using BUSCO-guided cutoffs [runmode=purgehap]

This runs just the Purge_haplotigs part of the main Diploidocus workflow. See main docs for details.

---

Telomere finding [runmode=telomere]

Diploidocus performs a regex-based search for Telomeres, based on [FindTelomeres](https://github.com/JanaSperschneider/FindTelomeres).
By default, this looks for a canonical telomere motif of TTAGGG/CCCTAA, allowing for some variation. (See main docs
to change telomere sequence.) For each sequence, Diploidocus trims off any trailing Ns and then searches for
telomere-like sequences at sequence ends. For each sequence, the presence/absence and length of trimming are reported
for the 5' end (tel5 and trim5) and 3' end (tel3 and trim3), along with the total percentage telomeric sequence (TelPerc).

By default, Diploidocus searches for a forward telomere regex sequence of C{2,4}T{1,2}A{1,3} at the 5' end, and a
reverse sequence at the 3' end of T{1,3}A{1,2}G{2,4}. These can be set with telofwd=X and telorev=X. Telomeres are
marked if at least 50% (teloperc=PERC) of the terminal 50 bp (telosize=INT) matches the appropriate regex. If either
end contains a telomere, the total percentage of the sequence matching either regex is calculated as TelPerc. Note
that this number neither restricts matches to the termini, not includes sequences within predicted telomeres that do
not match the regex. By default, only sequences with telomeres are output to the *.telomeres.tdt output, but
switching telonull=T will output all sequences.

---

Vector/contamination screening [runmode=vecscreen]

See main docs

---

Depth trimming [runmode=deptrim]

Depth trimming (deptrim) mode trims sequence termini of at least mintrim=INT bp with less than deptrim=INT
read depth. First, samtools mpileup or depth (depmethod=X) is used to find the first and last positions that
exceed deptrim=INT. If no positions meet this criterio, the entire sequence will removed. Otherwise, either
terminus that exceeds mintrim=INT base pairs of insufficent read depth are trimmed.

---

Region checking [runmode=regcheck]

Region checking, whether for read spanning analysis (runmode=regcheck) or copy number analysis
(runmode=regcnv or runmode=regcheck regcnv=T), analyses regions extracted from a delimited file given by:
regcheck=FILE. This can be a GFF file, in which case gfftype=LIST will restrict analysis to a specific feature
type, or regions can be defined by checkfields=LIST, which defines the locus, start and end positions. These
fields default to SeqName, Start and End fields. If these fields cannot be found, the first three fields
of the regcheck=FILE file will be used.

Long read data, given with the reads=FILELIST and readtype=LIST options, are then mapped onto the assembly
(seqin=FILE) using minimap2 to generate a PAF file. This is then parsed and reads spanning each feature based
on their positions and the target start and end positions in the PAF file. In addition to absolute spanning of
regions, reads spanning a region +/- distances set by checkflanks=LIST will also be calculated. If the end of a
sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified
using the MaxFlank5 and MaxFlank3 fields, which identify the maximum distance 5' and 3' that a read can span
due to sequence length constraints.

If regcnv=T then the region copy number analysis (below) will also be performed.

NOTE: Depth calculations are performed in parallel in the directory set with tmpdir=PATH. The number of
parallel calculations is set by forks=INT.

---

Region copy number analysis [runmode=regcnv]

Copy number analysis uses the same single copy depth profile analysis as the runmode=gensize genome size
prediction. In short, the modal read depth of BUSCO single copy Complete genes is calculated using samtools
mpileup (or samtools depth if quickdepth=T) and used to defined "single copy read depth". BUSCO single-copy
genes are parsed from a BUSCO full results table, given by busco=TSVFILE (default
full_table_$BASEFILE.busco.tsv). This can be replaced with any table with the fields:
['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with
Status = Complete and the Contig, Start and End fields are used to define the regions that should be
predominantly single copy.

Single-copy read depth can also be set using scdepth=NUM to re-used or over-ride previously calculated values.

Long read data, given with the reads=FILELIST and readtype=LIST options, are then mapped onto the assembly
(seqin=FILE) using minimap2. This can be skipped by providing a BAM file with bam=FILE. For each regcheck
feature, the same samtools depth calculation is performed as for the BUSCO data. The mean depth across the region
is then divided by the single copy depth to estimate total copy number across the region. Note that unlike the
single copy depth estimation itself, this might be biased by repeat sequences etc. that have a different depth
profile to the main region. One way to control for this might be to restrict analysis to a subset of reads that
meet a certain minimum length cutoff, e.g. 10kb.

Query-based CNV analysis. If the regcheck=FILE file has additional Qry, QryLen, QryStart and QryEnd
fields, the copy number analysi will have an additional query-focus. In this case, each region mapping onto a
specific query is summed up, adjusted for the proportion of the query covered by that region. For example, 3.5X
mean depth of a 100% length copy and 3.0X coverage of a 50% length copy would sum to (3.5x1.0 + 3x0.5 = 5 total
copies). If these fields are not present, each region will be analysed independently.

NOTE: Depth calculations are performed in parallel in the directory set with tmpdir=PATH. The number of
parallel calculations is set by forks=INT.

---

Assembly gap read-spanning analysis [runmode=gapspan]

This mode first identifies all the gaps in an assembly (seqin=FILE) (using SeqList gapstats or $SEQBASE.gaps.tdt if pre-
existing) and then runs the read spanning analysis (runmode=regcheck) with regcnv=F. Long read data, given
with the reads=FILELIST and readtype=LIST options, are mapped onto the assembly using minimap2 to generate a PAF file.
This is then parsed and reads spanning each gap are identified based on their positions and the target start and end positions in the PAF file.
In addition to absolute spanning of regions, reads spanning a region +/- distances set by checkflanks=LIST will also be calculated. If the end of a
sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified
using the MaxFlank5 and MaxFlank3 fields, which identify the maximum distance 5' and 3' that a read can span
due to sequence length constraints.

Spanning spanid output is also generated for each gap and saved in $BASEFILE_spanid. Each gap will be named:
seqname.start-end.

This mode is now primarily documented and updated through GapSpanner.

---

Assembly gap re-assembly [runmode=gapass]

In addition to the gapspan analysis, reads identified as spanning each gap are extracted and assembled using flye
in a $BASEFILE__gapassemble/ output directory.

This mode is now primarily documented and updated through GapSpanner.

---

Re-assembled gap-filling [runmode=gapfill]

In addition to the gapspan and gapass outputs, re-assembled gap regions are compiled into a single file and then
mapped back on the original assembly using Minimap2, with tabulated hit output into $BASEFILE__gapfill/. Local hits
are reduced to unique coverage of the assembly sequences. Gaps are filled if one of the two conditions are met:

1. A single local alignment spans an entire gap.
2. A pair of concordant local alignments from the same re-assembly contig (same orientation) flank an entire gap.

In the case of a single spanning local alignment, the entire assembly region is replaced by the corresponding
re-assembly contig region. For a pair of hits, the region between the two hits is replaced.

This mode is now primarily documented and updated through GapSpanner.

---

Sorted non-redundant assembly cleanup [runmode=sortnr]

The sorted non-redundant assembly cleanup mode (runmode=sortnr) screens out any sequences that are 100% gap,
then removes any sequences that are 100% redundant with other sequences in the input. This includes full and
partial matches, i.e. if sequence X is wholly contained within sequence Y then X will be removed.

First, sequences are loaded from the file given with seqin=FILE and any [rje_seqlist](http://rest.slimsuite.unsw.edu.au/seqlist)
filters and sequence sorting are applied to the input. Sequences that are 100% Ns are removed and any gaps
exceeding 10 nt are reduced to 10 Ns (NNNNNNNNNN) to prevent minimap2 from splitting sequences on long gaps.
These gap-reduced sequences are output to $BASEFILE.tmp.fasta and used for an all-by-all minimap2 search.

By default, minimap2 is run with the options to generate a $BASEFILE.tmp.paf file:

--cs -p 0.0001 -t 4 -x asm20 -N 250

To modify minimap2 search settings, please see the [rje_paf](http://rest.slimsuite.unsw.edu.au/rje_paf)
documentation.

NOTE: These run options can probably be made more stringent to speed up minimap2 without loss of function.
Future releases may alter defaults accordingly.

Minimap2 output is parsed to identify scaffold-scaffold matches. Self-hits are ignored.
The minimum (gap-reduced) sequence length is used as a rapid parsing filter: any minimap2 matches that are less
than 95% of the query sequence (Length+nn fields) or less that 100% identity (Identity+nn)/(Length+nn)
are filtered during parsing.

NOTE: Future releases may feature an option to reduce the global percentage identity cut-off. Please contact
the author if you wish to see this implemented.

Minimap2 hits are then processed reverse-sorted by reference sequence size (e.g. scaffold length). Any hits
where either sequence has already been filtered are skipped. Otherwise, if the match (as determined by the
length of : regions in the CS string) matches the query length, the Query sequence will be flagged for
remove as "identical to" or "contained within" the Hit. (Mutually partial overlapping exact matches are NOT
filtered.) Filtered IDs and their matches are output to $BASEFILE.redundant.txt.

Once all sequences have been filtered, the remaining sequences are output to: $BASEFILE.nr.fasta.

NOTE: By default, sequences are output in the same order as in the input. To output in reverse size order,
add the sortseq=invsize command to the Diploidocus run command.

Finally, the input and output files are summarised (unless summarise=F) and statistics output to:
$BASEFILE.summarise.tdt.

Temporary gap-reduced and minimap2 PAF files are deleted unless running in debug or
dev modes.

---

Pseudodiploid to primary and alternative haploptigs [runmode=diphap(nr)]

This protocol is based on 10x assemblies made for multiple organisms with supernova v2.0.0 and supernova v2.1.1.
In each case, some redundancy in output was discovered (a) within pseudohap output, and (b) in terms of fully
homozygous (identical) scaffolds shared by both haplotigs. It was also not entirely clear on what basis a
particular haplotype was assigned to pseudohap1 or pseudohap2.

The general workflow therefore sought to remove redundancy, generate a set of primary scaffolds based on scaffold
length, and generate a non-redundant set of alternative scaffolds where heterozygosity exists. If diphapnr mode
is used, the full workflow is implemented by first running the sortnr workflow described above. In the reduced
diphap mode, redundancy is not removed first.

Sequences are loaded and matching haplotigs identified based on their names. Sequence names MUST end HAP(\d+),
where (\d+) indicates an integer that matches up haplotigs (as produced by supernova pseudohap2 output, for
example). This is not a pipeline to identify haplotig pairs, it is purely for splitting identified
haplotigs into primary and alternative assemblies.

Processing itself is quite simple. Haplotig pairs are identified based on matching HAP(\d+) numbers. Where a
single haplotig is found, it is assigned as diploid, under the assumption that the two haplotigs were identical
and one was removed. (It is possible that only one parent had this scaffold, e.g. sex chromosomes, so some post-
processing of descriptions may be required.) If two haplotigs with the same number are identified, the longest
is assigned to haploidA and the shorter haploidB.

The Primary Assemmbly is then compiled from all haploidA and diploid sequences. These are given pri
prefixes and output to $BASEFILE.pri.fasta. The Alternative comprised of all haploidB sequences is output
to $BASEFILE.alt.fasta. If redundancy has been removed, this will likely be a subset of the full assembly. The
combined set of all primary and alternative sequences is output to $BASEFILE.dipnr.fasta.

NOTE: By default, sequences are output in the same order as in the input. To output in reverse size order,
add the sortseq=invsize command to the Diploidocus run command.

Finally, the input and output files are summarised (unless summarise=F) and statistics output to
$BASEFILE.summarise.tdt:

  • $BASEFILE.dipnr.fasta = Combined pseudodiploid with haploidA, haploidB and diploid annotation.
  • $BASEFILE.pri.fasta = Primary assembly with haploidA and diploid sequences.
  • $BASEFILE.alt.fasta = Alternative assembly with haploidB sequences.

---

In silico diploid generator [runmode=insilico]

This module generates balanced "in silico diploid" PacBio subread data from two sequenced haploid parents. Each
parent must first be run through SMRTSCAPE to generate subread summary data. (This will be performed if missing. Each
parent needs a *.fofn file of subread file names, *.unique.tdt unique subreads table and *.smrt.tdt SMRT cell
identifier table.)

A new set of subreads is then generated from the combined set of parent subreads. This is done by first ranking the
unique subreads from each parent by length. First, the longest subread from each parent are compared and the shortest
selected to be the first subread of the diploid. (The shortest is taken to minimise length differences between the
two parents.) Next, the longest subread from the next parent that is no longer than the previous subread is added.
This cycles, picking a read from the the parent with fewest cumulative bases each cycle. The longest subread that is
no longer than the previous subread is selected. This continues until one parent runs out of subreads. Additional
subreads will be added from the other parent if they reduce the difference in cumulative output for each parent, or
until lenfilter=X is reached.

Final output will be a *.LXXXRQXX.fasta file in which each parent has a similar total sequence content and for
which the subread length distributions should also be similar. This is to overcome biases in resulting diploid
assemblies, where one parent has higher quality data than the other.

NOTE: If performing downstream filtering by Read Quality (RQ), this might reintroduce a bias if one parent has much
higher RQ values than the other. The rqfilter=X setting can therefore be used to restrict output to reads with a
minimum RQ value. By default this is 0.84. If you do not get enough sequence output, this setting may need to be
relaxed. Similarly, only sequences above lenfilter=X in length will be output. These are the figures given in the
LXXXRQXX part of the output file, e.g. defaults of RQ>=0.84 and Len>=500 generates *.L500RQ84.fas.

Commandline

Main Diploidocus run options

seqin=FILE : Input sequence assembly [None]
runmode=X : Diploidocus run mode (diploidocus/dipcycle/gensize/purgehap/telomeres/vecscreen/regcheck/regcnv/deptrim/sortnr/diphap/diphapnr/insilico) [diploidocus]
basefile=FILE : Root of output file names [diploidocus or $SEQIN basefile]
summarise=T/F : Whether to generate and output summary statistics sequence data before and after processing [True]
genomesize=INT : Haploid genome size (bp) [0]
scdepth=NUM : Single copy ("diploid") read depth. If zero, will use SC BUSCO mode [0]
bam=FILE : BAM file of long reads mapped onto assembly [$BASEFILE.bam]
paf=FILE : PAF file of long reads mapped onto assembly [$BASEFILE.paf]
reads=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
dochtml=T/F : Generate HTML Diploidocus documentation (*.docs.html) instead of main run [False]
tmpdir=PATH : Path for temporary output files during forking (not all modes) [./tmpdir/]

Genome size prediction options

busco=TSVFILE : BUSCO full table [full_table_$BASEFILE.busco.tsv]
readbp=INT : Total combined read length for depth calculations (over-rides reads=FILELIST) []
quickdepth=T/F : Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False]
depdensity=T/F : Whether to use the BUSCO depth density profile in place of modal depth [True]
mapadjust=T/F : Whether to adjust predicted genome size based on read length:mapping ratio [False]

DiploidocusHocusPocus and Purge haplotigs options

kmerreads=FILELIST : File of high quality reads for KAT kmer analysis []
10xtrim=T/F : Whether to trim 16bp 10x barcodes from Read 1 of Kmer Reads data for KAT analysis [False]
minmedian=INT : Minimum median depth coverage to avoid low coverage filter [3]
minlen=INT : Minimum scaffold length to avoid low quality filter [500]
phlow=INT : Low depth cutoff for purge_haplotigs (-l X). Will use SCDepth/4 if zero. [0]
phmid=INT : Middle depth for purge_haplotigs (-m X). Will derive from SCDepth if zero. [0]
phhigh=INT : High depth cutoff for purge_haplotigs (-h X). Will use SCDepth x 2 if zero. [0]
zeroadjust=T/F : Add zero coverage bases to purge_haplotigs LowPerc and adjust total [True]
includegaps=T/F : Whether to include gaps in the zero coverage bases for adjustment (see docs) [False]
mingap=INT : Minimum length of a stretch of N bases to count as a gap for exclusion [10]
purgemode=X : Rules used for purgehap analysis (simple/complex/nala) [complex]
diploidify=T/F : Whether to generate alternative diploid output with duplicated diploid contigs and no hpurge [False]
pretrim=T/F : Run vectrim/vecmask and deptrim trimming prior to diploidocus run [False]
maxcycle=INT : Restrict run to maximum of INT cycles (0=No limit) [0]
purgecyc=INT : Minimum number of purged sequences to trigger next round of dipcycle [2]

Depth Trim options

deptrim=INT : Trim termini with <X depth [1]
mintrim=INT : Min length of terminal depth trimming [1000]

Telomere options

telofwd=X : Regex for 5' telomere sequence search [C{2,4}T{1,2}A{1,3}]
telorev=X : Regex for 5' telomere sequence search [T{1,3}A{1,2}G{2,4}]
telosize=INT : Size of terminal regions (bp) to scan for telomeric repeats [50]
teloperc=PERC : Percentage of telomeric region matching telomeric repeat to call as telomere [50]
telonull=T/F : Whether to output sequences without telomeres to telomere table [False]

VecScreen options

screendb=FILE : File of vectors/contaminants to screen out using blastn and VecScreen rules []
screenmode=X : Action to take following vecscreen searching (report/purge) [report]
minvechit=INT : Minimum length for a non-identical screendb match [50]
minidhit=INT : Minimum length for an identical screendb match (based on NCBI filtering) [27]
efdr=NUM : Expected FDR threshold for VecScreen queries (0 is no filter) [1.0]
vecpurge=PERC : Remove any scaffolds with >= PERC % vector coverage [50.0]
vectrim=INT : Trim any vector hits (after any vecpurge) within INT bp of the nearest end of a scaffold [1000]
vecmask=INT : Mask any vector hits of INT bp or greater (after vecpurge and vectrim) [900]
maskmode=X : Whether to perform full (all bases) or partial (every other base) masking of hits [partial]
keepnames=T/F : Whether to keep names unchanged for edited sequences or append 'X' [False]
veccheck=T/F : Check coverage of filtered contaminant hits using reads=FILELIST data [False]

Region checking options

regcheck=FILE : File of SeqName, Start, End positions for read coverage checking [None]
checkfields=LIST: Fields in checkpos file to give Locus, Start and End for checking [Hit,SbjStart,SbjEnd]
checkflanks=LIST: List of lengths flanking check regions that must also be spanned by reads [0,100,1000,5000]
spanid=X : Generate sets of read IDs that span veccheck/regcheck regions, grouped by values of field X []
regcnv=T/F : Whether to calculate mean depth and predicted CNV of regcheck regions based on SCdepth [True]
gfftype=LIST : Optional feature types to use if performing regcheck on GFF file (e.g. gene) ['gene']
subforks=INT : Number of forks for assembly subproccesses during gapfill and gapass modes [1]
mingapspan=INT : Minimum number of reads spanning a gap in order to re-assemble [2]
minlocid=PERC : Minimum percentage identity for aligned chunk to be kept (local %identity) [0]
minloclen=INT : Minimum length for aligned chunk to be kept (local hit length in bp) [500]

SortNR filtering/output options

checkcov=PERC : Percentage coverage for double-checking partial exact matches [95]
seqout=FILE : Output sequence assembly [$BASEFILE.nr.fasta]

In silico diploid input/output options

rqfilter=X : Minimum RQ for output subreads [0]
lenfilter=X : Min read length for filtered subreads [500]
parent1=FOFN : File of file names for subreads fasta files on Parent 1. []
parent2=FOFN : File of file names for subreads fasta files on Parent 2. []

See also SMRTSCAPE summarise=T options if *.unique.tdt/*.smrt.tdt have not been pre-generated with SMRTSCAPE.

Advanced/Developmental options

memperthread=INT: Number of Gb per thread to allocate to samtools etc. [6]
useqsub=T/F : Whether to use qsub to queue up system calls [False]
qsubvmem=INT : Memory setting (Gb) when queuing with qsub [126]
qsubwall=INT : Walltime setting (hours) when queuing with qsub [12]
modules=LIST : List of modules that needs to be loaded for running with qsub []
legacy=T/F : Run Legacy modes of updated methods that have been farmed out to other programs [False]


History Module Version History

    # 0.0.0 - Initial Compilation.
    # 0.1.0 - Fixed bugs with parent basefile, genome size default and Sequel data parsing.
    # 0.2.0 - Added sortnr minimap2 run mode, and diphap primary/alternative assembly splitting.
    # 0.3.0 - Added summarising of sequences post-run. Improved documents. Moved to tools/.
    # 0.4.0 - Added VecScreen mode.
    # 0.5.0 - Added Telomere finding mode based on https://github.com/JanaSperschneider/FindTelomeres.
    # 0.6.0 - Added GenomeSize option.
    # 0.7.0 - Added PurgeHaplotigs, minimum vescreen hit length, eFDR and vecscreen coverage. [GitHub]
    # 0.7.1 - Code tidying and documentation improvements. Added dipcycle subdirectory and minlen=INT.
    #       - Fixed purge_haplotigs temp directory issues by running in subdirectory.
    #       - Added diploidify=T/F : Whether to generate alternative diploid output with duplicated diploid contigs and no hpurge [False]
    # 0.8.0 - Added deptrim=INT mode for trimming the ends of contigs based on low depth.
    # 0.8.1 - Fixed failure to screen contaminants in nala, simple and crude purgemode. Add dev minimap2 fix.
    # 0.8.2 - Shifted LOWQUAL to quarantine rather than junk. Fixed eFDR calculation error (inverse ranking).
    #       - Added veccheck=T/F : Check coverage of filtered contaminant hits using reads=FILELIST data [False]
    # 0.9.0 - Added pretrim, vecpurge, vecmask and vectrim options. Reduced screenmode to report/purge.
    # 0.9.1 - Fixed partial vecpurge when setting parameters to zero.
    # 0.9.2 - Added keepnames=T/F : Whether to keep names unchanged for edited sequences or append 'X' [False]
    # 0.9.3 - Fixed termination of program when BUSCO gene mpileup goes wrong (unless debug=T).
    # 0.9.4 - Added capacity to pick up an aborted dipcycle run. Added maxcycle=INT setting to limit run times.
    # 0.9.5 - Added use of qsub for system calls and memperthread=INT to control max memory.
    # 0.9.6 - Dropped default vecmask to 900bp as 1kb length matches picked up by NCBI are sometimes shorter on scaffold.
    # 0.9.7 - Check for | characters in the input sequences. (Will break the program.)
    # 0.9.8 - Added additional Mode of Modes genomes size prediction to log.
    # 0.10.0 - Added regcheck=TDTFILE and regcnv=T/F functions (runmode=regcheck)
    # 0.10.1 - Added gfftype=X optional feature type to use if performing regcheck on GFF file (e.g. gene) []
    # 0.10.2 - Fixed GFF regcheck bug. Add some documentation. Made gfftype=LIST not X.
    # 0.10.3 - Added BUSCO Complete Copy Number calculation and CI estimate to GenomeSize (and RegCNV). Made default gfftype=gene
    # 0.10.4 - Added improved SCdepth calculation option using BUSCO depth density profile.
    # 0.10.5 - Added 95% CI estimates to CNV calculations if BUSCO depth profiles are present.
    # 0.10.6 - Fixed GenomeSize bug with zero coverage BUSCO genes (e.g. using different long reads from assembly)
    # 0.10.7 - Updated the depmode.R script to limit analysis to one depmethod.
    # 0.11.0 - Updated defaults to depdensity=T for genome size prediction.
    # 0.12.0 - Added gapspan function (regcheck but first makes the gaps table, then loads this in, and outputs reads per gap.)
    # 0.13.0 - Added full gap-filling to gapfill function.
    # 0.14.0 - Added HiFi read type.
    # 0.14.1 - Added GapSpanner sequence summary to end.
    # 0.14.2 - Fixed scdepth re-use printing bug.
    # 0.15.0 - Added mapadjust=T/F   : Whether to adjust predicted genome size based on read length:mapping ratio [False]
    # 0.16.0 - Updated vecscreen mode to have two-tier length filter, and partial masking to avoid splitting contigs.
    # 0.16.1 - Added *.repeats.fasta output of the non-score diploidocus sequences.
    # 0.16.2 - Minor bug fix catching Class key error. Added logo to docs.
    # 0.16.3 - Updated some of the documentation.
    # 0.16.4 - Fixed a bug where pretrim of vecscreen results will cause BUSCO genes to be missed during classification.
    # 0.17.0 - Added purgecyc=INT : Minimum number of purged sequences to trigger next round of dipcycle [2]
    # 0.17.1 - Minor tweaks to log output.
    # 0.17.2 - Stopped CSI indexing from crashing Diploidocus but not compatible with PurgeHaplotigs.
    # 0.18.0 - Implementation of density-based CNV estimation (dev=T).
    # 1.0.0 - Updated to use rje_kat and rje_readcore and made v1.0.0 in line with Stuart et al. publication.
    # 1.0.1 - Bug fixes for GFF checkpos.
    # 1.1.0 - Fixed TeloRev sequence for finding 3' telomeres. Add reporting of telomere lengths.
    # 1.1.1 - Fixed DepthSizer object bug for DipCycle.
    # 1.1.2 - Updated Tidy citation to Mol Ecol Res paper.
    # 1.1.3 - Fixed Rscript finding for standalone repo.
    # 1.1.4 - Minor bug fixes for read mapping and depth analyses.
    # 1.2.0 - Altered telomere output table to use SeqName not Name, for ChromSyn compatibility. Added telonull=T/F.

Diploidocus REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

There is currently no specific help available on REST output for this program.

© 2015 RJ Edwards. Contact: richard.edwards@unsw.edu.au.