Run Modes
Diploidocus builds on the PurgeHaplotigs classifications to use the depth bins, KAT assembly kmer frequencies and
BUSCO results to reclassify scaffolds and partition them into:
*.diploidocus.fasta
= the scaffolds kept for the next round of PurgeHap
*.core.fasta
= the same set of scaffolds, minus repeats
*.repeats.fasta
= the repeat scaffolds excluded from core
*.junk.fasta
= low coverage scaffolds, removed as junk
*.quarantine.fasta
= putative haplotigs, removed from the assembly but retained for reference.
NOTE: PurgeHaplotigs was not using zero coverage bases in its percentages. This is now fixed by Diploidocus.
See main docs for details.
---
Diploidocus can be automated to cycle through repeated rounds of the main purge_haplotigs/filtering method until
no further scaffolds are removed. Each cycle will run Diploidocus with a numerical suffix, e.g. $BASEFILE.1.*
,
using the *.diploidocus.fasta
output from the previous cycle as input. Cycling will continue until no further
scaffolds are filtered into either *.quarantine.fasta
or *.junk.fasta
.
Output for each cycle will be initially generated in the run directory but then moved to a dipcycle_$BASEFILE
directory upon completion.
Final outputs from the final cycle will then be compiled under the original $BASEFILE
prefix:
$BASEFILE.diploidocus.tdt
= Final ratings for the input scaffolds. This is like the single cycle output with an additional Cycle
field containing the last cycle this scaffold was processed in.
$BASEFILE.ratings.tdt
= Reduced final ratings output for the input scaffolds (SeqName
,SeqLen
,ScreenPerc
,Class
,Rating
,Cycle
).
$BASEFILE.diploidocus.fasta
= the scaffolds kept from the final Diploidocus cycle
$BASEFILE.core.fasta
= the same set of scaffolds, minus repeats
$BASEFILE.repeats.fasta
= the repeat scaffolds excluded from core
$BASEFILE.quarantine.fasta
= concatenated purged scaffolds from all Diploidocus cycles.
$BASEFILE.junk.fasta
= concatenated low coverage and low quality scaffolds, removed as junk, from all cycles.
NOTE: Contents for these four *.fasta
files are summarised in the main log. Individual purge cycles have their own
log files in the dipcycle_$BASEFILE
directory.
See runmode=diploidocus
documentation for more details.
---
This works on the principle that Complete
BUSCO genes should represent predominantly single copy (diploid read
depth) regions along with some ooor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc.
are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the
region is actually diploid coverage, the modal read depth is expected to represent the actual single copy
read depth.
Diploidocus uses samtools mpileup
(or samtools depth
if quickdepth=T
) to calculate the per-base read depth
and extracts the modal read depth for each BUSCO gene along with the overall modal read depth for all gene
regions. Genome size is then estimated based on a crude calculation using the total combined sequencing length.
This will be caculated from reads=FILELIST
unless provided with readbp=INT
.
BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE
(default
full_table_$BASEFILE.busco.tsv
). This can be replaced with any table with the fields:
['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status
= Complete
and the Contig
, Start
and End
fields are used to define the regions that should be predominantly single copy.
NOTE: The current genome size prediction appears to be an over-estimate. There is currently no adjustment for
contamination. The mapadjust
option attemtps to correct for read mapping and imbalanced insertion:deletion ratios
etc. but has not been extensively tested.
---
Running Purge_haplotigs using BUSCO-guided cutoffs [runmode=purgehap
]
This runs just the Purge_haplotigs part of the main Diploidocus workflow. See main docs for details.
---
Diploidocus performs a regex-based search for Telomeres, based on [FindTelomeres](https://github.com/JanaSperschneider/FindTelomeres).
By default, this looks for a canonical telomere motif of TTAGGG/CCCTAA, allowing for some variation. (See main docs
to change telomere sequence.) For each sequence, Diploidocus trims off any trailing Ns and then searches for
telomere-like sequences at sequence ends. For each sequence, the presence/absence and length of trimming are reported
for the 5' end (tel5 and trim5) and 3' end (tel3 and trim3), along with the total percentage telomeric sequence (TelPerc).
By default, Diploidocus searches for a forward telomere regex sequence of C{2,4}T{1,2}A{1,3} at the 5' end, and a
reverse sequence at the 3' end of T{1,3}A{1,2}G{2,4}. These can be set with telofwd=X
and telorev=X
. Telomeres are
marked if at least 50% (teloperc=PERC
) of the terminal 50 bp (telosize=INT
) matches the appropriate regex. If either
end contains a telomere, the total percentage of the sequence matching either regex is calculated as TelPerc. Note
that this number neither restricts matches to the termini, not includes sequences within predicted telomeres that do
not match the regex. By default, only sequences with telomeres are output to the *.telomeres.tdt
output, but
switching telonull=T
will output all sequences.
---
See main docs
---
Depth trimming (deptrim
) mode trims sequence termini of at least mintrim=INT
bp with less than deptrim=INT
read depth. First, samtools mpileup
or depth
(depmethod=X
) is used to find the first and last positions that
exceed deptrim=INT
. If no positions meet this criterio, the entire sequence will removed. Otherwise, either
terminus that exceeds mintrim=INT
base pairs of insufficent read depth are trimmed.
---
Region checking, whether for read spanning analysis (runmode=regcheck
) or copy number analysis
(runmode=regcnv
or runmode=regcheck
regcnv=T
), analyses regions extracted from a delimited file given by:
regcheck=FILE
. This can be a GFF file, in which case gfftype=LIST
will restrict analysis to a specific feature
type, or regions can be defined by checkfields=LIST
, which defines the locus, start and end positions. These
fields default to SeqName
, Start
and End
fields. If these fields cannot be found, the first three fields
of the regcheck=FILE
file will be used.
Long read data, given with the reads=FILELIST
and readtype=LIST
options, are then mapped onto the assembly
(seqin=FILE
) using minimap2 to generate a PAF file. This is then parsed and reads spanning each feature based
on their positions and the target start and end positions in the PAF file. In addition to absolute spanning of
regions, reads spanning a region +/- distances set by checkflanks=LIST
will also be calculated. If the end of a
sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified
using the MaxFlank5
and MaxFlank3
fields, which identify the maximum distance 5' and 3' that a read can span
due to sequence length constraints.
If regcnv=T
then the region copy number analysis (below) will also be performed.
NOTE: Depth calculations are performed in parallel in the directory set with tmpdir=PATH
. The number of
parallel calculations is set by forks=INT
.
---
Region copy number analysis [runmode=regcnv
]
Copy number analysis uses the same single copy depth profile analysis as the runmode=gensize
genome size
prediction. In short, the modal read depth of BUSCO single copy Complete
genes is calculated using samtools
mpileup
(or samtools depth
if quickdepth=T
) and used to defined "single copy read depth". BUSCO single-copy
genes are parsed from a BUSCO full results table, given by busco=TSVFILE
(default
full_table_$BASEFILE.busco.tsv
). This can be replaced with any table with the fields:
['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with
Status
= Complete
and the Contig
, Start
and End
fields are used to define the regions that should be
predominantly single copy.
Single-copy read depth can also be set using scdepth=NUM
to re-used or over-ride previously calculated values.
Long read data, given with the reads=FILELIST
and readtype=LIST
options, are then mapped onto the assembly
(seqin=FILE
) using minimap2. This can be skipped by providing a BAM file with bam=FILE
. For each regcheck
feature, the same samtools depth calculation is performed as for the BUSCO data. The mean depth across the region
is then divided by the single copy depth to estimate total copy number across the region. Note that unlike the
single copy depth estimation itself, this might be biased by repeat sequences etc. that have a different depth
profile to the main region. One way to control for this might be to restrict analysis to a subset of reads that
meet a certain minimum length cutoff, e.g. 10kb.
Query-based CNV analysis. If the regcheck=FILE
file has additional Qry
, QryLen
, QryStart
and QryEnd
fields, the copy number analysi will have an additional query-focus. In this case, each region mapping onto a
specific query is summed up, adjusted for the proportion of the query covered by that region. For example, 3.5X
mean depth of a 100% length copy and 3.0X coverage of a 50% length copy would sum to (3.5x1.0 + 3x0.5 = 5 total
copies). If these fields are not present, each region will be analysed independently.
NOTE: Depth calculations are performed in parallel in the directory set with tmpdir=PATH
. The number of
parallel calculations is set by forks=INT
.
---
Assembly gap read-spanning analysis [runmode=gapspan
]
This mode first identifies all the gaps in an assembly (seqin=FILE
) (using SeqList gapstats
or $SEQBASE.gaps.tdt
if pre-
existing) and then runs the read spanning analysis (runmode=regcheck
) with regcnv=F
. Long read data, given
with the reads=FILELIST
and readtype=LIST
options, are mapped onto the assembly using minimap2 to generate a PAF file.
This is then parsed and reads spanning each gap are identified based on their positions and the target start and end positions in the PAF file.
In addition to absolute spanning of regions, reads spanning a region +/- distances set by checkflanks=LIST
will also be calculated. If the end of a
sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified
using the MaxFlank5
and MaxFlank3
fields, which identify the maximum distance 5' and 3' that a read can span
due to sequence length constraints.
Spanning spanid
output is also generated for each gap and saved in $BASEFILE_spanid
. Each gap will be named:
seqname.start-end
.
This mode is now primarily documented and updated through GapSpanner.
---
In addition to the gapspan
analysis, reads identified as spanning each gap are extracted and assembled using flye
in a $BASEFILE__gapassemble/
output directory.
This mode is now primarily documented and updated through GapSpanner.
---
In addition to the gapspan
and gapass
outputs, re-assembled gap regions are compiled into a single file and then
mapped back on the original assembly using Minimap2, with tabulated hit output into $BASEFILE__gapfill/
. Local hits
are reduced to unique coverage of the assembly sequences. Gaps are filled if one of the two conditions are met:
1. A single local alignment spans an entire gap.
2. A pair of concordant local alignments from the same re-assembly contig (same orientation) flank an entire gap.
In the case of a single spanning local alignment, the entire assembly region is replaced by the corresponding
re-assembly contig region. For a pair of hits, the region between the two hits is replaced.
This mode is now primarily documented and updated through GapSpanner.
---
Sorted non-redundant assembly cleanup [runmode=sortnr
]
The sorted non-redundant assembly cleanup mode (runmode=sortnr
) screens out any sequences that are 100% gap,
then removes any sequences that are 100% redundant with other sequences in the input. This includes full and
partial matches, i.e. if sequence X is wholly contained within sequence Y then X will be removed.
First, sequences are loaded from the file given with seqin=FILE
and any [rje_seqlist](http://rest.slimsuite.unsw.edu.au/seqlist)
filters and sequence sorting are applied to the input. Sequences that are 100% Ns are removed and any gaps
exceeding 10 nt are reduced to 10 N
s (NNNNNNNNNN
) to prevent minimap2 from splitting sequences on long gaps.
These gap-reduced sequences are output to $BASEFILE.tmp.fasta
and used for an all-by-all minimap2 search.
By default, minimap2 is run with the options to generate a $BASEFILE.tmp.paf
file:
--cs -p 0.0001 -t 4 -x asm20 -N 250
To modify minimap2 search settings, please see the [rje_paf](http://rest.slimsuite.unsw.edu.au/rje_paf)
documentation.
NOTE: These run options can probably be made more stringent to speed up minimap2 without loss of function.
Future releases may alter defaults accordingly.
Minimap2 output is parsed to identify scaffold-scaffold matches. Self-hits are ignored.
The minimum (gap-reduced) sequence length is used as a rapid parsing filter: any minimap2 matches that are less
than 95% of the query sequence (Length
+nn
fields) or less that 100% identity (Identity
+nn
)/(Length
+nn
)
are filtered during parsing.
NOTE: Future releases may feature an option to reduce the global percentage identity cut-off. Please contact
the author if you wish to see this implemented.
Minimap2 hits are then processed reverse-sorted by reference sequence size (e.g. scaffold length). Any hits
where either sequence has already been filtered are skipped. Otherwise, if the match (as determined by the
length of :
regions in the CS string) matches the query length, the Query sequence will be flagged for
remove as "identical to" or "contained within" the Hit. (Mutually partial overlapping exact matches are NOT
filtered.) Filtered IDs and their matches are output to $BASEFILE.redundant.txt
.
Once all sequences have been filtered, the remaining sequences are output to: $BASEFILE.nr.fasta
.
NOTE: By default, sequences are output in the same order as in the input. To output in reverse size order,
add the sortseq=invsize
command to the Diploidocus run command.
Finally, the input and output files are summarised (unless summarise=F
) and statistics output to:
$BASEFILE.summarise.tdt
.
Temporary gap-reduced and minimap2 PAF files are deleted unless running in debug
or
dev
modes.
---
Pseudodiploid to primary and alternative haploptigs [runmode=diphap(nr
)]
This protocol is based on 10x assemblies made for multiple organisms with supernova v2.0.0 and supernova v2.1.1.
In each case, some redundancy in output was discovered (a) within pseudohap output, and (b) in terms of fully
homozygous (identical) scaffolds shared by both haplotigs. It was also not entirely clear on what basis a
particular haplotype was assigned to pseudohap1 or pseudohap2.
The general workflow therefore sought to remove redundancy, generate a set of primary scaffolds based on scaffold
length, and generate a non-redundant set of alternative scaffolds where heterozygosity exists. If diphapnr
mode
is used, the full workflow is implemented by first running the sortnr
workflow described above. In the reduced
diphap
mode, redundancy is not removed first.
Sequences are loaded and matching haplotigs identified based on their names. Sequence names MUST end HAP(\d+)
,
where (\d+)
indicates an integer that matches up haplotigs (as produced by supernova pseudohap2 output, for
example). This is not a pipeline to identify haplotig pairs, it is purely for splitting identified
haplotigs into primary and alternative assemblies.
Processing itself is quite simple. Haplotig pairs are identified based on matching HAP(\d+)
numbers. Where a
single haplotig is found, it is assigned as diploid
, under the assumption that the two haplotigs were identical
and one was removed. (It is possible that only one parent had this scaffold, e.g. sex chromosomes, so some post-
processing of descriptions may be required.) If two haplotigs with the same number are identified, the longest
is assigned to haploidA
and the shorter haploidB
.
The Primary Assemmbly is then compiled from all haploidA
and diploid
sequences. These are given pri
prefixes and output to $BASEFILE.pri.fasta
. The Alternative comprised of all haploidB
sequences is output
to $BASEFILE.alt.fasta
. If redundancy has been removed, this will likely be a subset of the full assembly. The
combined set of all primary and alternative sequences is output to $BASEFILE.dipnr.fasta
.
NOTE: By default, sequences are output in the same order as in the input. To output in reverse size order,
add the sortseq=invsize
command to the Diploidocus run command.
Finally, the input and output files are summarised (unless summarise=F
) and statistics output to
$BASEFILE.summarise.tdt
:
$BASEFILE.dipnr.fasta
= Combined pseudodiploid with haploidA
, haploidB
and diploid
annotation.
$BASEFILE.pri.fasta
= Primary assembly with haploidA
and diploid
sequences.
$BASEFILE.alt.fasta
= Alternative assembly with haploidB
sequences.
---
This module generates balanced "in silico diploid" PacBio subread data from two sequenced haploid parents. Each
parent must first be run through SMRTSCAPE to generate subread summary data. (This will be performed if missing. Each
parent needs a *.fofn
file of subread file names, *.unique.tdt
unique subreads table and *.smrt.tdt
SMRT cell
identifier table.)
A new set of subreads is then generated from the combined set of parent subreads. This is done by first ranking the
unique subreads from each parent by length. First, the longest subread from each parent are compared and the shortest
selected to be the first subread of the diploid. (The shortest is taken to minimise length differences between the
two parents.) Next, the longest subread from the next parent that is no longer than the previous subread is added.
This cycles, picking a read from the the parent with fewest cumulative bases each cycle. The longest subread that is
no longer than the previous subread is selected. This continues until one parent runs out of subreads. Additional
subreads will be added from the other parent if they reduce the difference in cumulative output for each parent, or
until lenfilter=X
is reached.
Final output will be a *.LXXXRQXX.fasta
file in which each parent has a similar total sequence content and for
which the subread length distributions should also be similar. This is to overcome biases in resulting diploid
assemblies, where one parent has higher quality data than the other.
NOTE: If performing downstream filtering by Read Quality (RQ), this might reintroduce a bias if one parent has much
higher RQ values than the other. The rqfilter=X
setting can therefore be used to restrict output to reads with a
minimum RQ value. By default this is 0.84. If you do not get enough sequence output, this setting may need to be
relaxed. Similarly, only sequences above lenfilter=X
in length will be output. These are the figures given in the
LXXXRQXX
part of the output file, e.g. defaults of RQ>=0.84 and Len>=500 generates *.L500RQ84.fas
.