|Description:|| SMRT Subread Coverage & Assembly Parameter Estimator|
|Last Edit:|| 14/07/17|
Copyright © 2017 Richard J. Edwards - See source code for GNU License Notice
See SLiMSuite Blog for further documentation. See
rje for general commands.
SMRTSCAPE (SMRT Subread Coverage & Assembly Parameter Estimator) is a tool for analysis of PacBio raw sequencing
data to assist the design and execution of PacBio genomics projects. It has a number of functions concerned with
predicting and/or assessing the quantity and quality of useable data produced:
1. **Genome Coverage (
coverage=T).** This method tries to predict genome coverage and accuracy for different depths of
PacBio sequencing. This is useful for estimating genome coverage and/or required numbers of SMRT cells from predicted
read outputs or emprical (average) SMRT cell data if the
BASEFILE.unique.tdt output (generated by
summary=T, below) is
found. NOTE: Default settings for SMRT cell output are not reliable and you should speak to your sequencing provider
for their up-to-date figures. By default, output for this mode is incremented by
XCoverage but this can be switched
to numbers of SMRT cells with
coverage=T mode can be run from the EdwardsLab server at:
2. **Summarise subreads (
summarise=T).** This function summarises subread data from a given
file, or a set of subread fasta files given with
batch=FILELIST (or listed in
*.fofn). This produces sequence
summary data (read lengths, N50 etc.) for each sequence file, SMRT cell and the combined dataset (
In addition, tables are generated that summarise each read individually, which can then be used for further read
filtering or calculations. Summaries are produced for _all_ data (
*.zmw.tdt) and just the **Unique** subread data,
which is the _longest_ read from each ZMW (
*.unique.tdt). FALCON only uses unique read data, and so it is these
data that are used for the rest of SMRTSCAPE functions. A summary of **Read Quality (RQ)** data is also output
3. **Calculate length cutoffs (
calculate=T).** Calculates length cutoffs for different XCoverage and RQ
combinations from subread summary data. Generates
4. **Optimise (
optimise=T).** This function attempts to generate predicted optimum assembly settings from the
calculate=T table data. NOTE: In
V1.x, this option was
5. **Filter subreads (
readfilter=T).** This filters *unique* subreads into a new fasta file (
based on min. read quality (
rqfilter=X) and min. read length (
lenfilter=X). NOTE: These filters are not
available in FALCON, so sequence input must be pre-filtered in this way.
6. **Preassembly fragmentation analysis (
preassembly=FILE).** Processes a Preassembly fasta file to assess/correct
over-fragmentation. Corrected preassembly reads are output to
*.smrtscape.fas and summary data output to
7. **Feature coverage (
ftxcov=INTLIST).** Calculates full-length coverage for a list of feature lengths, as well as
their probability of detection (in raw subread data) if present at different population frequencies
seqin=FILE is given, this will be used directly unless
optimise=T. Otherwise, unique reads from (
*.unique.tdt) will be used.
These are explored in more detail in the **Details** section.
NOTE: The following functions/settings have been retired in
V2.x as HGAP3-specific options. To use, please use
SMRTSCAPE_V1 or contact the author if you wish to have them reinstated for updated versions of SMRTPipeline/SMRTLink:
parseparam=FILES: Parse parameter settings from 1+ assembly runs 
paramlist=LIST : List of parameters to retain for parseparam output (file or comma separated,
Main input for
SMRTSCAPE is a set of subreads fasta files (
batch=FILELIST). This is not required
coverage=T. Input for Preassembly Fragmentation analysis is a single preassembly fasta file, given with
Main outputs are named using
basefile=X as the file name root. If
basefile=X is not set (or =
seqin=FILE will set the basefile name, trimming the file extension. If
seqin=FILE is not given
will be used. If neither
preassembly are given and
batch=FILELIST points to a
*.fofn "file of file
names", this file will be used to set
Re-running a particular mode will regenerate the relevant output (with options to back up unless
existing data from previous stages will be read and reused if found. If
force=T then earlier stages other than
summarise=T output (e.g.
*.unique.tdt). These will also be regenerated if
fullforce=T. (The only
output that is dependent on input parameters are
XCoverage fields, which use
genomesize=INT, and the
fields that also use
All modes will produce a
*.log file. Other outputs produced depend on the run mode selected:
*.coverage.tdt = Predict genome coverage and accuracy for different depths of PacBio sequencing.
*.fofn = File of input subread filenames.
*.summary.tdt = Sequence summary data (read lengths, N50 etc.) for each sequence file, SMRT cell and the
*.zmw.tdt = Individual summary data for all subreads.
*.unique.tdt = Individual summary data for **Unique** subreads (longest per ZMW).
*.rq.tdt= A summary of **Read Quality (RQ)** data.
*.cutoffs.tdt = Table of different read length and RQ cutoffs and the predicted depth and quality of
*.accuracy.tdt = Summary table of predicted accuracies at different RQ/Xdepth combinations.
optimise=T: No additional output. Optimal parameter suggestions are output in log file.
*.LXXXRQXXX.fasta = reads filteres on min. read quality (
RQXXX) and min. read length
*.LXXXRQXXX.fasta.index = index for fasta file, used by
smrtscape_ftfreq.ftxcov.tdt = Calculated full-length coverage for list of feature lengths, as well as their
probability of detection (in raw subread data) if present at different
population frequencies (
*.smrtscape.fas = fragmentation-corrected preassembly reads.
*.fragment.tdt = summary fragmentation data.
Details are provided in the **Details** Section.
How SMRTSCAPE works
SMRTSCAPE is built on two ways of calculating "
X" depth of coverage: (1) a simple calculation of mean X depth;
(2) Empirical modelling of read depth distributions. These are combined as appropriate with estimations of the
required depth of coverage to achieve the desired target genome accuracy. These calculations are described below.
See descriptions of individual
SMRTSCAPE functions for more information on how they are used in a given setting.
As a result,
SMRTSCAPE behaviour is largely controlled by four parameters:
targetcov=PERC : Target percentage coverage for final genome [
targeterr=NUM : Target errors per base for preassembly [
targetxcov=INT : Target 100% X Coverage for pre-assembly (e.g. error-corrected seed reads) [
**NOTE:** If you are getting messages about insufficient data, you might want to try reducing the target error rate
targeterr=NUM) and/or the target "complete" coverage (
**WARNING:** Changing these settings between different runs with the same
basefile=X setting may give some unusual
behaviour. It is safest to use a new
basefile=X if changing any of these settings. (Future
SMRTSCAPE versions may
enforce this.) The exception should be the
*.zmw.tdt outputs, which should be robust to changes
in these settings. (These files are the slowest to generate, and so copying/reusing them could be useful. See
summarise=T documentation for details.
Simple calculation of mean X depth
This is a pure prediction based on average data. This is quite simpy the total amount of sequence (bp) divided by the
genome size, given by
genomesize=INT and equates to the traditional
XCoverage value for genome sequence.
Empirical modelling of read depth distributions
The heart of SMRTSCAPE is an empirical calculation based modelling read distributions, assuming random sampling of
reads from across the genome. This calculates the total (summed) read lengths required to generate the desired genome
targetcov=X) at different
X depths, *e.g.* what total Xdepth is required to cover 99.999% of the genome
at 3X (or more). Note that, with the exception of preassembly-based calculations, the _square root_ of the
targetcov=X value is used; the assembly process involves *two* layers of genome coverage: (1) coverage of seed
reads by anchor reads to generate the error-corrected preassembly; (2) coverage of the preassembly.
XCovLimit" data are calculated by incrementing total summed read lengths in 100 kb increments (adjusted with
xsteplen=X). At each incremment,
genomesize=INT is used to calculate the total
X coverage, which is the mean
depth at any given position. The probability of the target
X coverage (starting at
1X) given the total X coverage
is then calculated using a Poisson distribution. If this probability exceeds the target genome coverage
targetcov=X), the current summed length is set as the
X and the target
X increased by 1. The
total summed read length is then incremented by
xsteplen and the process repeated until the summed length reaches
the total length of all subreads of at least the size set by
minreadlen=X (default 500 bp).
Target Genome Accuracy
The final piece of the puzzle is the depth of coverage required to achieve a particular level of accuracy. By
SMRTSCAPE aims at perfection, which is less than 1 error per genome. (NOTE: For large genomes, this will
require an unrealistic depth of coverage.)
Accuracy is based on a majority reads covering a particular base with the correct call, assuming random calls at the
other positions (*i.e.* the correct bases have to exceed 33% of the incorrect positions). For a given depth of
coverage *for a given base*, the majority cutoff
N is first calculated. Assuming random base calls for errors, this
must exceed 1/4 of the read depth (rounded up):
N = int(0.25X) + 2
Accuracy is then calculated as the probability of achieving
N correct calls, given the depth
X and the
1 = ErrPerBase) using a binomial distribution with an even (random) distribution of errors. When using mean SMRT
cell outputs, the
errperbase=NUM parameter is used for this calculation. When using empirical data, the lowest read
RQ) value is used. (Calculations are made for different
**NOTE:** Read Quality (
RQ) values are *accuracy* measures, whereas
errperbase=NUM sets an error rate. These are
simply related, where
Accuracy = 1.0 - ErrPerBase.
To be added.
basefile=FILE : Sets the root filename for all output, including the log [
genomesize=INT : Genome size (bp) [
targetcov=PERC : Target percentage coverage for final genome [
targeterr=NUM : Target errors per base for preassembly [
targetxcov=X : Target 100% X Coverage for pre-assembly (e.g. error-corrected seed reads) [
minanchorx=X : Minimum X coverage for anchor (preassembly error-correcting) subreads [
xmargin=X : "Safety margin" inflation of desired minimum X coverage [
xsteplen=X : [Adv.] Size (bp) of increasing coverage steps for calculating required depths of coverage [
force=T/F : [Adv.] Whether to force regeneration of existing data, except
zmw tables [
fullforce=T/F : [Adv.] Whether to force regeneration of existing data including
zmw tables [
Genome Coverage Options
coverage=T/F : Whether to generate coverage report [
avread=X : Average read length (bp) [
smrtreads=X : Average assemble output of a SMRT cell [
smrtunits=X : Units for
smrtreads=X (reads/Gb/Mb) [
errperbase=X : Error-rate per base [
maxcov=X : Maximum X coverage to calculate for
coverage=T analysis [
bysmrt=T/F : Whether to output estimated coverage by SMRT cell rather than X coverage [
xnlist=LIST : Additional columns giving % sites with coverage >= Xn [
SubRead Summary Options
summarise=T/F : Generate subread summary statistics including ZMW summary data [
batch=FILELIST : Batch input of multiple subread fasta files (wildcards allowed) if
seqin=FILE : Subread sequence file for analysis (over-rides batch command) [
Assembly Parameter Options
calculate=T/F : Calculate X coverage and target X coverage for given seed, anchor + RQ combinations [
optimise=T/F : Whether to output predicted "best" set of optimised parameters [
minreadlen=X : Absolute minimum read length for calculations (use
minlen=X to affect summary also) [
mapefficiency=X : Efficiency of mapping anchor subreads onto seed reads for correction [
rq=X,Y : Minimum (X) and maximum (Y) values for read quality cutoffs [
rqstep=X : Size of RQ jumps for calculation (min 0.001) [
calcx=NUMLIST : Add calculate entries for given raw Xdepths [
calclen=INTLIST : Add calculate entries for given read length cutoffs [
Subread Filtering Options
readfilter=T/F : Output filtered subreads into a new fasta file *.LXXXRQXXX.fasta
rqfilter=X : Min read quality for filtered subreads [
lenfilter=X : Min read length for filtered subreads [
Feature Coverage Calculation Option
ftxcov=INTLIST : List of feature lengths for which to predict full-length coverage 
ftxfreq=NUMLIST : List of feature frequencies for which to calculate probability of detection [
Preassembly Fragmentation analysis Options
preassembly=FILE: Preassembly fasta file for which to assess/correct over-fragmentation [
History Module Version History
# 0.0.0 - Initial Compilation.
# 1.0.0 - Initial working version for server.
# 1.1.0 - Added xnlist=LIST : Additional columns giving % sites with coverage >= Xn [10,25,50,100].
# 1.2.0 - Added assessment -> now PAGSAT.
# 1.3.0 - Added seed and anchor read coverage generator (calculate=T).
# 1.3.1 - Deleted assessment function. (Now handled by PAGSAT.)
# 1.4.0 - Added new coverage=T function that incorporates seed and anchor subreads.
# 1.5.0 - Added parseparam=FILES with paramlist=LIST to parse restricted sets of parameters.
# 1.6.0 - New SMRTSCAPE program building on PacBio v1.5.0. Added predict=T/F option.
# 1.6.1 - Updated parameters=T to incorporate that the seed read counts as X=1.
# 1.7.0 - Added *.summary.tdt output from subread summary analysis. Added minreadlen.
# 1.8.0 - preassembly=FILE: Preassembly fasta file to assess/correct over-fragmentation (use seqin=FILE for subreads)
# 1.9.0 - Updated empirical preassembly mapefficiency calculation.
# 1.10.0 - Added batch processing of subread files.
# 1.10.1 - Fixed bug in batch processing.
# 2.0.0 - Updated Version 2.x. Anchor read depth calculations no longer subtract seed coverage.
# 2.0.1 - Fixed a bug with the preassembly MinX calculation and enabled seqin=FILE ftcovx analysis.
# 2.0.2 - SeqIn does not generate *.fofn output.
# 2.1.0 - Added MapLen and MapX to optimise.tdt output.
# 2.1.1 - Fixed basefile bug.
# 2.2.0 - Updated to work with fasta files generated from Sequel BAM files. (No RQ values.)
# 2.2.1 - Fixed printLog typo bug.
SMRTSCAPE REST Output formats
&rest=help for general options. Run with
&rest=full to get full server output as text or
for more user-friendly formatted output. Individual outputs can be identified/parsed using
coverage = main results table