Module:	SMRTSCAPE
Description:	SMRT Subread Coverage & Assembly Parameter Estimator
Version:	2.2.3
Last Edit:	11/07/18

Imported modules: rje rje_db rje_obj rje_seqlist

See SLiMSuite Blog for further documentation. See rje for general commands.

Summary

Function

SMRTSCAPE (SMRT Subread Coverage & Assembly Parameter Estimator) is a tool for analysis of PacBio raw sequencing data to assist the design and execution of PacBio genomics projects. It has a number of functions concerned with predicting and/or assessing the quantity and quality of useable data produced:

1. **Genome Coverage (coverage=T).** This method tries to predict genome coverage and accuracy for different depths of PacBio sequencing. This is useful for estimating genome coverage and/or required numbers of SMRT cells from predicted read outputs or emprical (average) SMRT cell data if the BASEFILE.unique.tdt output (generated by summary=T, below) is found. NOTE: Default settings for SMRT cell output are not reliable and you should speak to your sequencing provider for their up-to-date figures. By default, output for this mode is incremented by XCoverage but this can be switched to numbers of SMRT cells with bysmrt=T.

2. **Summarise subreads (summarise=T).** This function summarises subread data from a given seqin=FILE fasta file, or a set of subread fasta files given with batch=FILELIST (or listed in *.fofn). This produces sequence summary data (read lengths, N50 etc.) for each sequence file, SMRT cell and the combined dataset (*.summary.tdt). In addition, tables are generated that summarise each read individually, which can then be used for further read filtering or calculations. Summaries are produced for all data (*.zmw.tdt) and just the Unique subread data, which is the longest read from each ZMW (*.unique.tdt). FALCON only uses unique read data, and so it is these data that are used for the rest of SMRTSCAPE functions. A summary of Read Quality (RQ) data is also output (*.rq.tdt).

3. **Calculate length cutoffs (calculate=T).** Calculates length cutoffs for different XCoverage and RQ combinations from subread summary data. Generates *.cutoffs.tdt.

4. **Optimise (optimise=T).** This function attempts to generate predicted optimum assembly settings from the summary=T and calculate=T table data. NOTE: In V1.x, this option was parameters=T.

5. **Filter subreads (readfilter=T).** This filters *unique* subreads into a new fasta file (*.LXXXRQXXX.fasta) based on min. read quality (rqfilter=X) and min. read length (lenfilter=X). NOTE: These filters are not available in FALCON, so sequence input must be pre-filtered in this way.

6. **Preassembly fragmentation analysis (preassembly=FILE).** Processes a Preassembly fasta file to assess/correct over-fragmentation. Corrected preassembly reads are output to *.smrtscape.fas and summary data output to *.fragment.tdt.

7. **Feature coverage (ftxcov=INTLIST).** Calculates full-length coverage for a list of feature lengths, as well as their probability of detection (in raw subread data) if present at different population frequencies (ftxfreq=NUMLIST). If seqin=FILE is given, this will be used directly unless summarise=T, calculate=T or optimise=T. Otherwise, unique reads from (*.unique.tdt) will be used.

These are explored in more detail in the Details section.

Retired Functions

NOTE: The following functions/settings have been retired in V2.x as HGAP3-specific options. To use, please use SMRTSCAPE_V1 or contact the author if you wish to have them reinstated for updated versions of SMRTPipeline/SMRTLink:

parseparam=FILES: Parse parameter settings from 1+ assembly runs []
paramlist=LIST : List of parameters to retain for parseparam output (file or comma separated, blank=all) []

Input/Output

Main input for SMRTSCAPE is a set of subreads fasta files (seqin=FILE or batch=FILELIST). This is not required for coverage=T. Input for Preassembly Fragmentation analysis is a single preassembly fasta file, given with preassembly=FILE.

Main outputs are named using basefile=X as the file name root. If basefile=X is not set (or =''/None), seqin=FILE will set the basefile name, trimming the file extension. If seqin=FILE is not given preassembly=FILE will be used. If neither seqin nor preassembly are given and batch=FILELIST points to a *.fofn "file of file names", this file will be used to set basefile. Otherwise, basefile=smrtscape.

Re-running a particular mode will regenerate the relevant output (with options to back up unless backups=F) but existing data from previous stages will be read and reused if found. If force=T then earlier stages other than summarise=T output (e.g. *.unique.tdt). These will also be regenerated if fullforce=T. (The only summarise=T output that is dependent on input parameters are XCoverage fields, which use genomesize=INT, and the Xerr fields that also use targeterr=X.)

All modes will produce a *.log file. Other outputs produced depend on the run mode selected:

1. coverage=T: * *.coverage.tdt = Predict genome coverage and accuracy for different depths of PacBio sequencing.

2. summarise=T: * *.fofn = File of input subread filenames. * *.summary.tdt = Sequence summary data (read lengths, N50 etc.) for each sequence file, SMRT cell and the combined dataset. * *.zmw.tdt = Individual summary data for all subreads. * *.unique.tdt = Individual summary data for Unique subreads (longest per ZMW). * *.rq.tdt= A summary of Read Quality (RQ) data.

3. calculate=T: * *.cutoffs.tdt = Table of different read length and RQ cutoffs and the predicted depth and quality of resulting assemblies. * *.accuracy.tdt = Summary table of predicted accuracies at different RQ/Xdepth combinations.

4. optimise=T: No additional output. Optimal parameter suggestions are output in log file.

5. readfilter=T: * *.LXXXRQXXX.fasta = reads filteres on min. read quality (rqfilter=X = RQXXX) and min. read length (lenfilter=X = LXXX). * *.LXXXRQXXX.fasta.index = index for fasta file, used by SeqList.

6. ftxcov=INTLIST): * smrtscape_ftfreq.ftxcov.tdt = Calculated full-length coverage for list of feature lengths, as well as their probability of detection (in raw subread data) if present at different population frequencies (ftxfreq=NUMLIST).

7. preassembly=FILE: * *.smrtscape.fas = fragmentation-corrected preassembly reads. * *.fragment.tdt = summary fragmentation data.

Details are provided in the Details Section.

Details

How SMRTSCAPE works

SMRTSCAPE is built on two ways of calculating "X" depth of coverage: (1) a simple calculation of mean X depth;
(2) Empirical modelling of read depth distributions. These are combined as appropriate with estimations of the
required depth of coverage to achieve the desired target genome accuracy. These calculations are described below.
See descriptions of individual SMRTSCAPE functions for more information on how they are used in a given setting.

As a result, SMRTSCAPE behaviour is largely controlled by four parameters:

genomesize=INT : Genome size (bp) [0]
targetcov=PERC : Target percentage coverage for final genome [99.999]
targeterr=NUM : Target errors per base for preassembly [1/genome size]
targetxcov=INT : Target 100% X Coverage for pre-assembly (e.g. error-corrected seed reads) [3]

NOTE: If you are getting messages about insufficient data, you might want to try reducing the target error rate
(targeterr=NUM) and/or the target "complete" coverage (targetcov=PERC).

WARNING: Changing these settings between different runs with the same basefile=X setting may give some unusual
behaviour. It is safest to use a new basefile=X if changing any of these settings. (Future SMRTSCAPE versions may
enforce this.) The exception should be the *.unique.tdt and *.zmw.tdt outputs, which should be robust to changes
in these settings. (These files are the slowest to generate, and so copying/reusing them could be useful. See
summarise=T documentation for details.

Simple calculation of mean X depth

This is a pure prediction based on average data. This is quite simpy the total amount of sequence (bp) divided by the
genome size, given by genomesize=INT and equates to the traditional XCoverage value for genome sequence.

Empirical modelling of read depth distributions

The heart of SMRTSCAPE is an empirical calculation based modelling read distributions, assuming random sampling of
reads from across the genome. This calculates the total (summed) read lengths required to generate the desired genome
coverage (targetcov=X) at different X depths, *e.g.* what total Xdepth is required to cover 99.999% of the genome
at 3X (or more). Note that, with the exception of preassembly-based calculations, the square root of the
targetcov=X value is used; the assembly process involves *two* layers of genome coverage: (1) coverage of seed
reads by anchor reads to generate the error-corrected preassembly; (2) coverage of the preassembly.

These "XCovLimit" data are calculated by incrementing total summed read lengths in 100 kb increments (adjusted with
xsteplen=X). At each incremment, genomesize=INT is used to calculate the total X coverage, which is the mean
depth at any given position. The probability of the target X coverage (starting at 1X) given the total X coverage
is then calculated using a Poisson distribution. If this probability exceeds the target genome coverage
(targetcov=X), the current summed length is set as the XCovLimit for X and the target X increased by 1. The
total summed read length is then incremented by xsteplen and the process repeated until the summed length reaches
the total length of all subreads of at least the size set by minreadlen=X (default 500 bp).

Target Genome Accuracy

The final piece of the puzzle is the depth of coverage required to achieve a particular level of accuracy. By
default, SMRTSCAPE aims at perfection, which is less than 1 error per genome. (NOTE: For large genomes, this will
require an unrealistic depth of coverage.)

Accuracy is based on a majority reads covering a particular base with the correct call, assuming random calls at the
other positions (*i.e.* the correct bases have to exceed 33% of the incorrect positions). For a given depth of
coverage *for a given base*, the majority cutoff N is first calculated. Assuming random base calls for errors, this
must exceed 1/4 of the read depth (rounded up):

N = int(0.25X) + 2

Accuracy is then calculated as the probability of achieving N correct calls, given the depth X and the RQ (or
1 = ErrPerBase) using a binomial distribution with an even (random) distribution of errors. When using mean SMRT
cell outputs, the errperbase=NUM parameter is used for this calculation. When using empirical data, the lowest read
quality (RQ) value is used. (Calculations are made for different RQ cutoffs.)

NOTE: Read Quality (RQ) values are *accuracy* measures, whereas errperbase=NUM sets an error rate. These are
simply related, where Accuracy = 1.0 - ErrPerBase.

Example runs

To be added.

Commandline

General Options

basefile=FILE : Sets the root filename for all output, including the log [smrtscape]
genomesize=INT : Genome size (bp) [0]
targetcov=PERC : Target percentage coverage for final genome [99.999]
targeterr=NUM : Target errors per base for preassembly [1/genome size]
targetxcov=X : Target 100% X Coverage for pre-assembly (e.g. error-corrected seed reads) [3]
minanchorx=X : Minimum X coverage for anchor (preassembly error-correcting) subreads [6]
xmargin=X : "Safety margin" inflation of desired minimum X coverage [1]
xsteplen=X : [Adv.] Size (bp) of increasing coverage steps for calculating required depths of coverage [1e5]
force=T/F : [Adv.] Whether to force regeneration of existing data, except unique and zmw tables [False]
fullforce=T/F : [Adv.] Whether to force regeneration of existing data including unique and zmw tables [False]

Genome Coverage Options

coverage=T/F : Whether to generate coverage report [False]
avread=X : Average read length (bp) [20000]
smrtreads=X : Average assemble output of a SMRT cell [50000]
smrtunits=X : Units for smrtreads=X (reads/Gb/Mb) [reads]
errperbase=X : Error-rate per base [0.14]
maxcov=X : Maximum X coverage to calculate for coverage=T analysis [100]
bysmrt=T/F : Whether to output estimated coverage by SMRT cell rather than X coverage [False]
xnlist=LIST : Additional columns giving % sites with coverage >= Xn [minanchorx->targetxcov+1]

SubRead Summary Options

summarise=T/F : Generate subread summary statistics including ZMW summary data [False]
batch=FILELIST : Batch input of multiple subread fasta files (wildcards allowed) if seqin=None [basefile.fofn]
seqin=FILE : Subread sequence file for analysis (over-rides batch command) [None]

Assembly Parameter Options

calculate=T/F : Calculate X coverage and target X coverage for given seed, anchor + RQ combinations [False]
optimise=T/F : Whether to output predicted "best" set of optimised parameters [False]
minreadlen=X : Absolute minimum read length for calculations (use minlen=X to affect summary also) [500]
mapefficiency=X : Efficiency of mapping anchor subreads onto seed reads for correction [0.8]
rq=X,Y : Minimum (X) and maximum (Y) values for read quality cutoffs [0.8,0.9]
rqstep=X : Size of RQ jumps for calculation (min 0.001) [0.01]
calcx=NUMLIST : Add calculate entries for given raw Xdepths [25,30]
calclen=INTLIST : Add calculate entries for given read length cutoffs [12000,15000,18000]

Subread Filtering Options

readfilter=T/F : Output filtered subreads into a new fasta file *.LXXXRQXXX.fasta
rqfilter=X : Min read quality for filtered subreads [0.85]
lenfilter=X : Min read length for filtered subreads [5000]

Feature Coverage Calculation Option

ftxcov=INTLIST : List of feature lengths for which to predict full-length coverage []
ftxfreq=NUMLIST : List of feature frequencies for which to calculate probability of detection [0.01,0.05,1.0]

Preassembly Fragmentation analysis Options

preassembly=FILE: Preassembly fasta file for which to assess/correct over-fragmentation [None]

History Module Version History

    # 0.0.0 - Initial Compilation.
    # 1.0.0 - Initial working version for server.
    # 1.1.0 - Added xnlist=LIST : Additional columns giving % sites with coverage >= Xn [10,25,50,100].
    # 1.2.0 - Added assessment -> now PAGSAT.
    # 1.3.0 - Added seed and anchor read coverage generator (calculate=T).
    # 1.3.1 - Deleted assessment function. (Now handled by PAGSAT.)
    # 1.4.0 - Added new coverage=T function that incorporates seed and anchor subreads.
    # 1.5.0 - Added parseparam=FILES with paramlist=LIST to parse restricted sets of parameters.
    # 1.6.0 - New SMRTSCAPE program building on PacBio v1.5.0. Added predict=T/F option.
    # 1.6.1 - Updated parameters=T to incorporate that the seed read counts as X=1.
    # 1.7.0 - Added *.summary.tdt output from subread summary analysis. Added minreadlen.
    # 1.8.0 - preassembly=FILE: Preassembly fasta file to assess/correct over-fragmentation (use seqin=FILE for subreads)
    # 1.9.0 - Updated empirical preassembly mapefficiency calculation.
    # 1.10.0 - Added batch processing of subread files.
    # 1.10.1 - Fixed bug in batch processing.
    # 2.0.0 - Updated Version 2.x. Anchor read depth calculations no longer subtract seed coverage.
    # 2.0.1 - Fixed a bug with the preassembly MinX calculation and enabled seqin=FILE ftcovx analysis.
    # 2.0.2 - SeqIn does not generate *.fofn output.
    # 2.1.0 - Added MapLen and MapX to optimise.tdt output.
    # 2.1.1 - Fixed basefile bug.
    # 2.2.0 - Updated to work with fasta files generated from Sequel BAM files. (No RQ values.)
    # 2.2.1 - Fixed printLog typo bug.
    # 2.2.2 - Added dna=T to all SeqList object generation.
    # 2.2.3 - Fixed bug where SMRT subreads are not returned by seqlist in correct order. Fixed RQ=0 bug.

SMRTSCAPE REST Output formats

Run with &rest=help for general options. Run with &rest=full to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT for:

coverage = main results table

SLiMSuite REST Server

SMRTSCAPE V2.2.3

SMRT Subread Coverage & Assembly Parameter Estimator