Module:	SAMPhaser
Description:	Diploid chromosome phasing from SAMTools Pileup format.
Version:	0.11.0
Last Edit:	24/03/21
Citation:	Song, Thomas & Edwards (2019), Marine Genomics 48:100687.
Documentation:	https://github.com/slimsuite/SLiMSuite/wiki/SAMPhaser

Imported modules: rje rje_html rje_obj rje_db rje_paf rje_samtools rje_seqlist

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

SAMPhaser is a tool designed to take an input of long read (e.g. PacBio) data mapped onto a genome assembly and phase the data into haplotype blocks before "unzipping" the assembly into phased "haplotigs". Unphased regions are also output as single "collapsed" haplotigs. This is designed for phasing PacBio assemblies of diploid organisms. By default, only SNPs are used for phasing, with indel polymorphisms being ignored. This is because indels are more likely to be errors. In particular, mononucleotide repeats could have indels that look like false well-supported polymorphisms.

SAMPhaser overview

SAMPhaser first identifies variants from a pileup file generated using [SAMtools](https://github.com/samtools/samtools) from a BAM file of mapped long reads. SNPs and indels are called for all positions where the minor allele is supported by at least 10% of the reads (mincut=X), with an absolute minimum of two reads (absmincut=X). The subset of biallelic SNPs with the minor variant supported by at least five reads (absphasecut=X) at a frequency of at least 25% (phasecut=X) are used for phasing. Indels, and any SNPs not meeting these criteria, are used for sequence correction, but not phasing.

Phasing is performed by iteratively assigning alleles and reads to haplotypes. Initially, each read is given an equal probability of being in haplotype "A" or "B". The reference allele of the first SNP then defines haplotype A. For each SNP, SAMPhaser iteratively calculates (1) the probability that each allele is in haplotype A given the haplotype A probabilities for reads containing that allele, and then (2) the probability that each read is in haplotype A given the haplotype A probabilities for that read's alleles at the last ten SNPs (snpcalc=X). This is performed by modelling a SNP call error rate (snperr=X set at 5%) and then calculating the relative likelihood of seeing the observed data if a read or allele is really in haplotype A versus haplotype B.

This progresses until all SNPs have been processed. If at any point, all reads with processed SNP positions reach their ends before another SNP is reached, a new phasing block is started. Draft phase blocks are then resolved into the final haplotype blocks by assigning reads and SNPs where the probability of assignment of a read to one haplotype exceeds 95% (trackprob=X). Ambiguous reads and SNPs are ignored.

The final step is to "unzip" the reference sequence into "haplotigs". SAMPhaser unzips phase blocks with at least five SNPs (minsnp=X). Regions that are not unzipped are output as "collapsed" haplotigs. First, phased reads are assigned to the appropriate haplotig. Regions of 100+ base pairs without coverage (splitzero=X) are removed as putative structural variants, and the haplotig split at this point. Haplotigs with an average depth of coverage below 5X (minhapx=X) are removed. Note that this can result in "orphan" haplotigs, where the minor haplotig did not have sufficient coverage for retention. Haplotigs ending within 10 bp (endmargin=X) of the end of the reference sequence are extended. Next, collapsed blocks are established by identifying reads that (a) have not been assigned to a haplotype, and (b) are not wholly overlapping a phased block.

Finally, unzipped blocks have their sequences corrected. This is performed by starting with the reference sequence and then identifying the dominant haplotype allele (or consensus for collapsed blocks) at all variant positions (not just those used for phasing) providing the variant has at least 10% (min. three) reads supporting it (unzipcut=X absunzipcut=X). The final haplotig sequence is the original reference sequence with any assigned non-reference alleles substituted in at the appropriate positions. Single base deletions are cut out of the sequence and so it may end up shorter than the original contig. Insertions and longer deletions are not currently handled and are ignored; for this reason, it is important to re-map reads and correct the final haplotig sequences.

Commandline

Input/Output options

seqin=FASFILE : Input genome to phase variants in []
pileup=FILE : Pileup file of reads against input genome. Will make from reads=FILELIST if missing. []
basefile=FILE : Root of output file names (same as Pileup input file by default) []
mincut=X : Minimum read count for minor allele (proportion of QN if <1) for pileup parsing [0.1]
absmincut=X : Absolute minimum read count for minor allele (used if mincut<1) [2]
indels=T/F : Whether to include indels in "SNP" parsing [True]
snptableout=T/F : Output filtered alleles to SNP Table [False]
skiploci=LIST : Optional list of loci (full names or accnum) to skip phasing []
phaseloci=LIST : Optional list of loci (full names or accnum) to phase (will skip rest) []
reads=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
pafphase=FILE : PAF file for experimental phasing (dev=T only) [None]
readnames=T/F : Output the read names to the RID file (dev PAFPhase mode only) [False]

Phasing options

phasecut=X : Minimum read count for minor allele for phasing (proportion of QN if <1) [0.25]
absphasecut=X : Absolute minimum read count for phasecut (used if phasecut<1) [5]
phaseindels=T/F : Whether to include indels in "SNP" phasing [False]
snperr=X : Probability of an incorrect (biallelic) SNP call for individual read nucleotides [0.05]
snpcalc=X : Max number of SNPs to use for read probability calculations (fewer = quicker) [10]
trackprob=X : Min probability for assigning a read/SNP to Track A/B [0.95]

Unzipping options

minsnp=X : Min number of SNPs per phased haplotype block [5]
endmargin=X : Extend block ends within X nucleotides of sequence ends [10]
unzipcut=X : Minimum read count for allele for unzipping haplotigs (proportion of QN if <1) [0.1]
absunzipcut=X : Absolute minimum read count for unzipcut (used if unzipcut<1) [3]
minhapx=X : Minimum mean coverage for haplotig [5]
halfhap=T/F : Whether to allow "half haplotigs" where one halpotig in a pair is removed by minhapx [True]
splitzero=X : Whether to split haplotigs at zero-coverage regions of X+ bp (-1 = no split) [100]
rgraphics=T/F : Whether to generate PNG graphics using R. (Needs R installed and setup) [True]
poordepth=T/F : Whether to include reads with poor track probability in haplotig depth plots (random track) [False]
pagsat=T/F : Use PAGSAT-style naming for split sequences (Rn in place of .n) [False]

History Module Version History

    # 0.0.0 - Initial Compilation.
    # 0.1.0 - Updated SAMPhaser to be more memory efficient.
    # 0.2.0 - Added reading of sequence and generation of SNP-altered haplotype blocks.
    # 0.2.1 - Fixed bug in which zero-phasing sequences were being excluded from blocks output.
    # 0.3.0 - Made a new unzip process.
    # 0.4.0 - Added RGraphics for unzip.
    # 0.4.1 - Fixed MeanX bug in devUnzip.
    # 0.4.2 - Made phaseindels=F by default: mononucleotide indel errors will probably add phasing noise. Fixed basefile R bug.
    # 0.4.3 - Fixed bug introduced by adding depthplot code. Fixed phaseindels bug. (Wasn't working!)
    # 0.4.4 - Modified mincut=X to adjust for samtools V1.12.0.
    # 0.4.5 - Updated for modified RJE_SAMTools output.
    # 0.4.6 - splitzero=X : Whether to split haplotigs at zero-coverage regions of X+ bp (-1 = no split) [100]
    # 0.5.0 - snptable=T/F    : Output filtered alleles to SNP Table [False]
    # 0.6.0 - Converted haplotig naming to be consistent for PAGSAT generation. Updated for rje_samtools v1.21.1.
    # 0.7.0 - Added skiploci=LIST and phaseloci=LIST  : Optional list of loci to skip phasing []
    # 0.8.0 - poordepth=T/F   : Whether to include reads with poor track probability in haplotig depth plots (random track) [False]
    # 0.9.0 - Added generation of mpileup file.
    # 0.9.1 - Tweaked naming for PAGSAT.
    # 0.10.0 - Added HiFi read type.
    # 0.11.0 - Added pafphase mode (dev=T) and readnames=T/F.

SAMPhaser REST Output formats

Run with &rest=help for general options. Run with &rest=full to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT.

SLiMSuite REST Server

SAMPhaser V0.11.0

Diploid chromosome phasing from SAMTools Pileup format.