Diploid chromosome phasing from SAMTools Pileup format.
Copyright © 2016 Richard J. Edwards - See source code for GNU License Notice
SAMPhaser is a tool designed to take an input of long read (e.g. PacBio) data mapped onto a genome assembly and phase the data into haplotype blocks before "unzipping" the assembly into phased "haplotigs". Unphased regions are also output as single "collapsed" haplotigs. This is designed for phasing PacBio assemblies of diploid organisms. By default, only SNPs are used for phasing, with indel polymorphisms being ignored. This is because indels are more likely to be errors. In particular, mononucleotide repeats could have indels that look like false well-supported polymorphisms.
SAMPhaser first identifies variants from a pileup file generated using [SAMtools](https://github.com/samtools/samtools)
from a BAM file of mapped long reads. SNPs and indels are called for all positions where the minor allele is
supported by at least 10% of the reads (
Phasing is performed by iteratively assigning alleles and reads to haplotypes. Initially, each read is given an equal
probability of being in haplotype "A" or "B". The reference allele of the first SNP then defines haplotype A.
For each SNP, SAMPhaser iteratively calculates (1) the probability that each allele is in haplotype A given the
haplotype A probabilities for reads containing that allele, and then (2) the probability that each read is in
haplotype A given the haplotype A probabilities for that read's alleles at the last ten SNPs (
This progresses until all SNPs have been processed. If at any point, all reads with processed SNP positions reach
their ends before another SNP is reached, a new phasing block is started. Draft phase blocks are then resolved into
the final haplotype blocks by assigning reads and SNPs where the probability of assignment of a read to one haplotype
exceeds 95% (
The final step is to "unzip" the reference sequence into "haplotigs". SAMPhaser unzips phase blocks with at least
five SNPs (
Finally, unzipped blocks have their sequences corrected. This is performed by starting with the reference sequence
and then identifying the dominant haplotype allele (or consensus for collapsed blocks) at all variant positions
(not just those used for phasing) providing the variant has at least 10% (min. three) reads supporting it
History Module Version History
# 0.0.0 - Initial Compilation. # 0.1.0 - Updated SAMPhaser to be more memory efficient. # 0.2.0 - Added reading of sequence and generation of SNP-altered haplotype blocks. # 0.2.1 - Fixed bug in which zero-phasing sequences were being excluded from blocks output. # 0.3.0 - Made a new unzip process. # 0.4.0 - Added RGraphics for unzip. # 0.4.1 - Fixed MeanX bug in devUnzip. # 0.4.2 - Made phaseindels=F by default: mononucleotide indel errors will probably add phasing noise. Fixed basefile R bug. # 0.4.3 - Fixed bug introduced by adding depthplot code. Fixed phaseindels bug. (Wasn't working!) # 0.4.4 - Modified mincut=X to adjust for samtools V1.12.0. # 0.4.5 - Updated for modified RJE_SAMTools output. # 0.4.6 - splitzero=X : Whether to split haplotigs at zero-coverage regions of X+ bp (-1 = no split)  # 0.5.0 - snptable=T/F : Output filtered alleles to SNP Table [False] # 0.6.0 - Converted haplotig naming to be consistent for PAGSAT generation. Updated for rje_samtools v1.21.1. # 0.7.0 - Added skiploci=LIST and phaseloci=LIST : Optional list of loci to skip phasing  # 0.8.0 - poordepth=T/F : Whether to include reads with poor track probability in haplotig depth plots (random track) [False] # 0.9.0 - Added generation of mpileup file. # 0.9.1 - Tweaked naming for PAGSAT. # 0.10.0 - Added HiFi read type. # 0.11.0 - Added pafphase mode (dev=T) and readnames=T/F.
SAMPhaser REST Output formatsRun with
for more user-friendly formatted output. Individual outputs can be identified/parsed using
© 2015 RJ Edwards. Contact: email@example.com.