Module:	rje_samtools
Description:	RJE SAMtools parser and processor
Version:	1.21.0
Last Edit:	24/03/21

Imported modules: rje rje_db rje_obj rje_seqlist rje_sequence rje_zen

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

The initial function of this program is for calling/assessing genetic changes following MPileup mapping of multiple short read datasets against the same reference genome (provided as BAM files). The MPileup files should be generated by piping the output of the following into a file (*.mpileup):

samtools mpileup -BQ0 -d10000000 -f <Ref Genome Fasta> <BAM file>

Initial parsing and filtering converts the MPileup format into a delimited text file with quality score and read depth filtering, converting mapped read data into an allele list in the form: allele1:count|allele2:count|... These will be sorted by frequency, so that allele1 is the "major allele". Output of the *.QX.tdt file will have the fields:

Locus = Reference locus (contig/chromosome)
Pos = Position in reference (1-L)
Ref = Reference sequence
N = Total read depth at that position
QN = Read depth after quality filtering
Seq = Mapped (filtered) allele list in the form: allele1:count|allele2:count|...

SNP Frequency Calculations

A second function of this tool is to compare the SNP frequencies of two populations/datasets and identify major
alleles in the "Treatment" that have significantly increased in frequency compared to the "Control". This mode takes
two pileup files as control=FILE and treatment=FILE. These file names (minus extension) will be output fields. For
something more user-friendly, use labels=X,Y to give them better labels (where X is control and Y is treatment).
These file names will also be used to set the output files CONTROL.vs.TREATMENT unless basefile=X is set.

Parsed pileup files (see above) are read in and combined. Only locus positions with entries in both files (i.e. both
meet the qcut=X and minqn=X criteria) are kept and base calls combined. Any alleles failing to meet the minimum
count criteria (mincut=X) are removed. If mincut < 1.0, it is treated as a proportion of all reads, with a minimum
value set by absmincut=X. The exception to this allele removal is the Reference allele, unless ignoreref=T. By
default, N alleles are removed and do not contribute to overall read counts for allele frequency calculations. This
can be changed by setting ignoren=F. Filtered SNPs are output to *.snp.tdt. Unless ignoreref=T, fixed
differences to the Reference (i.e. 100% frequency in both control and treatment) are also output.

The focus of analysis is the "major allele" for the treatment population, which is the allele with highest frequency.
(When tied, alleles will be arbitrarily selected in alphabetical order). The frequency of the major allele in the
treatment (MajFreq) is output along with the difference in frequency of this allele relative to the control
(MajDiff). A positive value indicates that the allele is at higher frequency in the treatment, whereas a negative
value means higher frequency in the control. Finally, the probability of the treatment frequency is calculated using
a binomial distribution, where: p is probability of a read being the major allele based on the control frequency (if
zero, this is conservatively set to 1/(N+1), i.e. assuming that the next control read would be that allele); k is the
number of major allele treatment reads and n is the total number of treatment reads.

The next stage is to filter these SNPs and calculate FDR statistics. By default, only positions with non-reference
major treatment alleles are considered. This can be switched off with majmut=F or reversed with majmut=F and
majref=T in combination. Following this filtering, the total
number of SNPs is used for an FDR correction of the MajProb probabilities, using the Benjamini & Hochberg approach,
where FDR = pN/n. To speed this up and reduce output, a probability cutoff can be applied with sigcut=X (0.05 by
default). To only keep positions where the treatment major allele is different to the control major allele, use
majdif=T. majcut=X will add an additional min. frequency criterion. These are applied after FDR correction.
Output is then further reduced to those SNPs where MajDiff > 0 and filtered with a final FDR cutoff (fdrcut=X).
Remaining SNPs are output to the *.fdr.tdt table.

Optionally, a table of SNP annotation (snptable=FILE) can be merged with the *.fdr.tdt output. This table must have
Locus and Pos fields. If it has other fields that are required to identify unique entries, these should be set
with snptabkeys=LIST.

Read Coverage Analysis

Version 1.8.0 added read coverage analysis, which will calculate average depth of coverage for each Locus based on a
SAM or mpileup file, or a Read ID file (*.rid.tdt) previously generated by rje_samtools. This file is given by
readcheck=FILE. If the source sequence file is also given using seqin=FASFILE then the true sequence lengths are
used for the calculation. Otherwise, the last position covered by a read in the readcheck file is used. Read depths
are output to *.coverage.tdt.

A table of regions to be checked can be provided using checkpos=FILE. This should have Locus, Start and End
fields for the regions to be checked. Reads from readcheck will be scanned to (a) identify reads completely
spanning the region, and (b) calculate the mean depth of coverage for that region. Reads spanning the region must
also cover a number of nucleotides flanking the region, as set by checkflanks=LIST. Each flanking distance in the
list will have its own SpanX output field. For example, the default settings check 0 flanks (just the region), plus
flanks of 100, 500 and 1000 nt. This generates output fields Span0, Span100, Span500 and Span1000.

Alternative Locus, Start and End fields for checkpos=FILE can be given with checkfields=LIST. NOTE: This is
case-sensitive and needs all three elements, even if only one or two fields are being changed.

If depthplot=T then read depth will be calculated across each locus and additional depth statistics (MinX, MaxX,
MedianX and Coverage) will be added to the coverage file. If rgraphics=T (default) and R is installed, plots per
locus will be generated. If readlen=T then additional outputs will be generated of the maximum read length at each
point in the genome.

If dirlen=X > 0 (default=500), the maximum 5' and 3' read linkage will be calculated every dirlen=X nucleotides
(plus the end position). The examines the reads spanning that position and reports the maximum distance any read
extends in the 5' (Len5) or 3' (Len3) direction from that position. This can identify misassemblies resulting in
breaks in read coverage. Note that reads in this context are contiguous SAM/pileup hits and so sequencing reads with
multiple or split mapping will be treated as multiple reads.

Commandline

See also rje.py : List of MPileup/SAM files to be parsed and filtered (e.g. *.pileup) []
: Min. quality score for a call to include [30]
: Min. number of reads meeting qcut (QN) for output [10]
: Whether to include Read ID (number) lists for each allele [True]
: Whether to restrict parsing output to SNP positions (will use mincut settings below) [False]
: Whether to include indels in "SNP" parsing [True]
: List of loci to exclude from pileup parsing (e.g. mitochondria) []
: Output filtered alleles to SNP Table [False]
: Output the read names to the RID file (SAM parsing only) [False]
: MPileup or processed TDT file to be used for control SNP frequencies []
: MPileup or processed TDT file to be used for treatment SNP frequencies []
: Optional labels for Control and Treatment fields in output (other file basename used) []
: Minimum read count for minor allele (proportion if <1) [1]
: Absolute minimum read count for minor allele (used if mincut<1) [2]
: Whether to restrict SNPs to pure biallelic SNPs (two alleles meeting mincut) [False]
: Whether to exclude "N" calls from alleles [True]
: If False will always keep Reference allele and call fixed change as SNP [False]
: Basename for frequency comparison output [<CONTROLBASE>.v.<TREATMENTBASE>]
: Whether the focus is on Major Alleles (True) or Mutant/Reference Alleles (False) [True]
: Whether to restrict output and stats to positions with Major Allele differences in sample [False]
: Whether to restrict output and stats to positions with non-reference Major Allele [True]
: Whether to restrict output and stats to positions with reference Major Allele (if majmut=F) [False]
: Frequency cutoff for Major allele [0.0]
: Significance cutoff for enriched treatment SNPs [0.05]
: Additional FDR cutoff for enriched treatment SNPs [1.0]
: Table of SNPs of cross-reference with FDR SNP output []
: Fields that make unique key entries for snptable (with Locus, Pos) []
: Optional SNPTable fields to replace for mapping onto FDR Locus,Pos fields (Locus,Pos) []
: Whether to generate snpfreq multichromosome plots [True]
: MPileup or processed TDT file to be used for control SNP frequencies in Alt genome []
: MPileup or processed TDT file to be used for treatment SNP frequencies in Alt genome []
: BAM/SAM/Pileup/RID file with read mappings [None]
: Sequence file for loci in MPileup/SAM files (e.g. matching all relevant RID files) [None]
: Whether to generate Xdepth plot data for the readcheck FILE. (May be slow!) [False]
: Proportion of read to be mapped to count as full-length [0.9]
: Include read length data for the readcheck file (if depthplot=T) [True]
: Include directional read length data at X bp intervals (readlen=T; 0=OFF) [500]
: If set, will generate *.clipped.* output for reads with terminal soft clipping above INT [0]
: If set, will generate *.noclip.* output for reads with less terminal soft clipping than INT [0]
: Minimum read length to include in depth plots (all reads will be in RID file) [0]
: Smooth out any read plateaus < X nucleotides in length [200]
: Smooth out Xcoverage peaks < X depth difference to flanks (<1 = %Median) [0.05]
: Whether to generate PNG graphics using R. (Needs R installed and setup) [True]
: File of Locus, Start, End positions for read coverage checking [None]
: Fields in checkpos file to give Locus, Start and End for checking [Locus,Start,End]
: List of lengths flanking check regions that must also be spanned by reads [0,100,500,1000]
generic commandline options.

History Module Version History

    # 0.0 - Initial Compilation.
    # 0.1.0 - Modified version to handle multiple loci per file. (Original was for single bacterial chromosomes.)
    # 0.2.0 - Added majmut=T/F : Whether to restrict output and stats to positions with non-reference Major Allele [False]
    # 1.0.0 - Major reworking. Old version frozen as rje_samtools_V0.
    # 1.1.0 - Added snptabmap=X,Y alternative SNPTable mapping and read_depth statistics []. Added majref=T/F.
    # 1.2.0 - Added developmental combining of read mapping onto two different genomes.
    # 1.3.0 - Major debugging and code clean up.
    # 1.4.0 - Added parsing of read number (to link SNPs) and fixed deletion error at same time. Added rid=T/F and snponly=T/F.
    # 1.5.0 - Added biallelic=T/F   : Whether to restrict SNPs to pure biallelic SNPs (two alleles meeting mincut) [False]
    # 1.5.1 - Fixed REF/Ref ALT/Alt bug.
    # 1.6.0 - Added majfocus=T/F : Whether the focus is on Major Alleles (True) or Mutant/Reference Alleles (False) [True]
    # 1.7.0 - Added parsing of *.sam files for generating RID table.
    # 1.8.0 - Added read coverage summary/checks.
    # 1.8.1 - Fixed issue when RID file not generated by pileup parsing. Set RID=True by default to avoid issues.
    # 1.9.0 - Added depthplot data generation. (Will need to add R function for plot itself.)
    # 1.9.1 - Changed mincut default to 0.1.
    # 1.10.0 - Added readlen output, which is like the depth plot but uses max read length (kb) instead of depth.
    # 1.11.0 - Added dirnlen=X : Include directional read length data at X bp intervals (depthplot=T; 0=OFF) [500]
    # 1.11.1 - Minor tweaks to try and speed up pileup parsing.
    # 1.12.0 - Updated the snpfreq run code to make clearer and check for parsing issues. Set mincut=1 default.
    # 1.13.0 - Added skiploci=LIST - need to screen out mitochondrion from Illumina Pileup parsing!
    # 1.14.0 - Added forking of pileup parsing for SNPFreq analysis.
    # 1.14.1 - Fixed SNPFreq rerunning bug.
    # 1.15.0 - Added rgraphics=T/F   : Whether to generate snpfreq multichromosome plots [True]
    # 1.16.0 - Add coverage calculation per locus to depth plot table output (depthplot=T).
    # 1.16.1 - Added reporting of existing files for parsing Pileup.
    # 1.17.0 - Added parsing of lengths from SAM files to RID file.
    # 1.18.0 - Updated processing of Treatment and Control without Alt to still limit to SNPTable. Fixed SNPFreq filters.
    # 1.19.0 - snptableout=T/F    : Output filtered alleles to SNP Table [False]
    # 1.19.1 - Fixed AltLocus SNP table bug.
    # 1.19.2 - Updated forker parsing to hopefully fix bug.
    # 1.20.0 - Added parsing of BAM file - needs samtools on system. Added minsoftclip=X, maxsoftclip=X and minreadlen=X.
    # 1.20.1 - Fixed mlen bug. Added catching of unmapped reads in SAM file. Fixed RLen bug. Changed softclip defaults.
    # 1.20.2 - Fixed readlen coverage bug and acut bug.
    # 1.20.3 - Fixed RLen bug.
    # 1.21.0 - Added readnames=T/F : Output the read names to the RID file (SAM parsing only) [False]

SLiMSuite REST Server

rje_samtools V1.21.0

RJE SAMtools parser and processor

Function

SNP Frequency Calculations

Read Coverage Analysis

Commandline

MPileup Parsing Options

SNP Frequency Options

Double Genome Analysis

Read Coverage Analysis

History Module Version History