SLiMSuite REST Server
RJE SAMtools parser and processor
The initial function of this program is for calling/assessing genetic changes following MPileup mapping of multiple
short read datasets against the same reference genome (provided as BAM files). The MPileup files should be generated
by piping the output of the following into a file (*.mpileup):
samtools mpileup -BQ0 -d10000000 -f <Ref Genome Fasta> <BAM file>
Initial parsing and filtering converts the MPileup format into a delimited text file with quality score and read
depth filtering, converting mapped read data into an allele list in the form: allele1:count|allele2:count|... These
will be sorted by frequency, so that allele1 is the "major allele". Output of the *.QX.tdt file will have the fields:
Locus = Reference locus (contig/chromosome)
Pos = Position in reference (1-L)
Ref = Reference sequence
N = Total read depth at that position
QN = Read depth after quality filtering
Seq = Mapped (filtered) allele list in the form: allele1:count|allele2:count|...
SNP Frequency Calculations
A second function of this tool is to compare the SNP frequencies of two populations/datasets and identify major
alleles in the "Treatment" that have significantly increased in frequency compared to the "Control". This mode takes
two pileup files as
treatment=FILE. These file names (minus extension) will be output fields. For
something more user-friendly, use
labels=X,Y to give them better labels (where X is control and Y is treatment).
These file names will also be used to set the output files
basefile=X is set.
Parsed pileup files (see above) are read in and combined. Only locus positions with entries in both files (i.e. both
minqn=X criteria) are kept and base calls combined. Any alleles failing to meet the minimum
count criteria (
mincut=X) are removed. If
mincut < 1.0, it is treated as a proportion of all reads, with a minimum
value set by
absmincut=X. The exception to this allele removal is the Reference allele, unless
N alleles are removed and do not contribute to overall read counts for allele frequency calculations. This
can be changed by setting
ignoren=F. Filtered SNPs are output to
differences to the Reference (i.e. 100% frequency in both control and treatment) are also output.
The focus of analysis is the "major allele" for the treatment population, which is the allele with highest frequency.
(When tied, alleles will be arbitrarily selected in alphabetical order). The frequency of the major allele in the
MajFreq) is output along with the difference in frequency of this allele relative to the control
MajDiff). A positive value indicates that the allele is at higher frequency in the treatment, whereas a negative
value means higher frequency in the control. Finally, the probability of the treatment frequency is calculated using
a binomial distribution, where: p is probability of a read being the major allele based on the control frequency (if
zero, this is conservatively set to 1/(N+1), i.e. assuming that the next control read would be that allele); k is the
number of major allele treatment reads and n is the total number of treatment reads.
The next stage is to filter these SNPs and calculate FDR statistics. By default, only positions with non-reference
major treatment alleles are considered. This can be switched off with
majmut=F or reversed with
majref=T in combination. Following this filtering, the total
number of SNPs is used for an FDR correction of the
MajProb probabilities, using the Benjamini & Hochberg approach,
where FDR = pN/n. To speed this up and reduce output, a probability cutoff can be applied with
sigcut=X (0.05 by
default). To only keep positions where the treatment major allele is different to the control major allele, use
majcut=X will add an additional min. frequency criterion. These are applied after FDR correction.
Output is then further reduced to those SNPs where
MajDiff > 0 and filtered with a final FDR cutoff (
Remaining SNPs are output to the
Optionally, a table of SNP annotation (
snptable=FILE) can be merged with the
*.fdr.tdt output. This table must have
Pos fields. If it has other fields that are required to identify unique entries, these should be set
Read Coverage Analysis
Version 1.8.0 added read coverage analysis, which will calculate average depth of coverage for each Locus based on a
SAM or mpileup file, or a Read ID file (
*.rid.tdt) previously generated by
rje_samtools. This file is given by
readcheck=FILE. If the source sequence file is also given using
seqin=FASFILE then the true sequence lengths are
used for the calculation. Otherwise, the last position covered by a read in the
readcheck file is used. Read depths
are output to
A table of regions to be checked can be provided using
checkpos=FILE. This should have
fields for the regions to be checked. Reads from
readcheck will be scanned to (a) identify reads completely
spanning the region, and (b) calculate the mean depth of coverage for that region. Reads spanning the region must
also cover a number of nucleotides flanking the region, as set by
checkflanks=LIST. Each flanking distance in the
list will have its own
SpanX output field. For example, the default settings check 0 flanks (just the region), plus
flanks of 100, 500 and 1000 nt. This generates output fields
End fields for
checkpos=FILE can be given with
checkfields=LIST. NOTE: This is
case-sensitive and needs all three elements, even if only one or two fields are being changed.
depthplot=T then read depth will be calculated across each locus and additional depth statistics (MinX, MaxX,
MedianX and Coverage) will be added to the coverage file. If
rgraphics=T (default) and R is installed, plots per
locus will be generated. If
readlen=T then additional outputs will be generated of the maximum read length at each
point in the genome.
dirlen=X > 0 (
default=500), the maximum 5' and 3' read linkage will be calculated every
(plus the end position). The examines the reads spanning that position and reports the maximum distance any read
extends in the 5' (Len5) or 3' (Len3) direction from that position. This can identify misassemblies resulting in
breaks in read coverage. Note that reads in this context are contiguous SAM/pileup hits and so sequencing reads with
multiple or split mapping will be treated as multiple reads.
MPileup Parsing Options
batch=FILELIST : List of MPileup/SAM files to be parsed and filtered (e.g. *.pileup) 
qcut=X : Min. quality score for a call to include [
minqn=X : Min. number of reads meeting qcut (QN) for output [
rid=T/F : Whether to include Read ID (number) lists for each allele [
snponly=T/F : Whether to restrict parsing output to SNP positions (will use mincut settings below) [
indels=T/F : Whether to include indels in "SNP" parsing [
skiploci=LIST : List of loci to exclude from pileup parsing (e.g. mitochondria) 
snptableout=T/F : Output filtered alleles to SNP Table [
SNP Frequency Options
control=FILE : MPileup or processed TDT file to be used for control SNP frequencies 
treatment=FILE : MPileup or processed TDT file to be used for treatment SNP frequencies 
labels=X,Y : Optional labels for Control and Treatment fields in output (other file basename used) 
mincut=X : Minimum read count for minor allele (proportion if <1) [
absmincut=X : Absolute minimum read count for minor allele (used if mincut<1) [
biallelic=T/F : Whether to restrict SNPs to pure biallelic SNPs (two alleles meeting mincut) [
ignoren=T/F : Whether to exclude "N" calls from alleles [
ignoreref=T/F : If False will always keep Reference allele and call fixed change as SNP [
basefile=X : Basename for frequency comparison output [
majfocus=T/F : Whether the focus is on Major Alleles (True) or Mutant/Reference Alleles (False) [
majdif=T/F : Whether to restrict output and stats to positions with Major Allele differences in sample [
majmut=T/F : Whether to restrict output and stats to positions with non-reference Major Allele [
majref=T/F : Whether to restrict output and stats to positions with reference Major Allele (if
majcut=X : Frequency cutoff for Major allele [
sigcut=X : Significance cutoff for enriched treatment SNPs [
fdrcut=X : Additional FDR cutoff for enriched treatment SNPs [
snptable=FILE : Table of SNPs of cross-reference with FDR SNP output 
snptabkeys=LIST : Fields that make unique key entries for snptable (with Locus, Pos) 
snptabmap=X,Y : Optional SNPTable fields to replace for mapping onto FDR Locus,Pos fields (Locus,Pos) 
rgraphics=T/F : Whether to generate snpfreq multichromosome plots [
Double Genome Analysis
altcontrol=FILE : MPileup or processed TDT file to be used for control SNP frequencies in Alt genome 
alttreatment=FILE : MPileup or processed TDT file to be used for treatment SNP frequencies in Alt genome 
Read Coverage Analysis
readcheck=FILE : BAM/SAM/Pileup/RID file with read mappings [
seqin=FASFILE : Sequence file for loci in MPileup/SAM files (e.g. matching all relevant RID files) [
depthplot=T/F : Whether to generate Xdepth plot data for the readcheck FILE. (May be slow!) [
fullcut=X : Proportion of read to be mapped to count as full-length [
readlen=T/F : Include read length data for the readcheck file (if
dirnlen=X : Include directional read length data at X bp intervals (
minsoftclip=INT : If set, will generate *.clipped.* output for reads with terminal soft clipping above INT [
maxsoftclip=INT : If set, will generate *.noclip.* output for reads with less terminal soft clipping than INT [
minreadlen=INT : Minimum read length to include in depth plots (all reads will be in RID file) [
depthsmooth=X : Smooth out any read plateaus < X nucleotides in length [
peaksmooth=X : Smooth out Xcoverage peaks < X depth difference to flanks (<1 = %Median) [
rgraphics=T/F : Whether to generate PNG graphics using R. (Needs R installed and setup) [
checkpos=TDTFILE : File of Locus, Start, End positions for read coverage checking [
checkfields=LIST : Fields in checkpos file to give Locus, Start and End for checking [
checkflanks=LIST : List of lengths flanking check regions that must also be spanned by reads [
See also rje.py generic commandline options.
History Module Version History
# 0.0 - Initial Compilation.
# 0.1.0 - Modified version to handle multiple loci per file. (Original was for single bacterial chromosomes.)
# 0.2.0 - Added majmut=T/F : Whether to restrict output and stats to positions with non-reference Major Allele [False]
# 1.0.0 - Major reworking. Old version frozen as rje_samtools_V0.
# 1.1.0 - Added snptabmap=X,Y alternative SNPTable mapping and read_depth statistics . Added majref=T/F.
# 1.2.0 - Added developmental combining of read mapping onto two different genomes.
# 1.3.0 - Major debugging and code clean up.
# 1.4.0 - Added parsing of read number (to link SNPs) and fixed deletion error at same time. Added rid=T/F and snponly=T/F.
# 1.5.0 - Added biallelic=T/F : Whether to restrict SNPs to pure biallelic SNPs (two alleles meeting mincut) [False]
# 1.5.1 - Fixed REF/Ref ALT/Alt bug.
# 1.6.0 - Added majfocus=T/F : Whether the focus is on Major Alleles (True) or Mutant/Reference Alleles (False) [True]
# 1.7.0 - Added parsing of *.sam files for generating RID table.
# 1.8.0 - Added read coverage summary/checks.
# 1.8.1 - Fixed issue when RID file not generated by pileup parsing. Set RID=True by default to avoid issues.
# 1.9.0 - Added depthplot data generation. (Will need to add R function for plot itself.)
# 1.9.1 - Changed mincut default to 0.1.
# 1.10.0 - Added readlen output, which is like the depth plot but uses max read length (kb) instead of depth.
# 1.11.0 - Added dirnlen=X : Include directional read length data at X bp intervals (depthplot=T; 0=OFF) 
# 1.11.1 - Minor tweaks to try and speed up pileup parsing.
# 1.12.0 - Updated the snpfreq run code to make clearer and check for parsing issues. Set mincut=1 default.
# 1.13.0 - Added skiploci=LIST - need to screen out mitochondrion from Illumina Pileup parsing!
# 1.14.0 - Added forking of pileup parsing for SNPFreq analysis.
# 1.14.1 - Fixed SNPFreq rerunning bug.
# 1.15.0 - Added rgraphics=T/F : Whether to generate snpfreq multichromosome plots [True]
# 1.16.0 - Add coverage calculation per locus to depth plot table output (depthplot=T).
# 1.16.1 - Added reporting of existing files for parsing Pileup.
# 1.17.0 - Added parsing of lengths from SAM files to RID file.
# 1.18.0 - Updated processing of Treatment and Control without Alt to still limit to SNPTable. Fixed SNPFreq filters.
# 1.19.0 - snptableout=T/F : Output filtered alleles to SNP Table [False]
# 1.19.1 - Fixed AltLocus SNP table bug.
# 1.19.2 - Updated forker parsing to hopefully fix bug.
# 1.20.0 - Added parsing of BAM file - needs samtools on system. Added minsoftclip=X, maxsoftclip=X and minreadlen=X.
# 1.20.1 - Fixed mlen bug. Added catching of unmapped reads in SAM file. Fixed RLen bug. Changed softclip defaults.
# 1.20.2 - Fixed readlen coverage bug and acut bug.
© 2015 RJ Edwards. Contact: firstname.lastname@example.org.