Trim Function
TRex sniffs out the carcasses of fragmented Ty elements (or other repeats) littering the end of preassembly reads and
removes them in the hope of improving the subsequent assembly. It is assumed that there is sufficient depth of
coverage for full-length elements to be captured by longer reads, for which assembly of the ends will not be
ambiguous. SMRTSCAPE "ftxcov" mode can be used to estimate this probability from the trimmed preassembly data.
First, GABLAM is used for identifying elements that fall near the end of sequences. This is restricted to hits above
length minlength=X
[default=250
]. In theory, this can be made as long as the required overlap for assembly. There
might be a trade-off where unnecessary trimming of repeat fragments breaks up assembly of structural variants, whilst
insufficient trimming leads to mid-repeat assembly graph nodes. Defaults err on the side of over-trimming: it is
hoped that phasing tools using original subreads will overcome any reduction in differentiation of the sequences
flanking structural variants.
Identified repeats that fall within endbuffer=X
[default 100 bp] of the end of a read will be removed and the
truncated read saved (along with unaffected reads) in a *.trex.fasta
file. In addition to the region matching the
repeat, the end of the read and an additional flanking regions specified by trimflank=X
[default 200 bp] will be
removed. If trimflank<0
, the hit region itself will be truncated prior to read trimming, i.e. a portion of the
repeat will be left on. For example, trimflank=-250
will leave 250 bp of the repeat on the end of the sequence,
which will generate ends similar to those ignored under the default minlength=250
. Only hits within endbuffer=X
of the end will be trimmed. This may need to be changed if there is the danger of a longer stretch in the middle of a
repeat sequence failing to have enough homology between the seqin
repeat sequences and the occurrences in the read.
TRex will be tested and optimised for yeast Ty elements.
To avoid long tandem repeats disappearing completely from the assembly, trimmed reads are re-scanned for additional
terminal repeats after a single repeat is snipped off either or both ends. By default (trmode=keep
), reads that
have a terminal repeat element post-trimming are retained in the preassembly at full length. Alternatively, TRex can
keep trimming back terminal repeats until there are none by setting trmode=iterate
. Third third alternative is
trmode=trim
, which will only trim the end fragment and leave internal repeats untrimmed; it is not clear under what
circumstances this would directly be useful but running TRex several times with trmode=trim
will iteratively remove
tandem repeats of elements with higher resolution tracking of the reads affected. (The end product should be
identical to running with trmode=iterate
.)
In addition to the main *.trex.fasta
output, a *.trex.tdt
table will be generated of the sequences trimmed from
the original preassembly reads:
- Read = Preassembly read being trimmed.
- Length = The untrimmed length of the read.
- Start = The start of the trimmed read relative to the original read (1-L).
- End = The end of the trimmed read relative to the original read (1-L).
- BegRpt = Repeat element used for trimming 5' end. (The one with the longest terminal hit will be used.)
- EndRpt = Repeat element used for trimming 3' end. (The one with the longest terminal hit will be used.)
By default, there will also be *.blast
and *.local.tdt
files generated by GABLAM as part of the search. Keeping
these will enable more rapid re-running of TRex with different trim settings. NOTE: to re-run with more relaxed
minlength/minid settings, the *.local.tdt
should first be removed.
Commandline
Input/Output options
seqin=FASFILE
: Fasta file of representative repeat sequences for removal []
basefile=FILE
: Root of output file names [preassembly or searchdb basefile
]
keepblast=T/F
: Whether to keep the BLAST output of the repeats versus reads [True
]
keeplocal=T/F
: Whether to keep the GABLAM local BLAST hit table of repeated versus reads [True
]
cleandb=T/F
: Whether to clean up (delete) files generated during makedb
formatting [False
]
Repeat Element Identification options
trex=X
: TRex run mode: trim/hunt/strip/mask/extract/repoint [hunt
]
minlength=INT
: Minimum length of a fragment (local BLAST hit) worthy of consideration [250
]
minid=PERC
: Minimum %identity of a fragment (local BLAST hit) worthy of consideration [60.0
]
Read Trimming options
preassembly=FASFILE
: Fasta file of preassembly reads []
endbuffer=X
: Max distance from end of sequence to flag repeat for trimming [100
]
trimflank=X
: Additional flanking nucleotides to trim off the end of the sequence [200
]
trmode=X
: How to handle terminal tandem repeats (keep/trim/iterate) [keep
]
Repeat Hunting/Stripping/Masking options
searchdb=FASFILE
: Genome or assembly in which to find and classify TEs []
refgenome=FASFILE
: Reference Genome to use for establishing flanking positions []
History Module Version History
# 0.0.0 - Initial Compilation.
# 0.1.0 - Added trmode=X : How to handle tandem repeats (keep/trim/iterate) [keep]
# 0.1.1 - Fixed IOEror typo.
# 0.2.0 - Added TR Hunt mode based on processing of GABLAM unique searches.