Function
This module is for mapping one set of protein sequences onto a different sequence database, using Accession Numbers
etc where possible and then using GABLAM when no direct match is possible. The program gives the following outputs:
- *.*.mapped.fas = Fasta file of successfully mapped sequences
- *.*.missing.fas = Fasta file of sequences that could not be mapped
- *.*.mapping.tdt = Delimited file giving details of mapping (Seq, MapSeq, Method)
If combine=T
then the *.missing.fas file will not be created and unmapped sequences will be output in *.mapped.fas.
Note that the possible mappings are all identified through BLAST and so a protein with matching IDs etc. but not
hitting with BLAST will NOT be mapped. Currently only mapping of protein or nucleotides onto a protein database is
supported.
Unless the interactivity setting is set to 2 or more (i=2
), sequences that are mapped using Name, AccNum, Sequence
(100% identical sequences), ID or DescAcc will be mapped onto the first appropriate sequence. If automap > 0, then
the best sequence according to the mapstat will be mapped automatically. If two sequences tie, the other two possible
stats will also be used to rank the hits. If still tied and mapfocus is not "both" then the sequences will be ranked
using both query and hit stats. If still tied, the first sequence will be selected.
Any sequences that fall below automap (or i>1) but meet the minmap criteria will be ranked according to their BLAST
rankings and then presented for a user decision. Presentation will be in reverse order, so that in the case of many
possible mappings, the best options remain clear and on screen. The default choice (selected by hitting ENTER) will
be the best ranked according to GABLAM stats, which will have been moved to position 1 if not already there. (BLAST
rankings and GABLAM rankings will not always agree.)
SeqMapper will enter a user menu if i>1 or seqin and/or mapdb are missing. If i=0
and one of these is missing, a
simple prompt will ask for the missing files. If i<0 and one of these is missing, the program will exit.
Commandline
### Input Options ###
seqin=FILE
: File of sequences to be mapped [None
]
mapdb=FILE
: File of sequences to map sequences onto [None
]
startfrom=X
: Shortname or AccNum of seqin file to startfrom (will append results) (memsaver=T
only) [None
]
### Output Options ###
resfile=FILE
: Base of output filenames (*.mapped.fas, *.missing.fas & *.mapping.tdt) [seqin.mapdb
]
combine=T/F
: Combine both fasta files in one (e.g. include unmapped sequences in *.mapping.fas) [False
]
gablamout=T/F
: Output GABLAM statistics for mapped sequences, including "straight" matches [True
]
append=T/F
: Append rather than overwrite results files [False
]
delimit=X
: Delimiter for *.mapping.* file (will set extension) [tab
]
basefile=FILE
: Set resfile=FILE
and log=FILE
at the same time []
### Mapping Options ###
i=X
: Set interactivity. i=-1
full auto. i=0
no menu. i=1
interactive menu. [1
]
mapspec=X
: Maps sequences onto given species code. "Self" = same species as query. "None" = any. [None
]
mapping=LIST
: Possible ways of mapping sequences (in pref order) [Name,AccNum,Sequence,ID,DescAcc,GABLAM,grep
]
- Name = First word of sequence name
- Sequence = Identical sequence
- grep = grep-based searching of sequence if no hits
- ID = SwissProt style ID of GENE_SPECIES (note that the species may be changed according to mapspec)
- AccNum = Primary Accession Number
- DescAcc = Accession Number featured in description line in form "\WAccNum\W", where \W is non-
skipgene=LIST
: List of "genes" in protein IDs to ignore [ens,nvl,ref,p,hyp,frag
]
mapstat=X
: GABLAM Stat to use for mapping assessment (if GABLAM in mapping list) (ID/Sim/Len) [ID
]
minmap=X
: Minimum value of mapstat for any mapping to occur [90.0
]
automap=X
: Minimum value of mapstat for automatic mapping to occur (if i<1) [99.5
]
ordered=T/F
: Whether to use GABLAMO rather than GABLAM stat [True
]
mapfocus=X
: Focus for mapping statistic, i.e. which sequence must meet requirements [query
]
- query = Best if query is ultimate focus and maximises closeness of mapped sequence)
- hit = Best if lots of sequence fragments are in mapdb and should be allowed as mappings
- either = Best if both above conditions are true
- both = Gets most similar sequences in terms of length but can be quite strict where length errors exist
### Advanced BLAST Options ###
blaste=X
: E-Value cut-off for BLAST searches (BLAST -e X) [1e-4
]
blastv=X
: Number of BLAST hits to return per query (BLAST -v X) [20
]
blastf=T/F
: Complexity Filter (BLAST -F X) [False
]
History Module Version History
# 0.0 - Initial Compilation.
# 1.0 - Basic working version for protein databases.
# 1.1 - Modified run() method to be called from other programs
# 1.2 - Added grep method
# 2.0 - Reworked with new Object format, new BLAST(+) module and new seqlist module.
# 2.1 - Added catching of failure to read input sequences. Removed 'Run' from GABLAM table.
# 2.2.0 - Updated basefile to set resfile.
# 2.3.0 - Added GABLAM-free method.
SeqMapper REST Output formats
Run with
&rest=docs
for program documentation and options. A plain text version is accessed with
&rest=help
.
&rest=OUTFMT
can be used to retrieve individual parts of the output, matching the tabs in the default
(
&rest=format
) output. Individual
OUTFMT
elements can also be parsed from the full (
&rest=full
) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...
Available REST Outputs
There is currently no specific help available on REST output for this program.