SLiMSuite REST Server


Links
REST Home
EdwardsLab Homepage
EdwardsLab Blog
SLiMSuite Blog
SLiMSuite
Webservers
REST Pages
REST Status
REST Help
REST Tools
REST Alias Data
REST API
REST News
REST Sitemap

SeqMapper V2.1

Sequence Mapping Program

Module: SeqMapper
Description: Sequence Mapping Program
Version: 2.1
Last Edit: 16/04/14

Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice


Imported modules: rje rje_menu rje_obj rje_seq rje_seqlist rje_zen rje_blast_V2 rje_sequence


See SLiMSuite Blog for further documentation.

Function

This module is for mapping one set of protein sequences onto a different sequence database, using Accession Numbers etc where possible and then using GABLAM when no direct match is possible. The program gives the following outputs: - *.*.mapped.fas = Fasta file of successfully mapped sequences - *.*.missing.fas = Fasta file of sequences that could not be mapped - *.*.mapping.tdt = Delimited file giving details of mapping (Seq, MapSeq, Method) If combine=T then the *.missing.fas file will not be created and unmapped sequences will be output in *.mapped.fas. Note that the possible mappings are all identified through BLAST and so a protein with matching IDs etc. but not hitting with BLAST will NOT be mapped. Currently only mapping of protein or nucleotides onto a protein database is supported.

Unless the interactivity setting is set to 2 or more (i=2), sequences that are mapped using Name, AccNum, Sequence (100% identical sequences), ID or DescAcc will be mapped onto the first appropriate sequence. If automap > 0, then the best sequence according to the mapstat will be mapped automatically. If two sequences tie, the other two possible stats will also be used to rank the hits. If still tied and mapfocus is not "both" then the sequences will be ranked using both query and hit stats. If still tied, the first sequence will be selected.

Any sequences that fall below automap (or i>1) but meet the minmap criteria will be ranked according to their BLAST rankings and then presented for a user decision. Presentation will be in reverse order, so that in the case of many possible mappings, the best options remain clear and on screen. The default choice (selected by hitting ENTER) will be the best ranked according to GABLAM stats, which will have been moved to position 1 if not already there. (BLAST rankings and GABLAM rankings will not always agree.)

SeqMapper will enter a user menu if i>1 or seqin and/or mapdb are missing. If i=0 and one of these is missing, a simple prompt will ask for the missing files. If i<0 and one of these is missing, the program will exit.

Commandline

### Input Options ###
seqin=FILE : File of sequences to be mapped [None]
mapdb=FILE : File of sequences to map sequences onto [None]
startfrom=X : Shortname or AccNum of seqin file to startfrom (will append results) (memsaver=T only) [None]
### Output Options ###
resfile=FILE : Base of output filenames (*.mapped.fas, *.missing.fas & *.mapping.tdt) [seqin.mapdb]
combine=T/F : Combine both fasta files in one (e.g. include unmapped sequences in *.mapping.fas) [False]
gablamout=T/F : Output GABLAM statistics for mapped sequences, including "straight" matches [True]
append=T/F : Append rather than overwrite results files [False]
delimit=X : Delimiter for *.mapping.* file (will set extension) [tab]
### Mapping Options ###
mapspec=X : Maps sequences onto given species code. "Self" = same species as query. "None" = any. [None]
mapping=LIST : Possible ways of mapping sequences (in pref order) [Name,AccNum,Sequence,ID,DescAcc,GABLAM,grep]
- Name = First word of sequence name
- Sequence = Identical sequence
- grep = grep-based searching of sequence if no hits
- ID = SwissProt style ID of GENE_SPECIES (note that the species may be changed according to mapspec)
- AccNum = Primary Accession Number
- DescAcc = Accession Number featured in description line in form "\WAccNum\W", where \W is non-
skipgene=LIST : List of "genes" in protein IDs to ignore [ens,nvl,ref,p,hyp,frag]
mapstat=X : GABLAM Stat to use for mapping assessment (if GABLAM in mapping list) (ID/Sim/Len) [ID]
minmap=X : Minimum value of mapstat for any mapping to occur [90.0]
automap=X : Minimum value of mapstat for automatic mapping to occur (if i<1) [99.5]
ordered=T/F : Whether to use GABLAMO rather than GABLAM stat [True]
mapfocus=X : Focus for mapping statistic, i.e. which sequence must meet requirements [query]
- query = Best if query is ultimate focus and maximises closeness of mapped sequence)
- hit = Best if lots of sequence fragments are in mapdb and should be allowed as mappings
- either = Best if both above conditions are true
- both = Gets most similar sequences in terms of length but can be quite strict where length errors exist

### Advanced BLAST Options ###
blaste=X : E-Value cut-off for BLAST searches (BLAST -e X) [1e-4]
blastv=X : Number of BLAST hits to return per query (BLAST -v X) [20]
blastf=T/F : Complexity Filter (BLAST -F X) [False]

History Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Basic working version for protein databases.
    # 1.1 - Modified run() method to be called from other programs
    # 1.2 - Added grep method
    # 2.0 - Reworked with new Object format, new BLAST(+) module and new seqlist module.
    # 2.1 - Added catching of failure to read input sequences. Removed 'Run' from GABLAM table.

© 2015 RJ Edwards. Contact: richard.edwards@unsw.edu.au.