|
|
Module: | rje_xref |
Description: | Generic identifier cross-referencing |
Version: | 1.8.2 |
Last Edit: | 08/01/16 |
|
Copyright © 2014 Richard J. Edwards - See source code for GNU License Notice
Imported modules:
rje
rje_db
rje_obj
See SLiMSuite Blog for further documentation. See rje
for general commands.
Function
This module is primarily for use with other programs/modules to handle database cross-referencing. Initially, it is
designed to be able to use the output from rje_genemap but might eventually replace this module by generating the
xref data from raw data in the first place. Although it is designed and documented with aliases in mind, e.g. mapping
UniProt IDs to HGNC, it can also be used for more general mapping of, for example, Pfam domains and genes.
This module is designed to work with a main xrefdata table (db table XRef) that contains 1:1:1 values for each
identifier. Where there are multiple values for a given field, these will be joined with '|' characters (or
splitchar=X
) and likewise split on '|' to generate aliases from the xrefdata table. (These will be sorted if
sortxref=T
. This may slow the program down a little.) Split fields will have whitespace removed. To avoid a field
being split, use splitskip=LIST
.
If multiple xrefdata files are provided, these will be combined into a single table. The first file becomes the
master XRef table and any additional input will be added to this table provided that (a) it has a KeyID or AltKeys
field, or (b) it can be mapped onto an existing KeyID via a given MapField.
NOTE. If newheaders=LIST
is used, it will apply to the FIRST xrefdata table only and the new fields will be used for
all subsequent processing. All other options should therefore use these new headers. If multiple files need new
headers, the program may need to be run several times before combining them. If newheaders replaces KeyID then KeyID
will be updated.
If fullmap=T
then ALL MapFields will be used for mapping, even if this produces multiple mappings. Otherwise, the
first successful mapping will be used. The KeyID and AltKeys fields are automatically added to the
front of any MapField list (and thus will be used first for mapping). If an ID List is provided (idlist=LIST
) then
these final XRef data are restricted to the KeyIDs in this list, once all mapping has been done.
If FileXRef file is given (filexref=FILE
) then the XRef data will be mapped onto this file (via MapFields) and
subsequent output will correspond to the mappings and IDs in this combined file. XRef fields to be added to the file
can be limited with xrefs=LIST
. Sorted (unique) lists of mapped files can be produced by xreflist=LIST
. If no
filexref file is given then output is for the entire XRef data table. Note that the default is to produce this table.
Some database identifiers are prefixed with the database and a colon. e.g. Entrez gene 10840 may be written
ENTREZ:10840. This will be recognised and standardised by either removing the prefix or, if the field is in
dbprefix=LIST
, by enforcing the DB:ID format. The database is always taken from the XRef field, so this must match,
although the case will be switched to uppercase unless in keepcase=LIST
.
Commandline
Input/Field Options
xrefdata=FILES
: List of files with delimited data of identifier cross-referencing (wildcards allowed) []
newheaders=LIST
: List of new Field headers for XRefData (will replace old - must be complete) []
keepcase=LIST
: Any fields matching keepcase will retain mixed case, otherwise be converted to upper case ['Desc','Description','Name'
]
dbprefix=LIST
: List of fields that should have the field added to the ID as a prefix, e.g. HGNC:0001 []
stripvar=CDICT
: Remove variants using Field:Char list, e.g. Uniprot:-,GenPept:. []
compress=LIST
: Compress listed fields into lists (using splitchar) to allow 1:many mapping in xrefdata. []
splitchar=X
: Character on which to split fields for multiple alias processing ['|'
]
splitcsv=T/F
: Whether to also split fields based on comma separation [True
]
splitskip=LIST
: List of fields to bypass for field splitting ['Desc','Description','Name'
]
sortxref=T/F
: Whether to sort multiple xref data alphabetically [True
]
keyid=X
: Key field header to be used in main Data dictionary - aliases map to this ['Gene'
]
comments=LIST
: List of comment line prefixes marking lines to ignore (throughout file) ['//','%'
]
xreformat=T/F
: Whether to apply field reformatting to input xrefdata (True) or just xrefs to map (False) [False
]
yeastxref=T/F
: First xrefdata file is a yeast.txt file to convert. (http://www.uniprot.org/docs/yeast.txt)
XRef/Processing Options
altkeys=LIST
: Alternative fields to look for in Alias Data ['Symbol','HGNC symbol'
]
onetomany=T/F
: Whether to keep potential one-to-many altkeys IDs [False
]
mapfields=X
: Fields to be used for Alias mapping plus KeyID. (Must be in XRef). []
maptomany=T/F
: Whether to keep potential one-to-many mapfield IDs [True
]
fullmap=T/F
: Whether to map onto ALL map fields or stop at first hit [False
]
uniquexref=T/F
: Whether to restrict analysis to unique XRef IDs [False
]
mapxref=LIST
: List of identifiers to map to KeyIDs using mapfields []
filexref=FILE
: File to XRef and expand with xrefs before re-saving []
badid=LIST
: List of XRef IDs to ignore ['!FAILED!','None','N/A','-'
]
aliases=LIST
: Combine XRef fields into single 'Aliases' field (and remove KeyID if found)
Join Method Options
join=LIST
: Run in join mode for list of FILE:key1|...|keyN:JoinField []
naturaljoin=T/F
: Whether to only output entries that join to all tables [False
]
Output Options
basefile=X
: Basefile for output files [Default: filexref or first xrefdata input file w/o path
]
savexref=T/F
: Save the xrefdata table (*.xref.tdt) following compilation of data [True
]
idlist=LIST
: Subset of key IDs to map onto. (All if blank) []
xrefs=LIST
: List of XRef (or join) fields to keep (blank/* for all) []
xreflist=LIST
: List of XRef fields to output as sorted (unique) lists (*.*.txt) (* for all) []
See also rje.py generic commandline options.
History Module Version History
# 0.0 - Initial Compilation.
# 1.0 - Added xfrom and xto fields and xMap() function for mapping from one ID set to another.
# 1.1 - Added output of ID lists to text files. Major reworking. Tested with HPRD and HGNC.
# 1.2 - Added join=LIST Run in join mode for list of FILE:key1|...|keyN:JoinField [] and naturaljoin=T/F.
# 1.3.0 - Added compress=LIST to handle 1:many input data. []
# 1.3.1 - Fixed xref list bug.
# 1.4.0 - Added optional Mapping dictionary for speeding up recurring mapping (should avoid if memsaver=F).
# 1.5.0 - Added stripvar=CDICT removal of variants using Field:Char list, e.g. Uniprot:-,GenPept:. []
# 1.6.0 - Added mapxref=LIST List of identifiers to map to KeyIDs using mapfields []
# 1.7.0 - Added comments=LIST ist of comment line prefixes marking lines to ignore (throughout file) ['//','%']
# 1.7.1 - Added xreformat=T/F : Whether to apply field reformatting to input xrefdata (True) or just xrefs to map (False) [False]
# 1.8.0 - Added recognition and parsing of yeast.txt XRef file from Uniprot (http://www.uniprot.org/docs/yeast.txt)
# 1.8.1 - Added rest run mode to avoid XRef table output if no gene ID list is given. Added `genes` and `genelist` as `idlist=LIST` synonym.
# 1.8.2 - Catching self.dict['Mapping'] error for REST server.
rje_xref REST Output formats
The XRef
server is designed to take a set of input gene (or other) identifiers and extract a database
identifier cross-references for them. Genes are given using &idlist=LIST
. (&genes=LIST
or &genelist=LIST
should also work.)
Run with &rest=help
for general options. Run with &rest=full
to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT
:
tab
= main table of identified elements. [tdt]
mapped
= pairs of provided identifiers and the primary ID mapped onto. [tdt]
failed
= list of identifiers that failed to map. [list]
In addition, there will be a tab per field of the XRef file listing the sorted unique identifiers mapped.