Module:	rje_xref
Description:	Generic identifier cross-referencing
Version:	1.8.2
Last Edit:	08/01/16

Imported modules: rje rje_db rje_obj

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This module is primarily for use with other programs/modules to handle database cross-referencing. Initially, it is designed to be able to use the output from rje_genemap but might eventually replace this module by generating the xref data from raw data in the first place. Although it is designed and documented with aliases in mind, e.g. mapping UniProt IDs to HGNC, it can also be used for more general mapping of, for example, Pfam domains and genes.

This module is designed to work with a main xrefdata table (db table XRef) that contains 1:1:1 values for each identifier. Where there are multiple values for a given field, these will be joined with '|' characters (or splitchar=X) and likewise split on '|' to generate aliases from the xrefdata table. (These will be sorted if sortxref=T. This may slow the program down a little.) Split fields will have whitespace removed. To avoid a field being split, use splitskip=LIST.

If multiple xrefdata files are provided, these will be combined into a single table. The first file becomes the master XRef table and any additional input will be added to this table provided that (a) it has a KeyID or AltKeys field, or (b) it can be mapped onto an existing KeyID via a given MapField.

NOTE. If newheaders=LIST is used, it will apply to the FIRST xrefdata table only and the new fields will be used for all subsequent processing. All other options should therefore use these new headers. If multiple files need new headers, the program may need to be run several times before combining them. If newheaders replaces KeyID then KeyID will be updated.

If fullmap=T then ALL MapFields will be used for mapping, even if this produces multiple mappings. Otherwise, the first successful mapping will be used. The KeyID and AltKeys fields are automatically added to the front of any MapField list (and thus will be used first for mapping). If an ID List is provided (idlist=LIST) then these final XRef data are restricted to the KeyIDs in this list, once all mapping has been done.

If FileXRef file is given (filexref=FILE) then the XRef data will be mapped onto this file (via MapFields) and subsequent output will correspond to the mappings and IDs in this combined file. XRef fields to be added to the file can be limited with xrefs=LIST. Sorted (unique) lists of mapped files can be produced by xreflist=LIST. If no filexref file is given then output is for the entire XRef data table. Note that the default is to produce this table.

Some database identifiers are prefixed with the database and a colon. e.g. Entrez gene 10840 may be written ENTREZ:10840. This will be recognised and standardised by either removing the prefix or, if the field is in dbprefix=LIST, by enforcing the DB:ID format. The database is always taken from the XRef field, so this must match, although the case will be switched to uppercase unless in keepcase=LIST.

Commandline

Input/Field Options

xrefdata=FILES : List of files with delimited data of identifier cross-referencing (wildcards allowed) []
newheaders=LIST : List of new Field headers for XRefData (will replace old - must be complete) []
keepcase=LIST : Any fields matching keepcase will retain mixed case, otherwise be converted to upper case ['Desc','Description','Name']
dbprefix=LIST : List of fields that should have the field added to the ID as a prefix, e.g. HGNC:0001 []
stripvar=CDICT : Remove variants using Field:Char list, e.g. Uniprot:-,GenPept:. []
compress=LIST : Compress listed fields into lists (using splitchar) to allow 1:many mapping in xrefdata. []
splitchar=X : Character on which to split fields for multiple alias processing ['|']
splitcsv=T/F : Whether to also split fields based on comma separation [True]
splitskip=LIST : List of fields to bypass for field splitting ['Desc','Description','Name']
sortxref=T/F : Whether to sort multiple xref data alphabetically [True]
keyid=X : Key field header to be used in main Data dictionary - aliases map to this ['Gene']
comments=LIST : List of comment line prefixes marking lines to ignore (throughout file) ['//','%']
xreformat=T/F : Whether to apply field reformatting to input xrefdata (True) or just xrefs to map (False) [False]
yeastxref=T/F : First xrefdata file is a yeast.txt file to convert. (http://www.uniprot.org/docs/yeast.txt)

XRef/Processing Options

altkeys=LIST : Alternative fields to look for in Alias Data ['Symbol','HGNC symbol']
onetomany=T/F : Whether to keep potential one-to-many altkeys IDs [False]
mapfields=X : Fields to be used for Alias mapping plus KeyID. (Must be in XRef). []
maptomany=T/F : Whether to keep potential one-to-many mapfield IDs [True]
fullmap=T/F : Whether to map onto ALL map fields or stop at first hit [False]
uniquexref=T/F : Whether to restrict analysis to unique XRef IDs [False]
mapxref=LIST : List of identifiers to map to KeyIDs using mapfields []
filexref=FILE : File to XRef and expand with xrefs before re-saving []
badid=LIST : List of XRef IDs to ignore ['!FAILED!','None','N/A','-']
aliases=LIST : Combine XRef fields into single 'Aliases' field (and remove KeyID if found)

Join Method Options

join=LIST : Run in join mode for list of FILE:key1|...|keyN:JoinField []
naturaljoin=T/F : Whether to only output entries that join to all tables [False]

Output Options

basefile=X : Basefile for output files [Default: filexref or first xrefdata input file w/o path]
savexref=T/F : Save the xrefdata table (*.xref.tdt) following compilation of data [True]
idlist=LIST : Subset of key IDs to map onto. (All if blank) []
xrefs=LIST : List of XRef (or join) fields to keep (blank/* for all) []
xreflist=LIST : List of XRef fields to output as sorted (unique) lists (*.*.txt) (* for all) []

See also rje.py generic commandline options.

History Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Added xfrom and xto fields and xMap() function for mapping from one ID set to another.
    # 1.1 - Added output of ID lists to text files. Major reworking. Tested with HPRD and HGNC.
    # 1.2 - Added join=LIST Run in join mode for list of FILE:key1|...|keyN:JoinField [] and naturaljoin=T/F.
    # 1.3.0 - Added compress=LIST to handle 1:many input data. []
    # 1.3.1 - Fixed xref list bug.
    # 1.4.0 - Added optional Mapping dictionary for speeding up recurring mapping (should avoid if memsaver=F).
    # 1.5.0 - Added stripvar=CDICT removal of variants using Field:Char list, e.g. Uniprot:-,GenPept:. []
    # 1.6.0 - Added mapxref=LIST List of identifiers to map to KeyIDs using mapfields []
    # 1.7.0 - Added comments=LIST ist of comment line prefixes marking lines to ignore (throughout file) ['//','%']
    # 1.7.1 - Added xreformat=T/F : Whether to apply field reformatting to input xrefdata (True) or just xrefs to map (False) [False]
    # 1.8.0 - Added recognition and parsing of yeast.txt XRef file from Uniprot (http://www.uniprot.org/docs/yeast.txt)
    # 1.8.1 - Added rest run mode to avoid XRef table output if no gene ID list is given. Added `genes` and `genelist` as `idlist=LIST` synonym.
    # 1.8.2 - Catching self.dict['Mapping'] error for REST server.

rje_xref REST Output formats

The XRef server is designed to take a set of input gene (or other) identifiers and extract a database
identifier cross-references for them. Genes are given using &idlist=LIST. (&genes=LIST or &genelist=LIST
should also work.)

Run with &rest=help for general options. Run with &rest=full to get full server output as text or &rest=format
for more user-friendly formatted output. Individual outputs can be identified/parsed using &rest=OUTFMT:

tab = main table of identified elements. [tdt]
mapped = pairs of provided identifiers and the primary ID mapped onto. [tdt]
failed = list of identifiers that failed to map. [list]

In addition, there will be a tab per field of the XRef file listing the sorted unique identifiers mapped.

SLiMSuite REST Server

rje_xref V1.8.2

Generic identifier cross-referencing