Module:	rje_genemap
Description:	RJE Gene & Database ID Mapping Module
Version:	1.5
Last Edit:	16/12/13

Imported modules: rje rje_db rje_seq rje_zen

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This module is designed to replace rje_genecards, which has become a bit unwieldy since its conception. Some of the original functions of rje_genecards will still be maintained by (hopefully) in a simplified format. Some of the additional mapping functions of Pingu will be added to this module for easier implementation across packages.

The main functions of this module are: 1. To map, store and retrieve database cross-references for a key dataset of gene IDs, usually HGNC symbols. 2. To store a number of aliases for the key gene IDs, including old versions of accession numbers etc. 3. To retrieve sequences from given datasets for stored aliases/genes.

The main processing pipeline is as follows: 1. Read in key data and generate data structure OR load pickle. 2. Repeat 1 until all data/pickles integrated. 3. Save pickle and output new data flat files, if desired.

This is the limit of the standalone functionality of the program. However, the GeneMap class with have a number of additional methods for data retrieval by other programs that use it.

GeneMap Class

The GeneMap Class stores two main data dictionaries:
1. A dictionary for the key Gene IDs that contains mappings to other databases.
2. A dictionary that maps aliases onto other Gene IDs.

In addition, sequence files may be loaded and used to map IDs onto Sequence objects.

Key Input

There are four primary input files that are processed into the mapping:
1. Designed for human data, an HGNC download file is one of the key input files. Headers will be converted into those
from the original rje_genecards files, which are now replaced by sourcedata=FILE.
2. Source Data files are delimited text files containing mapping to various databases. In each case, the first column
should be unique for each line. This will be treated as an Alias. If a Symbol column is found, this will be treated
as a key identifier (unless keyid=X has been changed).
3. Alias files containing simple lists of ID:Alias to populate Alias dictionary. The first column can have any header
but must be the identifier to map *to*. Another column must have the header "Aliases" and be a comma-separated list
of aliases.
4. Pickle data containing a pickled GeneMap object.

In addition, sequence files may be loaded that have additional links and can be used to map to sequences. These are:
1. EnsLoci = This is used to add additional mapping of genes to proteins and to EnsLoci protein IDs.

Input Processing

As data is loaded, either from a pickle or a text file, its data is integrated. NOTE: These commands can be repeated
several times and, unlike normal, subsequent commands will not replace earlier ones but simply add to the list. If
there is danger of additional unwanted commands in the command argument list, then the loadData() method should be
called with a specified list of commandline options rather than using the default system arguments.

If a given set of data has a "Symbol" (KeyID) Header then this is added to the main Data dictionary as a key and all
column headers as stored data. (These column headers are stored in the "Header" list.) The Alias - the original key
of the dictionary - is added to self.dict['Alias']. Note that each Alias can be involved in many-to-many and circular
referencing, which will need to be dealt with by the class when mapping. And headers that are in the "XRef" list of
database cross-references will also be added to the Alias dictionary. If there is no KeyID, the data is stored in a
"TempData" dictionary, and XRef headers are aliased to the Alias rather than the KeyID.

If a KeyID already exists in the Data dictionary, then any blank entries will be overwritten but data loaded from a
previous file will not be. All aliases will be mapped to the KeyID, however, even if they do not end up in the Data
dictionary itself. If a KeyID is missing from the Data dictionary but present in the TempData dictionary, it will be
overwritten in the same way and moved to the Data dictionary. If an Alias without a KeyID is already present in the
TempData dictionary, the same will happen without any transferral.

Sequence data will be processed according to the specifics of the type of sequence file it is. EnsLoci sequences
will be converted into an EnsLoci dictionary of {ID:Sequence} but also key gene-protein mappings will be extracted.
The protein to gene aliases will be added to the Alias dictionary, while the EnsLoci ID will be added as an XRef to
the appropriate Data or TempData dictionary element.

Once all data has been read in, each TempData entry will be assessed using the Alias mappings to see if it, or any of
its XRef entries, is an alias for a KeyID. If so, its data will be combined with that in Data and it will be removed
from TempData. If not, it will be assessed for being an alias of another TempData entry and will be combined if so.
After this final stage of processing, any entries still in TempData will be promoted to KeyIDs and added to the main
dictionary, though they will not appear in the "KeyID" list.

Commandline

Input Options

hgncdata=FILE : Download file containing HGNC data. []
mgidata=FILE : Download file containing MGI data (ftp://ftp.informatics.jax.org/pub/reports/MGI_MouseHumanSequence.rpt) []
sourcedata=FILE : File containing data in order of preference regarding conflicting data. []
aliases=FILE : Files containing aliases only. []
pickledata=FILE : Genemap pickle to import and use. []
ensloci=FILE : File of EnsLoci genome to incorporate [None]
genepickle=FILE : Use pickle of GeneMap data without additional loading/processing etc. [None]
pfamdata=FILE : Delimited files containing domain organisation of sequences [None]
approved=LIST : Approved HGNC gene symbols to avoid over-zealous alias mapping (will add to from HGNC) []

Processing Options

keyid=X : Key field header to be used in main Data dictionary - aliases map to this [Symbol]
xref=LIST : Headers in Data dictionaries that are used for aliases [EnsEMBL,Entrez,HGNC,HPRD,UniProt]
useweb=T/F : Whether to try and extract missing data from GeneCards website [False]
skiplist=LIST : Skip genes matching LIST when using GeneCards website (e.g. XP_*) ['HPRD*']

Output Options

basefile=X : Root for output files [genemap]
flatout=T/F : Whether to output flatfiles (*.data.tdt & *.aliases.tdt) [False]
pickleout=T/F : Whether to output pickle (*.pickle.gz) [False]

History Module Version History

    # 0.0 - Initial Compilation based loosely on rje_genecards V0.4 (28-Mar-08).
    # 1.0 - Standalone working version with basic functions.
    # 1.1 - Added bestMap() function for better compatibility with PINGU (V3.0)
    # 1.2 - Fixed bug of mapping current Approved Gene Symbols to other Gene Symbols due to redundant Aliases.
    # 1.3 - Add reduction of data to gene list.
    # 1.4 - Modified to read in MOUSE data.
    # 1.5 - Minor tweak of expected HGNC input following change to downloads.

SLiMSuite REST Server

rje_genemap V1.5

RJE Gene & Database ID Mapping Module