|
|
Module: | rje_genemap |
Description: | RJE Gene & Database ID Mapping Module |
Version: | 1.5 |
Last Edit: | 16/12/13 |
|
Copyright © 2008 Richard J. Edwards - See source code for GNU License Notice
Imported modules:
rje
rje_db
rje_seq
rje_zen
See SLiMSuite Blog for further documentation. See rje
for general commands.
Function
This module is designed to replace rje_genecards, which has become a bit unwieldy since its conception. Some of the
original functions of rje_genecards will still be maintained by (hopefully) in a simplified format. Some of the
additional mapping functions of Pingu will be added to this module for easier implementation across packages.
The main functions of this module are:
1. To map, store and retrieve database cross-references for a key dataset of gene IDs, usually HGNC symbols.
2. To store a number of aliases for the key gene IDs, including old versions of accession numbers etc.
3. To retrieve sequences from given datasets for stored aliases/genes.
The main processing pipeline is as follows:
1. Read in key data and generate data structure OR load pickle.
2. Repeat 1 until all data/pickles integrated.
3. Save pickle and output new data flat files, if desired.
This is the limit of the standalone functionality of the program. However, the GeneMap class with have a number of
additional methods for data retrieval by other programs that use it.
GeneMap Class
The GeneMap Class stores two main data dictionaries:
1. A dictionary for the key Gene IDs that contains mappings to other databases.
2. A dictionary that maps aliases onto other Gene IDs.
In addition, sequence files may be loaded and used to map IDs onto Sequence objects.
Input Processing
As data is loaded, either from a pickle or a text file, its data is integrated. NOTE: These commands can be repeated
several times and, unlike normal, subsequent commands will not replace earlier ones but simply add to the list. If
there is danger of additional unwanted commands in the command argument list, then the loadData() method should be
called with a specified list of commandline options rather than using the default system arguments.
If a given set of data has a "Symbol" (KeyID) Header then this is added to the main Data dictionary as a key and all
column headers as stored data. (These column headers are stored in the "Header" list.) The Alias - the original key
of the dictionary - is added to self.dict['Alias']. Note that each Alias can be involved in many-to-many and circular
referencing, which will need to be dealt with by the class when mapping. And headers that are in the "XRef" list of
database cross-references will also be added to the Alias dictionary. If there is no KeyID, the data is stored in a
"TempData" dictionary, and XRef headers are aliased to the Alias rather than the KeyID.
If a KeyID already exists in the Data dictionary, then any blank entries will be overwritten but data loaded from a
previous file will not be. All aliases will be mapped to the KeyID, however, even if they do not end up in the Data
dictionary itself. If a KeyID is missing from the Data dictionary but present in the TempData dictionary, it will be
overwritten in the same way and moved to the Data dictionary. If an Alias without a KeyID is already present in the
TempData dictionary, the same will happen without any transferral.
Sequence data will be processed according to the specifics of the type of sequence file it is. EnsLoci sequences
will be converted into an EnsLoci dictionary of {ID:Sequence} but also key gene-protein mappings will be extracted.
The protein to gene aliases will be added to the Alias dictionary, while the EnsLoci ID will be added as an XRef to
the appropriate Data or TempData dictionary element.
Once all data has been read in, each TempData entry will be assessed using the Alias mappings to see if it, or any of
its XRef entries, is an alias for a KeyID. If so, its data will be combined with that in Data and it will be removed
from TempData. If not, it will be assessed for being an alias of another TempData entry and will be combined if so.
After this final stage of processing, any entries still in TempData will be promoted to KeyIDs and added to the main
dictionary, though they will not appear in the "KeyID" list.
Commandline
Input Options
hgncdata=FILE
: Download file containing HGNC data. []
mgidata=FILE
: Download file containing MGI data (ftp://ftp.informatics.jax.org/pub/reports/MGI_MouseHumanSequence.rpt) []
sourcedata=FILE
: File containing data in order of preference regarding conflicting data. []
aliases=FILE
: Files containing aliases only. []
pickledata=FILE
: Genemap pickle to import and use. []
ensloci=FILE
: File of EnsLoci genome to incorporate [None
]
genepickle=FILE
: Use pickle of GeneMap data without additional loading/processing etc. [None
]
pfamdata=FILE
: Delimited files containing domain organisation of sequences [None
]
approved=LIST
: Approved HGNC gene symbols to avoid over-zealous alias mapping (will add to from HGNC) []
Processing Options
keyid=X
: Key field header to be used in main Data dictionary - aliases map to this [Symbol
]
xref=LIST
: Headers in Data dictionaries that are used for aliases [EnsEMBL,Entrez,HGNC,HPRD,UniProt
]
useweb=T/F
: Whether to try and extract missing data from GeneCards website [False
]
skiplist=LIST
: Skip genes matching LIST when using GeneCards website (e.g. XP_*) ['HPRD*'
]
Output Options
basefile=X
: Root for output files [genemap
]
flatout=T/F
: Whether to output flatfiles (*.data.tdt & *.aliases.tdt) [False
]
pickleout=T/F
: Whether to output pickle (*.pickle.gz) [False
]
History Module Version History
# 0.0 - Initial Compilation based loosely on rje_genecards V0.4 (28-Mar-08).
# 1.0 - Standalone working version with basic functions.
# 1.1 - Added bestMap() function for better compatibility with PINGU (V3.0)
# 1.2 - Fixed bug of mapping current Approved Gene Symbols to other Gene Symbols due to redundant Aliases.
# 1.3 - Add reduction of data to gene list.
# 1.4 - Modified to read in MOUSE data.
# 1.5 - Minor tweak of expected HGNC input following change to downloads.