|
rje_uniprot V3.25.3RJE Module to Handle Uniprot Files
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice Imported modules:
See SLiMSuite Blog for further documentation. See FunctionThis module contains methods for handling UniProt files, primarily in other rje modules but also with some standalone functionality. To get the most out of the module with big UniProt files (such as the downloads from EBI), first index the UniProt data using the rje_dbase module. This module can be used to extract a list of UniProt entries from a larger database and/or to produce summary tables
from UniProt flat files. Version 3.14 introduced direct querying from the UniProt website if In addition to method associated with the classes of this module, there are a number of methods that are called from the rje_dbase module (primarily) to download and process the UniProt sequence database. Version 3.19 has seen an over-haul of the dbxref extraction. Input/OutputInput Options
Output Options
UniProt Conversion Options
Specialist OptionsParsing Options (Programming Only)
General Options
UniProt Download Processing Options
History Module Version History# 0.0 - Initial Compilation. # 1.0 - Initial working version for interaction_motifs.py # 1.1 - Minor tidying and modification # 2.0 - Moved functions to rje_dbase. Added option to extract using index files. # 2.1 - Added possibility to extract splice variants # 2.2 - Added output of feature table for the entries in memory (not compatible with memsaver mode) # 2.3 - Added ID to tabout and also added accShortName() method to extract dictionary of {acc:ID__PrimaryAcc} # 2.4 - Add method for converting Sequence object and dictionary into UniProt objects... and saving # 2.5 - Added cc2ft Extra whole-length features added for TISSUE and LOCATION [False] and ftout=FILE # 2.6 - Added features based on case of sequence. (Uses seq.dict['Case']) # 2.7 - Added masking of features - Entry.maskFT(type='EM',inverse=False) # 2.8 - Added making of Taxa-specific databases using a list of UniProt Species codes # 2.9 - Added extraction of EnsEMBL, HGNC, UniProt and EntrezGene from IPI DAT file. # 3.0 - Added some module-level methods for use with rje_dbase. # 3.1 - Added extra linking of databases from UniProt entries # 3.2 - Added feature masking and TM conversion. # 3.3 - Added DBase processing options. # 3.4 - Made modifications to allow extended EMBL functionality as part of rje_embl. # 3.5 - Added SplitOut to go with rje_embl V0.1 # 3.6 - Added longlink=T/F : Whether link table is to be "long" (acc,db,dbacc) or "wide" (acc, dblinks) [True] # 3.7 - Added cleardata=T/F : Whether to clear unprocessed Entry data or retain in Entry & Sequence objects [True] # 3.8 - Added extraction of NCBI Taxa ID. # 3.9 - Added grepdat=T/F : Whether to use GREP in attempt to speed up processing [False] # 3.10- Added forking for speeding up of processing. # 3.11- Added storing of Reference information in UniProt entries. # 3.12- Added addition accdict extraction method for all entries read in. # 3.13- Minor bug fix for link table output. # 3.14- Added direct retrieval of UniProt entries from URL, including full proteomes. Updated output file naming. # 3.14- Added dblist=LIST and dbsplit=T/F for additional DB link output control. Set unipath default to url. # 3.15- Added extraction of taxonomic groups. Add UniFormat to improve pure downloads. # 3.16- Added WBGene ID's from WormBase as one of the recognised DB XRef to parse. # 3.17- Efficiency tweak to URL-based extraction of acclist. # 3.18- Minor modification to database parsing. # 3.19- Updated and consolidated dbxref table generation (formerly linkout) using rje_db. Changed acc_num to accnum. # - Added gotable=T generation of GO table. Fixed makeindex to use a single fork if needed. # 3.20- Updated dbsplit=T output and checked function with Pfam. Probably needs work for other databases. # 3.20.1 - Added uniprotid=LIST as an alias to acclist=LIST and extract=LIST. # 3.20.2 - Added extra sequence return methods to UniprotEntry. Added fasta REST output. # 3.20.3 - Fixed bug if new uniprot extraction method fails. # 3.20.4 - Fixed bug introduced by REST access modifications. # 3.20.5 - Improved handling of downloads for uniprot IDs that have been updated (i.e. no direct mapping). # 3.20.6 - Improved handling of zero accession numbers for extraction. # 3.20.7 - Fixed uniformat default error. # 3.21.0 - Added uparse=LIST option to try and accelerate parsing of large datasets for limited information. # 3.21.1 - FullText is no longer stored in Uniprot object. Will need special handling if required. # 3.21.2 - Fixed single uniprot extraction bug. # 3.21.3 - Added REST datout to proteomes extraction. # 3.21.4 - Fixed Feature masking. Should this be switched off by default? # 3.22.0 - Tweaked REST table output. # 3.23.0 - Added accnum map table output. Fixed REST output bug when bad IDs given. Added version and about output. # 3.24.0 - Added pfam out and changed map table headers. # 3.24.1 - Fixed process Uniprot error when uniprot=FILE given. # 3.24.2 - Updated HTTP to HTTPS. Having some download issues with server failures. # 3.25.0 - Fixed new Uniprot batch query URL. Added onebyone=T/F : Whether to download one entry at a time. Slower but should maintain order [False]. # 3.25.1 - Fixed proteome download bug following Uniprot changes. # 3.25.2 - Fixed Uniprot protein extraction issues by using curl. (May not be a robust fix!) # 3.25.3 - Fixed some problems with new Uniprot feature format. rje_uniprot REST Output formatsRun with&rest=docs for program documentation and options. A plain text version is accessed with &rest=help .&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default( &rest=format ) output. Individual OUTFMT elements can also be parsed from the full (&rest=full ) server output,which is formatted as follows: ###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~### # OUTFMT: ... contents for OUTFMT section ... Available REST Outputsdat = main Uniprot flat file. [dat]map = table of uniprotid elements given and the accession numbers they mapped to. [tdt]failed = list of identifiers that failed to map to uniprot. (Missing if none.) [list]tab = tabular summary of uniprot data (&tabout=T ). [tdt]ft = table of parsed uniprot features (&ftout=T ). [tdt]dom = table of parsed DOMAIN features and their positions (&domtable=T ). [tdt]pfam = table of parsed Pfam domains and their counts for each protein (&pfamout=T ). [tdt]xref = table of extracted database cross-references (&xrefout=T ). [tdt]go = table of GO categories extracted for each protein (&gotable=T ). [tdt]© 2015 RJ Edwards. Contact: richard.edwards@unsw.edu.au. |