Module:	rje_uniprot
Description:	RJE Module to Handle Uniprot Files
Version:	3.25.3
Last Edit:	21/04/20

Imported modules: rje rje_db rje_sequence

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This module contains methods for handling UniProt files, primarily in other rje modules but also with some standalone functionality. To get the most out of the module with big UniProt files (such as the downloads from EBI), first index the UniProt data using the rje_dbase module.

This module can be used to extract a list of UniProt entries from a larger database and/or to produce summary tables from UniProt flat files. Version 3.14 introduced direct querying from the UniProt website if unipath=None or unipath=URL.

In addition to method associated with the classes of this module, there are a number of methods that are called from the rje_dbase module (primarily) to download and process the UniProt sequence database.

Version 3.19 has seen an over-haul of the dbxref extraction. dblist=LIST and dbparse=LIST can now be used largely synonymously. Rather than extract all db xref by default, there is now a default list of databases to parse: ['UniProtKB/Swiss-Prot','ensembl','REFSEQ','HGNC','Entrez Gene','FlyBase','Pfam','GO','MGI','ZFIN'].

Input/Output

Input Options

unipath=PATH : Path to UniProt Datafile (looks here for DB Index file; url = use Web downloads) [url]
dbindex=FILE : Database index file [uniprot.index]
uniprot=FILE : Name of UniProt file [None]
extract=LIST : Extract IDs/AccNums in list. LIST can be FILE or list of IDs/AccNums X,Y,.. []
acclist=LIST : As extract=LIST.
uniprotid=LIST : As extract=LIST.
acctable=FILE : Delimited text file from which to retrieve a list of accession numbers [None]
accfield=X : Accession number field for acctable=FILE extraction [UniProt]
specdat=LIST : Make a UniProt DAT file of the listed species from the index (over-rules extract=LIST) []
proteome=LIST : Extract complete proteomes for listed Taxa ID (e.g. 9606 for human) []
taxonomy=LIST : Extract all entries for listed Taxa ID (e.g. 4751 for fungi) []
usebeta=T/F : Whether to use beta.uniprot.org rather than www.uniprot.org [False]
splicevar=T/F : Whether to search for AccNum allowing for splice variants (AccNum-X) [True]
tmconvert=T/F : Whether to convert TOPO_DOM features, using first description word as Type [False]
reviewed=T/F : Whether to restrict input to reviewed entries only [False]
complete=T/F : Whether to restrict proteome downloads to "complete proteome" sets [False]
uniformat=X : Desired UniProt format for proteome download. Append gz to compress. [txt]
- html | tab | xls | fasta | gff | txt | xml | rdf | list | rss
onebyone=T/F : Whether to download one entry at a time. Slower but should maintain order [False]

Output Options

basefile=X : If set, can use "T" or "True" for other *out options. (Will default to datout if given) [None]
datout=T/F/FILE : Name of new (reduced) UniProt DAT file of extracted sequences [None]
tabout=T/F/FILE : Table of extracted UniProt details [None]
ftout=T/F/FILE : Outputs table of features into FILE [None]
pfamout=T/F/FILE: Whether to output a "long" table of (accnum, pfam, name, num) [False]
domtable=T/F : Makes a table of domains from uniprot file [False]
gotable=T/F : Makes a table of AccNum:GO mapping [False]
cc2ft=T/F : Extra whole-length features added for TISSUE and LOCATION (not in datout) [False]
xrefout=T/F/FILE: Table of extracted Database xref (Formerly linkout=FILE) [None]
longlink=T/F : Whether link table is to be "long" (acc,db,dbacc) or "wide" (acc, dblinks) [True]
dblist=LIST : List of databases to extract (extract all if empty or contains 'all') [see above]
dbsplit=T/F : Whether to generate a table per dblist database (basefile.dbase.tdt) [False]
dbdetails=T/F : Whether to extract full details of DR line rather than parsing DB xref only [False]
append=T/F : Append to results files rather than overwrite [False]

UniProt Conversion Options

ucft=X : Feature to add for UpperCase portions of sequence []
lcft=X : Feature to add for LowerCase portions of sequence []
maskft=LIST : List of Features to mask out []
invmask=T/F : Whether to invert the masking and only retain maskft features [False]
caseft=LIST : List of Features to make upper case with rest of sequence lower case []

Specialist Options

Parsing Options (Programming Only)

fullref=T/F : Whether to store full Reference information in UniProt Entry objects [False]
dbparse=LIST : List of databases to parse from DR lines in UniProtEntry object [see code]
uparse=LIST : Restricted lines to parse to accelerate parsing of large datasets for limited information []

General Options

memsaver=T/F : Memsaver option to save memory usage - does not retain entries in UniProt object [False]
cleardata=T/F : Whether to clear unprocessed Entry data (True) or (False) retain in Entry & Sequence objects [True]
specsleep=X : Sleep for X seconds between species downloads [60]

UniProt Download Processing Options

makeindex=T/F : Generate UniProt index files [False]
makespec=T/F : Generate species table [False]
makefas=T/F : Generate fasta files [False]
grepdat=T/F : Whether to use GREP in attempt to speed up processing [False]

History Module Version History

# 0.0 - Initial Compilation.
# 1.0 - Initial working version for interaction_motifs.py
# 1.1 - Minor tidying and modification
# 2.0 - Moved functions to rje_dbase. Added option to extract using index files.
# 2.1 - Added possibility to extract splice variants
# 2.2 - Added output of feature table for the entries in memory (not compatible with memsaver mode)
# 2.3 - Added ID to tabout and also added accShortName() method to extract dictionary of {acc:ID__PrimaryAcc}
# 2.4 - Add method for converting Sequence object and dictionary into UniProt objects... and saving
# 2.5 - Added cc2ft Extra whole-length features added for TISSUE and LOCATION [False] and ftout=FILE
# 2.6 - Added features based on case of sequence. (Uses seq.dict['Case'])
# 2.7 - Added masking of features - Entry.maskFT(type='EM',inverse=False)
# 2.8 - Added making of Taxa-specific databases using a list of UniProt Species codes
# 2.9 - Added extraction of EnsEMBL, HGNC, UniProt and EntrezGene from IPI DAT file.
# 3.0 - Added some module-level methods for use with rje_dbase.
# 3.1 - Added extra linking of databases from UniProt entries
# 3.2 - Added feature masking and TM conversion.
# 3.3 - Added DBase processing options.
# 3.4 - Made modifications to allow extended EMBL functionality as part of rje_embl.
# 3.5 - Added SplitOut to go with rje_embl V0.1
# 3.6 - Added longlink=T/F : Whether link table is to be "long" (acc,db,dbacc) or "wide" (acc, dblinks) [True]
# 3.7 - Added cleardata=T/F : Whether to clear unprocessed Entry data or retain in Entry & Sequence objects [True]
# 3.8 - Added extraction of NCBI Taxa ID.
# 3.9 - Added grepdat=T/F : Whether to use GREP in attempt to speed up processing [False]
# 3.10- Added forking for speeding up of processing.
# 3.11- Added storing of Reference information in UniProt entries.
# 3.12- Added addition accdict extraction method for all entries read in.
# 3.13- Minor bug fix for link table output.
# 3.14- Added direct retrieval of UniProt entries from URL, including full proteomes. Updated output file naming.
# 3.14- Added dblist=LIST and dbsplit=T/F for additional DB link output control. Set unipath default to url.
# 3.15- Added extraction of taxonomic groups. Add UniFormat to improve pure downloads.
# 3.16- Added WBGene ID's from WormBase as one of the recognised DB XRef to parse.
# 3.17- Efficiency tweak to URL-based extraction of acclist.
# 3.18- Minor modification to database parsing.
# 3.19- Updated and consolidated dbxref table generation (formerly linkout) using rje_db. Changed acc_num to accnum.
# - Added gotable=T generation of GO table. Fixed makeindex to use a single fork if needed.
# 3.20- Updated dbsplit=T output and checked function with Pfam. Probably needs work for other databases.
# 3.20.1 - Added uniprotid=LIST as an alias to acclist=LIST and extract=LIST.
# 3.20.2 - Added extra sequence return methods to UniprotEntry. Added fasta REST output.
# 3.20.3 - Fixed bug if new uniprot extraction method fails.
# 3.20.4 - Fixed bug introduced by REST access modifications.
# 3.20.5 - Improved handling of downloads for uniprot IDs that have been updated (i.e. no direct mapping).
# 3.20.6 - Improved handling of zero accession numbers for extraction.
# 3.20.7 - Fixed uniformat default error.
# 3.21.0 - Added uparse=LIST option to try and accelerate parsing of large datasets for limited information.
# 3.21.1 - FullText is no longer stored in Uniprot object. Will need special handling if required.
# 3.21.2 - Fixed single uniprot extraction bug.
# 3.21.3 - Added REST datout to proteomes extraction.
# 3.21.4 - Fixed Feature masking. Should this be switched off by default?
# 3.22.0 - Tweaked REST table output.
# 3.23.0 - Added accnum map table output. Fixed REST output bug when bad IDs given. Added version and about output.
# 3.24.0 - Added pfam out and changed map table headers.
# 3.24.1 - Fixed process Uniprot error when uniprot=FILE given.
# 3.24.2 - Updated HTTP to HTTPS. Having some download issues with server failures.
# 3.25.0 - Fixed new Uniprot batch query URL. Added onebyone=T/F : Whether to download one entry at a time. Slower but should maintain order [False].
# 3.25.1 - Fixed proteome download bug following Uniprot changes.
# 3.25.2 - Fixed Uniprot protein extraction issues by using curl. (May not be a robust fix!)
# 3.25.3 - Fixed some problems with new Uniprot feature format.

rje_uniprot REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

dat = main Uniprot flat file. [dat]
map = table of uniprotid elements given and the accession numbers they mapped to. [tdt]
failed = list of identifiers that failed to map to uniprot. (Missing if none.) [list]
tab = tabular summary of uniprot data (&tabout=T). [tdt]
ft = table of parsed uniprot features (&ftout=T). [tdt]
dom = table of parsed DOMAIN features and their positions (&domtable=T). [tdt]
pfam = table of parsed Pfam domains and their counts for each protein (&pfamout=T). [tdt]
xref = table of extracted database cross-references (&xrefout=T). [tdt]
go = table of GO categories extracted for each protein (&gotable=T). [tdt]

SLiMSuite REST Server

rje_uniprot V3.25.3

RJE Module to Handle Uniprot Files