Module:	rje_taxonomy
Description:	Downloads, reads and converts Uniprot species codes and NCBI Taxa IDs
Version:	1.3.0
Last Edit:	22/06/18

Imported modules: rje rje_db rje_obj

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This module is designed to download and interconvert between NCBI Taxa IDs and Uniprot species codes and species names. It uses two main files: speclist.txt from Uniprot and node.dmp from NCBI taxonomy. These will be downloaded if missing and download=T, else can be manually downloaded from:

- http://www.uniprot.org/docs/speclist.txt - ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

These will be saved in the directory given by taxdir=PATH/ (default ./SourceData/) and will have the download date inserted. This enables several versions to be stored together if desired and selected using the sourcedate=DATE option, where DATE is in the form YYYY-MM-DD.

Alternatively, specfile=FILE, taxmap=FILE and namemap=FILE can be used to give other files of the same format. The SpecFile should follow key format features of the organism codes but do not need the headers. The TaxMap file simply uses the first three columns of the nodes.dump file (separated by "\t|\t"), which correspond to (1) the Taxa ID, (2) the parent Taxa ID, and (3) the rank of the entry. These ranks are used to determine which taxa are output if rankonly=T. By default, this is "species" and "subspecies". NameMap uses all four fields.

To extract and/or convert a set of Taxa IDs, a list of taxa should be given using taxin=LIST, where LIST is either a comma separated list of taxa, or a file containing one taxon per line. Taxa may be a mix of NCBI Taxa IDs, Uniprot species codes and case-insensitive (but exact match) species names. By default, all taxa will be combined but if batchmode=T then each TaxIn element will be processed individually. When batchmode=T, individual list elements can be files containing taxa. (This will not work in batchmode=F, unless only a single file is given.)

Taxa are first mapped on to NCBI Taxa IDs. Unless nodeonly=T, taxa will also be mapped on to all of their child taxa, as defined in nodes.dmp. If rankonly=T then only those taxa with a rank matching ranktypes=LIST will be retained. IDs can be further restricted by supplying a list with restrictid=LIST, which will limit mapped IDs to those within the list given. (Note that this list could itself be created by a previous file of rje_taxonomy and given as a file.) Taxa IDs are then mapped on the Uniprot species codes and species names using the SpecFile data. If missing from this file, scientific names will be pulled out of the NCBI NameMap file instead. NOTE: Uniprot is used first because NCBI has more redundant taxonomy assignments.

Output is determined by the taxout=LIST option, which is set by default to 'taxid'. Four possible output types are permitted: - taxid = NCBI Taxa IDs (e.g. 9606 or 7227) - spcode = Uniprot species codes (e.g. HUMAN or DROME) - name = Scientific name (e.g. Homo sapiens or Drosophila melanogaster) - common = Common name (e.g. Human or Fruit fly)

These will be output as lists to BASE.TYPE.txt files, where BASE is set by basefile=X (using the first taxin=LIST element if missing) and TYPE is the taxout type. If batchmode=T then a separate set of files will be made for each element of the TaxIn list, using BASE.TAXIN.TYPE.txt file naming.

Commandline

SOURCE DATA OPTIONS

specfile=FILE : Uniprot species code download. [speclist.txt]
taxmap=FILE : NCBI Node Dump File [nodes.dmp]
namemap=FILE : NCBI Name mapping file [names.dmp]
taxdir=PATH/ : Will look in this directory for input files if not found ['./SourceData/']
sourcedate=DATE : Source file date (YYYY-MM-DD) to preferentially use [None]
download=T/F : Whether to download files directly from websites where possible if missing [True]

TAXONOMY CONVERSION OPTIONS

taxin=LIST : List of Taxa IDs, Uniprot species codes (upper case) and/or common/scientific names []
batchmode=T/F : Treat each element of taxin as a separate run (will be used for output basefile) [False]
taxout=LIST : List of output formats (taxid/spcode/name/common/all) [taxid]
nodeonly=T/F : Whether to limit output to the matched nodes (i.e. no children) [False]
rankonly=T/F : Whether to limit output to species-level taxonomic codes [False]
ranktypes=LIST : List of Taxon types to include if rankonly=True [species,subspecies,no rank]
restrictid=LIST : List of Taxa IDs to restrict output to (i.e. output overlaps with taxin) []
basefile=X : Results file prefix. Will use first taxin=LIST term if missing [None]
taxtable=T/F : Whether to output results in a table rather than text lists [False]

See also rje.py generic commandline options.

History Module Version History

    # 0.0 - Initial Compilation.
    # 0.1 - Initial working version with rje_ensembl.
    # 1.0 - Fully functional version with modified viral species code creation.
    # 1.1.0 - Added parsing of yeast strains.
    # 1.2.0 - Added storage of Parents.
    # 1.3.0 - taxtable=T/F        : Whether to output results in a table rather than text lists [False]

rje_taxonomy REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

There is currently no specific help available on REST output for this program.

SLiMSuite REST Server

rje_taxonomy V1.3.0