This module is designed to download and interconvert between NCBI Taxa IDs and Uniprot species codes and species
names. It uses two main files: speclist.txt from Uniprot and node.dmp from NCBI taxonomy. These will be downloaded
if missing and
download=T, else can be manually downloaded from:
These will be saved in the directory given by
taxdir=PATH/ (default ./SourceData/) and will have the download
date inserted. This enables several versions to be stored together if desired and selected using the
option, where DATE is in the form YYYY-MM-DD.
namemap=FILE can be used to give other files of the same format. The
SpecFile should follow key format features of the organism codes but do not need the headers. The TaxMap
file simply uses the first three columns of the nodes.dump file (separated by "\t|\t"), which correspond to (1) the
Taxa ID, (2) the parent Taxa ID, and (3) the rank of the entry. These ranks are used to determine which taxa are
rankonly=T. By default, this is "species" and "subspecies". NameMap uses all four fields.
To extract and/or convert a set of Taxa IDs, a list of taxa should be given using
taxin=LIST, where LIST is either a
comma separated list of taxa, or a file containing one taxon per line. Taxa may be a mix of NCBI Taxa IDs, Uniprot
species codes and case-insensitive (but exact match) species names. By default, all taxa will be combined but if
batchmode=T then each TaxIn element will be processed individually. When
batchmode=T, individual list elements can be
files containing taxa. (This will not work in
batchmode=F, unless only a single file is given.)
Taxa are first mapped on to NCBI Taxa IDs. Unless
nodeonly=T, taxa will also be mapped on to all of their child taxa,
as defined in nodes.dmp. If
rankonly=T then only those taxa with a rank matching
ranktypes=LIST will be retained. IDs
can be further restricted by supplying a list with
restrictid=LIST, which will limit mapped IDs to those within the
list given. (Note that this list could itself be created by a previous file of rje_taxonomy and given as a file.)
Taxa IDs are then mapped on the Uniprot species codes and species names using the SpecFile data. If missing from this
file, scientific names will be pulled out of the NCBI NameMap file instead. NOTE: Uniprot is used first because NCBI
has more redundant taxonomy assignments.
Output is determined by the
taxout=LIST option, which is set by default to 'taxid'. Four possible output types are
- taxid = NCBI Taxa IDs (e.g. 9606 or 7227)
- spcode = Uniprot species codes (e.g. HUMAN or DROME)
- name = Scientific name (e.g. Homo sapiens or Drosophila melanogaster)
- common = Common name (e.g. Human or Fruit fly)
These will be output as lists to BASE.TYPE.txt files, where BASE is set by
basefile=X (using the first
element if missing) and TYPE is the taxout type. If
batchmode=T then a separate set of files will be made for each
element of the TaxIn list, using BASE.TAXIN.TYPE.txt file naming.
SOURCE DATA OPTIONS
specfile=FILE : Uniprot species code download. [
taxmap=FILE : NCBI Node Dump File [
namemap=FILE : NCBI Name mapping file [
taxdir=PATH/ : Will look in this directory for input files if not found [
sourcedate=DATE : Source file date (YYYY-MM-DD) to preferentially use [
download=T/F : Whether to download files directly from websites where possible if missing [
TAXONOMY CONVERSION OPTIONS
taxin=LIST : List of Taxa IDs, Uniprot species codes (upper case) and/or common/scientific names 
batchmode=T/F : Treat each element of taxin as a separate run (will be used for output basefile) [
taxout=LIST : List of output formats (taxid/spcode/name/common/all) [
nodeonly=T/F : Whether to limit output to the matched nodes (i.e. no children) [
rankonly=T/F : Whether to limit output to species-level taxonomic codes [
ranktypes=LIST : List of Taxon types to include if
restrictid=LIST : List of Taxa IDs to restrict output to (i.e. output overlaps with taxin) 
basefile=X : Results file prefix. Will use first
taxin=LIST term if missing [
taxtable=T/F : Whether to output results in a table rather than text lists [
See also rje.py generic commandline options.
History Module Version History
# 0.0 - Initial Compilation.
# 0.1 - Initial working version with rje_ensembl.
# 1.0 - Fully functional version with modified viral species code creation.
# 1.1.0 - Added parsing of yeast strains.
# 1.2.0 - Added storage of Parents.
# 1.3.0 - taxtable=T/F : Whether to output results in a table rather than text lists [False]
rje_taxonomy REST Output formats
for program documentation and options. A plain text version is accessed with
can be used to retrieve individual parts of the output, matching the tabs in the default
) output. Individual
elements can also be parsed from the full (
) server output,
which is formatted as follows:
... contents for OUTFMT section ...
Available REST Outputs
There is currently no specific help available on REST output for this program.