Function
This module is designed to download and interconvert between NCBI Taxa IDs and Uniprot species codes and species
names. It uses two main files: speclist.txt from Uniprot and node.dmp from NCBI taxonomy. These will be downloaded
if missing and download=T
, else can be manually downloaded from:
- http://www.uniprot.org/docs/speclist.txt
- ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
These will be saved in the directory given by taxdir=PATH
/ (default ./SourceData/) and will have the download
date inserted. This enables several versions to be stored together if desired and selected using the sourcedate=DATE
option, where DATE is in the form YYYY-MM-DD.
Alternatively, specfile=FILE
, taxmap=FILE
and namemap=FILE
can be used to give other files of the same format. The
SpecFile should follow key format features of the organism codes but do not need the headers. The TaxMap
file simply uses the first three columns of the nodes.dump file (separated by "\t|\t"), which correspond to (1) the
Taxa ID, (2) the parent Taxa ID, and (3) the rank of the entry. These ranks are used to determine which taxa are
output if rankonly=T
. By default, this is "species" and "subspecies". NameMap uses all four fields.
To extract and/or convert a set of Taxa IDs, a list of taxa should be given using taxin=LIST
, where LIST is either a
comma separated list of taxa, or a file containing one taxon per line. Taxa may be a mix of NCBI Taxa IDs, Uniprot
species codes and case-insensitive (but exact match) species names. By default, all taxa will be combined but if
batchmode=T
then each TaxIn element will be processed individually. When batchmode=T
, individual list elements can be
files containing taxa. (This will not work in batchmode=F
, unless only a single file is given.)
Taxa are first mapped on to NCBI Taxa IDs. Unless nodeonly=T
, taxa will also be mapped on to all of their child taxa,
as defined in nodes.dmp. If rankonly=T
then only those taxa with a rank matching ranktypes=LIST
will be retained. IDs
can be further restricted by supplying a list with restrictid=LIST
, which will limit mapped IDs to those within the
list given. (Note that this list could itself be created by a previous file of rje_taxonomy and given as a file.)
Taxa IDs are then mapped on the Uniprot species codes and species names using the SpecFile data. If missing from this
file, scientific names will be pulled out of the NCBI NameMap file instead. NOTE: Uniprot is used first because NCBI
has more redundant taxonomy assignments.
Output is determined by the taxout=LIST
option, which is set by default to 'taxid'. Four possible output types are
permitted:
- taxid = NCBI Taxa IDs (e.g. 9606 or 7227)
- spcode = Uniprot species codes (e.g. HUMAN or DROME)
- name = Scientific name (e.g. Homo sapiens or Drosophila melanogaster)
- common = Common name (e.g. Human or Fruit fly)
These will be output as lists to BASE.TYPE.txt files, where BASE is set by basefile=X
(using the first taxin=LIST
element if missing) and TYPE is the taxout type. If batchmode=T
then a separate set of files will be made for each
element of the TaxIn list, using BASE.TAXIN.TYPE.txt file naming.
Commandline
SOURCE DATA OPTIONS
specfile=FILE
: Uniprot species code download. [speclist.txt
]
taxmap=FILE
: NCBI Node Dump File [nodes.dmp
]
namemap=FILE
: NCBI Name mapping file [names.dmp
]
taxdir=PATH
/ : Will look in this directory for input files if not found ['./SourceData/'
]
sourcedate=DATE
: Source file date (YYYY-MM-DD) to preferentially use [None
]
download=T/F
: Whether to download files directly from websites where possible if missing [True
]
TAXONOMY CONVERSION OPTIONS
taxin=LIST
: List of Taxa IDs, Uniprot species codes (upper case) and/or common/scientific names []
batchmode=T/F
: Treat each element of taxin as a separate run (will be used for output basefile) [False
]
taxout=LIST
: List of output formats (taxid/spcode/name/common/all) [taxid
]
nodeonly=T/F
: Whether to limit output to the matched nodes (i.e. no children) [False
]
rankonly=T/F
: Whether to limit output to species-level taxonomic codes [False
]
ranktypes=LIST
: List of Taxon types to include if rankonly=True
[species,subspecies,no rank
]
restrictid=LIST
: List of Taxa IDs to restrict output to (i.e. output overlaps with taxin) []
basefile=X
: Results file prefix. Will use first taxin=LIST
term if missing [None
]
taxtable=T/F
: Whether to output results in a table rather than text lists [False
]
See also rje.py generic commandline options.
History Module Version History
# 0.0 - Initial Compilation.
# 0.1 - Initial working version with rje_ensembl.
# 1.0 - Fully functional version with modified viral species code creation.
# 1.1.0 - Added parsing of yeast strains.
# 1.2.0 - Added storage of Parents.
# 1.3.0 - taxtable=T/F : Whether to output results in a table rather than text lists [False]
rje_taxonomy REST Output formats
Run with
&rest=docs
for program documentation and options. A plain text version is accessed with
&rest=help
.
&rest=OUTFMT
can be used to retrieve individual parts of the output, matching the tabs in the default
(
&rest=format
) output. Individual
OUTFMT
elements can also be parsed from the full (
&rest=full
) server output,
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...
Available REST Outputs
There is currently no specific help available on REST output for this program.