Function
This module is designed to control generic database manipulations routinely used for me to generate customised
databases:
1. Download commonly used databases, primarily UniProt, EnsEMBL, PFam and PPI databases
2. Reformat and index crucial UniProt data from uniprot.dat and trembl.dat for ease of extraction. (UNIX platform)
3. Generate taxa-specific databases from input databases
4. Generate non-redundant species-specific databases using EnsEMBL gene locus informtation
By default, database paths are relative. To peform an update it is advised that a new directory is created and
RJE_DBASE run in this directory with the dbdownload=LIST
dbprocess=LIST
and taxadb=FILE
options. Once download and
formatting is complete, the new files can be copied over the old files.
Database download is controlled in two ways. UniProt and EnsEMBL are managed by their own respective modules. Other
databases are currently read from a file, which is in (an attempt of) XML format of the basic form:
<dbxml>
<database name="EnsEMBL
" ftproot="ftp://ftp.ensembl.org/pub
/" outdir="EnsEMBL/Current-release">
;
<file path="current_aedes_aegypti/data/fasta/pep/*.gz">Yellow
Fever Mosquito</file>
</database>
</dbxml>
The pre-version 1.2 options for making IPI-centred datasets can still be called using the makedb=FILE
option along
with its associated options: screenipi=T
screenens=F
ensloci=F
.
Commandline
### Primary Database download and processing options ###
dbdownload=LIST
: List of EnsEMBL/UniProt/XML files containing details of databases to download []
dbprocess=LIST
: List of EnsEMBL/UniProt/IPI database types to process []
datindex=T/F
: Create an index file for the Uniprot DAT files in unipath if UniProt in dbprocess [True
]
spectable=T/F
: Makes a table of species codes, taxonomy and taxon_id from DAT files if dbprocess UniProt [True
]
taxadb=FILE
: File containing details of taxanomic sub-databases to make [None
]
formatdb=T/F
: Whether to BLAST format database after making [True
]
force=T/F
: Whether to force regeneration of existing files [False
]
ignoredate=T/F
: Whether to ignore the relative timestamps of files when assessing whether to regenerate [False
]
ensloci=T/F
: Reduce EnsEMBL to a single protein per locus, mapping UniProt where possible [True
]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Database Path Details ###
unipath=PATH
: Path to UniProt files [UniProt/
]
ipipath=PATH
: Path to IPI files [IPI/
]
enspath=PATH
: Path to EnsEMBL file [EnsEMBL/
]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Database and sub-database manufacture ###
dbformat=LIST
: Reformats UniProt, IPI or EnsEMBL databases using RJE_SEQ []
makedb=FILE
: Makes a database from combined databases [None
]
- Note that rje_seq commandline options will be applied to this database with the addition of a
goodspec=X
filter applied from the taxalist=LIST
useX=T/F
: Whether to use certain aspects of databases,
where X is: uniprot/sprot/trembl/ensembl/known/novel/abinitio/ipi [All but ipi True]
taxalist=LIST
: List of taxanomic groups to extract spec_codes for reduced database (else all) [None
]
speconly=T/F
: Will simply output a list of SPECIES codes to the makedb file, rather than making dbase [False
]
inversedb=T/F
: TaxaList is a list of taxanomic groups *NOT* to be in database [False
]
screenipi=T/F
: Species represented by IPI databases will be screened out of UniProt and EnsEMBL. [False
]
screenens=T/F
: Species represented by EnsEMBL will be screened out of UniProt [True
]
seqfilter=T/F
: Use rje_seq to filter sequences (True) or simply filter on Species Codes (False) [False
]
ensfilter=T/F
: Run EnsEMBL genomes through rje_seq to apply filters, rather than just concatenating [False
]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
History Module Version History
# 1.0 - Initial working version for use with rje_uniprot V2.0
# 1.1 - Added automated downloading of databases from file
# 1.2 - Incorporated RJE_ENSEMBL for handling EnsEMBL genomes and making a one-protein-per-gene proteome.
# 1.3 - Tidied code a little and improved comments/docstrings.
# 2.0 - Heavily reorganised and modified module.
# 2.1 - Added seqfilter=T/F to speed up TaxaDB manufacture.
# 2.2 - Added use of rje_seqlist for TaxaDB manufacture.
# 2.3 - Added construction of EnsEMBL TaxaDB sets during TaxaDB construction.
# 2.3.1 - Updated the dbdownload function to recognise individual files and wildcard file lists.