Module:	rje_ensembl
Description:	EnsEMBL Processing/Manipulation Module
Version:	2.15.2
Last Edit:	20/04/15

Imported modules: rje rje_db rje_forker rje_seq rje_seqlist rje_taxonomy rje_tm rje_uniprot seqmapper rje_zen rje_hmm_V1

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

This module is for processing EnsEMBL data for the rje_dbase module. The main class is an EnsEMBL class, which stores information on EnsEMBL proteins in terms of their gene IDs, loci and descriptions. This generates the "EnsLoci" dataset for each genome, consisting of the "best" peptide for a given locus. For known genes, UniProt accession numbers will be used in place of the EnsEMBL accession number. If the EnsEMBL sequence maps to a SwissProt sequence but is of really low quality (20+ consecutive Xs with less non-X residues than the SwissProt sequence) then the SwissProt sequence itself will replace the EnsEMBL sequence. This is the only time that the relationship between EnsEMBL peptide ID and sequence will break down.

Version 1.7 introduced a new "EnsGO" function for making GO datasets for the species codes listed. This mode will need, for each SPECIES, the EnsLoci file enspath/ens_SPECIES.loci.fas, the GO mapping enspath/ens_SPECIES.GO.tdt and the GO ID file [GO.terms_ids_obs]. GO mapping files can be created for the relevant species using EnsEMBL's BioMart tool (http://www.ensembl.org/biomart/martview/), while the ID file can be downloaded from GO (http://www.geneontology.org/doc/GO.terms_ids_obs). From BioMart, the following columns should be downloaded: "Ensembl Gene ID","Ensembl Transcript ID","Ensembl Peptide ID","GO ID","GO description","GO evidence code", "EntrezGene ID","HGNC Symbol". Other fields can also be downloaded if desired. This function has been further updated in version 1.8 & 1.9. From Version 2.8, the columns should be: "Ensembl Gene ID", "Ensembl Transcript ID", "Ensembl Protein ID", "GO Term Accession", "GO Term Evidence Code", "EntrezGene ID", "HGNC symbol"

Version 2.0 introduced a new "EnsDat" function for generating fake UniProt format entries for EnsLoci data using PFam HMM domain prediction, TMHMM transmembrane topology prediction, SIGNALP signal peptide prediction and IUPRED disorder prediction. Assumes that the EnsLoci files have been created. (Use download=T ensloci=T if not!) Sequences should be extracted from the file created by this method using Accession Numbers only.

Version 2.11 is the start of a major reworking in preparation for V3.0. Species codes are now read in automatically and Ensembl species alone downloaded from Uniprot for EnsLoci processing. (This can be quite slow depending on connection etc.) This avoids the need for pre-processing Uniprot in order to make EnsLoci sequences. Modified Uniprot downloads and data extraction is used for db xref mapping in place of manual biomart tables. Species data is now split into subsets according to Ensembl sets (main, metazoa, protists etc.) and EnsLoci files are similarly split within an ensloci/ subdirectory of enspath/.

Commandline

### Primary Module Functions ###
download=T/F : Download EnsEMBL databases [False]
makeuniprot=T/F : Whether to generate an Ensembl.dat file of UniProt entries for species [False]
ensloci=T/F : Create EnsEMBL datasets "reduced by loci" [False]
enspep=T/F : Create full gnspacc EnsEMBL peptide datasets [False]
hgncmap=FILE : File to be used for HGNC ID mapping []
resume=X : Species or species code to pickup run from [None]
sections=LIST : List of Ensembl sections to use for run (else All) []
speclist=LIST : List of species to use for run (else All) []
chromspec=LIST : List of species codes to download chromosomes for [HUMAN,DROME,CAEEL,YEAST,MOUSE,DANRE,CHICK,XENTR]
speedskip=T/F : Whether to assume download is fine if pep.all/cdna.all/dna.toplevel file found [True]
### Advanced UniProt Mapping Options ###
mapstat=X : GABLAM Stat to use for mapping assessment (ID/Sim/Len) [ID]
automap=X : Minimum value of mapstat for mapping to occur [80.0]
unispec=FILE : Alternative UniProt species file [None]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### EnsGO Options ###
ensgo=LIST : List of species codes to make EnsGO Datasets for []
mingo=X : Minumum number of genes to output GO category [0]
obsgo=T/F : Whether to include obselete terms [False]
splicego=T/F : Whether to include all splice variants (EnsEMBL peptides) in GO datasets [False]
goids=FILE : File containing GO IDs [GO.terms_ids_obs]
goevidence=LIST : List of acceptable GO evidence codes. (Will use all if blank.) []
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### EnsDat Options ###
ensdat=LIST : Perform EnsDat construction of predicted UniProt data for the species listed []
tmhmm=FILE : Path to TMHMM program [/home/richard/Bioware/TMHMM2.0c/bin/tmhmm]
signalp=FILE : Path to SIGNALP program [/home/richard/Bioware/signalp-3.0/signalp]
hmmerpath=PATH : Path for hmmer files [/home/richard/Bioware/hmmer-2.3.2/src/]
pfam=FILE : Path to PFam LS file [/home/richard/Databases/PFam/Pfam_ls]
datpickup=FILE : Text file containing names of proteins already processed (skip and append) [ensdat.txt]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### System Parameters ###
enspath=PATH : Path to EnsEMBL file [EnsEMBL/]
unipath=PATH : Path to UniProt files [enspath=PATH/uniprot/]
specsleep=X : Sleep for X seconds between species downloads [60]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

History Module Version History

    # 0.0 - Initial compilation.
    # 1.0 - Initial working version with download and EnsLoci functions.
    # 1.1 - Added reformatDB() method to be called by rje_dbase
    # 1.2 - Fixed known-ccds bug
    # 1.3 - SwissProt bug still remained. Fixed and added speclist=LIST
    # 1.4 - Improved EnsLoci mapping to ignore Xs and use SeqMapper for better match to SwissProt
    # 1.5 - Fixed bug that results in multiple occurrences of some sequence names
    # 1.6 - Added crap-sequence catching
    # 1.7 - Added GO dataset partitioning
    # 1.8 - Modification to the GO dataset generation, and GO adaptation for use with PINGU.
    # 1.9 - Added the possibility to restrict GO data to certain evidence codes
    # 2.0 - Added new EnsDat functionality. (Now part of "UniFake" module.)
    # 2.1 - Made changes in line with new EnsEMBL setup.
    # 2.2 - Made changes in line with new EnsEMBL setup. Again. Grrrr. Stop changing things, EnsEMBL!
    # 2.3 - Added new species, metlist and option to download chromosome sequences.
    # 2.4 - Modified to allow HGNC evidence for Human EnsEMBL.
    # 2.5 - Modified to allow additional species-specific evidence in *.map.tdt file. (Mouse, Yeast, Zebrafish)
    # 2.6 - Added additional EnsEMBL sites to metazoa: fungi, plants, protists
    # 2.7 - Updated for new EnsEMBL format with ID mapping in gene.txt file.
    # 2.8 - Bug fixes for updated EnsEMBL release.
    # 2.9 - Reduced DNA chromosome downloads. Updated some species data. Added "known_by_projection" handling.
    # 2.10- Miscellaneous fixes.
    # 2.11- Added rje_taxonomy and makeuniprot=T/F. Removed metlist. Moved release and species data extraction.
    # 2.12- Changed chromspec to enable downloads of all species but also download toplevel files, not chromosomes.
    # 2.13- Added speedskip=T/F [True] that will skip when pep.all, cdna.all and dna.toplevel are found.
    # 2.14- Add enspep=T/F      : Create full gnspacc EnsEMBL peptide datasets [False]
    # 2.15.0 - Added capacity to download/process a section of Ensembl with speclist=LIST.
    # 2.15.1 - Improved error handling for too many FTP connections: still need to fix problem!
    # 2.15.2 - Trying to improve speed of Uniprot parsing for EnsLoci.

rje_ensembl REST Output formats

Run with &rest=docs for program documentation and options. A plain text version is accessed with &rest=help.
&rest=OUTFMT can be used to retrieve individual parts of the output, matching the tabs in the default
(&rest=format) output. Individual OUTFMT elements can also be parsed from the full (&rest=full) server output,
which is formatted as follows:

###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
# OUTFMT:
... contents for OUTFMT section ...

Available REST Outputs

There is currently no specific help available on REST output for this program.

SLiMSuite REST Server

rje_ensembl V2.15.2

EnsEMBL Processing/Manipulation Module

Function

Commandline

History Module Version History

rje_ensembl REST Output formats

Available REST Outputs