EnsEMBL Processing/Manipulation Module
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice
This module is for processing EnsEMBL data for the rje_dbase module. The main class is an EnsEMBL class, which stores information on EnsEMBL proteins in terms of their gene IDs, loci and descriptions. This generates the "EnsLoci" dataset for each genome, consisting of the "best" peptide for a given locus. For known genes, UniProt accession numbers will be used in place of the EnsEMBL accession number. If the EnsEMBL sequence maps to a SwissProt sequence but is of really low quality (20+ consecutive Xs with less non-X residues than the SwissProt sequence) then the SwissProt sequence itself will replace the EnsEMBL sequence. This is the only time that the relationship between EnsEMBL peptide ID and sequence will break down.
Version 1.7 introduced a new "EnsGO" function for making GO datasets for the species codes listed. This mode will need, for each SPECIES, the EnsLoci file enspath/ens_SPECIES.loci.fas, the GO mapping enspath/ens_SPECIES.GO.tdt and the GO ID file [GO.terms_ids_obs]. GO mapping files can be created for the relevant species using EnsEMBL's BioMart tool (http://www.ensembl.org/biomart/martview/), while the ID file can be downloaded from GO (http://www.geneontology.org/doc/GO.terms_ids_obs). From BioMart, the following columns should be downloaded: "Ensembl Gene ID","Ensembl Transcript ID","Ensembl Peptide ID","GO ID","GO description","GO evidence code", "EntrezGene ID","HGNC Symbol". Other fields can also be downloaded if desired. This function has been further updated in version 1.8 & 1.9. From Version 2.8, the columns should be: "Ensembl Gene ID", "Ensembl Transcript ID", "Ensembl Protein ID", "GO Term Accession", "GO Term Evidence Code", "EntrezGene ID", "HGNC symbol"
Version 2.0 introduced a new "EnsDat" function for generating fake UniProt format entries for EnsLoci data using
PFam HMM domain prediction, TMHMM transmembrane topology prediction, SIGNALP signal peptide prediction and IUPRED
disorder prediction. Assumes that the EnsLoci files have been created. (Use
Version 2.11 is the start of a major reworking in preparation for V3.0. Species codes are now read in automatically
and Ensembl species alone downloaded from Uniprot for EnsLoci processing. (This can be quite slow depending on
connection etc.) This avoids the need for pre-processing Uniprot in order to make EnsLoci sequences. Modified Uniprot
downloads and data extraction is used for db xref mapping in place of manual biomart tables. Species data is now
split into subsets according to Ensembl sets (main, metazoa, protists etc.) and EnsLoci files are similarly split
### Primary Module Functions ###
History Module Version History
# 0.0 - Initial compilation. # 1.0 - Initial working version with download and EnsLoci functions. # 1.1 - Added reformatDB() method to be called by rje_dbase # 1.2 - Fixed known-ccds bug # 1.3 - SwissProt bug still remained. Fixed and added speclist=LIST # 1.4 - Improved EnsLoci mapping to ignore Xs and use SeqMapper for better match to SwissProt # 1.5 - Fixed bug that results in multiple occurrences of some sequence names # 1.6 - Added crap-sequence catching # 1.7 - Added GO dataset partitioning # 1.8 - Modification to the GO dataset generation, and GO adaptation for use with PINGU. # 1.9 - Added the possibility to restrict GO data to certain evidence codes # 2.0 - Added new EnsDat functionality. (Now part of "UniFake" module.) # 2.1 - Made changes in line with new EnsEMBL setup. # 2.2 - Made changes in line with new EnsEMBL setup. Again. Grrrr. Stop changing things, EnsEMBL! # 2.3 - Added new species, metlist and option to download chromosome sequences. # 2.4 - Modified to allow HGNC evidence for Human EnsEMBL. # 2.5 - Modified to allow additional species-specific evidence in *.map.tdt file. (Mouse, Yeast, Zebrafish) # 2.6 - Added additional EnsEMBL sites to metazoa: fungi, plants, protists # 2.7 - Updated for new EnsEMBL format with ID mapping in gene.txt file. # 2.8 - Bug fixes for updated EnsEMBL release. # 2.9 - Reduced DNA chromosome downloads. Updated some species data. Added "known_by_projection" handling. # 2.10- Miscellaneous fixes. # 2.11- Added rje_taxonomy and makeuniprot=T/F. Removed metlist. Moved release and species data extraction. # 2.12- Changed chromspec to enable downloads of all species but also download toplevel files, not chromosomes. # 2.13- Added speedskip=T/F [True] that will skip when pep.all, cdna.all and dna.toplevel are found. # 2.14- Add enspep=T/F : Create full gnspacc EnsEMBL peptide datasets [False] # 2.15.0 - Added capacity to download/process a section of Ensembl with speclist=LIST. # 2.15.1 - Improved error handling for too many FTP connections: still need to fix problem! # 2.15.2 - Trying to improve speed of Uniprot parsing for EnsLoci.
rje_ensembl REST Output formatsRun with
which is formatted as follows:
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~### # OUTFMT: ... contents for OUTFMT section ...
Available REST OutputsThere is currently no specific help available on REST output for this program.
© 2015 RJ Edwards. Contact: firstname.lastname@example.org.