More on Ensembl data in UCSC

5 views
Skip to first unread message

Marek Szuba

unread,
Dec 14, 2009, 9:26:41 PM12/14/09
to pygr...@googlegroups.com
Hello everyone,

Now that I've got a working prototype for this (which you can find in
my 'ucsc_ensembl' GitHub branch), Chris believes it is time to discuss
the final form - or, in other words, the schema. Here is some
information about the data in question and what needs to be done to
obtain different types of annotations.


The Data

1. All the Ensembl data in UCSC is *transcript* data. You can get
transcript, gene, protein and/or exon annotations from the database but
in the end it all grinds down to appropriately manipulating transcript
data;

2. UCSC stores stable Ensembl identifiers of transcripts, genes and
proteins - but not exons. The latter can only be referred to by
combining the relevant stable Ensembl transcript ID with the exon's
offset (or, in Ensembl terminology, rank) within that transcript.

3. All Ensembl data in the UCSC database is stored in the following
three tables:
- ensGene - contains transcript information, including blobs
containing exon data. Each row is identified by either stable Ensembl
transcript ID (uniquely it seems, even though it's not guaranteed by
the database) or stable Ensembl gene ID (not always uniquely);
- ensPep - contains sequence data. Each row is identified by stable
Ensembl transcript ID (as primary key);
- ensGtp - contains mappings between stable Ensembl transcript, gene
and protein IDs. Transcript IDs are unique, the other two - not
necessarily;

4. We are interested in the following tables in the native Ensembl
database:
- exon_stable_id - maps between stable and internal Ensembl exon IDs;
- exon_transcript - maps internal Ensembl exon IDs to "internal
Ensembl transcript ID + rank" pairs;
- transcript_stable_id - maps between stable and internal Ensembl
transcript IDs;


The Schema

1. Transcript annotations: this one seems easy, an SQLTable connected
to ensGene and using the transcript-ID column as the primary key;

2. Gene annotation: at the first glance, as above but using the gene-ID
column as the primary key. One issue to consider here is alternative
splicing - with this implementation it's not possible to query genes
with multiple transcripts;

3. Protein annotation: get the SQLTable used for transcript
annotations, then connect it with MapView to another one attached to
ensGtp, with the protein-ID column as the primary key. For each protein
ID specified by the user use the MapView to obtain the corresponding
transcript ID, then return appropriate transcript information. One
question: shall we hide from the user that protein data (s)he gets is
actually transcript data (i.e come up with some sort of a wrapper) or
just let him/her have it as it is?

4. Exon annotations: this is where things get a bit more complicated:
a. First, get the stable Ensembl transcript ID corresponding to the
provided stable Ensembl exon ID. This requires a three-table join but
since it's still just a one-to-one mapping, MapView could be used (e.g.
from Ensembl's exon_stable_id to UCSC's ensGene);
b. Next, extract exon blobs from the relevant row of ensGene and parse
them so that information pertaining to individual exons can be accessed;
c. Afterwards, query the Ensembl database again but in the opposite
direction - get all stable Ensembl exon IDs for the relevant
transcript, preserving rank information somehow (either explicitly or
via appropriate ordering). This would most likely require a GraphView
from ensGene to exon_stable_id;
d. Finally, assemble exon tuples from data from both databases and
put them to a local dictionary. You can now use the original stable
Ensembl exon ID provided by the user to query that local dictionary,
moreover it's trivial to implement caching of already-assembled exons.


That's it. Please let me know what you think.

Cheers,
--
MS

C. Titus Brown

unread,
Dec 15, 2009, 9:37:44 AM12/15/09
to pygr...@googlegroups.com
On Mon, Dec 14, 2009 at 06:26:41PM -0800, Marek Szuba wrote:
> Now that I've got a working prototype for this (which you can find in
> my 'ucsc_ensembl' GitHub branch), Chris believes it is time to discuss
> the final form - or, in other words, the schema. Here is some
> information about the data in question and what needs to be done to
> obtain different types of annotations.

Hi Marek,

could you post a script or two showing how you might work with this
schema in practice? Thanks! Uncommented test scripts are just fine :)

> 3. Protein annotation: get the SQLTable used for transcript
> annotations, then connect it with MapView to another one attached to
> ensGtp, with the protein-ID column as the primary key. For each protein
> ID specified by the user use the MapView to obtain the corresponding
> transcript ID, then return appropriate transcript information. One
> question: shall we hide from the user that protein data (s)he gets is
> actually transcript data (i.e come up with some sort of a wrapper) or
> just let him/her have it as it is?

I don't understand this question... is the problem that the "transcript IDs"
would actually be protein IDs? If only for traceability, I would suggest a
wrapper.

> That's it. Please let me know what you think.

Seems good in abstract but I'm kind of code-focused so I'd be interested in
how actually working with this data looks.

cheers,
--titus
--
C. Titus Brown, c...@msu.edu

Marek Szuba

unread,
Dec 16, 2009, 9:57:18 PM12/16/09
to pygr...@googlegroups.com
On Tue, 15 Dec 2009 06:37:44 -0800
"C. Titus Brown" <c...@msu.edu> wrote:


> could you post a script or two showing how you might work with this
> schema in practice? Thanks! Uncommented test scripts are just
> fine :)

I suggest you have a look at pygr/apps/ucsc_ensembl_annot.py in my
'ucsc_ensembl' GitHub branch. Note that I keep changing this script on
a regular basis; that said, in this post I shall use

http://github.com/mkszuba/pygr/commit/b50f1b8b792c58180cac10d778ba22ab263272d0

as reference. Don't mind the ugly code, it is as I mentioned earlier a
prototype and will be reorganised before the release.

> > One question: shall we hide from the user that protein data (s)he
> > gets is actually transcript data (i.e come up with some sort of a
> > wrapper) or just let him/her have it as it is?
> I don't understand this question... is the problem that the
> "transcript IDs" would actually be protein IDs? If only for
> traceability, I would suggest a wrapper.

See lines 137-140 in my script, which produce the following output:

ENSP00000372525 annotENST00000383052[0:47226] chrY[2863321:2910547]

What I meant by hiding from the user was creating a wrapper that would
use data from the transcript database but with protein identifiers;
right now protein-to-transcript-ID mapping is done explicitly and the
database is still just a transcript database.

--
MS

Reply all
Reply to author
Forward
0 new messages