ensemble database schemas

Jenny Qing Qian

unread,

May 12, 2008, 2:12:11 AM5/12/08

to pygr-dev

Hi pygr Masters,

I've just uploaded the most recent ensembl database schemas (kindly
provided by Glenn). Please take a look at them if interested.

With the help of two files in the sql/ directory of the ensembl CVS
module (table.sql and foreign_keys.sql, also uploaded for your
convenience), I am gradually making sense of these diagrams: each
oriented line starts with a dark circle and ends with a white
diamond. Each line refers to a foreign key relationship. Although
the cardinality ratio (one-to-one, one-to-many, many-to-many) between
two entities is not explicitly displayed in the diagrams, it can be
inferred from the key constraints.

A few symbols and color codes are still mystery to me:

1) Within an entity, what do the orange and turquoise diamonds, under
the yellow-folder, represent, respectively?

2) Between two entities, what does a diamond half filled and half open
represent? What are the meanings of the various orientations of this
type of diamonds?

If you know the answer, can you share with me?

I am overwhelmed by the amount of information contained in these
diagrams...... I heard many good reports about how powerful pygr can
be in managing large datasets. Hopefully, pygr is going to help me
tame this monster data-warehouse, hence, spare me from being slaved,
or worse, devoured by it.

--jenny

Christopher Lee

unread,

May 12, 2008, 1:55:30 PM5/12/08

to pygr...@googlegroups.com

Hi Jenny,
the best way to deal with monsterously big problems is divide and
conquer: chop off one small tentacle of this monster and just deal
with that. I suggest you start with something that in principle ought
to conform to pygr's annotation model. For example let's take the
exon table (which I see in table.sql). The goals for this first trial
would be:

- be able to hook up Ensembl's sequence database table (dnac) as a
pygr sequence database (using pygr.sqlgraph.SQLTable and
pygr.seqdb.DNASQLSequence). That means that if seqDB is your SQLTable
object, you can do things like
seq1 = seqDB[1] # gets DNA sequence with ID 1
print str(seq1[100:200]) # print nucleotide sequence for an interval
in this sequence

- be able to hook up Ensembl's exon database table (exon) as a pygr
SQLTable database. That means if exonDB is your SQLTable object, you
can do things like
e1 = exon[1] # get the exon with ID 1
print e1.seq_region_start, e1.seq_region_strand # print some
coordinate info

- be able to plug both of those into a pygr.seqdb.AnnotationDB
database and get exon annotations out. The main thing is giving it a
sliceAttrDict dictionary that maps its default attribute names to what
Ensembl actually uses in their exon table. This would be something like
aDB = AnnotationDB(exonDB, seqDB,
sliceAttrDict=dict(id='seq_region_id', start='seq_region_start',
stop='seq_region_end', orientation='seq_region_strand'))
Note: I am not sure whether Ensembl's seq_region_strand values are
compatible with what pygr expects (1 for positive orientation, -1 for
negative orientation). It looks like they probably are but I haven't
checked the actual data... If not we'll have to provide some kind of
method to handle this.

Then you should be able to do things like:
e2 = aDB[2] # get annotated exon 2
print str(e2.sequence) # print its actual sequence

- be able to save these to pygr.Data. Then you should be able to pull
them back in to a new Python session as easily as:
import pygr.Data
aDB = pygr.Data.Bio.Annotation.Ensembl.hg18.exon() # assuming you gave
it this name when you saved it...
print str(aDB[2].sequence) # print exon 2's sequence...

- be able to build the inverse mapping from the genome to the
annotation (the annotation itself provides the forward mapping from
the annotation to the genome). To do so, just use the
NLMSA.add_annotation() method; see the pygr cookbook recipes for
examples. Then if exonmap is this NLMSA mapping, you should be able
to do things like

region = seqDB[4][1000000:1100000] # get 100kb of chromosome
for exon in exonmap[region]: # find all exons overlapping this region
print 'exon:', repr(exon.sequence)

- be able to save this inverse mapping to pygr.Data and in
pygr.Data.schema bind it to a genome attribute e.g. "exons". See the
pygr cookbook recipes for examples. Then you should be able to use it
in a new Python session as easily as:
import pygr.Data
seqDB = pygr.Data.Bio.Annotation.Ensemble.hg18()
for exon in seqDB[4][1000000:1100000].exons: # find exons overlapping
this region
print 'exon:', repr(exon.sequence)

To do this, you'll have to learn about the classes mentioned above,
pygr.Data, etc., but I bet you would be able to get this working
pretty fast, and will get a feeling for how easy / difficult it will
be to work with the Ensembl data in pygr...

Yours,

Chris

On May 11, 2008, at 11:12 PM, Jenny Qing Qian wrote:

>
> Hi pygr Masters,

Istvan Albert

unread,

May 12, 2008, 3:07:14 PM5/12/08

to pygr-dev

On May 12, 2:12 am, Jenny Qing Qian <jqian....@gmail.com> wrote:

> I've just uploaded the most recent ensembl database schemas (kindly
> provided by Glenn). Please take a look at

Are you planning on using some sort of object relational mapper (for
example SqlAlchemy) to connect and query the database?

Because if you choose to use such a mapper your job might become a lot
simpler. You could generate your models (semi-automatically) directly
from the schemas, plus it'll take care of a lot of query building for
you. Of course even in this case starting small as Chris said is the
right way to go about it.

best,

Istvan

Christopher Lee

unread,

May 12, 2008, 4:30:30 PM5/12/08

to pygr...@googlegroups.com

On May 12, 2008, at 12:07 PM, Istvan Albert wrote:
>
> Are you planning on using some sort of object relational mapper (for
> example SqlAlchemy) to connect and query the database?
>
> Because if you choose to use such a mapper your job might become a lot
> simpler. You could generate your models (semi-automatically) directly
> from the schemas, plus it'll take care of a lot of query building for
> you. Of course even in this case starting small as Chris said is the
> right way to go about it.

Yes, you could use any ORM that represents the database table as a
dict-like object (whose keys are primary key ID values, and whose
associated values are "row objects" that map the database columns to
attributes of the same name). However, we probably don't need
anything more sophisticated than pygr.sqlgraph.SQLTable, which does
this in just about the simplest possible way:

db = SQLTable('dbname.tablename')
If you don't provide a cursor, it automatically gets a MySQLdb
connection using your ~/.my.cnf file. It analyzes the schema of the
table automatically. It also "plays nice" with pygr.Data and other
aspects of pygr... I also understand the performance tuning options
(e.g. caching; grouping sets of rows to load as a group using a
cluster key; etc.) for this much better than for SqlAlchemy / other
ORMs, since we've had extensive experience using this with very large
tables...

RE: working with ORMs, pygr has a few conventions that would need to
be followed for the ORM to work completely transparently with every
aspect of pygr:

- database objects should have a dict-like interface as described above.

- row objects should have a "db" attribute (pointing to the database
object), and an "id" attribute (primary key value).

- to work with pygr.Data, the database object must be picklable (this
is easy, but I bet many ORMs have not addressed persistence in this
Pythonic and truly general way). Also, if the database provides an
__invert__() method, pygr will automatically use it to provide the
reverse mapping, for purposes of schema binding...

- database objects should also have a writable attribute "itemClass"
that allows the user to specify what class should be used for row
objects. This allows row objects to model any desired complex
datatype (e.g. a genome sequence; a fancy kind of annotation etc.).
If the itemClass is itself sliceable, it should use the db attribute
"itemSliceClass" as the class for the results of any such slicing
operation. E.g. slicing a sequence object should yield a sequence-
interval object... itemClass and itemSliceClass are also important
for pygr.Data, which needs to be able to automatically bind schema
attributes (as Python descriptors) to these classes in a transparent
way...

Note that by default, SQLTable uses the sqlgraph.TupleO class as the
itemClass, to avoid actually storing a __dict__ for every row object
(saves lots of space when loading lots of rows). TupleO stores each
row as a Python tuple, but provides an attribute interface based on
the table schema automatically analyzed by SQLTable.

-- Chris

Jenny Qing Qian

unread,

May 12, 2008, 4:51:42 PM5/12/08

to pygr...@googlegroups.com

Note: I am not sure whether Ensembl's seq_region_strand values are
compatible with what pygr expects (1 for positive orientation, -1 for
negative orientation). It looks like they probably are but I haven't
checked the actual data... If not we'll have to provide some kind of
method to handle this.

Just checked the actual ensembl exon table records. Ensembl also uses +1 for forward stand and -1 for reverse strand.

One problem less :)

Namshin Kim

unread,

May 12, 2008, 10:18:28 PM5/12/08

to pygr...@googlegroups.com

Just for curiosity, is the ensembl coordinate 0-based or 1-based?

Namshin Kim

Jenny Qing Qian

unread,

May 12, 2008, 10:54:09 PM5/12/08

to pygr...@googlegroups.com

Hi Namshin Kim,

I am pretty sure that the ensembl coordinate starts from 1. I have tried to select records where "seq_region_start = 0" from three different tables (exon, transcript and gene). All gave me empty sets. When "seq_region_start" was set to 1, one record was returned from each of the above three tables.

Best Regards,

Jenny

Rob Kirkpatrick

unread,

May 13, 2008, 5:02:17 PM5/13/08

to pygr...@googlegroups.com

A few symbols and color codes are still mystery to me:

1) Within an entity, what do the orange and turquoise diamonds, under
the yellow-folder, represent, respectively?

Looks like those are unique (orange) and simple (tuquoise) indexes. The folders just indicate the index (.idx) files.

2) Between two entities, what does a diamond half filled and half open
represent? What are the meanings of the various orientations of this
type of diamonds?

Not sure here but it looks like some diamonds are just cut off by text boxes... Some overlaps between the diamonds and filled circles also make it look like there are new shapes but I think there are just the two: filled circles, open diamonds

Rob Kirkpatrick

unread,

May 13, 2008, 5:35:50 PM5/13/08

to pygr...@googlegroups.com

Here's some of the basic workings with the Ensembl db queried by Pygr (two of heads on our monster). Note that I had to use the dna table as the dnac tables are all empty:

from pygr.seqdb import BlastDB
from pygr.sqlgraph import SQLTable
from pygr import seqdb
from pygr import cnestedlist
import MySQLdb
conn = MySQLdb.connect(host="ensembl01", user="ensembl", passwd="ensembl")
c = conn.cursor()

#get the dna sequence object
dna_sliceDB = SQLTable('homo_sapiens_core_49_36k.dna', c)
dna1 = dna_sliceDB[143909]
# dna1.data[0] is the record id - 143909 in this case
# dna1.data[1] is the sequence itself
# dna1.db is 'homo_sapiens_core_49_36k.dna'

dna_seq = dna1.data[1]
#print dna_seq[0:5] would just return the first bases of the sequence - CACCC

#get the exon (annotation) object
exon_sliceDB = SQLTable('homo_sapiens_core_49_36k.exon', c)
exon1 = exon_sliceDB[1]
#print str(exon1.seq_region_start) would give 19397
#print str(exon1.seq_region_end) would give 19669
#print str(exon1.seq_region_strand) would give -1

exon1_seq = dna_seq[19397:19669]
exon1_seq_2 = dna_seq[exon1.seq_region_start:exon1.seq_region_end]

# both exon1_seq and exon1_seq_2 would return the same thing: TCTTATAATTGTTTTTGTTCCTGTAAAATTTTTAGTAATATCCTTTTTATTTCTGATTTTAGTAATTTGAATCTTCTCTTTTTTCTTAGCCAATCTAGCTAAAAGTTTGTCAATTTGGTTGATCTTTTTAAAGAACCAACTTTTGGTTTCATTGAAGGCTTTTTTTCCTATGATTTTTTAAAATTCCTTGTTTTATTTATTTCTATTGTAACCTTTATTATGTCCTCCCTTTTGGTAGCTTTGGGTTTAGTTTTCTCTTCTTTTTCTAGTTC

Chris, just out of curiosity, dna1 in the above code is a pygr.sqlgraph.foo object. Is that a temp thing where we'll get a better name for it sometime or is that a placeholder for strange object type?

I'm feeling a little Pygr love now and I think we like each other.

Christopher Lee

unread,

May 13, 2008, 6:24:34 PM5/13/08

to pygr...@googlegroups.com

Very cool, Rob. Now let's bump it up a notch... comments below:

On May 13, 2008, at 2:35 PM, Rob Kirkpatrick wrote:
> Here's some of the basic workings with the Ensembl db queried by
> Pygr (two of heads on our monster). Note that I had to use the dna
> table as the dnac tables are all empty:
>
>
> from pygr.seqdb import BlastDB
> from pygr.sqlgraph import SQLTable
> from pygr import seqdb
> from pygr import cnestedlist
> import MySQLdb
> conn = MySQLdb.connect(host="ensembl01", user="ensembl",
> passwd="ensembl")
> c = conn.cursor()
>
> #get the dna sequence object
> dna_sliceDB = SQLTable('homo_sapiens_core_49_36k.dna', c)
> dna1 = dna_sliceDB[143909]
> # dna1.data[0] is the record id - 143909 in this case
> # dna1.data[1] is the sequence itself
> # dna1.db is 'homo_sapiens_core_49_36k.dna'
>

Instead of accessing the data as generic row objects, let's make them
into real sequence objects:

from pygr.seqdb import DNASQLSequence
class EnsemblDNA(DNASQLSequence): # this is just a performance
optimization...
def __len__(self):
return self._select('length(sequence)') # no need to
download whole sequence to get its length

dnaDB = SQLTable('homo_sapiens_core_49_36k.dna', c,
itemClass=EnsemblDNA, attrAlias=dict(seq='sequence'))
dna1 = dnaDB[143909] # not just a tuple, but a EnsemblDNA sequence
object
print len(dna1) # note: this doesn't actually retrieve the sequence
string from the db, just uses SQL to get integer length
ival = dna1[20:50] # not just a string, but a sequence interval object
print repr(ival), str(ival)
rc = -ival # get sequence object representing its opposite strand
(reverse complement)
rc_bit = rc[-20:] # of course any sequence interval is sliceable...
yielding another sequence interval object
print repr(rc_bit), str(rc_bit), len(rc_bit)

> dna_seq = dna1.data[1]
> #print dna_seq[0:5] would just return the first bases of the
> sequence - CACCC
>
> #get the exon (annotation) object
> exon_sliceDB = SQLTable('homo_sapiens_core_49_36k.exon', c)
> exon1 = exon_sliceDB[1]
> #print str(exon1.seq_region_start) would give 19397
> #print str(exon1.seq_region_end) would give 19669
> #print str(exon1.seq_region_strand) would give -1
>
>
> exon1_seq = dna_seq[19397:19669]
> exon1_seq_2 = dna_seq[exon1.seq_region_start:exon1.seq_region_end]
>
> # both exon1_seq and exon1_seq_2 would return the same thing:
> TCTTATAATTGTTTTTGTTCCTGTAAAATTTTTAGTAATATCCTTTTTATTTCTGATTTTAGTAATTTGAATCTTCTCTTTTTTCTTAGCCAATCTAGCTAAAAGTTTGTCAATTTGGTTGATCTTTTTAAAGAACCAACTTTTGGTTTCATTGAAGGCTTTTTTTCCTATGATTTTTTAAAATTCCTTGTTTTATTTATTTCTATTGTAACCTTTATTATGTCCTCCCTTTTGGTAGCTTTGGGTTTAGTTTTCTCTTCTTTTTCTAGTTC
>

OK, let's turn this into a real annotation database in the Pygr sense
of the word:
from pygr.seqdb import AnnotationDB
annoDB = AnnotationDB(exon_sliceDB, dnaDB,

sliceAttrDict=dict(id='seq_region_id', start='seq_region_start',
stop='seq_region_end', orientation='seq_region_strand'))

exon1 = annoDB[1] # get annotation object
ival = exon1.sequence # get associated sequence object
print repr(ival), str(ival) # print actual TCTTA... string

> Chris, just out of curiosity, dna1 in the above code is a
> pygr.sqlgraph.foo object. Is that a temp thing where we'll get a
> better name for it sometime or is that a placeholder for strange
> object type?

SQLTable automatically subclasses the itemClass (by default, TupleO)
to bind its attribute dictionary at the **class** level (that is how
we avoid having to store a __dict__ for each row instance in TupleO,
for example). Because I was lazy, that automatically generated
subclass is just getting called "foo" by default. We can change that!

Rob Kirkpatrick

unread,

May 13, 2008, 6:49:50 PM5/13/08

to pygr...@googlegroups.com

This might be some of the non-trivial mapping that Glenn mentioned (or I'm just doing something wrong). When I run that code on my end I get:

KeyError: 'homo_sapiens_core_49_36k.dna[226034].length(sequence) not found, or not unique'

The exon table has a number of entries that have seq_region_id=226034 but the dna table has no entries with that id. Either the mapping is done by routing through other tables or...

I presume pygr is sniffing out the id colums and using them to do joins? Would that explain the error above?

Christopher Lee

unread,

May 13, 2008, 7:15:18 PM5/13/08

to pygr...@googlegroups.com

On May 13, 2008, at 3:49 PM, Rob Kirkpatrick wrote:

> This might be some of the non-trivial mapping that Glenn mentioned
> (or I'm just doing something wrong). When I run that code on my end
> I get:
>
> KeyError: 'homo_sapiens_core_49_36k.dna[226034].length(sequence) not
> found, or not unique'
>
> The exon table has a number of entries that have
> seq_region_id=226034 but the dna table has no entries with that id.
> Either the mapping is done by routing through other tables or...
>

Yikes, foreign key values that don't exist as primary keys! Are there
large numbers of seq_region_id values in Ensembl's exon table that
don't exist in the dna table, or is this an infrequent problem? I'm
afraid I never looked at the ER diagram; does it truly indicate
exon.seq_region_id is a foreign key with a many-to-one relation to
dna.seq_region_id, as you would expect?

Are you absolutely sure that the data uploaded into your database is
complete? It might be worthwhile to search for the same seq_region_id
in Ensembl's public MySQL server, just to check... If 226034 is also
missing in Ensembl's database, that really makes me question Ensembl's
schema: either this missing key is a bug, or there must be some wacky
intermediate mapping table that maps "seq_region_id" to
"seq_region_id" (!) I guess we can bug Glenn if we can't figure this
out easily...

Meanwhile, does the AnnotationDB example code I sent you work for
those seq_region_id rows that DO exist?

> I presume pygr is sniffing out the id colums and using them to do
> joins? Would that explain the error above?

Yes, SQLTable automatically analyzes the column indexing for the
table, to identify the primary key. Check the dnaDB.primary_key
attribute to see if it is correct. Also check that a simple query
like dnaDB[2] gets the right row as confirmed by your own SQL
queries... I've never had a problem with this.

The error message simply indicates either that no row with that ID was
found, or that multiple rows matched (which according to the Ensembl
schema is impossible, right? dna.seq_region_id is a primary key, I
think).

Jenny Qing Qian

unread,

May 13, 2008, 7:15:50 PM5/13/08

to pygr...@googlegroups.com

Shall we create a Wiki for PygrOnEnsembl as well? Or, is it too early for that?

Jenny

Christopher Lee

unread,

May 13, 2008, 7:18:59 PM5/13/08

to pygr...@googlegroups.com

That seems like a fine idea!
-- Chris

Rob Kirkpatrick

unread,

May 13, 2008, 7:48:35 PM5/13/08

to pygr...@googlegroups.com

I might have been a bit naive there.... I was looking at the schema diagrams and it's virtually impossible to interpret the relationships properly. The foreign_keys.sql file is a bit more helpful. It has the following entries:

ALTER table dna ADD FOREIGN KEY (seq_region_id) REFERENCES seq_region(seq_region_id);
ALTER table exon ADD FOREIGN KEY (seq_region_id) REFERENCES seq_region(seq_region_id);
ALTER table seq_region ADD FOREIGN KEY (coord_system_id) REFERENCES coord_system(coord_system_id);

This should lead us somewhere, but this implies the seq_region table is the linker. Given that, the dna table still doesn't have an entry for seq_region_id=226034. I'm wondering if the dna table actually stores the sequence that the rest of the db references...

I obviously need more time in the schema to figure it out and check to ensure that our mirror is complete...

mysql> select max(seq_region_id) from dna;
+--------------------+
| max(seq_region_id) |
+--------------------+
|             171002 |
+--------------------+

mysql> select max(seq_region_id) from exon;
+--------------------+
| max(seq_region_id) |
+--------------------+
|             226065 |
+--------------------+

mysql> select max(seq_region_id) from seq_region;
+--------------------+
| max(seq_region_id) |
+--------------------+
|            1966001 |
+--------------------+

Jenny Qing Qian

unread,

May 13, 2008, 9:49:25 PM5/13/08

to pygr...@googlegroups.com

Done.

http://code.google.com/p/pygr/wiki/PygrOnEnsembl

---jenny

Jenny Qing Qian

unread,

May 15, 2008, 7:10:09 AM5/15/08

to pygr...@googlegroups.com

Hi Chris,

Meanwhile, does the AnnotationDB example code I sent you work for

those seq_region_id rows that DO exist?

I checked the code you sent to Rob. In my code, I connect to the real ensembl dbs, not a local mirror. It works for seq_region_id up to 171002. This is the maximum existing value for dna.seq_region_id.

Yikes, foreign key values that don't exist as primary keys! Are there large numbers of seq_region_id values in Ensembl's exon table that don't exist in the dna table, or is this an infrequent problem?

Basically, any seq_region_id bigger than 171002 does not exist in the dna table. In other words, there are 55063 seq_region_ids in the exon table that can not be found in the dna table.

I'm afraid I never looked at the ER diagram; does it truly indicate exon.seq_region_id is a foreign key with a many-to-one relation to dna.seq_region_id, as you would expect?

The exon and the dna tables do not have any foreign key relationship. I checked the schema diagram again, the dna.seq_region_id refers to seq_region.seq_region_id. Since the seq_region_id is the primary key in both tables, the relationship between the dna table and the seq_region table is one to one. However, there are more records in the seq_region table than those in the dna table. But that alone does not violate foreign key constraint.

The relationship between the exon table and the seq_region table is many to one since there is a foreign key relationship from the exon table to the seq_region table. In addition, the exon.seq_region_id is not a primary key.

So, just as Rob pointed out in the following e-mail, the dna table and the exon tables are only related via the seq_region table. There seems to be many exon.seq_region_ids that are missing from the dna table, even though they do refer uniquely to certain seq_region.seq_region_ids,

Jenny

Rob Kirkpatrick

unread,

May 15, 2008, 2:02:03 PM5/15/08

to pygr...@googlegroups.com

It looks like some of the relationship mapping passes though the meta_coord and coord_system tables. As Glenn noted in one of his examples:

"the API will look in meta_coord to see where prediction_transcripts (contigs),
project the chromosomal slice onto that coord_system"

Those two tables seem to break down the seq_region id's according to virtual or non-virtual sequence entities. So to make the right query, we need to know what coord system we are trying to work with, then route the SQL through the appropriate tables.

The entirety of Glenn's example:

#
# retrieve feature stored in contig coordinates from a chromosome slice
#
my $slice1 = $slice_adaptor->fetch_by_region('chromosome', 18);
my @prediction_transcripts = @{ $slice1->get_all_PreditionTranscripts };

the API will look in meta_coord to see where prediction_transcripts (contigs),
project the chromosomal slice onto that coord_system, fetch prediction
transcripts for each contig and then transform coordinates back into chromosomal
coordinates before returning the features.

e.g. GENSCAN00000036948 is stored on contig AC116447.10.1.105108 48849-66632,
but returned on chromosome 18 31140100-31141800 (approx.)

Christopher Lee

unread,

May 15, 2008, 5:41:49 PM5/15/08

to pygr...@googlegroups.com

Hi Jenny, Rob,
I was actually going to ask whether Ensembl provides a mapping (e.g.
for the exon annotations) to genome coordinates (i.e. on hg18 or hg17,
presumably). Rob's comments below seem to imply that we could do
that. Genomic coordinates would seem like the most robust place to
attach annotations and escape from these "foreign keys" that don't
seem to be reliable identifiers in the usual sense of the word.

Jenny, it would be really helpful if you could dig up answers to the
following questions:

- what version of the genome is this Ensembl annotation dataset
attached to (e.g. hg18)?
- what Ensembl tables store such genomic coordinates?
- what additional coordinate systems does Ensembl use? (I guess
seq_region is an example...) Do we understand what Ensembl uses them
for, instead of just expressing everything in terms of genomic
coordinates?
- are all these coordinate systems mappable to genomic coordinates?
If so, how?

Thanks!

-- Chris

Rob Kirkpatrick

unread,

May 15, 2008, 6:18:40 PM5/15/08

to pygr...@googlegroups.com

I had sent this previously to Jenny but I thought I would send it on to pygr-dev. There was a dude that put together a Ruby Ensembl API last year or so and he forged a path that we can possibly follow (http://saaientist.blogspot.com/2007/08/ruby-api-to-ensembl-database.html). Shortly after that, another dude wrote some online docs for that API which shed some light on the mappings a bit more. See my message to Jenny below.

---------- Forwarded message ----------
From: Rob Kirkpatrick <robert.d.k...@gmail.com>
Date: Thu, May 15, 2008 at 1:45 PM
Subject: Re: Ruby API for Ensembl
To: Jenny Qian <jqia...@gmail.com>

Another dude has set up some docs for the ruby ensembl api and is helping shed some light on the seq_region_id thing.

Check out the docs here: http://bioruby-annex.rubyforge.org/
Specifically, look at the Ensembl::Core::CoordSystem and Ensembl::Core::Dna classes.

If we look at the coord_system table in ensembl we see:

mysql> select * from coord_system;
+-----------------+-------------+---------+------+--------------------------------+
| coord_system_id | name        | version | rank | attrib                         |
+-----------------+-------------+---------+------+--------------------------------+
|              17 | chromosome | NCBI36 |    1 | default_version                |
|              15 | supercontig | NULL    |    2 | default_version                |
|               4 | contig      | NULL    |    4 | default_version,sequence_level |
|              11 | clone       | NULL    |    3 | default_version                |
|             101 | chromosome | NCBI35 | 101 |                                |
+-----------------+-------------+---------+------+--------------------------------+

The dna table only holds the seq_region_ids for the contig-level coord_system. The contig level is also the only one that has "real" sequence associated with it. All other coord_system levels are virtual and have to stitch together the contig sequences to achieve their sequence.

I'm guessing the the exons must be mapped onto a different coord_system then contigs and I'm guessing the dna table is only for contigs.

I think that makes sense...

This is from the ruby ensembl api docs for the dna table:

"The Dna class contains the actual DNA sequence for the sequence regions that belong to the seq_level coordinate system.

This class uses ActiveRecord to access data in the Ensembl database. See the general documentation of the Ensembl module for more information on what this means and what methods are available."

and this is from the coord_system docs:

"The CoordSystem class describes the coordinate system to which a given SeqRegion belongs. It is an interface to the coord_system table of the Ensembl mysql database.

Two virtual coordinate systems exist for every species:

toplevel: the coordinate system with rank 1
seqlevel: the coordinate system that contains the seq_regions with the sequence

This class uses ActiveRecord to access data in the Ensembl database. See the general documentation of the Ensembl module for more information on what this means and what methods are available."

Let me know if that makes sense to you.

Namshin Kim

unread,

May 15, 2008, 10:30:30 PM5/15/08

to pygr...@googlegroups.com

Hi Jenny,

I hope attached document help you understand Ensembl database schema.

Yours,

Namshin Kim

EnsSchema.pdf

Christopher Lee

unread,

May 15, 2008, 10:49:36 PM5/15/08

to pygr...@googlegroups.com

Thanks, Rob! This was very useful. I now understand the basic idea:
Ensembl seq_region table maps all of their seq_region_id to one of
several "coordinate systems" (sequence databases) listed in the table
coord_system. The bottom line is almost all of the data is mapped on
the NCBI36 human genome (coord_system_id=17), which I believe is
equivalent to hg18 (somebody please check this), with a much smaller
fraction mapping to the "supercontig" (coord_system_id=15). The
counts of mappings on all the coordinate systems are summarized in the
meta_coord table:

mysql> select * from meta_coord;
+--------------------------+-----------------+------------+
| table_name | coord_system_id | max_length |
+--------------------------+-----------------+------------+
| assembly_exception | 17 | 4731698 |
| oligo_feature | 17 | 25 |
| ditag_feature | 15 | 25 |
| transcript | 15 | 209805 |
| dna_align_feature | 15 | 209805 |
| protein_align_feature | 15 | 207164 |
| exon | 15 | 4484 |
| gene | 15 | 209805 |
| transcript | 17 | 2304118 |
| protein_align_feature | 17 | 2298737 |
| dna_align_feature | 17 | 2298740 |
| exon | 17 | 18172 |
| gene | 17 | 2304118 |
| dna_align_feature | 4 | 21324 |
| protein_align_feature | 4 | 21324 |
| marker_feature | 15 | 1549 |
| marker_feature | 17 | 5750 |
| repeat_feature | 15 | 2717 |
| repeat_feature | 17 | 50000 |
| simple_feature | 4 | 44058 |
| prediction_exon | 4 | 21324 |
| prediction_transcript | 4 | 329127 |
| repeat_feature | 4 | 160602 |
| oligo_feature | 15 | 25 |
| ditag_feature | 17 | 25 |
| misc_feature | 17 | 100530253 |
| regulatory_feature | 17 | 131013 |
| regulatory_search_region | 17 | 4998 |
| density_feature | 17 | 1648332 |
| density_feature | 15 | 8702 |
| karyotype | 17 | 30639485 |
+--------------------------+-----------------+------------+
31 rows in set (0.17 sec)

The dna table turns out to only store entries for coord_system_id=4
("contig"). The only tables that seem to have annotations on contigs
(coord_system_id=4) are simple_feature, prediction_transcript,
prediction_exon, and dna_align_feature.

So, to work with Ensembl annotations (e.g. exon table), all I had to
do was create a SeqRegion class that will map seq_region_id to the
proper sequence database based on its coord_system_id:

seq_region =
sqlgraph.SQLTable('homo_sapiens_core_47_36i.seq_region',
cursor)
import pygr.Data
hg18 = pygr.Data.Bio.Seq.Genome.HUMAN.hg18() # human genome
srdb = SeqRegion(seq_region, {17:hg18}) # just map
coord_system_id 17 to hg18

Now it will return the right sequence object for any seq_region_id (as
long as it's in NCBI36):
chr1 = srdb[226034] # this is chromosome 1
print len(chr1) # 247249719

Next create an interface to the exon table:
exonSliceDB = sqlgraph.SQLTable('homo_sapiens_core_47_36i.exon',
cursor, itemClass=EnsemblRow)
e = exonSliceDB[73777]
print e.start # 444865
EnsemblRow only does one thing: convert the interval coords to Python-
style (zero-offset) coordinates.

Finally, let's create an annotation database interface for exons:
from pygr.seqdb import AnnotationDB
annoDB = AnnotationDB(exonSliceDB, srdb, sliceAttrDict=

dict(id='seq_region_id',stop='seq_region_end',
orientation='seq_region_strand'))
e = annoDB[73777] # get an exon annotation
print e.phase, e.end_phase # 1 0
s = e.sequence
print str(s)
#'GCAAGCTGTGGACAAGAAGTATGAAGGTCGCTTACAGCATTCTACACAAATTAGGCACAAAGCAGGAACCCATGGTCCGGCCTGGAGATAGG
'
print str(s.before()[-10:]) # 'GTATTCATAG' last 2 nt is splice-site
print str(s.after()[:10]) # 'GTAAGTGCAA' first 2 nt is splice-site

I'll upload the Python file that contains the SeqRegion and EnsemblRow
classes, and this test code, to the Groups page.

For just a few tables (simple_feature, prediction_transcript,
prediction_exon, and dna_align_feature), we'll have to convert from
the coordinate system (contig) to NCBI36 genomic coords via the
assembly table.

-- Chris

On May 15, 2008, at 3:18 PM, Rob Kirkpatrick wrote:
> I had sent this previously to Jenny but I thought I would send it on
> to pygr-dev. There was a dude that put together a Ruby Ensembl API
> last year or so and he forged a path that we can possibly follow (http://saaientist.blogspot.com/2007/08/ruby-api-to-ensembl-database.html

> ). Shortly after that, another dude wrote some online docs for that

Namshin Kim

unread,

May 15, 2008, 10:57:41 PM5/15/08

to pygr...@googlegroups.com

Hi Chris,

NCBI36 (Oct 2005) = Ensembl genome assembly (17)

NCBI36.1 (March 2006) = hg18

NCBI Latest Build : 36.3

http://www.ncbi.nlm.nih.gov/genome/guide/human/release_notes.html

I think NCBI36 is hg18 basically. But, small version number may mean there is a slight change.

Yours,

Namshin Kim

Christopher Lee

unread,

May 16, 2008, 2:00:01 AM5/16/08

to pygr...@googlegroups.com

Hi,
I added new functionality to the seqregion.py module, to do automatic
mapping via the mysql database (instead of needing to create an NLMSA
locally), called EnsemblMapper. It's quite easy to use:

mapper = EnsemblMapper(annoDB, srdb) # exon annotation DB , seq
region DB
ival = chr1[:100000] # get a piece of hg18
print mapper[ival] # find exons in this interval
print 'neg:', mapper[-ival] # forces annotations to opposite ori...

This will take care of our mapping needs for everything except the 4
tables that are stored in contig coordinates, which will require an
intermediate step using the assembly table as I mentioned in my
previous email.

I uploaded the new version to the Google Groups site.

-- Chris

Christopher Lee

unread,

May 16, 2008, 2:03:45 AM5/16/08

to pygr...@googlegroups.com

Jenny should check whether Ensembl "NCBI36" means the Oct2005 version
or a later version. When I tried examining chr1 exons mapped onto
hg18, I got the definite impression that something was awry, because
in many cases there were no AG...GT splice sites around the putative
exons. This would make sense if I was using the wrong genome version
(hg18 = NCBI36.1, according to you).

-- Chris

Jenny Qing Qian

unread,

May 16, 2008, 2:59:57 AM5/16/08

to pygr...@googlegroups.com

Hi Chris,

Double checked. "NCBI36" does mean the Oct2005 version.
http://ensembl.genomics.org.cn/Homo_sapiens/index.html

--jenny

Namshin Kim

unread,

May 16, 2008, 3:02:11 AM5/16/08

to pygr...@googlegroups.com

Hmm... What about the chromosome size? Are they different or same? If they are different, we may have to run blastz....

Namshin Kim

Jenny Qing Qian

unread,

May 16, 2008, 3:17:26 AM5/16/08

to pygr...@googlegroups.com

Hi Namshin,

Chromosome sizes of the NCBI36 and NCBI36.1 versions are the same.

Jenny

Jenny Qing Qian

unread,

May 16, 2008, 3:33:08 AM5/16/08

to pygr...@googlegroups.com

Hi Namshin,

Thank you for the EnsSchema.pdf. It is really helpful!

Jenny

Christopher Lee

unread,

May 16, 2008, 1:00:55 PM5/16/08

to pygr...@googlegroups.com

I think we need to find out where we can download a copy of the Oct05
NCBI36 genome from. My tests with about six exons annotated on chr1
suggest that the Ensembl exon annotations don't map correctly onto hg18.

-- Chris

Jenny Qing Qian

unread,

May 16, 2008, 3:56:02 PM5/16/08

to pygr...@googlegroups.com

Hi Chris,

I went to the Ensembl FTP site http://www.ensembl.org/info/downloads/ftp_site.html. Do you think that we should use the data from here? ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/.
In the Ensembl Archive site http://ensembl.genomics.org.cn/Homo_sapiens/index.html, there is no place to download NCBI36, Oct 2005.

I also checked UCSC current http://genome.ucsc.edu/cgi-bin/hgGateway and archive http://genome-archive.cse.ucsc.edu/index.html?org=Human&db=hg13&hgsid=6203
genome download sites and no copy of NCBI36, Oct 2005 can be found.
Do you think it is worthwhile to try hg17?

Rob also checked GSC in-house Ensembl human genome downloads and he only found a copy for 36.1 (Mar 2006).

Jenny

Christopher Lee

unread,

May 16, 2008, 5:17:48 PM5/16/08

to pygr...@googlegroups.com

Hi Jenny,
Maybe we should check with Glenn at Ensembl whether the
homo_sapiens_core_47_36i annotations are on NCBI36 (as it says in the
coord_system table, coord_system_id=17) or on NCBI36.1 (i.e. hg18).
If the latter, then all is well. But if it's the former, then
somebody needs to tell us where we can actually get the NCBI36 genome!

Note: my tests with a few exons on hg18 chr1 are not definitive, but
certainly indicate a need to check this question carefully. Another
check we could do is to get random Ensembl exon sequences directly
from an official source (presumably Ensembl's website lets you do
this), and to compare with what my seqregion.py returns as sequence
(from hg18) for the same exon IDs. Sigh. I guess yet another test
would be to map the NCBI36 intervals back to intervals in the contig
coordinate system, and then look up the sequence strings for those
contig intervals in the dna table.

You might as well also ask Glenn what's the standard Ensembl way of
getting sequence for a given genomic slice (interval). I was assuming
that one would just get that interval from a standard genome release
file (e.g. hg18) like UCSC and everyone else does, but now I am
starting to wonder... Maybe Ensembl always has to map their genomic
coordinates back to contig coordinates, and then get the sequence
string from their dna table using contig coordinates? This would make
sense if their genomic coordinates are on a non-standard genome
assembly that no one else uses!

-- Chris

On May 16, 2008, at 12:56 PM, Jenny Qing Qian wrote:
> Hi Chris,
>
> I went to the Ensembl FTP site http://www.ensembl.org/info/downloads/ftp_site.html

> . Do you think that we should use the data from here? ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/

C. Titus Brown

unread,

May 16, 2008, 5:42:49 PM5/16/08

to pygr...@googlegroups.com

-> Note: my tests with a few exons on hg18 chr1 are not definitive, but
-> certainly indicate a need to check this question carefully. Another
-> check we could do is to get random Ensembl exon sequences directly
-> from an official source (presumably Ensembl's website lets you do
-> this), and to compare with what my seqregion.py returns as sequence
-> (from hg18) for the same exon IDs. Sigh. I guess yet another test
-> would be to map the NCBI36 intervals back to intervals in the contig
-> coordinate system, and then look up the sequence strings for those
-> contig intervals in the dna table.

It's easy to use EnsMart to extract data in a more obvious form -- could
just do that comparison.

--t

Namshin Kim

unread,

May 16, 2008, 7:37:53 PM5/16/08

to pygr...@googlegroups.com

I think this ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/ is NCBI36 as it says in file names.

Or depending on version, we could search DNAs like this.

ftp://ftp.ensembl.org/pub/release-47/fasta/homo_sapiens/dna/

ftp://ftp.ensembl.org/pub/release-48/fasta/homo_sapiens/dna/

ftp://ftp.ensembl.org/pub/release-49/fasta/homo_sapiens/dna/

Are we using release 49? Then, use last one

Namshin Kim

unread,

May 16, 2008, 7:43:16 PM5/16/08

to pygr...@googlegroups.com

Hmmmm.... Why homo_sapiens_core_47_36i? I think latest one is homo_sapiens_core_49_36k ftp://ftp.ensembl.org/pub/release-49/mysql/homo_sapiens_core_49_36k/

Namshin Kim

Jenny Qing Qian

unread,

May 16, 2008, 7:49:06 PM5/16/08

to pygr...@googlegroups.com

Because in the actual ensembl database system, the latest human core database is homo_sapiens_core_47_36i.

Jenny

Namshin Kim

unread,

May 16, 2008, 7:51:16 PM5/16/08

to pygr...@googlegroups.com

I see. Then, isn't ftp://ftp.ensembl.org/pub/release-47/fasta/homo_sapiens/dna/ the right assembly? It says release-47.

Jenny Qing Qian

unread,

May 16, 2008, 7:59:06 PM5/16/08

to pygr...@googlegroups.com

Not sure. Shouldn't NCBI36 be released earlier than NCBI36.1 (Mar, 2006)?

Namshin Kim

unread,

May 16, 2008, 8:06:03 PM5/16/08

to pygr...@googlegroups.com

Then, what about tracking down the file sizes?

ftp://ftp.ensembl.org/pub/release-45/homo_sapiens_45_36g/data/fasta/dna/ => 05/31/2007

ftp://ftp.ensembl.org/pub/release-46/homo_sapiens_46_36h/data/fasta/dna/ => 08/03/2008

ftp://ftp.ensembl.org/pub/release-47/fasta/homo_sapiens/dna/ => 10/16/2007

I think they are same assembly.

Jenny Qing Qian

unread,

May 16, 2008, 8:10:13 PM5/16/08

to pygr...@googlegroups.com

Is there a Oct, 2005 assembly?

Jenny Qing Qian

unread,

May 16, 2008, 8:18:27 PM5/16/08

to pygr...@googlegroups.com

Both Rob and I think that Ensembl has pruned their archive genome download websites! That's why I can't find a place to download a copy of NCBI36 (Oct, 2005) from this website http://ensembl.genomics.org.cn/Homo_sapiens/index.html.

Jenny

Namshin Kim

unread,

May 16, 2008, 8:23:10 PM5/16/08

to pygr...@googlegroups.com

You are using China mirror site. **ALL** the files are in ftp://ftp.ensembl.org

Jenny Qing Qian

unread,

May 16, 2008, 8:35:57 PM5/16/08

to pygr...@googlegroups.com

Really? How weird! Both Rob and I are in Vancouver, Canada though. Good to know that there is a China mirror site. I thought the Chinese Firewall would be so powerful and it would block everything from the Western world. Google for instance...

Jenny

Namshin Kim

unread,

May 16, 2008, 8:38:52 PM5/16/08

to pygr...@googlegroups.com

Hmm.... I think you are using Chinese version of Web browser.

Jenny Qing Qian

unread,

May 16, 2008, 8:41:53 PM5/16/08

to pygr...@googlegroups.com

Is Iceweasel Chinese version of Web browser?

Rob Kirkpatrick

unread,

Jun 7, 2008, 2:24:19 PM6/7/08

to pygr...@googlegroups.com

So, to work with Ensembl annotations (e.g. exon table), all I had to
do was create a SeqRegion class that will map seq_region_id to the
proper sequence database based on its coord_system_id:

seq_region =
sqlgraph.SQLTable('homo_sapiens_core_47_36i.seq_region',
cursor)
import pygr.Data
hg18 = pygr.Data.Bio.Seq.Genome.HUMAN.hg18() # human genome
srdb = SeqRegion(seq_region, {17:hg18}) # just map
coord_system_id 17 to hg18

Hey Chris,

Just to come back to this for a moment, if we use a class like SeqRegion to map between coord systems, will we need to have a sequence set for each coord_system in pygr.Data? i.e. for supercontigs, we would have something like:

seq_region = sqlgraph.SQLTable('homo_sapiens_core_47_36i.seq_region', cursor)
import pygr.Data

supersource = pygr.Data.Bio.Seq.Genome.HUMAN.somesortofsupercontigsource()
srdb = SeqRegion(seq_region, {15:supersource})

Is that what you had envisioned?

Reply all

Reply to author

Forward