- be able to hook up Ensembl's sequence database table (dnac) as a
pygr sequence database (using pygr.sqlgraph.SQLTable and
pygr.seqdb.DNASQLSequence). That means that if seqDB is your SQLTable
object, you can do things like
seq1 = seqDB[1] # gets DNA sequence with ID 1
print str(seq1[100:200]) # print nucleotide sequence for an interval
in this sequence
- be able to hook up Ensembl's exon database table (exon) as a pygr
SQLTable database. That means if exonDB is your SQLTable object, you
can do things like
e1 = exon[1] # get the exon with ID 1
print e1.seq_region_start, e1.seq_region_strand # print some
coordinate info
- be able to plug both of those into a pygr.seqdb.AnnotationDB
database and get exon annotations out. The main thing is giving it a
sliceAttrDict dictionary that maps its default attribute names to what
Ensembl actually uses in their exon table. This would be something like
aDB = AnnotationDB(exonDB, seqDB,
sliceAttrDict=dict(id='seq_region_id', start='seq_region_start',
stop='seq_region_end', orientation='seq_region_strand'))
Note: I am not sure whether Ensembl's seq_region_strand values are
compatible with what pygr expects (1 for positive orientation, -1 for
negative orientation). It looks like they probably are but I haven't
checked the actual data... If not we'll have to provide some kind of
method to handle this.
Then you should be able to do things like:
e2 = aDB[2] # get annotated exon 2
print str(e2.sequence) # print its actual sequence
- be able to save these to pygr.Data. Then you should be able to pull
them back in to a new Python session as easily as:
import pygr.Data
aDB = pygr.Data.Bio.Annotation.Ensembl.hg18.exon() # assuming you gave
it this name when you saved it...
print str(aDB[2].sequence) # print exon 2's sequence...
- be able to build the inverse mapping from the genome to the
annotation (the annotation itself provides the forward mapping from
the annotation to the genome). To do so, just use the
NLMSA.add_annotation() method; see the pygr cookbook recipes for
examples. Then if exonmap is this NLMSA mapping, you should be able
to do things like
region = seqDB[4][1000000:1100000] # get 100kb of chromosome
for exon in exonmap[region]: # find all exons overlapping this region
print 'exon:', repr(exon.sequence)
- be able to save this inverse mapping to pygr.Data and in
pygr.Data.schema bind it to a genome attribute e.g. "exons". See the
pygr cookbook recipes for examples. Then you should be able to use it
in a new Python session as easily as:
import pygr.Data
seqDB = pygr.Data.Bio.Annotation.Ensemble.hg18()
for exon in seqDB[4][1000000:1100000].exons: # find exons overlapping
this region
print 'exon:', repr(exon.sequence)
To do this, you'll have to learn about the classes mentioned above,
pygr.Data, etc., but I bet you would be able to get this working
pretty fast, and will get a feeling for how easy / difficult it will
be to work with the Ensembl data in pygr...
Yours,
Chris
On May 11, 2008, at 11:12 PM, Jenny Qing Qian wrote:
>
> Hi pygr Masters,
Yes, you could use any ORM that represents the database table as a
dict-like object (whose keys are primary key ID values, and whose
associated values are "row objects" that map the database columns to
attributes of the same name). However, we probably don't need
anything more sophisticated than pygr.sqlgraph.SQLTable, which does
this in just about the simplest possible way:
db = SQLTable('dbname.tablename')
If you don't provide a cursor, it automatically gets a MySQLdb
connection using your ~/.my.cnf file. It analyzes the schema of the
table automatically. It also "plays nice" with pygr.Data and other
aspects of pygr... I also understand the performance tuning options
(e.g. caching; grouping sets of rows to load as a group using a
cluster key; etc.) for this much better than for SqlAlchemy / other
ORMs, since we've had extensive experience using this with very large
tables...
RE: working with ORMs, pygr has a few conventions that would need to
be followed for the ORM to work completely transparently with every
aspect of pygr:
- database objects should have a dict-like interface as described above.
- row objects should have a "db" attribute (pointing to the database
object), and an "id" attribute (primary key value).
- to work with pygr.Data, the database object must be picklable (this
is easy, but I bet many ORMs have not addressed persistence in this
Pythonic and truly general way). Also, if the database provides an
__invert__() method, pygr will automatically use it to provide the
reverse mapping, for purposes of schema binding...
- database objects should also have a writable attribute "itemClass"
that allows the user to specify what class should be used for row
objects. This allows row objects to model any desired complex
datatype (e.g. a genome sequence; a fancy kind of annotation etc.).
If the itemClass is itself sliceable, it should use the db attribute
"itemSliceClass" as the class for the results of any such slicing
operation. E.g. slicing a sequence object should yield a sequence-
interval object... itemClass and itemSliceClass are also important
for pygr.Data, which needs to be able to automatically bind schema
attributes (as Python descriptors) to these classes in a transparent
way...
Note that by default, SQLTable uses the sqlgraph.TupleO class as the
itemClass, to avoid actually storing a __dict__ for every row object
(saves lots of space when loading lots of rows). TupleO stores each
row as a Python tuple, but provides an attribute interface based on
the table schema automatically analyzed by SQLTable.
-- Chris
Note: I am not sure whether Ensembl's seq_region_strand values are
compatible with what pygr expects (1 for positive orientation, -1 for
negative orientation). It looks like they probably are but I haven't
checked the actual data... If not we'll have to provide some kind of
method to handle this.
A few symbols and color codes are still mystery to me:
1) Within an entity, what do the orange and turquoise diamonds, under
the yellow-folder, represent, respectively?
2) Between two entities, what does a diamond half filled and half open
represent? What are the meanings of the various orientations of this
type of diamonds?
On May 13, 2008, at 2:35 PM, Rob Kirkpatrick wrote:
> Here's some of the basic workings with the Ensembl db queried by
> Pygr (two of heads on our monster). Note that I had to use the dna
> table as the dnac tables are all empty:
>
>
> from pygr.seqdb import BlastDB
> from pygr.sqlgraph import SQLTable
> from pygr import seqdb
> from pygr import cnestedlist
> import MySQLdb
> conn = MySQLdb.connect(host="ensembl01", user="ensembl",
> passwd="ensembl")
> c = conn.cursor()
>
> #get the dna sequence object
> dna_sliceDB = SQLTable('homo_sapiens_core_49_36k.dna', c)
> dna1 = dna_sliceDB[143909]
> # dna1.data[0] is the record id - 143909 in this case
> # dna1.data[1] is the sequence itself
> # dna1.db is 'homo_sapiens_core_49_36k.dna'
>
Instead of accessing the data as generic row objects, let's make them
into real sequence objects:
from pygr.seqdb import DNASQLSequence
class EnsemblDNA(DNASQLSequence): # this is just a performance
optimization...
def __len__(self):
return self._select('length(sequence)') # no need to
download whole sequence to get its length
dnaDB = SQLTable('homo_sapiens_core_49_36k.dna', c,
itemClass=EnsemblDNA, attrAlias=dict(seq='sequence'))
dna1 = dnaDB[143909] # not just a tuple, but a EnsemblDNA sequence
object
print len(dna1) # note: this doesn't actually retrieve the sequence
string from the db, just uses SQL to get integer length
ival = dna1[20:50] # not just a string, but a sequence interval object
print repr(ival), str(ival)
rc = -ival # get sequence object representing its opposite strand
(reverse complement)
rc_bit = rc[-20:] # of course any sequence interval is sliceable...
yielding another sequence interval object
print repr(rc_bit), str(rc_bit), len(rc_bit)
> dna_seq = dna1.data[1]
> #print dna_seq[0:5] would just return the first bases of the
> sequence - CACCC
>
> #get the exon (annotation) object
> exon_sliceDB = SQLTable('homo_sapiens_core_49_36k.exon', c)
> exon1 = exon_sliceDB[1]
> #print str(exon1.seq_region_start) would give 19397
> #print str(exon1.seq_region_end) would give 19669
> #print str(exon1.seq_region_strand) would give -1
>
>
> exon1_seq = dna_seq[19397:19669]
> exon1_seq_2 = dna_seq[exon1.seq_region_start:exon1.seq_region_end]
>
> # both exon1_seq and exon1_seq_2 would return the same thing:
> TCTTATAATTGTTTTTGTTCCTGTAAAATTTTTAGTAATATCCTTTTTATTTCTGATTTTAGTAATTTGAATCTTCTCTTTTTTCTTAGCCAATCTAGCTAAAAGTTTGTCAATTTGGTTGATCTTTTTAAAGAACCAACTTTTGGTTTCATTGAAGGCTTTTTTTCCTATGATTTTTTAAAATTCCTTGTTTTATTTATTTCTATTGTAACCTTTATTATGTCCTCCCTTTTGGTAGCTTTGGGTTTAGTTTTCTCTTCTTTTTCTAGTTC
>
OK, let's turn this into a real annotation database in the Pygr sense
of the word:
from pygr.seqdb import AnnotationDB
annoDB = AnnotationDB(exon_sliceDB, dnaDB,
sliceAttrDict=dict(id='seq_region_id', start='seq_region_start',
stop='seq_region_end', orientation='seq_region_strand'))
exon1 = annoDB[1] # get annotation object
ival = exon1.sequence # get associated sequence object
print repr(ival), str(ival) # print actual TCTTA... string
> Chris, just out of curiosity, dna1 in the above code is a
> pygr.sqlgraph.foo object. Is that a temp thing where we'll get a
> better name for it sometime or is that a placeholder for strange
> object type?
SQLTable automatically subclasses the itemClass (by default, TupleO)
to bind its attribute dictionary at the **class** level (that is how
we avoid having to store a __dict__ for each row instance in TupleO,
for example). Because I was lazy, that automatically generated
subclass is just getting called "foo" by default. We can change that!
> This might be some of the non-trivial mapping that Glenn mentioned
> (or I'm just doing something wrong). When I run that code on my end
> I get:
>
> KeyError: 'homo_sapiens_core_49_36k.dna[226034].length(sequence) not
> found, or not unique'
>
> The exon table has a number of entries that have
> seq_region_id=226034 but the dna table has no entries with that id.
> Either the mapping is done by routing through other tables or...
>
Yikes, foreign key values that don't exist as primary keys! Are there
large numbers of seq_region_id values in Ensembl's exon table that
don't exist in the dna table, or is this an infrequent problem? I'm
afraid I never looked at the ER diagram; does it truly indicate
exon.seq_region_id is a foreign key with a many-to-one relation to
dna.seq_region_id, as you would expect?
Are you absolutely sure that the data uploaded into your database is
complete? It might be worthwhile to search for the same seq_region_id
in Ensembl's public MySQL server, just to check... If 226034 is also
missing in Ensembl's database, that really makes me question Ensembl's
schema: either this missing key is a bug, or there must be some wacky
intermediate mapping table that maps "seq_region_id" to
"seq_region_id" (!) I guess we can bug Glenn if we can't figure this
out easily...
Meanwhile, does the AnnotationDB example code I sent you work for
those seq_region_id rows that DO exist?
> I presume pygr is sniffing out the id colums and using them to do
> joins? Would that explain the error above?
Yes, SQLTable automatically analyzes the column indexing for the
table, to identify the primary key. Check the dnaDB.primary_key
attribute to see if it is correct. Also check that a simple query
like dnaDB[2] gets the right row as confirmed by your own SQL
queries... I've never had a problem with this.
The error message simply indicates either that no row with that ID was
found, or that multiple rows matched (which according to the Ensembl
schema is impossible, right? dna.seq_region_id is a primary key, I
think).
ALTER table dna ADD FOREIGN KEY (seq_region_id) REFERENCES seq_region(seq_region_id);This should lead us somewhere, but this implies the seq_region table is the linker. Given that, the dna table still doesn't have an entry for seq_region_id=226034. I'm wondering if the dna table actually stores the sequence that the rest of the db references...
ALTER table exon ADD FOREIGN KEY (seq_region_id) REFERENCES seq_region(seq_region_id);
ALTER table seq_region ADD FOREIGN KEY (coord_system_id) REFERENCES coord_system(coord_system_id);
Jenny, it would be really helpful if you could dig up answers to the
following questions:
- what version of the genome is this Ensembl annotation dataset
attached to (e.g. hg18)?
- what Ensembl tables store such genomic coordinates?
- what additional coordinate systems does Ensembl use? (I guess
seq_region is an example...) Do we understand what Ensembl uses them
for, instead of just expressing everything in terms of genomic
coordinates?
- are all these coordinate systems mappable to genomic coordinates?
If so, how?
Thanks!
-- Chris
This class uses ActiveRecord to access data in the Ensembl database. See the general documentation of the Ensembl module for more information on what this means and what methods are available."
Two virtual coordinate systems exist for every species:
This class uses ActiveRecord to access data in the Ensembl database. See the general documentation of the Ensembl module for more information on what this means and what methods are available."
Let me know if that makes sense to you.
mysql> select * from meta_coord;
+--------------------------+-----------------+------------+
| table_name | coord_system_id | max_length |
+--------------------------+-----------------+------------+
| assembly_exception | 17 | 4731698 |
| oligo_feature | 17 | 25 |
| ditag_feature | 15 | 25 |
| transcript | 15 | 209805 |
| dna_align_feature | 15 | 209805 |
| protein_align_feature | 15 | 207164 |
| exon | 15 | 4484 |
| gene | 15 | 209805 |
| transcript | 17 | 2304118 |
| protein_align_feature | 17 | 2298737 |
| dna_align_feature | 17 | 2298740 |
| exon | 17 | 18172 |
| gene | 17 | 2304118 |
| dna_align_feature | 4 | 21324 |
| protein_align_feature | 4 | 21324 |
| marker_feature | 15 | 1549 |
| marker_feature | 17 | 5750 |
| repeat_feature | 15 | 2717 |
| repeat_feature | 17 | 50000 |
| simple_feature | 4 | 44058 |
| prediction_exon | 4 | 21324 |
| prediction_transcript | 4 | 329127 |
| repeat_feature | 4 | 160602 |
| oligo_feature | 15 | 25 |
| ditag_feature | 17 | 25 |
| misc_feature | 17 | 100530253 |
| regulatory_feature | 17 | 131013 |
| regulatory_search_region | 17 | 4998 |
| density_feature | 17 | 1648332 |
| density_feature | 15 | 8702 |
| karyotype | 17 | 30639485 |
+--------------------------+-----------------+------------+
31 rows in set (0.17 sec)
The dna table turns out to only store entries for coord_system_id=4
("contig"). The only tables that seem to have annotations on contigs
(coord_system_id=4) are simple_feature, prediction_transcript,
prediction_exon, and dna_align_feature.
So, to work with Ensembl annotations (e.g. exon table), all I had to
do was create a SeqRegion class that will map seq_region_id to the
proper sequence database based on its coord_system_id:
seq_region =
sqlgraph.SQLTable('homo_sapiens_core_47_36i.seq_region',
cursor)
import pygr.Data
hg18 = pygr.Data.Bio.Seq.Genome.HUMAN.hg18() # human genome
srdb = SeqRegion(seq_region, {17:hg18}) # just map
coord_system_id 17 to hg18
Now it will return the right sequence object for any seq_region_id (as
long as it's in NCBI36):
chr1 = srdb[226034] # this is chromosome 1
print len(chr1) # 247249719
Next create an interface to the exon table:
exonSliceDB = sqlgraph.SQLTable('homo_sapiens_core_47_36i.exon',
cursor, itemClass=EnsemblRow)
e = exonSliceDB[73777]
print e.start # 444865
EnsemblRow only does one thing: convert the interval coords to Python-
style (zero-offset) coordinates.
Finally, let's create an annotation database interface for exons:
from pygr.seqdb import AnnotationDB
annoDB = AnnotationDB(exonSliceDB, srdb, sliceAttrDict=
dict(id='seq_region_id',stop='seq_region_end',
orientation='seq_region_strand'))
e = annoDB[73777] # get an exon annotation
print e.phase, e.end_phase # 1 0
s = e.sequence
print str(s)
#'GCAAGCTGTGGACAAGAAGTATGAAGGTCGCTTACAGCATTCTACACAAATTAGGCACAAAGCAGGAACCCATGGTCCGGCCTGGAGATAGG
'
print str(s.before()[-10:]) # 'GTATTCATAG' last 2 nt is splice-site
print str(s.after()[:10]) # 'GTAAGTGCAA' first 2 nt is splice-site
I'll upload the Python file that contains the SeqRegion and EnsemblRow
classes, and this test code, to the Groups page.
For just a few tables (simple_feature, prediction_transcript,
prediction_exon, and dna_align_feature), we'll have to convert from
the coordinate system (contig) to NCBI36 genomic coords via the
assembly table.
-- Chris
On May 15, 2008, at 3:18 PM, Rob Kirkpatrick wrote:
> I had sent this previously to Jenny but I thought I would send it on
> to pygr-dev. There was a dude that put together a Ruby Ensembl API
> last year or so and he forged a path that we can possibly follow (http://saaientist.blogspot.com/2007/08/ruby-api-to-ensembl-database.html
> ). Shortly after that, another dude wrote some online docs for that
mapper = EnsemblMapper(annoDB, srdb) # exon annotation DB , seq
region DB
ival = chr1[:100000] # get a piece of hg18
print mapper[ival] # find exons in this interval
print 'neg:', mapper[-ival] # forces annotations to opposite ori...
This will take care of our mapping needs for everything except the 4
tables that are stored in contig coordinates, which will require an
intermediate step using the assembly table as I mentioned in my
previous email.
I uploaded the new version to the Google Groups site.
-- Chris
-- Chris
-- Chris
Note: my tests with a few exons on hg18 chr1 are not definitive, but
certainly indicate a need to check this question carefully. Another
check we could do is to get random Ensembl exon sequences directly
from an official source (presumably Ensembl's website lets you do
this), and to compare with what my seqregion.py returns as sequence
(from hg18) for the same exon IDs. Sigh. I guess yet another test
would be to map the NCBI36 intervals back to intervals in the contig
coordinate system, and then look up the sequence strings for those
contig intervals in the dna table.
You might as well also ask Glenn what's the standard Ensembl way of
getting sequence for a given genomic slice (interval). I was assuming
that one would just get that interval from a standard genome release
file (e.g. hg18) like UCSC and everyone else does, but now I am
starting to wonder... Maybe Ensembl always has to map their genomic
coordinates back to contig coordinates, and then get the sequence
string from their dna table using contig coordinates? This would make
sense if their genomic coordinates are on a non-standard genome
assembly that no one else uses!
-- Chris
On May 16, 2008, at 12:56 PM, Jenny Qing Qian wrote:
> Hi Chris,
>
> I went to the Ensembl FTP site http://www.ensembl.org/info/downloads/ftp_site.html
> . Do you think that we should use the data from here? ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
It's easy to use EnsMart to extract data in a more obvious form -- could
just do that comparison.
--t
So, to work with Ensembl annotations (e.g. exon table), all I had to
do was create a SeqRegion class that will map seq_region_id to the
proper sequence database based on its coord_system_id:
seq_region =
sqlgraph.SQLTable('homo_sapiens_core_47_36i.seq_region',
cursor)
import pygr.Data
hg18 = pygr.Data.Bio.Seq.Genome.HUMAN.hg18() # human genome
srdb = SeqRegion(seq_region, {17:hg18}) # just map
coord_system_id 17 to hg18