Issue: Move from shelve to sqlite3 storage?

Istvan Albert

unread,

Dec 13, 2008, 2:46:27 PM12/13/08

to pygr-dev

Hello,

regarding issue

http://code.google.com/p/pygr/issues/detail?id=52

I'd like to comment here, as this forum is probably better suited for
a back and forth discussion. When you mention the poor scalability of
bsdb, is there some test code that is demonstrably slow (unscalable)
with the bsdb shelve?
It could be automatically generated, not necessarily a real dataset.

I would be willing to take such a piece of code and test it out with a
sqlite backend. That way we could verify that sqlite is indeed a
better solution and that it won't produce the same type of problems
(maybe others of course).

Istvan

Christopher Lee

unread,

Dec 15, 2008, 12:31:10 PM12/15/08

to pygr...@googlegroups.com, n...@rna.kr

Hi Istvan,
thanks for your interest in this! Namshin found that the default
bsddb index used by shelve (hash), became slow on large datasets
(millions of records). He reported that the btree index had much
better scalability, so we switched over most usages of shelve in Pygr
to use btree by default instead of hash (annoying side-effect: you
can't open these Pygr files with the usual shelve.open() command).
Here are a couple of Namshin's emails, giving some performance numbers
and test code. His SNP dataset had 10 million records and took about
an hour. I believe his second dataset (the U87 fasta file example
that he uses below) contained about 95 million records, each with a
key (ID) consisting of a string of about 20-30 characters, and a value
consisting of a string of about 36 characters (Solexa sequence data).
You could simulate all this with random strings, I think. You could
reduce the size of the test to ten million records to make the test
run faster...

-- Chris

On Nov. 16, 2007, Namshin Kim wrote:
Hi Chris,

Here is the simple comparison of building time, Hash vs. B-Tree.

Data: 10 million SNP 126 Database
Building Pygr Annotation Database

1. Hash: 16 hours (960 min). It was running in lln4, but killed by
"Bus error" just after building shelve. NLMSA was not built.
2. B-Tree: 1 hour 20 min (80 min).

Yours,
Namshin Kim

On Nov. 6, 2007, Namshin Kim wrote:
B-Tree is faster than Hash algorithm. And, MySQL (MyISAM Engine) uses
B-Tree by default. It took 10 hours to finish building.

#!/usr/bin/env python

import sys, os, string, bsddb

args = sys.argv
if len(args) != 2:
print 'Usage:', args[0], 'dummy'
sys.exit()

U87 = bsddb.btopen(' U87.btree', 'c')

fahead = ''
for lines in open('U87', 'r').xreadlines():
if lines[0] == '>':
fahead = lines[1:].strip()
else:
U87[fahead] = lines.strip()

U87.close()

Yours,
Namshin Kim

Istvan Albert

unread,

Dec 16, 2008, 9:44:30 AM12/16/08

to pygr-dev

On Dec 15, 12:31 pm, Christopher Lee <l...@chem.ucla.edu> wrote:

> Here are a couple of Namshin's emails, giving some performance numbers
> and test code. His SNP dataset had 10 million records and took about

Hi Chris,

thanks for the emails, I'll code up some examples with sqlite as
shelve and we'll see how it works.

Istvan

Istvan Albert

unread,

Dec 16, 2008, 3:30:43 PM12/16/08

to pygr-dev

> You could simulate all this with random strings, I think. You could
> reduce the size of the test to ten million records to make the test
> run faster...

I have now run comparison runs on several implementations of the on
file based dictionary storage. See below the stats. The btopen,
hashopen are the functions that ship with bsddb, the sq_dict_open
function comes from an sqlite based shelve implementation by Josiah
Carlson (see http://bugs.python.org/file11470/sq_dict.py ) while the
indexed_open is my own implementation that differs from the previous
one by having explicit index management meaning one needs to call the
create_index method after loading the data. I expected that not having
an index will speed up initial data loading. I have also experimented
with a different loading strategy for this latter choice (see
fastload), it loads data from iterators via the executemany() method.
In total it loads 1 million 100 character long strings.

In a nutshell it does not seem seem likely that the btree based bsddb
can be beat in any task. Explicit index management and fast loading
can speed up the data insertion for sqlite to 30 sec instead of 86,
but that is still just about the same as the 31 sec with btopen.

The picture is less promising on the 'reverse_iter' and 'update'
functions that are 2 to 10 times slower. These functions visited all
keys in the database and accessed each one individually. The
'forward_iter' function operates on the database keys as returned from
the database whereas the 'reverse_iter' function iterates in reverse.
Access times for random items will be somewhere between the limits set
by the 'forward_iter' and 'reverse_iter' methods.

I believe that the sqlite based implementation could be sped up
substantially whenever multiple datasets need to be updated or
retrieved. It is not clear whether it could be made faster than btree
based bsddb though.

Results:

elapsed= 14.2s, test=loading, func=btopen
elapsed= 43.9s, test=loading, func=hashopen
elapsed= 32.4s, test=loading, func=sq_dict_open
elapsed= 15.4s, test=loading, func=indexed_open
----------
elapsed= 15.6s, test=fastloading, func=btopen
elapsed= 48.0s, test=fastloading, func=hashopen
elapsed= 33.5s, test=fastloading, func=sq_dict_open
elapsed= 9.1s, test=fastloading, func=indexed_open
----------
elapsed= 0.0s, test=indexing, func=btopen
elapsed= 0.0s, test=indexing, func=hashopen
elapsed= 0.0s, test=indexing, func=sq_dict_open
elapsed= 20.4s, test=indexing, func=indexed_open
----------
elapsed= 22.0s, test=forward_iter, func=btopen
elapsed= 26.4s, test=forward_iter, func=hashopen
elapsed= 42.2s, test=forward_iter, func=sq_dict_open
elapsed= 34.2s, test=forward_iter, func=indexed_open
----------
elapsed= 13.8s, test=reverse_iter, func=btopen
elapsed= 35.0s, test=reverse_iter, func=hashopen
elapsed=180.3s, test=reverse_iter, func=sq_dict_open
elapsed=171.3s, test=reverse_iter, func=indexed_open
----------
elapsed= 38.2s, test=update, func=btopen
elapsed= 76.5s, test=update, func=hashopen
elapsed=104.2s, test=update, func=sq_dict_open
elapsed= 91.9s, test=update, func=indexed_open
----------

Christopher Lee

unread,

Dec 16, 2008, 5:26:50 PM12/16/08

to pygr...@googlegroups.com

Thanks, Istvan! A few questions:
- could you post your test code so we can understand exactly what each
test means?

- what's the total time for loading and initializing via
indexed_open(): is it 15.4 s for standard loading, vs. 9.1 + 20.4 s
for fastloading? Or does standard loading also require the indexing
step, for a total of 15.4 + 20.4 s?

- what exactly does your update test do? 1 million random updates on
existing keys? Add new keys?

- by sq_dict_open, do you mean sq_dict.open()? There is no
"sq_dict_open" function in the sq_dict.py file that you linked to...

Thanks again for doing all these tests!

-- Chris

Namshin Kim

unread,

Dec 17, 2008, 1:43:57 AM12/17/08

to pygr...@googlegroups.com

FYI.

test_python252.shelve: Berkeley DB (Hash, version 8, native byte-order) # PYTHON 2.5.2
test_python30.shelve: GNU dbm 1.x or ndbm database, little endian # PYTHON 3.0

I just checked Python 3.0 and found out that bsddb has been retired in Python 3.0. But, it has been replaced with GNU dbm. We may need to test GNU dbm.

Namshin Kim

Istvan Albert

unread,

Dec 17, 2008, 10:28:02 AM12/17/08

to pygr-dev

On Dec 16, 5:26 pm, Christopher Lee <l...@chem.ucla.edu> wrote:
> Thanks, Istvan! A few questions:
> - could you post your test code so we can understand exactly what each
> test means?

I have added the code here:

http://pygr.googlecode.com/svn/contrib/benchmark/

Based on James Taylor's suggestion I have added another implementation
using

http://pypi.python.org/pypi/python-cdb/0.32

see the results.txt there for a run on linux, it looks like the scores
are a lot closer on linux, I think the sqlite windows implementation
must be slower than that for linux.

> - what's the total time for loading and initializing via
> indexed_open(): is it 15.4 s for standard loading, vs. 9.1 + 20.4 s
> for fastloading? Or does standard loading also require the indexing
> step, for a total of 15.4 + 20.4 s?

As for your questions:

some of those test were a little confusing as I made them up on the
go. There is only one loading now that uses faster loading if
possible. I also renamed some functions because the timer extracts the
function name automatically this way we know what is being tested.

Istvan

Christopher Lee

unread,

Dec 18, 2008, 5:16:32 PM12/18/08

to pygr...@googlegroups.com

Hi Istvan,
Namshin suggested re-running your tests with 10 million records,
because he observed poor scalability for hash indexes going from 1
million to 10 million records... It would be interesting to see how
sqlite3 holds up at this scale.

Yours,

Chris

Istvan Albert

unread,

Dec 19, 2008, 7:42:42 PM12/19/08

to pygr-dev

I had some connectivity problems to the server that I can liberally
hammer away so I will do this once those are sorted out. I have added
gdbm (GNU dbm) to the list of methods that are tested, so there will
be one more datapoint.

Istvan

Istvan Albert

unread,

Dec 23, 2008, 3:40:08 PM12/23/08

to pygr-dev

Here is a test run for 10 million rows, I've also added a gnu dbm test
(called dbm_open).

http://pygr.googlecode.com/svn/contrib/benchmark/results.txt

Istvan

Istvan Albert

unread,

Dec 23, 2008, 3:44:59 PM12/23/08

to pygr-dev

and yes I forgot to mention, as Namshin mentioned, the hash indices
show atrocious scaling for both hashopen and gnu dbm.

Marek Szuba

unread,

Dec 29, 2008, 6:11:36 PM12/29/08

to pygr...@googlegroups.com

On Tue, 23 Dec 2008 12:40:08 -0800 (PST)
Istvan Albert <istvan...@gmail.com> wrote:

> Here is a test run for 10 million rows, I've also added a gnu dbm test
> (called dbm_open).

Huh, I have completely forgotten about CDB (it's *the* CDB from Dan
Bernstein, right?) despite actually using it on my system... From these
results, it would seem this is the way to go - not only is it faster
than even btree-based bsddb, its performance is actually consistent
over different benchmarks (which sadly isn't the case for sqlite and
gdbm... Some of these results look AWFUL).

Would you be able to see how python-cdb works with Python 3.0, Istvan?

--
MS

Istvan Albert

unread,

Dec 30, 2008, 7:49:32 AM12/30/08

to pygr-dev

On Dec 29, 6:11 pm, Marek Szuba <mare...@gmail.com> wrote:
>
> gdbm... Some of these results look AWFUL).
>
> Would you be able to see how python-cdb works with Python 3.0, Istvan?

Disappointingly, Python 3.0 'vanilla' IO read/write is orders(!) of
magnitude slower than that of Python 2.* branch. IMO this makes Python
3.0 is unsuited to *any* demanding application in general (data
processing or web), and it will remain so until the entire IO is
rewritten.

http://bugs.python.org/issue4565

While I have no insight into the development priorities it might a
long time (years?) until they'll get all these issues ironed out.

Istvan

C. Titus Brown

unread,

Dec 30, 2008, 11:04:18 PM12/30/08

to pygr...@googlegroups.com

On Tue, Dec 30, 2008 at 04:49:32AM -0800, Istvan Albert wrote:
-> On Dec 29, 6:11?pm, Marek Szuba <mare...@gmail.com> wrote:
-> >
-> > gdbm... Some of these results look AWFUL).
-> >
-> > Would you be able to see how python-cdb works with Python 3.0, Istvan?
->
-> Disappointingly, Python 3.0 'vanilla' IO read/write is orders(!) of
-> magnitude slower than that of Python 2.* branch. IMO this makes Python
-> 3.0 is unsuited to *any* demanding application in general (data
-> processing or web), and it will remain so until the entire IO is
-> rewritten.
->
-> http://bugs.python.org/issue4565
->
-> While I have no insight into the development priorities it might a
-> long time (years?) until they'll get all these issues ironed out.

Bet on months...

Seriously, though, I think it's premature to think much about Python
3.0.

First, Python 2.x isn't going to go out of business anytime soon. So
there's no reason to move on.

Second, neither Pyrex nor Cython have 3.x support yet. So we couldn't
do it even if we wanted to.

Third, there's no compelling reason to transition to 3.x. No
performance plusses, no big development community, etc.

Fourth, the tools for transitioning from Python 2.x to 3.x will be
improving dramatically over the next year. So we might as well wait.

Fifth, the suggested "path" to transitioning to 3.x is to engineer
complete code coverage, then use an automated tool to rewrite the source
code, then manually adjust. We are very far from complete code
coverage!

--

Even apart from those issues re the 3.0 transition, using something like
python-cdb raises a completely different set of issues from the
bsddb/sqlite discussion. Because it's a 3rd-party package that's not
included with Python, we could not rely on it being available; we'd have
to distribute binaries with pygr if we wanted people to be able to use
pygr "out of the box". This is a significant investment of time and
resources for a project the size of pygr!

On the flip side, using a 3rd-party package alleviates the problems
associated with python-dev's unwillingness to ship a scalable data store
with Python 2.x or 3.x!

These issues completely drown out the problem of making python-cdb py3k
compatible, which would take one of us less than a day to do...

cheers,
--titus
--
C. Titus Brown, c...@msu.edu

Russell

unread,

Dec 31, 2008, 6:32:45 AM12/31/08

to pygr-dev

> Even apart from those issues re the 3.0 transition, using something like
> python-cdb raises a completely different set of issues from the
> bsddb/sqlite discussion. Because it's a 3rd-party package that's not
> included with Python, we could not rely on it being available; we'd have
> to distribute binaries with pygr if we wanted people to be able to use
> pygr "out of the box". This is a significant investment of time and
> resources for a project the size of pygr!

I can't imagine how a project like pygr could possible avoid having
dependencies on external modules. Maybe Debian has spoiled me, but
that doesn't seem like a big deal. Certainly, one should try to keep
the
number of dependencies under control.

I'm not sure who would demand that pygr work out-of-the-box. Surely
anyone who is about to do a bunch of genome analysis can be expected
to spend a few minutes installing a few sensible module dependencies?

Russell

Istvan Albert

unread,

Dec 31, 2008, 8:42:52 AM12/31/08

to pygr-dev

Hello Russell,

> I'm not sure who would demand that pygr work out-of-the-box. Surely
> anyone who is about to do a bunch of genome analysis can be expected
> to spend a few minutes installing a few sensible module dependencies?

I think the issue is not as much out of the box, rather than working
reliably and across multiple platforms. For example pycdb would not
work out of the box with python 2.5 there was an API change that
needed to be applied (found the solution in a blog post). It also will
not work on Windows.

What we could do instead is make the back end swappable, the interface
for an on-disk-dictionary is fairly simple so it does not seem like a
big undertaking to have multiple options. This would allow people to
champion their favorite backend, without having to lock out others.

Istvan

C. Titus Brown

unread,

Dec 31, 2008, 1:07:24 PM12/31/08

to pygr...@googlegroups.com

On Wed, Dec 31, 2008 at 05:42:52AM -0800, Istvan Albert wrote:
-> Hello Russell,
->
-> > I'm not sure who would demand that pygr work out-of-the-box. Surely
-> > anyone who is about to do a bunch of genome analysis can be expected
-> > to spend a few minutes installing a few sensible module dependencies?
->
-> I think the issue is not as much out of the box, rather than working
-> reliably and across multiple platforms. For example pycdb would not
-> work out of the box with python 2.5 there was an API change that
-> needed to be applied (found the solution in a blog post). It also will
-> not work on Windows.

Cross-platform functionality is definitely one big issue. I do think,
though, that working "out of the box" is an important goal. From
working support for many years, I can tell you that if you want people
to even *look* at your software, you need to make it as simple and easy
to install as possible. Every additional install step loses you a big
chunk of potential users.

We can expect people to have a working compiler on UNIX machines, but
that's about it; we really need to have Windows binaries available too.

Istvan, are you saying that cdb doesn't work on Windows??

-> What we could do instead is make the back end swappable, the interface
-> for an on-disk-dictionary is fairly simple so it does not seem like a
-> big undertaking to have multiple options. This would allow people to
-> champion their favorite backend, without having to lock out others.

Right! Chris has regularly pointed out that backwards compatibility is
important to him and other users: we want to be able to load our
databases created with Python 2.5 into pygr running under Python 2.6,
and maybe even 3.x. That should be possible with a "third party" db like
cdb.

Istvan Albert

unread,

Jan 1, 2009, 1:57:07 PM1/1/09

to pygr-dev

On Dec 31 2008, 1:07 pm, "C. Titus Brown" <c...@msu.edu> wrote:

> Istvan, are you saying that cdb doesn't work on Windows??

That's what the installation page says.

http://cr.yp.to/cdb/install.html

It is not clear whether this is due to the original author not being
interested in porting it to Windows or that it has been built using a
Unix specific functionality.

Istvan

James Taylor

unread,

Jan 1, 2009, 3:02:26 PM1/1/09

to pygr...@googlegroups.com

The python module should work fine on windows -- it doesn't require
the original code -- and there are also other implementations
(tinycdb) that work on windows.

CDB is just a simple file format for on disk hash tables:

http://cr.yp.to/cdb/cdb.txt

Not too hard to implement, and the existing code is public domain. So
if you want to avoid dependency problems, just internalize it.

Istvan Albert

unread,

Jan 1, 2009, 5:28:30 PM1/1/09

to pygr-dev

On Jan 1, 3:02 pm, James Taylor <ja...@jamestaylor.org> wrote:

> The python module should work fine on windows -- it doesn't require
> the original code

I agree that it should, alas it won't due to the fact that it was not
written in portable C. The code that python-cdb includes is a
simplified version of the original code that depends on many Unix
libraries:

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <unistd.h>

Istvan

James Taylor

unread,

Jan 1, 2009, 7:02:06 PM1/1/09

to pygr...@googlegroups.com

Oi... windows. Looks like mmap is the biggest problem, but the code
has fallbacks for when mmap fails, so "#ifdef HAVE_MMAP" in the right
places is enough. Other than that I think it is just large files
(lseek, off_t), is there a portable way to do that? (other than POSIX
of course, which is what this code already uses). Worst case, some
more preprocessor stuff.

Anyway... for someone with a few hours and a windows box and compiler
this should be easy enough to make work on both windows and POSIX.
Whether it is worth the trouble is another question entirely...

C. Titus Brown

unread,

Jan 6, 2009, 8:15:51 PM1/6/09

to pygr...@googlegroups.com

To add to the discussion, I got some interesting comments on this blog
post:

http://ivory.idyll.org/blog/jan-09/lazyweb-worm-options-in-python.html

I think we need to do some more benchmarking; Alex, is this something
you'd be interested in doing, perhaps by expanding Istvan's benchmarking
code?

One interesting option is to move to an interface like 'shove', which
supports many different backends:

http://pypi.python.org/pypi/shove

cheers,
--titus

Alex Nolley

unread,

Jan 6, 2009, 9:43:25 PM1/6/09

to pygr...@googlegroups.com

Sure, I could certainly do that. Do you mean expanding the diversity of
tests or testing more on-disk storage techniques, or both? I'll post some
benchmarks soon showing how seqdb2 measures up to the other techniques
Istvan tested earlier.

-Alex

No virus found in this incoming message.
Checked by AVG - http://www.avg.com
Version: 8.0.176 / Virus Database: 270.10.3/1878 - Release Date: 1/6/2009
7:56 AM

C. Titus Brown

unread,

Jan 6, 2009, 9:51:01 PM1/6/09

to pygr...@googlegroups.com

On Tue, Jan 06, 2009 at 09:43:25PM -0500, Alex Nolley wrote:
->
-> Sure, I could certainly do that. Do you mean expanding the diversity of
-> tests or testing more on-disk storage techniques, or both? I'll post some
-> benchmarks soon showing how seqdb2 measures up to the other techniques
-> Istvan tested earlier.

Testing more on-disk storage techniques would be the first thing to do
-- just modifying Istvan's code to do more.

tnx,
--t

Istvan Albert

unread,

Jan 7, 2009, 9:49:10 AM1/7/09

to pygr-dev

On Jan 6, 8:15 pm, "C. Titus Brown" <c...@msu.edu> wrote:

> One interesting option is to move to an interface like 'shove', which
> supports many different backends:
>
> http://pypi.python.org/pypi/shove

This shove is interesting, I'm curious too to see how it works out.

I made a metakit backend, commited it recently, it does not seem to be
fast at all. I think pushing long strings into it is probably not what
they had in mind when they developed it (maybe I'm misusing it, the
docs are not the easiest to read).

I don't think HDF can be used for arbitrary lenght strings. I'm using
HDF myself and it is *extremely* performant. Probably nothing would
outperform HDF when it comes to reading out say 1 million integers
from a database. But the columns have to have a fixed size: say
StringCol(100) and I think shorter strings will be padded to this
size. Thus every entry needs to be made as big as the longest one.

C. Titus Brown

unread,

Jan 7, 2009, 3:18:50 PM1/7/09

to pygr...@googlegroups.com

On Wed, Jan 07, 2009 at 06:49:10AM -0800, Istvan Albert wrote:
->

->
->
-> On Jan 6, 8:15?pm, "C. Titus Brown" <c...@msu.edu> wrote:
->
-> > One interesting option is to move to an interface like 'shove', which
-> > supports many different backends:
-> >
-> > ? ? ? ?http://pypi.python.org/pypi/shove
->
-> This shove is interesting, I'm curious too to see how it works out.
->
-> I made a metakit backend, commited it recently, it does not seem to be
-> fast at all. I think pushing long strings into it is probably not what
-> they had in mind when they developed it (maybe I'm misusing it, the
-> docs are not the easiest to read).

No, from what little I know, shoving long strings into databases is not
generally the intended use case :). I was puzzled by some of the people
who responded to my blog post with comments like,

Use ZODB!

for that very reason. But I guess I didn't describe our issues very
well.

-> I don't think HDF can be used for arbitrary lenght strings. I'm using
-> HDF myself and it is *extremely* performant. Probably nothing would
-> outperform HDF when it comes to reading out say 1 million integers
-> from a database. But the columns have to have a fixed size: say
-> StringCol(100) and I think shorter strings will be padded to this
-> size. Thus every entry needs to be made as big as the longest one.

I wonder how PyTables deals with that?

http://www.pytables.org/moin

Paul Rigor

unread,

Apr 28, 2009, 4:04:55 AM4/28/09

to C. Titus Brown, pygr...@googlegroups.com

Has there been any further other discussion on creating an extensible
on-disk back-end?

Thanks,
Paul

C. Titus Brown

unread,

Apr 28, 2009, 9:49:29 AM4/28/09

to pygr...@googlegroups.com

On Tue, Apr 28, 2009 at 01:04:55AM -0700, Paul Rigor wrote:
-> Has there been any further other discussion on creating an extensible
-> on-disk back-end?

No, not so far. Want to pick up the torch? ;)

--titus

Christopher Lee

unread,

Apr 28, 2009, 3:26:02 PM4/28/09

to pygr...@googlegroups.com

Hi Paul,
can you explain a bit more what you mean by "an extensible on-disk
back-end"? To store what kind of data? Extensible how?

-- Chris

Paul Rigor (gmail)

unread,

Apr 28, 2009, 3:43:30 PM4/28/09

to pygr...@googlegroups.com

Hi,

The support for relational databases seems to be tightly coupled to mysql (Please correct me if I'm wrong). In sqlgraph.py for example, database connection defaults to mysql with an explicit import of MySQLdb, search for mysql configuration file variants, etc.

Wouldn't a package like SQLalchemy provide a more transparent way of accessing a relational database backend? This package can handle several database 'engines' and provide a common interface to work with.

I'm still a bit unsure whether pygr actually uses the db uri stored in the PYGRDATAPATH environment variable to obtain the connection string. For sure a metabase object is instantiated with the correct 'dbpath', but that object doesn't seem to parse that dbpath parameter. Also, In loading the 'metabases', there is no way to explicitly specify connection parameters (user, host, password, port).

I guess I'm still a little confused and I'm learning by placing print statements all over the place...

Paul

--
Paul Rigor
Graduate Student
Institute for Genomics and Bioinformatics
Donald Bren School of Information and Computer Sciences
University of California in Irvine
248 ICS2 Bldg.
+1 (760) 536 - 6767 (skype)

C. Titus Brown

unread,

Apr 28, 2009, 4:24:08 PM4/28/09

to pygr...@googlegroups.com

On Tue, Apr 28, 2009 at 12:43:30PM -0700, Paul Rigor (gmail) wrote:
-> Wouldn't a package like SQLalchemy provide a more transparent way of
-> accessing a relational database backend? This package can handle several
-> database 'engines' and provide a common interface to work with.

Yes, this was suggested, but I think that it would be a fair amount of
work to do. Good for down the road, but right now the code is working.

--t

Christopher Lee

unread,

Apr 28, 2009, 4:30:20 PM4/28/09

to pygr...@googlegroups.com

On Apr 28, 2009, at 12:43 PM, Paul Rigor (gmail) wrote:

> Hi,
>
> The support for relational databases seems to be tightly coupled to
> mysql (Please correct me if I'm wrong). In sqlgraph.py for example,
> database connection defaults to mysql with an explicit import of
> MySQLdb, search for mysql configuration file variants, etc.

In Pygr 0.7 and before, only MySQL was supported. 0.8 provides a
mechanism for handling the SQL variations of different database
servers, and adds sqlite support as a first example of that. The old
mechanisms for connecting to the database server are gradually being
deprecated in favor of the new ServerInfo mechanism (a pickleable
database connection object), which is totally general.

>
> Wouldn't a package like SQLalchemy provide a more transparent way of
> accessing a relational database backend? This package can handle
> several database 'engines' and provide a common interface to work
> with.

Jenny Qian did an analysis of this a while back. You can see that
meaty discussion here:
http://groups.google.com/group/pygr-dev/browse_thread/thread/a251addd179531e5/1cade214145543de?lnk=gst&q=sqlalchemy#1cade214145543de

Pygr's goal is to be agnostic about back ends (i.e. work with any back-
end without prejudice). So adding support for working with SQLAlchemy
would be great. The question is whether we can *eliminate our own
support* for SQL back-ends by switching to SQLAlchemy. AFAIK, the
main problem is a major difference in philosophies.

- Pygr follows a "pickling" model, i.e. to access an existing
resource, *no user code whatsoever* should be required, by the user
simply "saying the name" of the desired resource, it should be
automatically loaded, with all its relations to other data resources
intact. I don't think SQLAlchemy database objects are pickleable in
this way.

- there are bunch of other issues surrounding the general Pygr
philosophy of providing a consistent graph interface to all data, with
automatic schema relations etc.

I would be very interested if you see a way to make a SQLAlchemy
interface pickleable, so that it could be saved to pygr.Data.

>
> I'm still a bit unsure whether pygr actually uses the db uri stored
> in the PYGRDATAPATH environment variable to obtain the connection
> string. For sure a metabase object is instantiated with the correct
> 'dbpath', but that object doesn't seem to parse that dbpath
> parameter. Also, In loading the 'metabases', there is no way to
> explicitly specify connection parameters (user, host, password, port).

Actually, you can specify host user password in the PYGRDATAPATH.
From the docs for SQLTable: "If cursor is None, it will attempt to
connect to a MySQL server using authentication information either from
your the name string (treated as a whitespace separated list in the
form tablename host user passwd; at least tablename and host must be
present)..." metabase uses the same syntax. We could add port easily
to this list.

Sometime soon metabase will switch to using SQLTable explicitly, which
will make the (current) MySQLMetabase backend compatible with sqlite
(and other database servers if we add SQLTable support for them).
This would get rid of the annoying path syntaxes (such as that above)
and employ a really clean mechanism based on the general concept of
ServerInfo (i.e. a pickleable reference to a database). In other
words, a connection to a database would itself be a resource saved in
pygr.Data.

Officially, this kind of focus on development of pygr.Data is supposed
to happen in release 1.0. But if users like you push for it, it could
be done soon.

>
> I guess I'm still a little confused and I'm learning by placing
> print statements all over the place...

Sorry! Updating the docs for 0.8 is high on my list of to-do's...

-- Chris

Paul Rigor (gmail)

unread,

Apr 28, 2009, 11:19:33 PM4/28/09

to pygr...@googlegroups.com

Hi gang,

Thanks for the link to the discussion with Jenny! That was uber informative.

Chris, thanks for the SQLiteServerInfo class. I've written a more generic serverinfo that uses sqlalchemy based on the sqliteserverinfo class. Also, I don't think we need to pickle any sqlalchemy objects directly. Using the serverinfo classes, we can easily recreate connections to whichever db-backend. This generic class is pickleable so it can be saved within pygr.Data.

I've been looking into how to obtain the primary key of a table as well (i'll include that in a future attachment and/or a diff to sqlgraph.py). No need for hacks since sqlalchemy provides (albeit, indirect) methods to obtain the primary key for a table.

Aside from restoring the connection and obtaining the primary key, it looks like other sql statements in sqlgraph can be performed without any changes by using the cursor returned by the generic serverinfo object.

As I've only truly started looking into the code the past couple of days please please let me know if I'm overlooked other functionality that could break. In any case, we'll be doing more tests tomorrow.

Thanks so much for all of your advice =)

Paul

Christopher Lee

unread,

Apr 29, 2009, 12:24:24 AM4/29/09

to pygr...@googlegroups.com

On Apr 28, 2009, at 8:19 PM, Paul Rigor (gmail) wrote:

> Hi gang,
>
> Thanks for the link to the discussion with Jenny! That was uber
> informative.
>
> Chris, thanks for the SQLiteServerInfo class. I've written a more
> generic serverinfo that uses sqlalchemy based on the
> sqliteserverinfo class. Also, I don't think we need to pickle any
> sqlalchemy objects directly. Using the serverinfo classes, we can
> easily recreate connections to whichever db-backend. This generic
> class is pickleable so it can be saved within pygr.Data.

Could you send some examples of your code? This would help educate me
about how we could try to work with sqlalchemy as a back-end...

>
> I've been looking into how to obtain the primary key of a table as
> well (i'll include that in a future attachment and/or a diff to
> sqlgraph.py). No need for hacks since sqlalchemy provides (albeit,
> indirect) methods to obtain the primary key for a table.
>
> Aside from restoring the connection and obtaining the primary key,
> it looks like other sql statements in sqlgraph can be performed
> without any changes by using the cursor returned by the generic
> serverinfo object.

Just to make sure I understand, are you talking about sqlite, or about
sqlalchemy? Pygr's current code analyzes the sqlite schema to
determine the primary key. See sqlgraph.sqlite_table_schema() as an
example of such a schema analysis function.

SQLTable has a general mechanism for plugging in new functionality to
analyze the schema. sqlgraph._schemaModuleDict is a dictionary that
maps a module name (derived from the cursor object) to a schema
analysis function. You can add your own functions to this mapping,
and then just pass in a cursor object to SQLTable, or better yet, pass
in a pickleable serverInfo object (which can return a usable cursor,
to again be automatically handled by the above system...).

>
> As I've only truly started looking into the code the past couple of
> days please please let me know if I'm overlooked other functionality
> that could break. In any case, we'll be doing more tests tomorrow.

It sounds like you are using SQLTable, which will then query your back-
end using SQL. In that case, you may want to be able to specify the
SQL variations of your back-end, so that SQLTable will use the proper
variant for your back-end. For examples of this, see the
_sqliteMacros and _mysqlMacros in sqlgraph.py.

If you create your own *_table_schema() function for your back-end, it
should create an appropriate SQLFormatDict with the paramstyle and
macro dict for that back-end. Again, see the mysql_table_schema() and
sqlite_table_schema() code for examples.

-- Chris

Paul Rigor (gmail)

unread,

Apr 29, 2009, 4:02:42 PM4/29/09

to pygr...@googlegroups.com

Hi Chris,

Attached is a file containing both the generic db server info class along with a method (which can be pushed out as a stand-alone method) which obtains the primary_key for a table regardless of database type. The sqlalchemy group was very helpful with this latter functionality.

sqlite = GenericDBServerInfo("sqlite:////home/prigor/projects/genomics/hts/src/sql/pygrdata/test_sqlite.db")
sqlite.get_primary_key()
Out[16]: u'name'

mysql = GenericDBServerInfo("mysql://genomics:geno...@motifmap.ics.uci.edu:5000/hts")
mysql.get_primary_key()
Out[17]: u'name'

GenericDBServerInfo.py

Reply all

Reply to author

Forward