-- Chris
On Nov. 16, 2007, Namshin Kim wrote:
Hi Chris,
Here is the simple comparison of building time, Hash vs. B-Tree.
Data: 10 million SNP 126 Database
Building Pygr Annotation Database
1. Hash: 16 hours (960 min). It was running in lln4, but killed by
"Bus error" just after building shelve. NLMSA was not built.
2. B-Tree: 1 hour 20 min (80 min).
Yours,
Namshin Kim
On Nov. 6, 2007, Namshin Kim wrote:
B-Tree is faster than Hash algorithm. And, MySQL (MyISAM Engine) uses
B-Tree by default. It took 10 hours to finish building.
#!/usr/bin/env python
import sys, os, string, bsddb
args = sys.argv
if len(args) != 2:
print 'Usage:', args[0], 'dummy'
sys.exit()
U87 = bsddb.btopen(' U87.btree', 'c')
fahead = ''
for lines in open('U87', 'r').xreadlines():
if lines[0] == '>':
fahead = lines[1:].strip()
else:
U87[fahead] = lines.strip()
U87.close()
Yours,
Namshin Kim
- what's the total time for loading and initializing via
indexed_open(): is it 15.4 s for standard loading, vs. 9.1 + 20.4 s
for fastloading? Or does standard loading also require the indexing
step, for a total of 15.4 + 20.4 s?
- what exactly does your update test do? 1 million random updates on
existing keys? Add new keys?
- by sq_dict_open, do you mean sq_dict.open()? There is no
"sq_dict_open" function in the sq_dict.py file that you linked to...
Thanks again for doing all these tests!
-- Chris
Yours,
Chris
> Here is a test run for 10 million rows, I've also added a gnu dbm test
> (called dbm_open).
Huh, I have completely forgotten about CDB (it's *the* CDB from Dan
Bernstein, right?) despite actually using it on my system... From these
results, it would seem this is the way to go - not only is it faster
than even btree-based bsddb, its performance is actually consistent
over different benchmarks (which sadly isn't the case for sqlite and
gdbm... Some of these results look AWFUL).
Would you be able to see how python-cdb works with Python 3.0, Istvan?
--
MS
Bet on months...
Seriously, though, I think it's premature to think much about Python
3.0.
First, Python 2.x isn't going to go out of business anytime soon. So
there's no reason to move on.
Second, neither Pyrex nor Cython have 3.x support yet. So we couldn't
do it even if we wanted to.
Third, there's no compelling reason to transition to 3.x. No
performance plusses, no big development community, etc.
Fourth, the tools for transitioning from Python 2.x to 3.x will be
improving dramatically over the next year. So we might as well wait.
Fifth, the suggested "path" to transitioning to 3.x is to engineer
complete code coverage, then use an automated tool to rewrite the source
code, then manually adjust. We are very far from complete code
coverage!
--
Even apart from those issues re the 3.0 transition, using something like
python-cdb raises a completely different set of issues from the
bsddb/sqlite discussion. Because it's a 3rd-party package that's not
included with Python, we could not rely on it being available; we'd have
to distribute binaries with pygr if we wanted people to be able to use
pygr "out of the box". This is a significant investment of time and
resources for a project the size of pygr!
On the flip side, using a 3rd-party package alleviates the problems
associated with python-dev's unwillingness to ship a scalable data store
with Python 2.x or 3.x!
These issues completely drown out the problem of making python-cdb py3k
compatible, which would take one of us less than a day to do...
cheers,
--titus
--
C. Titus Brown, c...@msu.edu
Cross-platform functionality is definitely one big issue. I do think,
though, that working "out of the box" is an important goal. From
working support for many years, I can tell you that if you want people
to even *look* at your software, you need to make it as simple and easy
to install as possible. Every additional install step loses you a big
chunk of potential users.
We can expect people to have a working compiler on UNIX machines, but
that's about it; we really need to have Windows binaries available too.
Istvan, are you saying that cdb doesn't work on Windows??
-> What we could do instead is make the back end swappable, the interface
-> for an on-disk-dictionary is fairly simple so it does not seem like a
-> big undertaking to have multiple options. This would allow people to
-> champion their favorite backend, without having to lock out others.
Right! Chris has regularly pointed out that backwards compatibility is
important to him and other users: we want to be able to load our
databases created with Python 2.5 into pygr running under Python 2.6,
and maybe even 3.x. That should be possible with a "third party" db like
cdb.
http://ivory.idyll.org/blog/jan-09/lazyweb-worm-options-in-python.html
I think we need to do some more benchmarking; Alex, is this something
you'd be interested in doing, perhaps by expanding Istvan's benchmarking
code?
One interesting option is to move to an interface like 'shove', which
supports many different backends:
http://pypi.python.org/pypi/shove
cheers,
--titus
Testing more on-disk storage techniques would be the first thing to do
-- just modifying Istvan's code to do more.
tnx,
--t
No, from what little I know, shoving long strings into databases is not
generally the intended use case :). I was puzzled by some of the people
who responded to my blog post with comments like,
Use ZODB!
for that very reason. But I guess I didn't describe our issues very
well.
-> I don't think HDF can be used for arbitrary lenght strings. I'm using
-> HDF myself and it is *extremely* performant. Probably nothing would
-> outperform HDF when it comes to reading out say 1 million integers
-> from a database. But the columns have to have a fixed size: say
-> StringCol(100) and I think shorter strings will be padded to this
-> size. Thus every entry needs to be made as big as the longest one.
I wonder how PyTables deals with that?
Thanks,
Paul
No, not so far. Want to pick up the torch? ;)
--titus
-- Chris
Yes, this was suggested, but I think that it would be a fair amount of
work to do. Good for down the road, but right now the code is working.
--t
> Hi,
>
> The support for relational databases seems to be tightly coupled to
> mysql (Please correct me if I'm wrong). In sqlgraph.py for example,
> database connection defaults to mysql with an explicit import of
> MySQLdb, search for mysql configuration file variants, etc.
In Pygr 0.7 and before, only MySQL was supported. 0.8 provides a
mechanism for handling the SQL variations of different database
servers, and adds sqlite support as a first example of that. The old
mechanisms for connecting to the database server are gradually being
deprecated in favor of the new ServerInfo mechanism (a pickleable
database connection object), which is totally general.
>
> Wouldn't a package like SQLalchemy provide a more transparent way of
> accessing a relational database backend? This package can handle
> several database 'engines' and provide a common interface to work
> with.
Jenny Qian did an analysis of this a while back. You can see that
meaty discussion here:
http://groups.google.com/group/pygr-dev/browse_thread/thread/a251addd179531e5/1cade214145543de?lnk=gst&q=sqlalchemy#1cade214145543de
Pygr's goal is to be agnostic about back ends (i.e. work with any back-
end without prejudice). So adding support for working with SQLAlchemy
would be great. The question is whether we can *eliminate our own
support* for SQL back-ends by switching to SQLAlchemy. AFAIK, the
main problem is a major difference in philosophies.
- Pygr follows a "pickling" model, i.e. to access an existing
resource, *no user code whatsoever* should be required, by the user
simply "saying the name" of the desired resource, it should be
automatically loaded, with all its relations to other data resources
intact. I don't think SQLAlchemy database objects are pickleable in
this way.
- there are bunch of other issues surrounding the general Pygr
philosophy of providing a consistent graph interface to all data, with
automatic schema relations etc.
I would be very interested if you see a way to make a SQLAlchemy
interface pickleable, so that it could be saved to pygr.Data.
>
> I'm still a bit unsure whether pygr actually uses the db uri stored
> in the PYGRDATAPATH environment variable to obtain the connection
> string. For sure a metabase object is instantiated with the correct
> 'dbpath', but that object doesn't seem to parse that dbpath
> parameter. Also, In loading the 'metabases', there is no way to
> explicitly specify connection parameters (user, host, password, port).
Actually, you can specify host user password in the PYGRDATAPATH.
From the docs for SQLTable: "If cursor is None, it will attempt to
connect to a MySQL server using authentication information either from
your the name string (treated as a whitespace separated list in the
form tablename host user passwd; at least tablename and host must be
present)..." metabase uses the same syntax. We could add port easily
to this list.
Sometime soon metabase will switch to using SQLTable explicitly, which
will make the (current) MySQLMetabase backend compatible with sqlite
(and other database servers if we add SQLTable support for them).
This would get rid of the annoying path syntaxes (such as that above)
and employ a really clean mechanism based on the general concept of
ServerInfo (i.e. a pickleable reference to a database). In other
words, a connection to a database would itself be a resource saved in
pygr.Data.
Officially, this kind of focus on development of pygr.Data is supposed
to happen in release 1.0. But if users like you push for it, it could
be done soon.
>
> I guess I'm still a little confused and I'm learning by placing
> print statements all over the place...
Sorry! Updating the docs for 0.8 is high on my list of to-do's...
-- Chris
> Hi gang,
>
> Thanks for the link to the discussion with Jenny! That was uber
> informative.
>
> Chris, thanks for the SQLiteServerInfo class. I've written a more
> generic serverinfo that uses sqlalchemy based on the
> sqliteserverinfo class. Also, I don't think we need to pickle any
> sqlalchemy objects directly. Using the serverinfo classes, we can
> easily recreate connections to whichever db-backend. This generic
> class is pickleable so it can be saved within pygr.Data.
Could you send some examples of your code? This would help educate me
about how we could try to work with sqlalchemy as a back-end...
>
> I've been looking into how to obtain the primary key of a table as
> well (i'll include that in a future attachment and/or a diff to
> sqlgraph.py). No need for hacks since sqlalchemy provides (albeit,
> indirect) methods to obtain the primary key for a table.
>
> Aside from restoring the connection and obtaining the primary key,
> it looks like other sql statements in sqlgraph can be performed
> without any changes by using the cursor returned by the generic
> serverinfo object.
Just to make sure I understand, are you talking about sqlite, or about
sqlalchemy? Pygr's current code analyzes the sqlite schema to
determine the primary key. See sqlgraph.sqlite_table_schema() as an
example of such a schema analysis function.
SQLTable has a general mechanism for plugging in new functionality to
analyze the schema. sqlgraph._schemaModuleDict is a dictionary that
maps a module name (derived from the cursor object) to a schema
analysis function. You can add your own functions to this mapping,
and then just pass in a cursor object to SQLTable, or better yet, pass
in a pickleable serverInfo object (which can return a usable cursor,
to again be automatically handled by the above system...).
>
> As I've only truly started looking into the code the past couple of
> days please please let me know if I'm overlooked other functionality
> that could break. In any case, we'll be doing more tests tomorrow.
It sounds like you are using SQLTable, which will then query your back-
end using SQL. In that case, you may want to be able to specify the
SQL variations of your back-end, so that SQLTable will use the proper
variant for your back-end. For examples of this, see the
_sqliteMacros and _mysqlMacros in sqlgraph.py.
If you create your own *_table_schema() function for your back-end, it
should create an appropriate SQLFormatDict with the paramstyle and
macro dict for that back-end. Again, see the mysql_table_schema() and
sqlite_table_schema() code for examples.
-- Chris