Predicate Push downs and indexes

36 views
Skip to first unread message

vada

unread,
Jun 13, 2012, 2:22:56 PM6/13/12
to python-u...@googlegroups.com
Cool stuff,

Looking at the implementation Underverse is cool and pretty lightweight.  I think that the whole storage aspect could be easily abstracted.  Staying focused on the SQLite implementation though, it would be pretty trivial to implement additional indexes on the underlying table.  The only thing left would be converting the predicate input into something that covers the index.  For example it might be a subset of the operands referenced in the original query.  At any rate it's a little complicated, but I don't think it would totally blow out the system and the performance benefits would be huge.  The current system does a table scan for every query!  This makes it unusable for real world use cases over 100K records.

Thoughts?

-Travell

Max

unread,
Jun 13, 2012, 3:05:11 PM6/13/12
to python-u...@googlegroups.com
Yeah, you're right it would be trivial to add table indexes. SQLite also has an extension for fast full-text search. However, I'm not sure if it's included in the Python version.

As for the performance, I know what you mean. That's kinda why I included the 'glob' function to keep the searching in SQLite as much as I could. Besides the indexes, there is at least one other place that impacts performance heavily and that is using jsonpickle for de/serialization. I've recently been rethinking that decision and maybe wanting to go with another module. The issue is that everything going into and out of SQLite has to go through jsonpickle. The module is rather slow because it has to recurse through the object. I've done some comparisons recently which suggests it's 10-100x slower than json, marshal and msgpack. The drawback from using those is that they do not support native Python objects. Switching the de/serialization methods to one of those would yield a tremendous speed up. But, you lose the ability to store python class objects. I was considering moving to either json or msgpack because they can be parsed from other languages. Changing modules could also simplify Underverse by a good bit.

I haven't had hardly any feedback about the module, so I wasn't sure if anyone was really using it. I hadn't completely convinced myself to switch, but maybe with your input we can do something about the performance.

Max

P.S. - sorry if the code is messy.

vada

unread,
Jun 14, 2012, 1:33:02 PM6/14/12
to python-u...@googlegroups.com
Hey Max,

I think a good starting place would be to switch from sqlite3 to sqlalchemy for the the actual storage layer.  Having Underverse as a schemaless interface to any relational database is pretty powerful. Based on where the package is today there's no need for the sqlite dependency.  Another thought is to leverage boto for access to DynamoDB.  

So basically the StorageEngine interface should support plugin n play and generically support conditional push down.  We could start out with a  RDMSStorageEngine to work out the kinks and then graduate to a  DynamoDBStorageEngine. 

# Create a Underverse with indexes

accounts = uv.new("accounts", {"active":bool, "number":int, "name":str, "address.zipcode":str})

What do you think of that?

-Travell

P.S.

Max

unread,
Jun 14, 2012, 2:52:43 PM6/14/12
to python-u...@googlegroups.com
I like idea of abstracting the storage engine. It would allow people to write their own for whatever they need. There's also authentication and authorization code that would be needed depending on the storage engine. I simply skipped that b/c I was using SQLite.

I also like the proposed syntax of the indexes. Simple and easy. Just to make sure I'm understanding you're thoughts on the indexes, the code you posted would create additional columns in the underlying table with indexes for the given attributes, correct? Also, how do you envision the "push-downs" to work?

Max

vada

unread,
Jun 15, 2012, 10:30:37 AM6/15/12
to python-u...@googlegroups.com
I would suspect that the ideal solution would be able to analyse the query and extract a portion to run against an indexed column.  

Max

unread,
Jun 15, 2012, 10:57:04 AM6/15/12
to python-u...@googlegroups.com
Right, sounds easy enough. You'd query all the indexed columns first then iterate over the results for any non-indexed columns and return what's left.

vada

unread,
Jun 15, 2012, 3:49:13 PM6/15/12
to python-u...@googlegroups.com
Yeah, promote the predicate to the pushdown stage if it has an  index that covers it.  The trick is how do you push down more than one predicate to form a proper sub query? I guess this involves writing a basic query optimizer :)  In the case of MySQL it just picks the best index for the query.  This is why covering indexes (multi column) are important.

-Travell

Max

unread,
Jun 15, 2012, 5:00:12 PM6/15/12
to python-u...@googlegroups.com
I'd say let each storage engine handle it appropriately. For RDBMS engines, it wouldn't be hard to construct a simple "where" clause for the indexed columns then query the underlying relational table.

vada

unread,
Jun 15, 2012, 6:18:52 PM6/15/12
to python-u...@googlegroups.com
Cool, so:

query.pushdown('address.zip = '21001').filter(lamda x: x.medianincome > 30000)

I also think it would be cool to switch to simple list comprehensions for filtering instead of the query language.  Just pass a function to filter that takes a Document.

Thoughts?

Max

unread,
Jun 17, 2012, 12:13:27 AM6/17/12
to python-u...@googlegroups.com
Actually, Underverse already has that functionality (http://packages.python.org/underverse/#underverse.model.Document.udf) via UDFs. I was also thinking that the "pushdowns" would be hidden from the user and would happen automatically depending on the query.

# creates new accounts table if it doesn't already exist

# adds indexes if they don't already exist
accounts
= uv.new("accounts", indexes={"active":bool, "number":int, "name":str, "address.zipcode":str})

# Query for active accounts where the zipcode is '21001' and the median income is > $30k.
active_accounts
= accounts.find(D.address.zip == "21001", D.active == True, D.udf(lambda doc: doc.medianincome > 30000))


So basically the 'active' and 'address.zip' columns would be used to query the underlying indexed columns, then the results are iterated over and filtered even more using the UDF for the 'medianincome' column.

However, instead of using a lambda function for the above query you can just use:

D.medianincome > 30000

But the capability is there for more complex filtering if you need it though UDFs and UDPs.

Also, I believe I can abstract the SQLAlchemy connection interface so that the user can simply specify the connection string in order to access the storage engine. Like so:
uv = Underverse('sqlite:///file.db')

or MySQL:

uv = Underverse('mysql://scott:tiger@localhost/test')


This would allow Underverse access to all the databases that SQLAlchemy can connect to and not just SQLite.

I'd like to avoid changing the API as much as possible while still adding the new functionality. I've spent some time today getting familiar with some of the lower level  SQLAlchemy classes and I think I can make it all work together without rewriting too much code.

- Max

vada

unread,
Jun 19, 2012, 11:39:05 AM6/19/12
to python-u...@googlegroups.com
Awesome Max,

This is really cool.  I can't wait to kick the tires.  Have you figured out how limits and counts will work (efficiently)? 

-T

Max

unread,
Jun 20, 2012, 9:42:45 PM6/20/12
to python-u...@googlegroups.com
Well as for limits, I think it depends on the query. If all the attributes being queried are indexed then you can add a limit clause to the underlying RDBMS. However, if there are any attributes being queried that are not indexed, then I think the most accurate way to do limits is through python. Although, there may be some cases where we might can use the RDBMS and python together to still guarantee the correctness of the number of results returned and get some speed increase as well. As for counts, it's already querying the SQLite database.

Do you have any other thoughts on the matter? I'm going to start a new thread with all the proposed changes going into v0.5. Let me know if I've missed something.

 - Max
Reply all
Reply to author
Forward
0 new messages