Extending pycassa with sqlalchemy-like cql query generation

313 views
Skip to first unread message

Samuel Christie

unread,
Mar 29, 2013, 4:49:08 PM3/29/13
to pycass...@googlegroups.com
Hey all,

Right now both of the two major cassandra libraries for python (pycassa and cassandra-dbapi2) are both a subset of the features I need. Pycassa has better connection pooling support and the beginnings of an object mapper, while cql makes cql3 queries more accessible. I've been looking into extending one of them to encompass the desired feature set, and at this point it looks like adding a cql3 query system to pycassa will be a lot easier, and more valuable in the long run given pycassa's better maintenance record.

I thought I would mention it here so that you could give your input on the idea. The changes are largely intended for internal use, but it would be best if they could be merged in with the main project at the end.

My current plan is to extend the ColumnFamily class with a few extra methods, such as select(), that would return a Query object that could then be further restricted by where(), limit(), etc. Each method would return a modified Query object so that they could be chained, and the query would be executed by calling all(), first(), one(), just like in sqlalchemy. I think it would be best if the ColumnFamily class could take a new kwarg 'cql_version' that would let it query for a more detailed schema with column names and types during initialization. The column types could then be extended with operator overloading to allow simpler expression based where clauses, again á la sqlalchemy. There would also be basic enhancements to the plain cql query execution system, to get nicer output and a simpler interface than the existing low-level execute_cql3_query method.

Anyway, that's just a preliminary concept, and I'm interested to learn from you what your existing plans are on the subject, what changes you would like to see along the same lines, and what design decisions you wouldn't accept for inclusion into the pycassa library. Most of the enhancements would be enabled with kwargs or be adding additional methods and classes, so I don't think it should interfere with existing code, but it's still worth asking.

Thanks,

-Samuel Christie

Tyler Hobbs

unread,
Mar 29, 2013, 5:36:15 PM3/29/13
to pycass...@googlegroups.com
Hi Samuel,

First off, thanks for your interest in adding some useful features! There are a few things worth mentioning:

The first is that I'm currently laying the groundwork for a new python driver for Cassandra that's similar to the new java driver (https://github.com/datastax/java-driver).  It uses the new native protocol instead of Thrift and only presents the cql3 API; additionally, it will have some good improvements in connection pooling, request pipelining, node discovery, and failure/retry policies over what pycassa offers.  I'm actually planning to include a query-builder interface that is similar to sqlalchemy's just like you mention, so I'm glad to see there is interest in having that.  I expect to have a beta-quality version of the new driver available on GitHub in roughly 1.5 or 2 months. (I'm not working on it full-time, at the moment.)

The second is that I am definitely interested in getting cql3 support into pycassa regardless of developments on the new driver, primarily because it provides a nice upgrade path for those who are currently using pycassa but are looking to use cql3.  At a minimum level, proper cql3 support includes:
* Being able to execute queries through a ConnectionPool with proper retries, etc
* Encoding and decoding cql3 types

The first point shouldn't be very difficult at all. It primarily depends on deciding what the API looks like (i.e. whether it's a method on the ConnectionPool, on a ColumnFamily, or comes from a different module); I don't necessarily have a strong preference there yet.

The second point can also be pretty easily accomplished by copying code from the dbapi2 driver.

I would be quite happy to have just those two things in pycassa and will gladly accept pull requests. Anything extra is a bonus.

Regarding a mapper and query-builder interface, I feel like extending the ColumnFamily class is the wrong way to go simply because the two APIs are essentially mutually incompatible.  I haven't thought too deeply about the topic, but my gut says a combination of a subset of the sqlalchemy core expression api (http://docs.sqlalchemy.org/en/rel_0_8/core/expression_api.html) and declarative mappings (similar to http://docs.sqlalchemy.org/en/rel_0_8/orm/tutorial.html#declare-a-mapping) where execute() and query() are methods on a ConnectionPool would be a good start.

I'm definitely interested in discussing the API further.  Any work done for this in pycassa will likely be closely translated to the new python driver (and perhaps vice-versa).



--
You received this message because you are subscribed to the Google Groups "pycassa-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pycassa-deve...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Tyler Hobbs
DataStax

Samuel Christie

unread,
Mar 29, 2013, 6:34:21 PM3/29/13
to pycass...@googlegroups.com
Good to know we're at least thinking in the same general direction. In fact, I added a simple form of peer detection earlier today. How much progress have you already made towards those features that you mentioned? It would be best not to waste effort duplicating features if possible.

I am currently trying to work on just improving the basic cql3 support, so that sounds like a good place to start the discussion.

I've started with adding a query() method to ConnectionPool, that will be a wrapper for retries, etc. around calling query() on a single connection, since execute() is already taken. I was planning on having the results decoded into a list/iterator of ordered dictionaries, but haven't gotten that far along yet.

In the end, I was thinking the api would end up very similar to the cql dbapi2 interface, with prepared queries etc., and most of the names/code can be borrowed directly. Thoughts?

As for copying the decoding from dbapi2, which pieces aren't covered by the current type unpacking in pycassa? There might be a few cql3 specific types that aren't covered, but I think that for the most part it's just the data structure mapping from the results to column types that needs to be added, which could probably be done better than the current design in dbapi2 anyway. Partly I just don't like the fact that dbapi2 returns a flat list of values instead of dicts.

Regarding extending the ColumnFamily class for query generation I was thinking that it would really just be responsible for providing the schema and executing the query, while a separate module/Query class would be responsible for the actual query generation. Towards that end I thought it was the closest analog to the table classes in sqlalchemy, in that it encapsulates a schema loaded from the database. The current interface for the orm in sqlalchemy is something like session.query(Table).get(id), or session.query(Table).filter(Table.col == val).all(), which I suppose would imply using the query method on the pool instead of the column family but the api in pycassa is slightly different, in that get(id) is called directly on the CF instance, and the CF contains a reference to the pool.

If the CF wasn't used as a starting point for queries already, I would probably suggest leabing the query method on the ConnectionPool (like the db session), and have it take the CF as an argument to provide the schema information. Then the only extensions to the CF would be support for getting the cql3 version of the schema, or defining the schema as part of the init args. In the latter case, the CF object could be used to initialize a new column family in the database with create() or similar. Since the CF already has support for querying via its own reference to the pool, I'm not sure that I see a problem with being able to call CF.select().where().first() instead of just CF.get().
You received this message because you are subscribed to a topic in the Google Groups "pycassa-devel" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pycassa-devel/tPQBCc71Ck0/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to pycassa-deve...@googlegroups.com.

Tyler Hobbs

unread,
Apr 1, 2013, 2:15:14 PM4/1/13
to pycass...@googlegroups.com
On Fri, Mar 29, 2013 at 5:34 PM, Samuel Christie <emai...@gmail.com> wrote:
Good to know we're at least thinking in the same general direction. In fact, I added a simple form of peer detection earlier today. How much progress have you already made towards those features that you mentioned? It would be best not to waste effort duplicating features if possible.

Node discovery is done pretty differently with the new native protocol (e.g. Cassandra can push notifications about ring changes and the ring state is stored in system tables) so there's not much overlap.

I'm sure you will, but if you end up doing much work here, just keep it separate from the cql3 support so that we can work on and commit the changes separately.
 


I am currently trying to work on just improving the basic cql3 support, so that sounds like a good place to start the discussion.

Agreed.
 

I've started with adding a query() method to ConnectionPool, that will be a wrapper for retries, etc. around calling query() on a single connection, since execute() is already taken. I was planning on having the results decoded into a list/iterator of ordered dictionaries, but haven't gotten that far along yet.

In the end, I was thinking the api would end up very similar to the cql dbapi2 interface, with prepared queries etc., and most of the names/code can be borrowed directly. Thoughts?

I'm not 100% clear on what you're suggesting.  It sounds like you're interested in having a prepare_query() function somewhere and the equivalent of the Cursor class's execute() and execute_prepared() methods with an (implicit?) fetchall(), meaning the execute() methods return results directly.  Is that accurate?

If so, I'm agreeable to that.

I'm also a fan of returning a list or generator of OrderedDicts.
 

As for copying the decoding from dbapi2, which pieces aren't covered by the current type unpacking in pycassa? There might be a few cql3 specific types that aren't covered, but I think that for the most part it's just the data structure mapping from the results to column types that needs to be added, which could probably be done better than the current design in dbapi2 anyway.

That sounds correct; I think you've got a good picture of what's missing.  Most of the cql3 types match something in pycassa.marshal.  Off the top of my head, the cq3 collection types are the only new data types that need to be handled.

The current dbapi2 code supports all of cql2, cql3, Thrift, and the native protocol, so a version that only handles cql3 over Thrift should indeed be simpler.
 
Partly I just don't like the fact that dbapi2 returns a flat list of values instead of dicts.

Agreed. It would be nice to be able to toggle that behavior, but just returning OrderedDicts as a start is fine by me.
 

Regarding extending the ColumnFamily class for query generation I was thinking that it would really just be responsible for providing the schema and executing the query, while a separate module/Query class would be responsible for the actual query generation. Towards that end I thought it was the closest analog to the table classes in sqlalchemy, in that it encapsulates a schema loaded from the database. The current interface for the orm in sqlalchemy is something like session.query(Table).get(id), or session.query(Table).filter(Table.col == val).all(), which I suppose would imply using the query method on the pool instead of the column family but the api in pycassa is slightly different, in that get(id) is called directly on the CF instance, and the CF contains a reference to the pool.

If the CF wasn't used as a starting point for queries already, I would probably suggest leabing the query method on the ConnectionPool (like the db session), and have it take the CF as an argument to provide the schema information. Then the only extensions to the CF would be support for getting the cql3 version of the schema, or defining the schema as part of the init args. In the latter case, the CF object could be used to initialize a new column family in the database with create() or similar. Since the CF already has support for querying via its own reference to the pool, I'm not sure that I see a problem with being able to call CF.select().where().first() instead of just CF.get().

I'm still against reusing the ColumnFamily class for mapping or query behavior.

cql3 schemas aren't always compatible with Thrift schemas.  Only tables that are created WITH COMPACT STORAGE are accessible through the Thrift API.  All Thrift-defined column families are accessible through cql3 (except for maybe super column familes? I haven't looked at that recently).  Some column family properties don't apply to non-COMPACT STORAGE tables. It's a pretty confusing situation already, and I'm worried about increasing confusion by mixing the two APIs.

Additionally, cq3 results include the data types of the returned data, so you don't need to determine the schema elsewhere.  So the ColumnFamily object wouldn't and shouldn't provide the db schema.  The "table" classes in sqlalchemy are more of a mapping from the db schema to python types and python behavior.

I would be in favor of creating a new class that mirrors sqlalchemy's Base class, where the API involves subclassing, declaring mappings from columns to attributes on subclass objects, etc.  If we want to support automatic persistence, this would require an equivalent to sqlalchemy's Session class, because mutations across multiple objects need to be tracked.  If those classes are in place, I think the proper API looks just like sqlalchemy: session.query(Table).filter(...).

I think the primary goal of this work is to allow pycassa users to easily execute cql3 queries with the same connection pool that they're already using.  Reusing familiar classes like ColumnFamily for another purpose or matching other aspects of pycassa's API aren't nearly as valuable, in my opinion, especially given the incompatibilities.

As I mentioned before, if the mapper work turns out to be too involved, don't worry.  Just being able to execute queries without a mapper is a good start.

Thanks again for your work so far.

--
Tyler Hobbs
DataStax

Blake Eggleston

unread,
May 16, 2013, 3:18:54 PM5/16/13
to pycass...@googlegroups.com
Hi Tyler and Samuel,

Regarding a CQL3 mapper for Python, you may want to check out my project cqlengine:


We've been using it in production on 2 websites for some time now.

It uses the cql library for talking to cassandra. We are very interested in developing a more robust connection manager for it because the current one is not especially fault tolerant.

Tyler, we'd really like to look at the work you've been doing, and would like to get involved if possible.

Thanks,

Blake
Reply all
Reply to author
Forward
0 new messages