Cross-shard queries

189 views
Skip to first unread message

George V. Reilly

unread,
Mar 11, 2012, 8:03:03 PM3/11/12
to sqlal...@googlegroups.com
We ran into a nasty problem with sharding a while back. We came up with an effective workaround, but we don't fully understand the problem. Perhaps someone here can provide more insight.

We shard requests to our production MySQL databases using sqlalchemy.ext.horizontal_shard.ShardedSession. Our query_chooser uses  a consistent hashing scheme on our account_ids to compute shard_ids. In almost all cases, we know the account_id before performing a query, therefore a query goes directly to the correct shard and only to that shard. In the other cases, we're typically trying to find the account_id(s) given some other key.

The nasty problem arose thus. We made a cross-shard query, which apparently opened a database connection to more than one of the sharded databases. The query returned an account_id and our associated SQLAlchemy Account object. We then added data to the account and committed it — all on the same scoped_session. Result: the thread (process?) blocks on the connections to the other databases.

The workaround is to call commit() on the session after performing the cross-shard query, and then call scoped_session() to make a fresh session before adding data to the account.

Anyone understand what the underlying problem was? Is it in SQLAlchemy, MySQL-python, or some lower level?

The pattern that we're moving towards with these cross-shard queries is to perform them in a distinct session using a contextmanager which commits the session. Account_ids and other keys are returned from the contextmanager, but SQLAlchemy objects are not. One (or more) of our SQLAlchemy Account objects are then created in the main session. I'm worried about the >1 case, since those accounts may be distributed across several shards.

Is this safe?

Michael Bayer

unread,
Mar 12, 2012, 1:56:44 AM3/12/12
to sqlal...@googlegroups.com


the Session works like this: each time you execute() on a particular Connection, that Connection is added to the state of the SessionTransaction, and remains open in a transaction until the Session itself is committed, rolled back, or closed. It's in this way that a Session coordinates multiple operations across many databases, and if twophase=True it will also call prepare() on all those connections before the commit.

So when using ShardedSession, if you hit three databases, they're all in that session's state.

Let's say you wanted to close out the first two. That may be fine, but how do you know that those connections aren't in the Session's state due to previous operations on those databases ? So it's something of a tricky problem, to say that you'd like to scan multiple database nodes with a certain SELECT statement, but then once you find the node you care about, close out the previous nodes. You'd have to check first that they weren't already there.

This is all doable though is not public API, to optimize this operation would mean you'd need to peek inside of Session.transaction._connections before and after the operation, then rollback() those connections you know you don't want anymore and remove them from Session.transaction._connections.

So that's what's going on, it's just I think if you try yo experiment with closing out individual connections from Session.transaction._connections it may become apparent that this is a bit of an intricate case.

Overall, I'd likely be using distinct Session objects explicitly in any case within a horizontal sharding situation; assuming I can make sure that the scope of a certain request is going to be focused entirely on just one of those shards, not unlike the solution you came up with.


George V. Reilly

unread,
Mar 21, 2012, 2:20:16 PM3/21/12
to sqlal...@googlegroups.com
Thanks, Michael. 
Reply all
Reply to author
Forward
0 new messages