re: shards

8 views

Skip to first unread message

Max Ross

unread,

Sep 17, 2008, 12:16:21 AM9/17/08

to Danny Antonetti, hibernate-...@googlegroups.com

Hi Danny,

I'm answering your question using my personal gmail and cc-ing the dev group so that others can benefit from the discussion. Hope you don't mind.

Great question! Writing to a master and reading from a slave is a common technique that definitely increases the capacity of a db-backed system. I've seen it done with Hibernate with good results a number of times. Making it very easy to accomplish this with Hibernate sounds worthwhile, but my initial feeling is that Shards isn't necessarily the right place for this. Shards' focus is on increasing scalability using horizontal partitioning, and the feature you're proposing is more about increasing scalability using replication. Shards assumes that data only lives on a single shard. This feature would assume that all data lives on all shards. If you implemented this it could certainly be used within Shards, and no doubt some projects would benefit from it. Whether you'd add the logic at the SessionFactory, the Session, or even the Connection layer...I'm not sure. But no matter which one you chose it would still plug in to Shards pretty seamlessly.

That said, we do need to think about replication within Shards because of The Static Data Problem. This has been eating at me for awhile, and if you're interested in working on this I'd love to hear your thoughts. Basically, it's pretty common for an application to have what I call static data: enums, country codes, etc. Tables that contain slow-changing data that is basically there to provide referential integrity. These tables pose a problem for Shards because they either need to live on a single shard (no referential integrity), or they need to live on all shards (violates our assumption that data lives on only one shard). I think referential integrity trumps our design assumptions, so we need to figure out how to deal with slow-changing data that lives on all shards.

One thought I had was to allow users to annotate the mapping for a class as "replicated." This would tell us that instances of this class live on all shards instead of just one. There are a number of implementation details that need to be worked out:
Should we prevent cascades from non-replicated to replicated classes? If so, how? If not, how do we make sure the cascade is applied to an instance associated with the same shard as the one on which the upstream save is taking place?
When someone looks up an instance of a replicated class by id, which shard do we hit?
And so on.

Does this sound interesting to you?

Max
-----------------------------------------------------------------------------------------------------------------------
Danny wrote:

I had a question.

I was thinking about the following feature and I was curious if it had been considered, or if there were other fundamental problems with it.

In most production systems the database is setup as a master/slave, so each shard would probably have a master and one or more slaves.
So would it be possible to add a feature where the master is used for writing and the slave is used for reading.

The only problem that I can think of would be if the slave was out of date for a given object as it was written to the master, but this would addressed by requiring (or highly suggesting) that every object has to have a Version field, and or querying the slave to verify that it is up to date.

I feel like I pretty much understand most of the code.
I think I am ready to take a small task.
I am not by any means an expert at sharded databases, so I may need some help working on these tasks.

Thanks

Danny

Danny Antonetti

unread,

Sep 17, 2008, 10:55:14 AM9/17/08

to Max Ross, hibernate-...@googlegroups.com

On Tue, Sep 16, 2008 at 9:16 PM, Max Ross <max....@gmail.com> wrote:

Hi Danny,

I'm answering your question using my personal gmail and cc-ing the dev group so that others can benefit from the discussion. Hope you don't mind.

Great question! Writing to a master and reading from a slave is a common technique that definitely increases the capacity of a db-backed system. I've seen it done with Hibernate with good results a number of times. Making it very easy to accomplish this with Hibernate sounds worthwhile, but my initial feeling is that Shards isn't necessarily the right place for this. Shards' focus is on increasing scalability using horizontal partitioning, and the feature you're proposing is more about increasing scalability using replication. Shards assumes that data only lives on a single shard. This feature would assume that all data lives on all shards. If you implemented this it could certainly be used within Shards, and no doubt some projects would benefit from it. Whether you'd add the logic at the SessionFactory, the Session, or even the Connection layer...I'm not sure. But no matter which one you chose it would still plug in to Shards pretty seamlessly.

I was thinking that you could have N shards, each with 1 master and M slaves, for a total of N * (1+M) databases.
But I see what you mean, this might be something that might be more appropriate at a different level.

That said, we do need to think about replication within Shards because of The Static Data Problem. This has been eating at me for awhile, and if you're interested in working on this I'd love to hear your thoughts. Basically, it's pretty common for an application to have what I call static data: enums, country codes, etc. Tables that contain slow-changing data that is basically there to provide referential integrity. These tables pose a problem for Shards because they either need to live on a single shard (no referential integrity), or they need to live on all shards (violates our assumption that data lives on only one shard). I think referential integrity trumps our design assumptions, so we need to figure out how to deal with slow-changing data that lives on all shards.

One thought I had was to allow users to annotate the mapping for a class as "replicated." This would tell us that instances of this class live on all shards instead of just one. There are a number of implementation details that need to be worked out:
Should we prevent cascades from non-replicated to replicated classes? If so, how? If not, how do we make sure the cascade is applied to an instance associated with the same shard as the one on which the upstream save is taking place?
When someone looks up an instance of a replicated class by id, which shard do we hit?
And so on.

I would be interested in working on this, it sounds like a Meaty problem :)

It seems like we should start out preventing the cascade, and then make it configurable as people come up with usecases.
There are 11 different kinds of cascades in hibernate, we would not necessarily need to handle all of them right now.
http://www.hibernate.org/hib_docs/annotations/api/org/hibernate/annotations/CascadeType.html

What happens if we perform an update and one of the shards has an error?
The Replicated Data would be in an inconsistent state across all of the shards.

I will look at it more tonight.

Thanks

Danny

Reply all

Reply to author

Forward

0 new messages