Client API and adding/removing servers

Leon Mergen

unread,

Feb 12, 2009, 3:34:29 AM2/12/09

to project-voldemort

Hello,

With all the distributed key/value store projects out there nowadays,
it's hard to see the forest by the trees, but it looks like Project
Voldemort is the most suitable for my needs (plain ol' distributed key/
value store, reliable and "unlimited" scalability).

I was wondering whether I was correct on these two issues, since I
couldn't find it explicitly in the docs:

1. The Client API currently only is Java, and if you're using it from
any other language than Java, you will have to roll out your own
integration solution. And if this is so, my next question is: how much
effort do you think it's going to be to integrate something like
ProtocolBuffers or Thrift into Voldemort? It seems only natural to me
to support such a solution, and as far as I can see it would be
somewhat straightforward to implement (although versioning might
require a bit of delicacy to support with this, but this is an
absolute requirement for me to prevent race conditions). Needless to
say I would be willing to look into this myself too.

2. Is it correct that when adding or removing servers to an existing
pool of running servers, will require shutting down the entire pool,
updating the cluster configuration on all servers, and getting them
back online? Would it make sense to integrate some Paxos kind of
configuration synchronization here ? (for example by integrating
ZooKeeper http://hadoop.apache.org/zookeeper/ into Voldemort)

Other than that, Voldemort seems pretty neat! Are there any examples
of the scale of deployment, for example within LinkedIn ? Are any
other companies other than LinkedIn using Voldemort in a large-scale
deployment ?

Regards,

Leon Mergen

ijuma

unread,

Feb 12, 2009, 5:33:04 AM2/12/09

to project-voldemort

Hi Leon,

On Feb 12, 8:34 am, Leon Mergen <l...@solatis.com> wrote:
> Hello,
>
> With all the distributed key/value store projects out there nowadays,
> it's hard to see the forest by the trees, but it looks like Project
> Voldemort is the most suitable for my needs (plain ol' distributed key/
> value store, reliable and "unlimited" scalability).
>
> I was wondering whether I was correct on these two issues, since I
> couldn't find it explicitly in the docs:
>
> 1. The Client API currently only is Java, and if you're using it from
> any other language than Java, you will have to roll out your own
> integration solution. And if this is so, my next question is: how much
> effort do you think it's going to be to integrate something like
> ProtocolBuffers or Thrift into Voldemort? It seems only natural to me
> to support such a solution, and as far as I can see it would be
> somewhat straightforward to implement (although versioning might
> require a bit of delicacy to support with this, but this is an
> absolute requirement for me to prevent race conditions). Needless to
> say I would be willing to look into this myself too.

Jay is currently working on this, see the protobuf branch. From
another mailing list message, I think he's interested in working with
people who want to write clients in other languages to make sure the
solution works well for them.

I will let Jay handle the other question, although it has come up
before so it might be worth checking the archives.

Best,
Ismael

Leon Mergen

unread,

Feb 12, 2009, 6:03:39 AM2/12/09

to project-voldemort

Hello Ismael,

On Feb 12, 11:33 am, ijuma <ism...@juma.me.uk> wrote:
> > 1. The Client API currently only is Java, and if you're using it from
> > any other language than Java, you will have to roll out your own
> > integration solution. And if this is so, my next question is: how much
> > effort do you think it's going to be to integrate something like
> > ProtocolBuffers or Thrift into Voldemort?
>

> Jay is currently working on this, see the protobuf branch. From
> another mailing list message, I think he's interested in working with
> people who want to write clients in other languages to make sure the
> solution works well for them.

Thanks for your reply, and I'll see if I have something to say about
that issue.

Regards,

Leon Mergen

Bob Ippolito

unread,

Feb 12, 2009, 12:04:58 PM2/12/09

to project-...@googlegroups.com

On Thu, Feb 12, 2009 at 12:34 AM, Leon Mergen <le...@solatis.com> wrote:
>
> Hello,
>
> With all the distributed key/value store projects out there nowadays,
> it's hard to see the forest by the trees, but it looks like Project
> Voldemort is the most suitable for my needs (plain ol' distributed key/
> value store, reliable and "unlimited" scalability).
>
> I was wondering whether I was correct on these two issues, since I
> couldn't find it explicitly in the docs:
>
>
> 1. The Client API currently only is Java, and if you're using it from
> any other language than Java, you will have to roll out your own
> integration solution. And if this is so, my next question is: how much
> effort do you think it's going to be to integrate something like
> ProtocolBuffers or Thrift into Voldemort? It seems only natural to me
> to support such a solution, and as far as I can see it would be
> somewhat straightforward to implement (although versioning might
> require a bit of delicacy to support with this, but this is an
> absolute requirement for me to prevent race conditions). Needless to
> say I would be willing to look into this myself too.

I haven't looked too deeply yet but it seems that all of the logic for
handling replication, resolving read inconsistencies and doing
read-repair is in the client... at least by default (there may be some
other option). So a sanity change in protocol would be insignificant
if you still have to write all of that code to handle the vector
clocks, efficiently speaking to several servers in parallel, etc. It's
not too hard to build all of this stuff, but it's too much of a hurdle
to get started and there's no way you could do it in something like
PHP without writing another daemon for it to talk to because you need
to do concurrency well to have a good implementation.

The protocol should definitely change, there seems to be a lot of them
and all of them are pretty dumb serializations, e.g. the "json"
protocol (which is nothing like JSON at all, actually) has an
arbitrary 32kb limit for string data since it uses a signed 16-bit
integer length prefix(!!). I can't figure out why you would want so
many different serializations beyond one reasonable way to store raw
bytes. I don't think the servers actually have any code that does
anything with the data beyond storage and retrieval so all of the
serialization and schema options don't make any sense to me. If you
had schema for the way the bytes were formatted I don't see any reason
why you couldn't just store it as a key and let the clients sort it
out rather than putting all of the limitations in the server.

> 2. Is it correct that when adding or removing servers to an existing
> pool of running servers, will require shutting down the entire pool,
> updating the cluster configuration on all servers, and getting them
> back online? Would it make sense to integrate some Paxos kind of
> configuration synchronization here ? (for example by integrating
> ZooKeeper http://hadoop.apache.org/zookeeper/ into Voldemort)

I have not seen any code in Voldemort that handles adding or removing
servers. There is a little bit of code that will rebalance your
cluster which is something you would need in this kind of event after
you've managed to update the configuration everywhere, but the code is
commented out and not referenced from anywhere.

> Other than that, Voldemort seems pretty neat! Are there any examples
> of the scale of deployment, for example within LinkedIn ? Are any
> other companies other than LinkedIn using Voldemort in a large-scale
> deployment ?

I've been looking at it for Mochi but honestly I will probably choose
another solution. I would be surprised if anyone is using Voldemort in
production for data that isn't transient. I build a proof of concept
client in Python to play with but it was a fair amount of work and it
would take a while longer to polish it up, write some proper tests,
and get the concurrency stuff right (all network stuff is currently
serialized, it only speaks to one server at a time).

-bob

Cliff Moon

unread,

Feb 12, 2009, 12:31:47 PM2/12/09

to project-...@googlegroups.com

Have you looked at Dynomite? I'm the author, and I can tell you that it
currently supports thrift clients, dynamic adding of nodes, and it keeps
all of the read repair, replication, and concurrency in the server,
keeping the client code as simple as possible. I don't want to be that
guy who shills for his own project in a competing project's mailing
list, but it really seems like it might fit your requirements better.

Bob Ippolito

unread,

Feb 12, 2009, 12:35:56 PM2/12/09

to project-...@googlegroups.com

I haven't yet looked at Dynomite. It's on my list of Erlang
implementations to look closely at along with Ringo and Kai.

Jay Kreps

unread,

Feb 12, 2009, 12:54:11 PM2/12/09

to project-...@googlegroups.com

Hi Bob,

The logic for routing, reconciliation, failure detection, etc. does
not make assumptions about whether it is on the client or the server.
You are correct that the current wiring-up of layers is for client
side routing, but there is no limitation that says that this must be
so. There is a very real trade-off of efficiency vs. client
simplicity, so it is good to support both models (this is what dynamo
does as well). I completely agree that non-jvm clients should no try
to do any of this, they should just send serialized requests to
servers for routing. This should not require any handling of
concurrency on the client.

If you don't like the json format that is fine, you need not use it. I
agree the name is not perfect, if you have a better idea then maybe we
can change it. I do think the system should maintain data about the
format of bytes, my experience is that free-form bytes create a lot
more problems then they solve. That said there is a identity
serialization type that just gives byte arrays with no guidance if
that is what you are looking for. I agree that arbitrary limitations
on sizes are pretty lame, but that is true of many database types (not
an excuse, just a fact).

As for whether you should use it in production, I would think hard
about any new storage system that isn't MySQL, Postgres, or Oracle,
etc. There are trade-offs of performance and scalability vs. stability
and track-record. If you think other solutions are better that may be
true. My personal opinion is that storage systems are developed on a
much longer time-horizon than things like web-frameworks, and
absolutely every free entry in the distributed storage space is very
alpha. LinkedIn is doing a number of production things using Voldemort
but we are very careful about risk exposure. The current uses are for
problems that have high-scalability requirements but lower
data-criticality requirements.

The front page of the website says "It is still a new system which has
rough edges, bad error messages, and probably plenty of uncaught
bugs", and that is true. The focus for us is moving it up the ladder
of trustworthiness and making it a better system, usable from more
languages, with fewer arbitrary limitations and bugs. If it isn't
where you need it to be now, then that is completely understandable.

Cheers,

-Jay

Bob Ippolito

unread,

Feb 12, 2009, 2:48:08 PM2/12/09

to project-...@googlegroups.com

On Thu, Feb 12, 2009 at 9:54 AM, Jay Kreps <jay....@gmail.com> wrote:
>
> Hi Bob,
>
> The logic for routing, reconciliation, failure detection, etc. does
> not make assumptions about whether it is on the client or the server.
> You are correct that the current wiring-up of layers is for client
> side routing, but there is no limitation that says that this must be
> so. There is a very real trade-off of efficiency vs. client
> simplicity, so it is good to support both models (this is what dynamo
> does as well). I completely agree that non-jvm clients should no try
> to do any of this, they should just send serialized requests to
> servers for routing. This should not require any handling of
> concurrency on the client.

It would be helpful to include an example configuration that does
server routing to make it easier for people evaluating the project.
I've done all of my research and my Python implementation by talking
to the single_node_cluster config and reading source code from the
client all the way up the stack. It seems to me that given a client
that does routing, the server doesn't need to do much at all, which is
pretty cool... just difficult to implement correctly :)

> If you don't like the json format that is fine, you need not use it. I
> agree the name is not perfect, if you have a better idea then maybe we
> can change it. I do think the system should maintain data about the
> format of bytes, my experience is that free-form bytes create a lot
> more problems then they solve. That said there is a identity
> serialization type that just gives byte arrays with no guidance if
> that is what you are looking for. I agree that arbitrary limitations
> on sizes are pretty lame, but that is true of many database types (not
> an excuse, just a fact).

I did see that the identity serialization does the right thing and is
used by the metadata store. The test configuration uses the "json"
encoding with a "string" schema which does not make Voldemort look
good because of the arbitrary limitations and usage of a length prefix
on a single datum that is already delimited by other means.

I would highly recommend using a name that had Voldemort in it, since
it's a serialization that was made for this project that is not
otherwise in common use (at least in open source). Perhaps Voldemort
Schema or something like that? I would probably just drop it
altogether for real JSON and/or something like protocol buffers or
thrift. As far as serializations go I don't see any real size benefits
to using Voldemort's JSON serialization over actual JSON unless you
have strings that contain lots of natural backslash characters... e.g.
actual JSON has a two byte overhead for strings but no size
limitation, although I concede that it's most likely slower to parse
(but is that a real bottleneck?). I guess there is the fact that
Voldemort defines a serialization for datetime but you could do the
same as an extension to actual JSON.

> As for whether you should use it in production, I would think hard
> about any new storage system that isn't MySQL, Postgres, or Oracle,
> etc. There are trade-offs of performance and scalability vs. stability
> and track-record. If you think other solutions are better that may be
> true. My personal opinion is that storage systems are developed on a
> much longer time-horizon than things like web-frameworks, and
> absolutely every free entry in the distributed storage space is very
> alpha. LinkedIn is doing a number of production things using Voldemort
> but we are very careful about risk exposure. The current uses are for
> problems that have high-scalability requirements but lower
> data-criticality requirements.

I agree, all of this stuff is new. I'm just saying that I would be
surprised if someone was using Voldemort as-is for long-term
persistence because it doesn't appear to have features like adding or
removing nodes from the cluster online... and it just seems like you
would probably use it for data that expires because it has options to
do that.

Jay Kreps

unread,

Feb 12, 2009, 3:50:12 PM2/12/09

to project-...@googlegroups.com

I think your main point is that we have not done a good job of
documenting what is done, what is almost done, and what is not
planned. This gives the impression that something works when it
doesn't, or that something is possible when it is not. This is a very
good criticism. The minimum documentation for a viable open source
project is much, much higher than for an in-house project (since you
can't just swing by my desk and ask questions), and this is something
we need to correct.

I think you may have some confusion about the json format; if that is
the case it is just due to our bad documentation. The advantage of our
voldemort/json serialization over real json is compactness and type
checking and a full mapping from string<=>bytes. Consider a real JSON
string:

{
"member_id":12345,
"first_name": "jay",
"last_name": "kreps",
"join_date":"",
"group_ids":[4324434, 234335, 4545454, 234676, 454567, 3434676]
}

If you were to store data like this using real JSON blobs you could do
that by just storing the JSON bytes (say using the string
serialization type in voldemort, which has no length limits). The
problem is that the following is repeated for every row:
{
"member_id":,
"first_name": ,
"last_name": ,
"join_date":,
"group_ids":[, , , , , ]
}

Problems with this:
1. Your data is between 1/2 and 2/3 overhead. Your numbers are being
stored as N byte character strings instead of just using fixed byte
representations.
2. You can easily bork your data by misspelling a field
3. Parsing is more expensive (probably not a huge deal)

We solve this by storing only the data and checking the data supplied
by the user with the schema store with the table and then storing only
the data itself, not the field names.

Looking at real tables we have, real JSON or XML is not a great
solution because of the abundance of ids. Using these formats the the
data blow-up is quite large since we store the field name over and
over and explode the ids into char strings. To fix this you will be
tempted to compress all your field names into little one or two
character abbreviations, or to gzip the whole thing, but this kills
the readability that was the selling point of a text format in the
first place.

We solve this by separating the JSON into a schema which stores all
the overhead and is only specified once for a given table and the data
which is stored in every row:
Schema (stored only once per store):
{
"member_id":"int32",
"first_name":"string" ,
"last_name": "string",
"join_date":"date",
"group_ids":["int32"]
}

In addition all of these fields are parsed for you into proper types
on the client side (which is nice, parsing dates sucks).

So it is true that a given field (e.g. first_name) is limited to 16k,
but this is not unique to the database world. We could have expanded
this limit (or probably even gotten rid of it), but I don't know that
it is quite the crippling limitation that you think it is. Certainly
if you are trying to store large media blobs then this is not the way
to go for you.

Code generation approaches like protocol buffers or thrift are also
excellent, and thanks to user contributions are also supported. One
downside of these approaches is that you cannot easily have a command
line client or query page that supports updating your data, since
there is no mapping from a string representation to a protocol buffers
object (though there may be such a thing for thrift?).

-Jay

Bob Ippolito

unread,

Feb 12, 2009, 4:27:11 PM2/12/09

to project-...@googlegroups.com

Yeah, that certainly applies if you're serializing JSON objects due to
the property overhead... To be fair, you could do the same thing to
create a mapping from an object-free subset of JSON
(array|number|string|boolean|null) to friendly JSON objects.

We do that internally in a couple places to turn arrays of homogeneous
objects (e.g. database result sets) into a compact array of arrays
with an array that lists the properties that should be zipped with the
data to make objects (if the client chooses to do so). We actually do
it only if the client sends an X-header that says they want the data
formatted compactly. A store could do something similar such that it
will accept and return either verbose or compact representations of
the data depending on what the client asks for... this would allow for
a client that doesn't need to read the metadata store at all.

Reply all

Reply to author

Forward