socket client access with python

hb

unread,

Jan 21, 2009, 8:33:54 AM1/21/09

to project-voldemort

I am trying to access voldemort server with socket interface using
Python. However I didn't find any info about the data protocol. I
looked at StreamStoreRequestHandler.java to figure out the request
format, but frankly don't get any success to make a call from Python.
Is there any pointer about the protocol format?

By the way, thanks a lot for your contribution.
Haldun.

Jay Kreps

unread,

Jan 21, 2009, 10:20:59 AM1/21/09

to project-...@googlegroups.com

Hi Haldun,

The protocol is not very well documented right now, but I do want to
support clients that aren't in java. Rather than document the protocol
and try to maintain n clients in sync as we evolve the protocol I was
thinking that the best approach would be to change the on-wire
serialization format for things like the vector-clock versions to use
protocol buffers. I don't really know the performance implications of
this, but a coworker of mine did some evaluation and it didn't look at
all unreasonable
(http://eishay.blogspot.com/2008/11/protobuf-with-option-optimize-for-speed.html).

The default for the java client is to handle a lot of the routing and
failure responsibility, but an initial python client need not be so
ambitious. An initial version could do no client side routing and use
the HTTP-based protocol we have. So a client would look like this:

1. Connect to one of a list of "bootstrap servers" and do an
HTTP-based voldemort request to get the cluster and store metadata to
get the list of nodes and stores
2. Use protocol buffers code to deserialize the vector clock version
that comes with values and handle the fact that many versions might be
returned
3. Round robin requests over servers and let the server handle correct
routing and version reconciliation

The responsibility of the minimum viable client would be constructing
the request/response objects and issuing HTTP GET, POST, and DELETE
operations for the voldemort get, put, and delete operations. This
would be a fairly achievable implementation effort and would still let
us evolve the protocol (since protocol buffers is fairly good at
compatibility).

I have put a little thought in to this, but didn't want to start until
I had a potential non-java client to try out. It sounds like you are
digging around and ready to write some code. If I created a branch in
which I swapped out the current protocol for protocol buffers objects
would you be interested in taking a shot at a python client based on
that?

-Jay

Here would be a very rough attempt at a protocol buffers protocol:

package voldemort;

option java_package = "voldemort.serialization.version";

message ClockEntry {
required int32 node_id = 1;
required int64 version = 2;
}

message VectorClock {
repeated ClockEntry entries = 1;
optional int64 timestamp = 2;
}

message Versioned {
required bytes value = 1;
required VectorClock version = 2;
}

message Error {
required int32 error_code = 1;
required string error_message = 2;
}

message KeyedVersions {
required bytes key = 1;
repeated Versioned versions = 2;
}

message GetRequest {
optional int32 from_node = 1 [default = -1];
optional bool should_route = 2 [default = false];
repeated bytes ids = 3;
}

message GetResponse {
repeated KeyedVersions items = 1;
optional Error error = 2;
}

message PutRequest {
optional int32 from_node = 1 [default = -1];
optional int32 to_node = 2 [default = -1];
optional bool should_route = 3 [default = false];
optional bytes key = 4;
optional Versioned versioned = 5;
}

message PutResponse {
optional Error error = 1;
}

message DeleteRequest {
optional int32 from_node = 1 [default = -1];
optional int32 to_node = 2 [default = -1];
optional bool should_route = 3 [default = false];
optional bytes key = 4;
optional VectorClock version = 5;
}

message DeleteResponse {
optional bool success = 1;
optional Error error = 2;
}

enum RequestType {
GET = 0;
PUT = 1;
DELETE = 2;
}

message VoldemortRequest {
required RequestType type = 1;
optional GetRequest get = 2;
optional PutRequest put = 3;
optional DeleteRequest delete = 4;

Dan Diephouse

unread,

Jan 21, 2009, 2:22:47 PM1/21/09

to project-...@googlegroups.com

Hi Jay,

Great work on Voldemort. I was able to get it up and running no problem.

Regarding serialization, I wanted to point you to Jackson: http://www.cowtowncoder.com/hatchery/jackson/

It's a fast, streaming JSON implementation for Java [1]. Probably can't compare to Protocol Buffers ultimately, but it has a different purpose. Could make your life in voldemort.serialization.json easier too. If nothing else it deals with Maps well, whereas I noticed I had to switch to the java serialization mechanism in my stores.xml to be able to use Maps as a value type.

Dan

1. http://markmail.org/message/h5bnvilj3pug5url

--
Dan Diephouse
http://mulesource.com | http://netzooid.com/blog

Jay Kreps

unread,

Jan 21, 2009, 6:11:42 PM1/21/09

to project-...@googlegroups.com

Hmm that seems very nice. For network communication I want to stick
with a binary protocol even though that is somewhat frowned upon by
the world. The rationale is that in most systems network is going to
be 100% of the bottleneck and so adding overhead here just does not
make sense. (see, e.g., how memcached is having to make this switch:
http://lists.danga.com/pipermail/memcached/2007-July/004636.html).

For serializing application data this could definitely be quite
useful. Indeed I was so frustrated with the various json parsers i
tried that i ended up writing my, which is really not code i
necessarily want.

That said, could you tell me what the problem you found with
serializing maps was? We have a bunch of unit tests on serializing
maps and are using that in production without any issues, so if you
have found i bug i definitely want to fix it.

-jay

Dan Diephouse

unread,

Jan 22, 2009, 12:01:17 AM1/22/09

to project-...@googlegroups.com

I had tried something like:

<store>

...

<value-serializer>

<schema-info>"map"</schema-info>

</value-serializer>

</store>

After that first naive attempt didn't work I went and looked at the sources and didn't see anything for Json map serialization. I didn't look overly hard though. Instead I just switched to the java-serialization mechanism.

Suggestions?

Dan

Jay Kreps

unread,

Jan 22, 2009, 12:44:45 AM1/22/09

to project-...@googlegroups.com

So the format for specifying maps is {}. E.g. the following would
specify the type of a map that contains user information:

{
"member_id":"int64",
"fname":"string",
"lname":"string",
"joined":"date"
}

The format for lists is [], so a list of integers is

["int32"]

Note that this is not free-form json. This has pluses and minuses. The
primary plus is that it is much much more compact since field names
are stored once per store and integers are stored as 4 bytes rather
than large character strings (so your data shrinks by almost an order
of magnitude). It also keeps you from corrupting your data if you push
code with a typo somewhere.

There is a little more documentation burried in the design doc on the
site, but really it needs to be broken into its own page so it is
easier to find. Sorry if that was confusing.

-Jay

Elias Torres

unread,

Jan 30, 2009, 10:10:42 PM1/30/09

to project-voldemort

Jay,

I guess we never heard back from Haldun. I'm interested in using
Voldemort too from a non-java client (Erlang specifically), but
without socket specification, I'm stuck with a brick for now or is
there another way I can get started now?

I also read your batching section with Hadoop. I'm currently doing
something very similar: batch process our data in Hadoop, store in S3
and cache using varnish. I wanted to go with something like memcached,
couchdb, thriftdb, etc but nothing really seemed suitable for the big
batch load with persistence too. Since you have been doing it already
with Voldemort I'd be willing to give it a try once you open that part
too.

Elias

On Jan 21, 10:20 am, Jay Kreps <jay.kr...@gmail.com> wrote:
> Hi Haldun,
>
> The protocol is not very well documented right now, but I do want to
> support clients that aren't in java. Rather than document the protocol
> and try to maintain n clients in sync as we evolve the protocol I was
> thinking that the best approach would be to change the on-wire
> serialization format for things like the vector-clock versions to use
> protocol buffers. I don't really know the performance implications of
> this, but a coworker of mine did some evaluation and it didn't look at
> all unreasonable

> (http://eishay.blogspot.com/2008/11/protobuf-with-option-optimize-for-...).

bhupesh bansal

unread,

Jan 30, 2009, 10:47:41 PM1/30/09

to project-...@googlegroups.com

Hey Elias,

We wrote a Read-Only Random access file based store implementation for Voldemort.
We are building the files for each partition in Hadoo and scping them to Voldemort nodes for online query.

We saw numbers of about about 18,000 - 20,000 read per second with a JVM of 2G and data size of about 8G. (35 Mil keys)
as the data size increases we will be bounded by disk seeks primarily.

We are in process of doing some performance evaluations and we will make it available to community in near future.

Best
Bhupesh

Elias Torres

unread,

Jan 30, 2009, 10:55:50 PM1/30/09

to project-...@googlegroups.com

Hi Bhupesh,

Sounds like a plan. I'm also storing about 35M keys about 1K each in
value and storing them in S3 in chunks so I only do like 16K writes.
I'm assuming swapping read-only files is seamless in Voldemort.

Looking forward to it, except my front end to that is PHP, so a
socket-based protocol (binary is fine) would be great.

-Elias

Jay Kreps

unread,

Jan 31, 2009, 1:16:24 PM1/31/09

to project-...@googlegroups.com

Hey Elias,

We didn't ever hear from Haldun, but there is an engineer at LinkedIn
who does a lot of our Ruby work who wants to take a shot at a Ruby
client. I think things are currently blocked on my converting the
network protocol to protocol buffers or thrift (which is fairly easy
to do, it just requires testing), and then it should be reasonably
easy to generate starter code for doing a new client which either does
an HTTP post with the serialized payload or uses sockets to do the
same.

Do you know the state of erlang support in thrift and protocol
buffers? I see something for Thrift about Erlang but nothing for
protocol buffers.

My inclination was to go with protocol buffers for the network
protocol since I have used it and it has a very nice api and actual
documentation. However I have been told the support for php and ruby
is not that good and I have no idea about erlang.

-Jay

Jay Kreps

unread,

Jan 31, 2009, 1:38:57 PM1/31/09

to project-...@googlegroups.com

To answer your question about the read only store here is some more
info. There are a few problems we are trying to solve:
1. Building a b-tree can be very expensive, and much of the
"re-balancing" ability a b-tree provides is not that helpful if your
data is read-only
2. Doing the building on the live servers (e.g. by doing millions of
inserts) will really impact the performance of the database for users
3. Often times the amount of time required to create the data offline
is fairly large (say 8 hours), so if you push out data that is bad in
some way (say do to a bug in your backend job), then it will take you
8 hours to recreate good data once you have fixed the bug.

Here is the solution we are working on:

1. You build a foo.data and foo.index file for each node containing
the data in store "foo" that the given node is responsible for. The
format is that .index is a file of sorted key MD5 hashes along with
the position of the data in the .data file; the .data is a file
containing the corresponding values at each position. This seems to be
fairly efficient for lookups, especially for large values. The real
advantage is that external sort seems to be faster than b-tree
building, and easy to do in a hadoop cluster.
2. The live voldemort system has N versions of the data, so for the
index it has foo.index, foo.index.1, ..., foo.index.N. These are your
backups. In the simple case there is only one backup.
3. You rsync the data out to the appropriate voldemort nodes,
overwriting foo.index.N.
4. You call a swap method on voldemort servers that shuffles all the
versions so that the newly rsync'd files are now the live file (e.g.
foo.index) and the previous live file is now the first backup
(foo.index.1). This is transparent to the live system.

So the advantage is the only overhead on the live system is from rsync
which is much lighter than building a massive many gb b-tree, and you
can rollback easily if anything goes wrong in the offline system, plus
you can build big readonly datasets more quickly. We think this is
pretty cool because we haven't really seen any good answers for what
to do once you have computed TBs of data on your Hadoop cluster
without this...

Here is what we have open sourced in voldemort so far:

1. The read-only store type which supports reads off of these files
and swapping back and forth as described above.
2. A non-distributed java program which does the data partitioning and
builds these stores for a cluster. You can see a shell script wrapper
for this program under bin/build-readonly-store.sh.

What we haven't yet open-sourced is the Hadoop job which does the
build on a cluster of machines or the rsync script which does the push
operation. The reason is that it is currently entangled with our
Hadoop infrastructure which we aren't planning on open sourcing, so
there is a little surgery to do to separate it out.

It should also be noted that the serialization supported so far is
only the "json" type. A small amount of additional work will be needed
to make this user defined.

This is all in the early stages, so any thoughts or experience people
have in this area would be very helpful.

-Jay

Elias Torres

unread,

Jan 31, 2009, 3:18:18 PM1/31/09

to project-...@googlegroups.com

On Sat, Jan 31, 2009 at 1:16 PM, Jay Kreps <jay....@gmail.com> wrote:
>
> Hey Elias,
>
> We didn't ever hear from Haldun, but there is an engineer at LinkedIn
> who does a lot of our Ruby work who wants to take a shot at a Ruby
> client. I think things are currently blocked on my converting the
> network protocol to protocol buffers or thrift (which is fairly easy
> to do, it just requires testing), and then it should be reasonably
> easy to generate starter code for doing a new client which either does
> an HTTP post with the serialized payload or uses sockets to do the
> same.
>
> Do you know the state of erlang support in thrift and protocol
> buffers? I see something for Thrift about Erlang but nothing for
> protocol buffers.

Thrift has what it seems decent Erlang support. The downside is that
is a direct conversion of the c++ so it's not erlangish, but there was
another rewrite that was more efficient. However, I'm having trouble
finding the latest git repository for it.

>
> My inclination was to go with protocol buffers for the network
> protocol since I have used it and it has a very nice api and actual
> documentation. However I have been told the support for php and ruby
> is not that good and I have no idea about erlang.

I say definitely go with your inclination, since you're the main coder. I found:

http://github.com/ngerakines/erlang_protobuffs/tree/master

But it's basic. It doesn't support enums or nesting pbs, but I guess
it could be made work if necessary.

I don't know but from what I have seen in Hadoop land, the RPC should
be pluggable, but the sooner you write the first one, the easier will
be the make that pluggable. For now I'll just wait until we see the
other branch (pb/whatever).

Thanks!

Elias Torres

unread,

Jan 31, 2009, 3:35:44 PM1/31/09

to project-...@googlegroups.com

On Sat, Jan 31, 2009 at 1:38 PM, Jay Kreps <jay....@gmail.com> wrote:
>
> To answer your question about the read only store here is some more
> info. There are a few problems we are trying to solve:
> 1. Building a b-tree can be very expensive, and much of the
> "re-balancing" ability a b-tree provides is not that helpful if your
> data is read-only
> 2. Doing the building on the live servers (e.g. by doing millions of
> inserts) will really impact the performance of the database for users
> 3. Often times the amount of time required to create the data offline
> is fairly large (say 8 hours), so if you push out data that is bad in
> some way (say do to a bug in your backend job), then it will take you
> 8 hours to recreate good data once you have fixed the bug.

I hear you. My job is shorter (2+ hours). Thanks for the clarification
that you don't update straight from Hadoop into Voldemort. I figured
as much.

>
> Here is the solution we are working on:
>
> 1. You build a foo.data and foo.index file for each node containing
> the data in store "foo" that the given node is responsible for. The
> format is that .index is a file of sorted key MD5 hashes along with
> the position of the data in the .data file; the .data is a file
> containing the corresponding values at each position. This seems to be
> fairly efficient for lookups, especially for large values. The real
> advantage is that external sort seems to be faster than b-tree
> building, and easy to do in a hadoop cluster.

I'm using you store the position and length, or are you using fixed
records? Also, binary search on the index? Is it binary or a text-list
of MD5 hashes?

> 2. The live voldemort system has N versions of the data, so for the
> index it has foo.index, foo.index.1, ..., foo.index.N. These are your
> backups. In the simple case there is only one backup.
> 3. You rsync the data out to the appropriate voldemort nodes,
> overwriting foo.index.N.
> 4. You call a swap method on voldemort servers that shuffles all the
> versions so that the newly rsync'd files are now the live file (e.g.
> foo.index) and the previous live file is now the first backup
> (foo.index.1). This is transparent to the live system.
>
> So the advantage is the only overhead on the live system is from rsync
> which is much lighter than building a massive many gb b-tree, and you
> can rollback easily if anything goes wrong in the offline system, plus
> you can build big readonly datasets more quickly. We think this is
> pretty cool because we haven't really seen any good answers for what
> to do once you have computed TBs of data on your Hadoop cluster
> without this...

This sort of was my initial plan with a few modifications. I wanted to
partition each version into chunks so I didn't have to merge into a
single file in Hadoop. Then use consistent hashing to spread the load
and route between the machines having the partitions. Then rsync the
partitions and have something serve it. I was looking at other file
formats such as cdb (http://cr.yp.to/cdb.html). My dilemma at that
point was what I was going to use to serve the contents. At my company
(Lookery) we love simplicity so I proposed going with an HTTP-only
route after looking at CouchDB, ThruDb, persistent-memcached. S3
stores chunks, php middle-tier looks up into chunks and varnish caches
those responses. S3 allows us to "swap" chunks because it won't serve
the new version until it has fully committed the PUT so it works
great.

>
> Here is what we have open sourced in voldemort so far:
>
> 1. The read-only store type which supports reads off of these files
> and swapping back and forth as described above.
> 2. A non-distributed java program which does the data partitioning and
> builds these stores for a cluster. You can see a shell script wrapper
> for this program under bin/build-readonly-store.sh.

I'll take a look.

>
> What we haven't yet open-sourced is the Hadoop job which does the
> build on a cluster of machines or the rsync script which does the push
> operation. The reason is that it is currently entangled with our
> Hadoop infrastructure which we aren't planning on open sourcing, so
> there is a little surgery to do to separate it out.

That's OK. I understand what you have now and I have enough from my
own Hadoop infrastructure to create the read-only files.

>
> It should also be noted that the serialization supported so far is
> only the "json" type. A small amount of additional work will be needed
> to make this user defined.

I also use JSON for my backend.

>
> This is all in the early stages, so any thoughts or experience people
> have in this area would be very helpful.

No problem. Love to collaborate on this stuff. Right now, we use
HAproxy in front of our varnish servers, so if we had voldemort
swapping (partitioned files + redundancy) we could go straight from
our front-end to voldemort cluster and not need my other moving parts.
Although HTTP/Restful (caching specifically) it's a beautiful thing.

However, I'm more interested in Voldemort to support a more realtime
system. I would like to keep track of lists stored in V directly from
our log activity and surface them via a comet endpoint. That's why I
need the socket-based api for erlang and php. This past week I have
made progress with Scalaris but you show much better numbers in Java
2500 req/sec vs 16K/sec on a single node ;-).

-Elias

Jay Kreps

unread,

Jan 31, 2009, 4:59:00 PM1/31/09

to project-...@googlegroups.com

The exact format of the files is

index stores the following for each key:
16 byte binary md5
8 byte position offset in data file for the value

The entries in the index are in sorted order. The md5 is the exact
bytes returned by Java's built in MD5

data file stores the following repeated for each value:
4 byte value size
variable value blob

Binary search goes against the index (with caching) to locate the
correct blob in the data file, then seek to that position and read.
The values are whatever bytes are stored so depending on the
serialization used for the store this could be text, or whatever.

Currently there is only 1 data and index file per store per node. This
is not ideal because our hadoop cluster is bigger than our voldemort
cluster so it is somewhat under-used in the reduce phase where it is
sorting and building the index and data files, since the number of
reducers is fixed to the number of voldemort nodes. It is fairly easy
to generalize this so that multiple index and data files are used per
node, but this does slightly complicate the push logic. For the first
version we just want to get it out there since it is already much
better than all our current solutions.

If you have any thoughts on this, or it would be useful to you let me
know. It sounds like a lot of people have a similar problem. I think
that by solving scalability problems in one place hadoop tends to
create scalaibility problems elsewhere (in the live/delivery part of
the system) since you can now create so much more data. S3 does seem
like a good solution.

-Jay

Elias Torres

unread,

Jan 31, 2009, 7:12:09 PM1/31/09

to project-...@googlegroups.com

On Sat, Jan 31, 2009 at 4:59 PM, Jay Kreps <jay....@gmail.com> wrote:
>
> The exact format of the files is
>
> index stores the following for each key:
> 16 byte binary md5
> 8 byte position offset in data file for the value
>
> The entries in the index are in sorted order. The md5 is the exact
> bytes returned by Java's built in MD5
>
> data file stores the following repeated for each value:
> 4 byte value size
> variable value blob

Simple and functional, I like it. I thought about doing something
similar and use http ranges since I'm all HTTP and S3 supports it, but
I'd still needed to deal with the index distribution which is still
large.

>
> Binary search goes against the index (with caching) to locate the
> correct blob in the data file, then seek to that position and read.
> The values are whatever bytes are stored so depending on the
> serialization used for the store this could be text, or whatever.
>
> Currently there is only 1 data and index file per store per node. This

If you have tens of millions keys, you are talking around a gigabyte
per index file at 20 bytes per key.

> is not ideal because our hadoop cluster is bigger than our voldemort
> cluster so it is somewhat under-used in the reduce phase where it is
> sorting and building the index and data files, since the number of
> reducers is fixed to the number of voldemort nodes. It is fairly easy

So you are in fact partitioning in Hadoop land and distributing the
indexes based on hashing hash(md5(key) % voldemort nodes. I take it
you can't afford to lose a node in voldemort cluster. This is just
like using a memcached server that we could bulk-load at startup
without consistent hashing.

> to generalize this so that multiple index and data files are used per
> node, but this does slightly complicate the push logic. For the first
> version we just want to get it out there since it is already much
> better than all our current solutions.
>
> If you have any thoughts on this, or it would be useful to you let me
> know. It sounds like a lot of people have a similar problem. I think
> that by solving scalability problems in one place hadoop tends to
> create scalaibility problems elsewhere (in the live/delivery part of
> the system) since you can now create so much more data. S3 does seem
> like a good solution.

I think I'm going to have to quote you on that because that's the
truth. Hadoop is solving the big problem of quickly churning through
data, but now we need more tools to deliver that data since it's a big
firehose. No more MySQL to the rescue. We had a contractor that was
writing binary MySQL files at some point. I'll keep thinking about
this more and what could be a good tool to generalize out of this.
I'll run it by a few more folks as well.

-Elias

Jay Kreps

unread,

Jan 31, 2009, 8:00:30 PM1/31/09

to project-...@googlegroups.com

Hi Elias,

The overhead is exactly 28 bytes per value (16 for md5, 8 for data
file offset, 4 for value size). This is rather high, but still
competitive with database overhead. In practice a lot of our values
are large lists of things so the overhead is not too bad. If we broke
the data into 2GB chunks on a single node and used variable length
encoding for the value sizes we could probably get this down to around
22 bytes of overhead, but I think we are going to save that for a
future version since binary compatibility is not a big deal for read
only data that gets recreated every day. So for 10 million keys this
would mean 280 MB overhead (some of that is in the index file and some
in the data file).

For your other question the partitioning is the exact same consistent
hashing done elsewhere in voldemort, and yes this is computed in
hadoop (we just call the voldemort partitioner as the hadoop
partitioner). Note that this does NOT mean that each value is only on
a single server. The partitioning (as well as the data deployment) is
done using the exact same stores.xml and cluster.xml metadata you use
for the live system, and the replication-factor you specify in the
stores.xml file will determine how many nodes a given key/value is
stored on. This means you can tolerate a number of simultaneous
failures equal to the number you set as the replication-factor for the
store without loss of data-availability for any key. (obviously more
redundancy means more data on each node, though).

So yes, we think this solution is totally superior to the mysql "load
data infile" route. You can build myisam files offline and ship them
around, which is pretty cool, but moving this into hadoop would be
awkward.

Cheers,

-Jay

Leon Mergen

unread,

Feb 12, 2009, 6:25:14 AM2/12/09

to project-voldemort

Hello Jay,

Below are a few comments I have on this whole issue.

On Jan 21, 4:20 pm, Jay Kreps <jay.kr...@gmail.com> wrote:
> The protocol is not very well documented right now, but I do want to
> support clients that aren't in java. Rather than document the protocol
> and try to maintain n clients in sync as we evolve the protocol I was
> thinking that the best approach would be to change the on-wire
> serialization format for things like the vector-clock versions to use
> protocol buffers. I don't really know the performance implications of
> this, but a coworker of mine did some evaluation and it didn't look at
> all unreasonable

> (http://eishay.blogspot.com/2008/11/protobuf-with-option-optimize-for-...).

+1 for protocol buffers -- I'm using it here in my (C++) project, and
the performance with speed optimization is awesome, and seems to fit
Voldemort, where speed is an issue. Looking at your specification
below, this especially is going to be a lot better than Thrift in
terms of object sizes: PB does fancy things like replacing small
integers like "1" to a character 1 instead of a 32-bit integer, which
makes the protocol way more dense. I would expect a lot of the node
id's, request types and error codes to be very small integers, so I
think you can get a big win here.

Other than that, Thrift does look like a viable option, given that it
has more code generators for different languages. But if you look at
it from the long term, I think it's only a matter of time before PB
will support more languages too.

Thrift does provide a complete means for client/server communication,
too, on top of the serialization: this would make the client coding a /
lot/ easier (basically, no client code would be required, and a few
usage examples on the voldemort website would suffice).

I would definitely vote against something other than PB or Thrift: I
feel these two are the most widely used solutions for (high
performance) data serialization, and the whole point of this interface
is to make an interface that is more easy to integrate with other
languages. Using some obscure serialization format kinda defeats the
purpose there I think.

> The default for the java client is to handle a lot of the routing and
> failure responsibility, but an initial python client need not be so
> ambitious. An initial version could do no client side routing and use
> the HTTP-based protocol we have. So a client would look like this:
>
> 1. Connect to one of a list of "bootstrap servers" and do an
> HTTP-based voldemort request to get the cluster and store metadata to
> get the list of nodes and stores
> 2. Use protocol buffers code to deserialize the vector clock version
> that comes with values and handle the fact that many versions might be
> returned
> 3. Round robin requests over servers and let the server handle correct
> routing and version reconciliation

Ok, that looks acceptable: personally I would rather use a load
balancer to balance requests over the voldemort servers than to let my
front-end servers know about cluster configuration, to reduce
complexity of all this (and more importantly, to just provide one
entry point for my front-end servers and not having to give them *any*
information about cluster configuration). But I can see that that
extra level of indirection is a problem for many people too. But
personally, I wish to keep as much of voldemort cluster configuration
out of my front-end servers, since I don't want to bring them down
when changing cluster configuration.

> The responsibility of the minimum viable client would be constructing
> the request/response objects and issuing HTTP GET, POST, and DELETE
> operations for the voldemort get, put, and delete operations. This
> would be a fairly achievable implementation effort and would still let
> us evolve the protocol (since protocol buffers is fairly good at
> compatibility).
>
> I have put a little thought in to this, but didn't want to start until
> I had a potential non-java client to try out. It sounds like you are
> digging around and ready to write some code. If I created a branch in
> which I swapped out the current protocol for protocol buffers objects
> would you be interested in taking a shot at a python client based on
> that?

I can help you out with developing/testing a C/C++ client, most likely
based on something like libcurl. Let me know if that is desirable.

Regards,

Leon Mergen

Jay Kreps

unread,

Feb 12, 2009, 12:21:29 PM2/12/09

to project-...@googlegroups.com

Hey Leon,

Help with new clients would be very much appreciated. I have lagged a
bit on this project due to other conflicting projects, but plan to
actually get something done starting now. :-)

I completely agree that the varint support in protocol buffers is a
very important feature. In particular the vector clock entries we
serialize and the string/byte[] lengths are expected to be small
positive values, so they greatly benefit from a proper prefix free
encoding.

The rough plan for the refactoring is very simple. Basically in the
current mechanism the protocol (HTTP/raw sockets) is tied up with the
wire format (bytes in a request/response). The basic plan is to
separate these and attempt to reuse the same bytes for both HTTP and
sockets. So there will be an interface

interface WireFormat {
public void writeGetRequest(DataOutputStream o, ...)
public void writePutRequest(...)
...
}

I will add a protocol buffers implementation, and if the performance
is okay this may become the default. Others can add Thrift if they
want to get broader language support or just prefer it. The difficulty
of adding a new wire format implementation should be fairly low.

This will mean being more abusive of HTTP in the HTTP-based protocol,
and will offend some people. Likely all the HTTP requests will become
POSTs instead of the current GET/POST/DELETE for voldemort
GET/PUT/DELETE. I think this is okay, since the idea of REST does not
really make sense in a partition aware system (but makes great sense
in a public web service); and client and server are in fact tightly
coupled (unlike the clients of a real public web service, where client
and server are not owned by the same people). In addition since we
allow keys of unlimited bytes there is not an obvious mapping to a URL
safe thing, and unless the mapping is obvious much of the advantage of
REST goes away (you can no longer debug via URLs). I think the
solution here is just to add a query page in the voldemort admin
interface that understands common serialization types. The caching
benefits of REST don't really make sense for voldemort, since it is a
storage system, so it should be doing the caching--adding a SQUID
proxy (or whatever) would mess up the semantics and not improve
performance.

Cheers,

-Jay

Leon Mergen

unread,

Feb 12, 2009, 2:51:32 PM2/12/09

to project-voldemort

Hey Jay,

On Feb 12, 6:21 pm, Jay Kreps <jay.kr...@gmail.com> wrote:
> Help with new clients would be very much appreciated. I have lagged a
> bit on this project due to other conflicting projects, but plan to
> actually get something done starting now. :-)

Ok good -- let me know when things get in a bit of shape so I can
start working on a c++ client lib.

> I completely agree that the varint support in protocol buffers is a
> very important feature. In particular the vector clock entries we
> serialize and the string/byte[] lengths are expected to be small
> positive values, so they greatly benefit from a proper prefix free
> encoding.
>
> The rough plan for the refactoring is very simple. Basically in the
> current mechanism the protocol (HTTP/raw sockets) is tied up with the
> wire format (bytes in a request/response). The basic plan is to
> separate these and attempt to reuse the same bytes for both HTTP and
> sockets. So there will be an interface
>
> interface WireFormat {
> public void writeGetRequest(DataOutputStream o, ...)
> public void writePutRequest(...)
> ...
>
> }
>
> I will add a protocol buffers implementation, and if the performance
> is okay this may become the default. Others can add Thrift if they
> want to get broader language support or just prefer it. The difficulty
> of adding a new wire format implementation should be fairly low.

Yeah that makes -- instead of just dictating one format, make the
proper abstractions. Good to hear!

> This will mean being more abusive of HTTP in the HTTP-based protocol,
> and will offend some people. Likely all the HTTP requests will become
> POSTs instead of the current GET/POST/DELETE for voldemort
> GET/PUT/DELETE. I think this is okay, since the idea of REST does not
> really make sense in a partition aware system (but makes great sense
> in a public web service); and client and server are in fact tightly
> coupled (unlike the clients of a real public web service, where client
> and server are not owned by the same people). In addition since we
> allow keys of unlimited bytes there is not an obvious mapping to a URL
> safe thing, and unless the mapping is obvious much of the advantage of
> REST goes away (you can no longer debug via URLs). I think the
> solution here is just to add a query page in the voldemort admin
> interface that understands common serialization types. The caching
> benefits of REST don't really make sense for voldemort, since it is a
> storage system, so it should be doing the caching--adding a SQUID
> proxy (or whatever) would mess up the semantics and not improve
> performance.

I think it's the most important that you just fix a system that works
here -- do not start conforming to REST just for the sake of
conforming.

That being said, is HTTP even a good idea at all for this? In essence
there will just be a client sending serialized commands to a server
and waiting for replies, not? And given the fact that network latency
will probably be the biggest delay, is there anything *against* just a
really dense binary protocol that would look somewhat like this:

Message := Len Data
Len := uint32_t
Data := byte Data | byte

in which both a Request from client to server and a Response from
server to client will both be a Message. This will make the client a
lot easier to code, and makes somewhat sense if efficiency is
important.

Or am I proposing something really silly here ? I'm just wondering
whether HTTP still makes sense if you only send POST messages, do not
want to use the caching facilities, etc etc.

Just my 2 cents worth.

Regards,

Leon Mergen

Jay Kreps

unread,

Feb 12, 2009, 3:15:38 PM2/12/09

to project-...@googlegroups.com

I think HTTP is just a convenience. Implementing a ruby client that
uses HTTP is (slightly) easier than one that uses sockets. The
performance of HTTP for small key/values is not great.

-Jay

Jay Kreps

unread,

Feb 12, 2009, 3:16:39 PM2/12/09

to project-...@googlegroups.com

And yes, the socket protocol is basically exactly what you describe.

Sebastian

unread,

Feb 12, 2009, 3:48:01 PM2/12/09

to project-voldemort

Hi Jay

On Jan 21, 4:20 pm, Jay Kreps <jay.kr...@gmail.com> wrote:
> The protocol is not very well documented right now, but I do want to
> support clients that aren't in java. Rather than document the protocol
> and try to maintain n clients in sync as we evolve the protocol I was
> thinking that the best approach would be to change the on-wire
> serialization format for things like the vector-clock versions to use
> protocol buffers. I don't really know the performance implications of
> this, but a coworker of mine did some evaluation and it didn't look at
> all unreasonable

> (http://eishay.blogspot.com/2008/11/protobuf-with-option-optimize-for-...).
>

The problem I see with protobufs is the lacking support for anything
non-{java,phython,c}.
For anything else (in my case php) it's rare to see a well supported
library. Another option might be the Thrift binary protocol which has
a much wider language support and might be a better choice in terms of
"language-market-share". I'm not sure how the data is saved internally
in voldemort, maybe it is an option to simply save it in a "plain"
format, and convert it on the interfaces. That way it would be
possible to support both thrift and protobufs (or whatever wire-
protocol one would come up with).

Regards
Sebastian

Jay Kreps

unread,

Feb 12, 2009, 4:06:18 PM2/12/09

to project-...@googlegroups.com

Yes the wire format is just the wire format, it is not persisted, and
your data is just your data (in whatever format you choose). The only
reason I would do protocol buffers for the wire format is because I
know it and it is a way to get the wire format abstracted so a thrift
expert can come in and add support for that. I have seen that Thrift
has superior language support.

-Jay

Leon Mergen

unread,

Feb 13, 2009, 3:12:11 AM2/13/09

to project-voldemort

Hey Jay,

Thanks for you response. I misunderstood that there will be both a
Socket protocol and a HTTP protocol -- I'll definitely use the Socket
protocol for the C client.

Let me know when I can get started.

Regards,

Leon Mergen

On Feb 12, 9:15 pm, Jay Kreps <jay.kr...@gmail.com> wrote:
> I think HTTP is just a convenience. Implementing a ruby client that
> uses HTTP is (slightly) easier than one that uses sockets. The
> performance of HTTP for small key/values is not great.
>
> -Jay
>

mkoppanen

unread,

Feb 16, 2009, 5:26:19 AM2/16/09

to project-voldemort

On Feb 13, 8:12 am, Leon Mergen <l...@solatis.com> wrote:
> Hey Jay,
>

> Thanks for you response. I misunderstood that there will be both a
> Socket protocol and a HTTP protocol -- I'll definitely use the Socket
> protocol for the C client.
>
> Let me know when I can get started.

Hi,

thank you for pointing me to the right direction, Leon. C or C++
library will definitely make my work easier because I can just wrap
the library into PHP extension structure and expose the API to the PHP
users.

Let me know when you get started.

Thanks,
Mikko

Leon Mergen

unread,

Feb 16, 2009, 6:13:29 AM2/16/09

to project-voldemort

On Feb 16, 11:26 am, mkoppanen <mikko.koppa...@gmail.com> wrote:
> > Thanks for you response. I misunderstood that there will be both a
> > Socket protocol and a HTTP protocol -- I'll definitely use the Socket
> > protocol for the C client.
>
> > Let me know when I can get started.
>

> thank you for pointing me to the right direction, Leon. C or C++
> library will definitely make my work easier because I can just wrap
> the library into PHP extension structure and expose the API to the PHP
> users.

I was planning on basing this API on libevent (http://monkey.org/
~provos/libevent/) -- will this be a problem for you ? As far as I'm
aware, quite a lot of popular C API's are making use of this (I know
libmemcached does it too), and it will make my life a lot easier
allowing the use of high-performance event notification mechanisms
(like epoll, kqueue, etc) without explicitly having to write code for
these manually.

Regards,

Leon Mergen

Mikko Koppanen

unread,

Feb 16, 2009, 10:18:59 AM2/16/09

to project-voldemort

On Feb 16, 11:13 am, Leon Mergen <l...@solatis.com> wrote:
> I was planning on basing this API on libevent (http://monkey.org/
> ~provos/libevent/) -- will this be a problem for you ? As far as I'm
> aware, quite a lot of popular C API's are making use of this (I know
> libmemcached does it too), and it will make my life a lot easier
> allowing the use of high-performance event notification mechanisms
> (like epoll, kqueue, etc) without explicitly having to write code for
> these manually.

There is an existing PHP extension [1] using libmemcached so I
wouldn't think libevent is an issue. You should also take a look at
libev which is a similar library. At least benchmarks [2] (provided by
them) suggest that libev is slightly faster. Then again my experience
with event loops is quite limited since I've used only libev in the
past.

Regards,
Mikko

[1] http://pecl.php.net/package/memcached
[2] http://libev.schmorp.de/bench.html

Leon Mergen

unread,

Feb 16, 2009, 10:35:15 AM2/16/09

to project-...@googlegroups.com

Hello Mikko,

> > I was planning on basing this API on libevent (http://monkey.org/
> > ~provos/libevent/) -- will this be a problem for you ? As far as I'm
> > aware, quite a lot of popular C API's are making use of this (I know
> > libmemcached does it too), and it will make my life a lot easier
> > allowing the use of high-performance event notification mechanisms
> > (like epoll, kqueue, etc) without explicitly having to write code for
> > these manually.
>
> There is an existing PHP extension [1] using libmemcached so I
> wouldn't think libevent is an issue. You should also take a look at
> libev which is a similar library. At least benchmarks [2] (provided by
> them) suggest that libev is slightly faster. Then again my experience
> with event loops is quite limited since I've used only libev in the
> past.

Ah I see -- interesting results! My main issue is that I know libevent very well, and have no prior experience with libev. However, I can always put a few abstractions in place that allow the communications library to be pluggable. Might be a good idea anyway, in the spirit of Voldemort.

I did some research about the PHP Memcache module, and I see that they actually require you to manually install libevent for it:

http://nl2.php.net/manual/en/memcache.installation.php

And my guess is that there will be a big overlap in people using PHP's Voldemort module and PHP's Memcache module, so it might be a good idea to stick with the same dependencies.

So, bottom line: I'm going to put proper abstractions in place, and if people wish to do so, they can always use a different event synchronization library.