Version 2.1 Hopes - Hash/Dictionaries and improved Python code

56 views
Skip to first unread message

Will P

unread,
Aug 4, 2008, 11:20:55 PM8/4/08
to Protocol Buffers
Hey Folks,

I'm eagerly awaiting version 2.1 of Google's Protocol Buffers.
Primarily, I'm most interested in the addition of Hashmaps/Hashes/
Associative-Arrays/Dictionaries (depending on which language's syntax
you choose, they are all hashed key=value maps). I'm also very
interested in Python bindings that aren't littered with so many "TODO"
comments - i.e. please just _WRAP_IT_UP_ and get it done... I have
ZERO interest in deploying half-baked code to production, so the
Python 2.0 Protocol Buffers code really needs attention before you can
honestly consider it production-ready. Or perhaps I'm just
ridiculously strict? Please correct me if so...

Is there a timeline for a 2.1 release, and does it include a new
HashMap/Hash/Dictionary datatype? And is the Python binding code
supremely improved?

Your friend,
Will

Kenton Varda

unread,
Aug 5, 2008, 1:26:50 AM8/5/08
to Will P, Protocol Buffers
Are you offering to help?

Torbjörn Gyllebring

unread,
Aug 5, 2008, 2:11:30 AM8/5/08
to Kenton Varda, Will P, Protocol Buffers

re: Hash/Dictionary

message KeyValuePair{

  required <some_data_type> key = 1;

  required <some_data_type> value= 2;

}

message Dictionary {

  repeated KeyValuePair items = 1;

}

I haven't looked at the python code but would you prefer that they had left the TODO's out instead? Or are there any specific issues that's hindering you from accomplishing the task at hand with the current code?

Alek Storm

unread,
Aug 5, 2008, 4:03:35 AM8/5/08
to Protocol Buffers
For memory-sparse systems (requires no transformation to a different
in-memory hashtable):
message KeyValuePair {
// more generic w/ bytes
required bytes key = 1;
required bytes value= 2;
}

message Bucket {
repeated KeyValuePair contents = 1;
}

message Dictionary {
extensions 65536 to max; // reserve first few for buckets
}

And pseudo-Java to wrap it (I've never used the Java API before, and
haven't used Java itself in a year or two):
public class MyDictionary<K,V> {
private Dictionary.Builder dict_builder;
public void init() {}
public void init(Dictionary.Builder _dict_builder) {
dict_builder = _dict_builder
}
public V get(K key) {
for (KeyValuePair pair :
dict_builder.getUnknownFields().getField(key.hashCode()).contents)
if (pair.key == key)
return pair.value;
}
public void put(K key, V value) {
// really pseudo-code
hash = key.hashCode();
if (key not in dict_builder.getUnknownFields())
dict_builder.addField(hash, new Bucket());
dict_builder.getField(hash).addRepeatedField(new KeyValuePair(key,
value));
}
}

> > On Mon, Aug 4, 2008 at 8:20 PM, Will P <jwi...@gmail.com> wrote:
> >> you choose, they are all hashed key=value maps). I'm also very
> >> interested in Python bindings that aren't littered with so many "TODO"
> >> comments - i.e. please just _WRAP_IT_UP_ and get it done... I have
> >> ZERO interest in deploying half-baked code to production, so the
> >> Python 2.0 Protocol Buffers code really needs attention before you can
> >> honestly consider it production-ready. Or perhaps I'm just
> >> ridiculously strict? Please correct me if so...

So don't deploy it. Nobody gets a prize if you do. What specific
features do you need that are TODO'd?

> >> Is there a timeline for a 2.1 release, and does it include a new
> >> HashMap/Hash/Dictionary datatype? And is the Python binding code
> >> supremely improved?

For an informal 2.0 roadmap, you can try
http://groups.google.com/group/protobuf/browse_thread/thread/2ac680d9832ee1d2/470b6d55c50e2313

Will Pierce

unread,
Aug 5, 2008, 7:56:21 AM8/5/08
to Torbjörn Gyllebring, Kenton Varda, Protocol Buffers
Torbjörn, Thanks for the tip!
This definitely will encode a dictionary, but what would the resulting object look like, in terms of accessor methods...?  A Dictionary message would look like a list container/sequence, with a series of 2 element lists within it (just as you defined)... Duplicate keys could creep in, and if I'm not mistaken, access would literally require foo.key() and foo.value() to get the key/value pair, rather than a more native-appearing data hash/dictionary structure like  foo[key]=value  (perl: $foo{$key}=$value;)

It's actually an interesting problem- how do you efficiently encode a dictionary/hash?  I love the tight encoding of integers (the zig-zag method), and so it's an interesting problem to try and encode a dictionary/hash with as much intentionally tight encoding, versus the obvious angles... Any ideas?

The TODO's in the code are more of a scare factor to me than impacting (my grep scrolled)...

Alek Storm

unread,
Aug 5, 2008, 1:36:54 PM8/5/08
to Protocol Buffers
On Aug 5, 6:56 am, "Will Pierce" <jwi...@gmail.com> wrote:
> Torbjörn, Thanks for the tip!
> This definitely will encode a dictionary, but what would the resulting
> object look like, in terms of accessor methods...?  A Dictionary message
> would look like a list container/sequence, with a series of 2 element lists
> within it (just as you defined)... Duplicate keys could creep in, and if I'm
> not mistaken, access would literally require foo.key() and foo.value() to
> get the key/value pair, rather than a more native-appearing data
> hash/dictionary structure like  foo[key]=value  (perl: $foo{$key}=$value;)
>
> It's actually an interesting problem- how do you efficiently encode a
> dictionary/hash?  I love the tight encoding of integers (the zig-zag
> method), and so it's an interesting problem to try and encode a
> dictionary/hash with as much intentionally tight encoding, versus the
> obvious angles... Any ideas?

Sorry, I guess I wasn't clear what mine actually does. Torbjörn's
dictionary encodes a list of key-value pairs, which I would recommend
for almost everything. My version uses the hash value modulo 2^16 as
the tag number for every key-value pair. This saves decoding,
rehashing, and filling a new in-memory hash, because the PB in-memory
representation itself is the hash. In other words, you can use the
interface you wanted: foo[key]=value. It also makes it much easier to
check for duplicate values (an O(n^2) operation), since you only have
to check the same bucket. Of course, if you're transmitting it over a
wire, both endpoints have to agree on a hashing algorithm, and how
many possible buckets to reserve (you can use much fewer than 2^16 -
best is twice the expected number of key-value pairs).

Kenton Varda

unread,
Aug 5, 2008, 3:06:34 PM8/5/08
to Will Pierce, Petar Petrov, Protocol Buffers
Great.  You should coordinate with Petar, who is in charge of Python protobufs.

Please note that very few resources are actually devoted to protobuf development inside Google.  Petar works on the Python code and I work on C++ and Java, but we both have other projects on which we're supposed to be spending the majority of our time.  Usually, when people want a new feature implemented, they do it themselves, and we review their changes.  No one else is clamoring for map support, so I guess that leaves it up to you.  :)

In the .proto language, maps should be defined using the experimental_map_key option (this will be renamed to map_key later, once it is fully-supported).  Example:

  message Foo {
    optional string name = 1;
    optional int32 some_field = 2;
    ...
  }

  message Bar {
    repeated Foo foo = 1 [experimental_map_key = "name"];
  }

Note that options (like experimental_map_key) are defined in descriptor.proto.  Anything defined there is automatically picked up by the compiler and made available via the descriptors objects in each language.

I think in Python it would make sense to use syntax like:

  bar = Bar()
  bar.foo["some_name"].some_field = 123
  bar.foo["another_name"].some_field = 456
  assert bar.foo["some_name"].some_field == 123
  assert bar.foo["another_name"].some_field == 456

But I'm not a heavy Python user so you and Petar should make the final decision.

I'm not sure what should happen if someone tries to change a field's key, e.g.:

  bar.foo["some_name"].name = "other_name"

Ideally this would update the containing map, but making that work would potentially be complicated and inefficient (even more so in C++ and Java), since every class would have to be prepared for any of its fields to trigger such an update.  An alternative design that would solve this problem would be to use syntax like:

  message Foo {
    optional string name = 1 [is_map_key=true];
    optional int32 some_field = 2;
    ...
  }

  message Bar {
    repeated Foo foo = 1 [is_map=true];
  }

But this implies that Foo objects can *only* be indexed by name, which may be limiting.

Please note that you'll have to sign the contributor license agreement before you can contribute any code.  If you're doing this on your own time you can sign the individual agreement by just filling out the web form on this page:


If you're working on company time you'll need your employer to sign this version instead:


On Tue, Aug 5, 2008 at 3:22 AM, Will Pierce <jwi...@gmail.com> wrote:
You bet! 
Reply all
Reply to author
Forward
0 new messages