Feature proposal: mapped fields

1,678 views
Skip to first unread message

Igor Gatis

unread,
Oct 6, 2010, 9:23:03 AM10/6/10
to Protocol Buffers
Not sure whether this has been discussed before. In any case...

It would be nice to have mapped fields, e.g. key-value pairs. It would work similar to repeated fields, which are implicit maps, e.g 0..N keyed messages. Mapped fields would break from 0..N keys to int or string keys. Integers are very compact and that is very attractive in terms of wire format but settle with integer keys are not really greater than 0..N keys. Thus, string seems more suitable keys of mapped fields. Thus, it seems each item of a mapped field could be defined by the following template-like proto message:

message KeyValuePair_of_SomeMessageType {
  required string key = 1;
  optional SomeMessageType value = 2;
}


Let's pick a example. Consider the following messages:

message Foo {
  optional int int_field = 1;
  ...
}

message Bar {
  mapped Foo foo = 1;
}

Internally, protobuf would read the above code as something like:

message Foo {
  optional int int_field = 1;
  ...
}

// Known in code generation time only.
message KeyValuePair_of_Foo {
  required string key = 1;
  optional Foo value = 2;
}

message Bar {
  repeated KeyValuePair_of_Foo foo = 1;
}


And generated C++ code for Bar would look like:

int32 foo_size() const;
bool has_foo(const string& key) const;
const Foo& foo(const string& key) const;
Foo* mutable_foo(const string& key);
void put_foo(const string& key, const Foo& foo);
void remove_foo(const string& key);
const RepeatedPtrField<string>& foo_keys() const;
const RepeatedPtrField<const Foo&>& foo_values() const;


Thoughts?

-Gatis

Evan Jones

unread,
Oct 6, 2010, 10:40:08 AM10/6/10
to Igor Gatis, Protocol Buffers
On Oct 6, 2010, at 9:23 , Igor Gatis wrote:
> It would be nice to have mapped fields, e.g. key-value pairs.

I think that map support would probably be useful. I've basically
created my own maps in protocol buffers a couple times, either by
using two repeated fields, or a repeated field of a custom "pair"
type. In these cases, it would have been nice to be able to use the
Protocol Buffer as a map directly, rather than needing to transfer the
data to some other object that actually implements the map. I would be
interested to hear the opinion of the Google maintainers. I'm assuming
that there are probably many applications inside Google that exchange
map-like messages.

This would be a big change, although it wouldn't be an impossible one,
I don't think. I think it could be implemented as "syntactic sugar"
over a repeated Pair message. I think the biggest challenge is that
maps are a "higher level" abstraction than repeated fields, which
leads to many design challenges:

* Are the maps ordered or unordered?
* If ordered, how are keys compared? This needs to be consistent
across programming languages.
* If unordered, how are hash values computed? This could result in a
message being parsed and re-serialized differently, if different
languages compute the hashes differently.
* For both, how are "'unknown" fields handled?
* Do the maps support repeated keys?
* If not, what happens when parsing a message with repeated keys?


Other message protocols contain map-like structures: JSON, Thrift, and
Avro. Avro only supports string keys. JSON only supports primitive
keys. Thrift has a similar note about maps:

http://wiki.apache.org/thrift/ThriftTypes

> For maximal compatibility, the key type for map should be a basic
> type rather than a struct or container type. There are some
> languages which do not support more complex key types in their
> native map types. In addition the JSON protocol only supports key
> types that are base types.


Evan

--
Evan Jones
http://evanjones.ca/

David Yu

unread,
Oct 6, 2010, 11:07:07 AM10/6/10
to Igor Gatis, Protocol Buffers
In descriptor.proto, you'll see an experimental map field.  It's not usable atm.
In the meantime, you could always simulate a map serialization using a repeated message with 
odd field numbers as $key and even as $value (sequential).

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.



--
When the cat is away, the mouse is alone.
- David Yu

Igor Gatis

unread,
Oct 6, 2010, 1:50:37 PM10/6/10
to Evan Jones, Protocol Buffers
On Wed, Oct 6, 2010 at 11:40 AM, Evan Jones <ev...@mit.edu> wrote:
On Oct 6, 2010, at 9:23 , Igor Gatis wrote:
It would be nice to have mapped fields, e.g. key-value pairs.

I think that map support would probably be useful. I've basically created my own maps in protocol buffers a couple times, either by using two repeated fields, or a repeated field of a custom "pair" type. In these cases, it would have been nice to be able to use the Protocol Buffer as a map directly, rather than needing to transfer the data to some other object that actually implements the map. I would be interested to hear the opinion of the Google maintainers. I'm assuming that there are probably many applications inside Google that exchange map-like messages.

This would be a big change, although it wouldn't be an impossible one, I don't think. I think it could be implemented as "syntactic sugar" over a repeated Pair message.

The syntactic sugar is what I meant by implementing it with repeated pairs.
 
I think the biggest challenge is that maps are a "higher level" abstraction than repeated fields, which leads to many design challenges:

* Are the maps ordered or unordered?
       * If ordered, how are keys compared? This needs to be consistent across programming languages.
       * If unordered, how are hash values computed? This could result in a message being parsed and re-serialized differently, if different languages compute the hashes differently.

Actually, I have a different opinion. High level is not a problem. It's a dictionary, e.g. string -> object. It does not matter whether the list is ordered or not. It does not matter how the hash is computed. Serialization is simple: list of Pairs. The it does not matter the order.
 
A given implementation may decide to read each pair at a time and insert it into a hashmap. The down side is that a big list will cause hashmap to grow many times. A solution could be reading the flat list to figure out what the size is, then create a map out of it.

       * For both, how are "'unknown" fields handled?
* Do the maps support repeated keys?
       * If not, what happens when parsing a message with repeated keys?

Hm... I'd say no. Simple (unique) string to object seems to fit lots of people needs. Perhaps, thought, values could be repeated. But that is easily achieved by a wrapping message which has a repeated object.

Igor Gatis

unread,
Oct 6, 2010, 2:17:05 PM10/6/10
to David Yu, Protocol Buffers
  // EXPERIMENTAL.  DO NOT USE.
  // For "map" fields, the name of the field in the enclosed type that
  // is the key for this map.  For example, suppose we have:
  //   message Item {
  //     required string name = 1;
  //     required string value = 2;
  //   }
  //   message Config {
  //     repeated Item items = 1 [experimental_map_key="name"];
  //   }
  // In this situation, the map key for Item will be set to "name".
  // TODO: Fully-implement this, then remove the "experimental_" prefix.
  optional string experimental_map_key = 9;

Interesting. It basically pins a field within repeated object to be the key. I like that. That's very aligned with my ideas. This allows other key types fairly easily. I guess it would make sense to support string, int and enum by default. I like the syntax I described better though. :)

Jason Hsueh

unread,
Oct 7, 2010, 5:19:34 PM10/7/10
to Igor Gatis, David Yu, Protocol Buffers
Indeed, maps have been brought up repeatedly. I forget the current state of the discussion, but I think it's generally agreed that it would be a good thing to add; it's just a matter of how to implement it (and finding the time to do it).

A couple of the major issues:
- backward compatibility with existing repeated fields
- ensuring that the client can't change the map key in the mapped value without also updating the map structure.

Igor Gatis

unread,
Oct 7, 2010, 10:02:38 PM10/7/10
to Jason Hsueh, David Yu, Protocol Buffers
On Thu, Oct 7, 2010 at 6:19 PM, Jason Hsueh <jas...@google.com> wrote:
Indeed, maps have been brought up repeatedly. I forget the current state of the discussion, but I think it's generally agreed that it would be a good thing to add; it's just a matter of how to implement it (and finding the time to do it).

A couple of the major issues:
- backward compatibility with existing repeated fields

I believe this has been addressed by the proposal I have made. Maps a repeated fields. So they can be exposed as list of pairs (string, message).
 
- ensuring that the client can't change the map key in the mapped value without also updating the map structure.

Could you provide an example?

Jason Hsueh

unread,
Oct 7, 2010, 10:18:04 PM10/7/10
to Igor Gatis, David Yu, Protocol Buffers
On Thu, Oct 7, 2010 at 7:02 PM, Igor Gatis <igor...@gmail.com> wrote:


On Thu, Oct 7, 2010 at 6:19 PM, Jason Hsueh <jas...@google.com> wrote:
Indeed, maps have been brought up repeatedly. I forget the current state of the discussion, but I think it's generally agreed that it would be a good thing to add; it's just a matter of how to implement it (and finding the time to do it).

A couple of the major issues:
- backward compatibility with existing repeated fields

I believe this has been addressed by the proposal I have made. Maps a repeated fields. So they can be exposed as list of pairs (string, message).

For backwards compatibility, the idea was to allow current repeated message fields to be converted to maps. You might expose these as pairs (string, message), or just as regular message accessors, except with a key parameter instead of an index. On the wire though, the map field would be serialized as the messages only. Keys are not serialized separately; they're just a special field within the message, as in the example in the .proto file. That leads to ...
 
 
- ensuring that the client can't change the map key in the mapped value without also updating the map structure.

Could you provide an example?

In the example from the .proto, suppose we generate key-based accessors for the items map field, and internally the field is represented as a hash_map<string, Item>.

Item* mutable_items(const string& name);
const Item& items(const string& name);

A client could do something like:
const string key = "foo";
foo.mutable_items(key)->set_name("bar");

i.e. its logical key (and its key when serialized and reparsed) is "bar" but the hash_map points to it using the old key "foo".

Igor Gatis

unread,
Oct 7, 2010, 11:02:31 PM10/7/10
to Jason Hsueh, David Yu, Protocol Buffers
On Thu, Oct 7, 2010 at 11:18 PM, Jason Hsueh <jas...@google.com> wrote:


On Thu, Oct 7, 2010 at 7:02 PM, Igor Gatis <igor...@gmail.com> wrote:


On Thu, Oct 7, 2010 at 6:19 PM, Jason Hsueh <jas...@google.com> wrote:
Indeed, maps have been brought up repeatedly. I forget the current state of the discussion, but I think it's generally agreed that it would be a good thing to add; it's just a matter of how to implement it (and finding the time to do it).

A couple of the major issues:
- backward compatibility with existing repeated fields

I believe this has been addressed by the proposal I have made. Maps a repeated fields. So they can be exposed as list of pairs (string, message).

For backwards compatibility, the idea was to allow current repeated message fields to be converted to maps.

I see. I don't think that should drive the new design though. The design I proposed requires a wrapper message. So if your declared repeated field is of type Foo. Internally, protobuf compiler would generate a new message, like the following:

message KeyValuePair_of_Foo {
  required string key = 1;
  optional Foo foo = 2;
}

In order to migrate a repeated field into a mapped field one should declare a new field which is the mapped one, and move all items from the repeated field to the mapped field. This is required because the wire types are different.

 
You might expose these as pairs (string, message), or just as regular message accessors, except with a key parameter instead of an index.

I see the value of exposing the internal message type (access key and values without doing several hash lookups). But that can be achieved easily by a simple wrapper just like STL's pair<string,*> object, or Java's Map.Entry, and Pythons (key, value) tuple. 
 
On the wire though, the map field would be serialized as the messages only. Keys are not serialized separately; they're just a special field within the message, as in the example in the .proto file. That leads to ...

Well, I believe the migration scheme I described above is just much simpler.

One thing to keep in mind: protobuf compiler is already too complicated. The mechanism you described adds more complexity to it. Every time you think of adding complexity to, remember to multiply it by the number of programming languages available out there, or even worse, multiply it by the number of third party implementations. Complexity is worth adding when tons of people get benefit of it. The backward compatibility stuff will benefit, say, less than 100 people out there, IMHO. The migration mechanism I described is way simpler and does not add complexity.
 
 
 
- ensuring that the client can't change the map key in the mapped value without also updating the map structure.

Could you provide an example?

In the example from the .proto, suppose we generate key-based accessors for the items map field, and internally the field is represented as a hash_map<string, Item>.

Item* mutable_items(const string& name);
const Item& items(const string& name);

A client could do something like:
const string key = "foo";
foo.mutable_items(key)->set_name("bar");

i.e. its logical key (and its key when serialized and reparsed) is "bar" but the hash_map points to it using the old key "foo".

You see. This is not possible in the proposal I've made. The key is separated from the object. The (String,Message) message never gets exposed. It is not possible to modify it through the generated API.
Reply all
Reply to author
Forward
0 new messages