About random access for Cap'N proto message

wangt...@gmail.com

unread,

Dec 5, 2017, 1:53:26 PM12/5/17

to Cap'n Proto

Hi,

I am working on a project which is using protobuf to encode/decode messages. I am evaluating if it is worth to migrate to Cap'N proto. I am using the Java implementation of Cap'N. https://github.com/capnproto/capnproto-java

From the documentation, https://capnproto.org/index.html, Random access is mentioned as a key feature. But I am not able to find any piece of code example to demonstrate this feature. Am I misunderstanding it? Does "random access" simply means we can access any field without "deserializing" the whole message (it actually not serialized at all if not packed)?

What I thought about "random access" is Cap'N is able to read any field back from disk without loading the whole bunch of message data to memory. But from the java API implementation (the source code), it seems that it always read the whole message back to byte buffer, getRoot and then access any field. So, I guess my understanding is wrong, isn't it?

Our scenario:

Our current protobuf message schema has many fields (~100) with embedded other messages. The serialized message size varies from hundreds bytes to tens of kilobytes and a few large messages may over 1 megabytes. We store the messages in term of compressed byte array to underlying KV store and read back from KV store, uncompress and then parse to protobuf object.

In this case, do you think it is worth to migrate from protobuf to cap'N ? If so, how can I benefit from "random access" feature?

Thanks,

Tao

Kenton Varda

unread,

Dec 5, 2017, 2:09:13 PM12/5/17

to wangt...@gmail.com, Cap'n Proto

Hi Tao,

You can get random access to files on disk by memory mapping the file. In Java, you would use FileChannel.map() to get a MappedByteBuffer. You can then pass that ByteBuffer off to Cap'n Proto and use it like any other ByteBuffer. The operating system will not actually read in the data from disk until your program attempts to access the corresponding part of the MappedByteBuffer, which Cap'n Proto will only do when you invoke the accessor for a field located there. So, somewhat magically, you get random access.

Unfortunately, you cannot get random access to compressed data this way, unless the compression is implemented inside the OS / filesystem. (And most compression methods are not random-access-friendly anyhow.)

-Kenton

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/capnproto.

wangt...@gmail.com

unread,

Dec 5, 2017, 5:27:30 PM12/5/17

to Cap'n Proto

Thanks a lot. I got it. In my case, I will always read the compressed byte array back from KV store, decompress and then read fields. So, in this case, "random access" means Cap'N will only create the object of that field from unpacked message without creating the temp objects of other fields, in other word, all other fields will still be the flat bytes without any managed objects created. Is that correct?

Moreover, another question is how to write message in packed format to a byte array. Because I have to allocate a ByteBuffer will enough capacity to store the message. But it is not possible to know the packed message size without packing it first. Currently, I have to allocate with its unpacked size (computeSerializedSizeInWords * 8), then use a tricky way to trim the tailing zeros. Do you know if there is any better way to do this?

Thanks,

Tao

Kenton Varda

unread,

Dec 6, 2017, 6:05:48 PM12/6/17

to wangt...@gmail.com, Cap'n Proto

On Tue, Dec 5, 2017 at 2:27 PM, <wangt...@gmail.com> wrote:

Thanks a lot. I got it. In my case, I will always read the compressed byte array back from KV store, decompress and then read fields. So, in this case, "random access" means Cap'N will only create the object of that field from unpacked message without creating the temp objects of other fields, in other word, all other fields will still be the flat bytes without any managed objects created. Is that correct?

Yes. However, if you're reading *packed* messages, then packed bytes do need to be unpacked upfront. They are unpacked into another ByteBuffer. No message objects are created, but this does require reading through all the bytes.

The memory mapping strategy I described does not work for packed messages.

Moreover, another question is how to write message in packed format to a byte array. Because I have to allocate a ByteBuffer will enough capacity to store the message. But it is not possible to know the packed message size without packing it first. Currently, I have to allocate with its unpacked size (computeSerializedSizeInWords * 8), then use a tricky way to trim the tailing zeros. Do you know if there is any better way to do this?

The only way to know the packed size is to actually run the packing algorithm. You could run the algorithm twice, once where you throw away the data just to get the size, and then another time to save it. Or, you could allocate successive buffers on-demand, and then assemble them into one big buffer at the end. Or, if you're going to write to an OutputStream anyway, write the bytes to the OutputStream as they are being packed, rather than packing everything first and writing second.

-Kenton

Reply all

Reply to author

Forward