Random access clarified

19 views
Skip to first unread message

Johannes Dröge

unread,
Sep 4, 2023, 11:43:34 AM9/4/23
to Cap'n Proto
Hi there!

The FAQ states "Random access: You can read just one field of a message without parsing the whole thing". However, does that also apply to List indexing? I have a flexible-length list of potentially large objects, and I need to access the nth list element from disk without having to hold other elements in memory.

I started using capnp for internal serialization in a prototype, with a more dynamic approach to data types and data structures. For this, I'm mostly attracted by the fast implementation and random access option, which gives me the possibility to mmap data structures to lazy-load attributes from disk. I'm currently using the Python interface but I might switch to C++, Rust or go at a later stage.

I will try to profile this with a toy example. Nevertheless, I'd be thankful for a theoretical consideration here!

Cheers
Johannes

Kenton Varda

unread,
Sep 4, 2023, 11:47:22 AM9/4/23
to Johannes Dröge, Cap'n Proto
Hi Johannes,

Yes, it applies to list indexing.

-Kenton

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/capnproto/48035174-9253-48d3-ad3d-b3fe69d249a3n%40googlegroups.com.

Johannes Dröge

unread,
Sep 5, 2023, 7:08:01 AM9/5/23
to Cap'n Proto
Great, thanks for the clear statement!

I have some follow-up questions now:

1) How is the memory usage patter for the deserialization of such members when they are accessed (either from disk or in memory). I suppose that the accessed element must be converted to a hardware-specific representation in memory to be usable, right? I assume that such a copy will exist in memory when used but no other copies.

2) For embedded binary (aka Data) objects, I assume that no real deserialization is actually needed. Is there a way to read such objects in a stream-like fashion to avoid putting them into memory entirely?

3) I assume that the python interface does work the same. Are you aware of any limitations?

Thanks for your support, I really enjoy that piece of software und hope that I can also use it for RPC in the future!

Kenton Varda

unread,
Sep 5, 2023, 11:20:55 AM9/5/23
to Johannes Dröge, Cap'n Proto
On Tue, Sep 5, 2023 at 6:08 AM Johannes Dröge <jd.b...@gmail.com> wrote:
Great, thanks for the clear statement!

I have some follow-up questions now:

1) How is the memory usage patter for the deserialization of such members when they are accessed (either from disk or in memory). I suppose that the accessed element must be converted to a hardware-specific representation in memory to be usable, right? I assume that such a copy will exist in memory when used but no other copies.

The Cap'n Proto wire encoding is documented on the web site:


As you'll see, there is no need to translate to a "hardware-specific representation", as the wire representation is already designed to be agreeable to all modern hardware without translation. This is the core design goal of Cap'n Proto serialization.

When reading a message in Cap'n Proto, the backing buffer is only accessed on-demand when you call the getter method to get a field. There is no preprocessing at all. When you call the getter method for a pointer field, only the pointer itself is read, in order to construct a Reader object; the destination data is not read until you call methods on the Reader object to read it.
 
2) For embedded binary (aka Data) objects, I assume that no real deserialization is actually needed. Is there a way to read such objects in a stream-like fashion to avoid putting them into memory entirely?

When reading a byte array, you essentially get a pointer into the backing buffer; none of the bytes are accessed until your code does the accessing. If you are reading from an mmaped file, then the pages will only be loaded into memory when you access them, and the kernel can automatically unload pages later when it needs memory for something else. There's nothing special you need to do to achieve "streaming" in this case.

If you are reading a message from the network, though, it is necessary for the entire message to arrive in memory before you can begin accessing it. To achieve "streaming" from the network, you need to design your application to send chunks of the stream as separate RPC calls.
 
3) I assume that the python interface does work the same. Are you aware of any limitations?

The Python implementation wraps the C++ implementation, so should broadly work the same, but not all APIs are exposed. I don't personally maintain the Python code so I can't really answer detailed questions about what it can or can't do, sorry.

-Kenton
 

Thanks for your support, I really enjoy that piece of software und hope that I can also use it for RPC in the future!
Kenton Varda schrieb am Montag, 4. September 2023 um 17:47:22 UTC+2:
Hi Johannes,

Yes, it applies to list indexing.

-Kenton

On Mon, Sep 4, 2023 at 10:43 AM Johannes Dröge <jd.b...@gmail.com> wrote:
Hi there!

The FAQ states "Random access: You can read just one field of a message without parsing the whole thing". However, does that also apply to List indexing? I have a flexible-length list of potentially large objects, and I need to access the nth list element from disk without having to hold other elements in memory.

I started using capnp for internal serialization in a prototype, with a more dynamic approach to data types and data structures. For this, I'm mostly attracted by the fast implementation and random access option, which gives me the possibility to mmap data structures to lazy-load attributes from disk. I'm currently using the Python interface but I might switch to C++, Rust or go at a later stage.

I will try to profile this with a toy example. Nevertheless, I'd be thankful for a theoretical consideration here!

Cheers
Johannes

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/capnproto/48035174-9253-48d3-ad3d-b3fe69d249a3n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+...@googlegroups.com.

Johannes Dröge

unread,
Sep 19, 2023, 10:35:10 AM9/19/23
to Cap'n Proto


Dear Kenton,

Thanks for taking the time for those detailed answers, I really appreciate it! Though most information is available elsewhere in more technical formats, it's not always giving starters those answers in a comprehensible way.

Design-wise, capnp really seems outstanding, and my first tests are quite impressive. My take home message for system design, based on this is: "Though message objects can grow quite large (with 500 MiB member/field), you should keep them small for streaming applications to avoid excessive buffering".

Cheers
Johannes
Reply all
Reply to author
Forward
0 new messages