Streaming objects as they're built

277 views
Skip to first unread message

Scott Purdy

unread,
May 22, 2015, 5:13:41 PM5/22/15
to capn...@googlegroups.com
In our project we have "networks" that can be very large. They may be comprised of 10 different "nodes" that may be 500MB or a gig.

It would be nice to have a "Network" schema that encapsulates the entire network but my understanding is that we will have to copy the data for the entire network and all nodes into the capnp structs in memory before writing anything to disk. This also means you need 2x the memory during serialization. We could independently write out each node, but this requires us to keep track of the different pieces, which I'd prefer to rely on capnp for.

Is there any way to start writing a capnp builder out to disk before you have copied all of the data over? Or is there a different solution that people have had success with?

Kamal Marhubi

unread,
May 22, 2015, 5:37:35 PM5/22/15
to Scott Purdy, capn...@googlegroups.com
This is not an immediate solution, but I think it should be possible to write a MessageBuilder implementation backed by an mmapped file.

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+...@googlegroups.com.
Visit this group at http://groups.google.com/group/capnproto.

Scott Purdy

unread,
May 22, 2015, 5:44:41 PM5/22/15
to Kamal Marhubi, capnproto
That would be great but how would you allocate it initially when there are lists?

--
You received this message because you are subscribed to a topic in the Google Groups "Cap'n Proto" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/capnproto/kLQOsxjkjxM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to capnproto+...@googlegroups.com.

Kenton Varda

unread,
May 22, 2015, 9:51:23 PM5/22/15
to Scott Purdy, Kamal Marhubi, capnproto
MessageBuilder is designed to be growable -- you allocate a small segment first, then when it's out of space allocate a new larger segment, etc. So an mmap-backed MessageBuilder would do ftruncate() each time it needs to grow the file and mmap the new space as the next segment.

-Kenton

Scott Purdy

unread,
May 22, 2015, 11:08:36 PM5/22/15
to Kenton Varda, Kamal Marhubi, capnproto
Yeah that sounds like it would work really well. Any chance something like this would work to stream data over the net? Or only with disk?

Kamal Marhubi

unread,
May 22, 2015, 11:35:25 PM5/22/15
to Scott Purdy, Kenton Varda, capnproto

I've never done it, but I think something similar should be possible with some combination of splice, vmsplice, and sendfile.

Scott Purdy

unread,
May 22, 2015, 11:53:31 PM5/22/15
to Kamal Marhubi, Kenton Varda, capnproto

Ok thanks for the info, good to know those options are there!

Kamal Marhubi

unread,
May 23, 2015, 9:16:11 PM5/23/15
to Scott Purdy, capnproto
On Fri, May 22, 2015 at 2:44 PM Scott Purdy <sc...@fer.io> wrote:
That would be great but how would you allocate it initially when there are lists?

I'm not sure I follow. I am imagining an implementation whose `allocateSegment` method would return a pointer to some memory in the mapped region.

Kenton Varda

unread,
May 23, 2015, 9:18:32 PM5/23/15
to Kamal Marhubi, Scott Purdy, capnproto
(For some reason Kamal's last message landed in the Google Groups moderation queue flagged as spam, hence the delayed delivery.)

Kamal Marhubi

unread,
Jun 23, 2015, 12:10:38 AM6/23/15
to Kenton Varda, Scott Purdy, capnproto
Scott—I'm planning on taking a stab at an mmap-based MessageBuilder this week. To avoid duplicating work, is this something you started on?

Kenton—would this fit in the core Cap'n Proto library?

Kenton Varda

unread,
Jun 23, 2015, 1:16:03 AM6/23/15
to Kamal Marhubi, Scott Purdy, capnproto
On Mon, Jun 22, 2015 at 9:10 PM, Kamal Marhubi <ka...@marhubi.com> wrote:
Kenton—would this fit in the core Cap'n Proto library?

Maybe, I'll have to see how you do it. :)

The main problem is segment framing. The current standard format, with an upfront segment table, won't really work for mmap, for a couple reasons:
- When you need to grow the table, you'd have to push back all the data by a word, which is obviously unacceptable.
- Each segment will end up with some unused space at the end, since it's unlikely that the last object allocated in the segment will just happen to end at exactly the end of the segment. In the serialized format, we can easily just not serialize these unused bytes, but in an mmap() format that space is already on-disk. Currently you cannot ask Cap'n Proto to return those bytes to you; once allocated, those bytes are permanently part of the segment and Cap'n Proto may choose to go back and use them even after new segments have been allocated.

So you need some alternative framing approach.

While you're at it, you may want to aim for segments to be page-aligned.

Maybe you could define a format in which all segment sizes are hard-coded, so you don't need any table. E.g. have the segment sizes be:

1: N pages
2: N pages
3: 2*N pages
4: 4*N pages
5: 8*N pages
...

Thus the file size is always a power of two and you can easily determine the count, size, and locations of all segments from the file size.

(Note that segments have a hard limit of 4GB, so once you hit that subsequent segments will have to be 4GB each and file sizes will no longer be power-of-two...)

Another possibility is to store each segment as a separate file.

Or yet another, crazy-but-awesome strategy would be to have all segments be 4GB, but initially allocated only with ftruncate(), which adds a "hole" to the end of the file for which pages aren't actually allocated until needed. This way each page will be allocated when you first touch it. Since Cap'n Proto fills the segment sequentially, this actually works really well, at least in theory. The down side is that files with "holes" tend to behave badly with standard tools, e.g. `ls` will show the file as being 4GB even when it only has a few bytes of data, and `cp` will turn the hole into zeros, losing the optimization. There may also be performance problems with the TLB cache being flushed every time a new page is allocated; I'm not sure.

-Kenton

Kamal Marhubi

unread,
Jun 23, 2015, 4:19:45 PM6/23/15
to Kenton Varda, Scott Purdy, capnproto
On Tue, Jun 23, 2015 at 1:16 AM Kenton Varda <ken...@sandstorm.io> wrote:
On Mon, Jun 22, 2015 at 9:10 PM, Kamal Marhubi <ka...@marhubi.com> wrote:
Kenton—would this fit in the core Cap'n Proto library?

Maybe, I'll have to see how you do it. :)

Fair enough! Is a Linux only implementation ok for inclusion, assuming everything else is fine? It should be possible to make it use only POSIX, but I have to start somewhere.
 
The main problem is segment framing. The current standard format, with an upfront segment table, won't really work for mmap, for a couple reasons:
- When you need to grow the table, you'd have to push back all the data by a word, which is obviously unacceptable.

Possible mitigation that preserves compatibility with the recommended byte stream framing: always have the segment table be a multiple of the page size, with a bunch of zero-length segments to pad it out. I haven't looked at the code, but this sounds like it should work from the encoding specification page. I'm not sure how best to handle potentially different platforms' various page sizes though. Lowest common denominator, perhaps? More wastage would result, but they'd be readable on more systems.

An alternative that does not preserve compatibility: move the segment table to the end, delimited by some sort of magic sequence. This requires me to look more carefully at the encoding spec to pick some magic.
 
- Each segment will end up with some unused space at the end, since it's unlikely that the last object allocated in the segment will just happen to end at exactly the end of the segment. In the serialized format, we can easily just not serialize these unused bytes, but in an mmap() format that space is already on-disk. Currently you cannot ask Cap'n Proto to return those bytes to you; once allocated, those bytes are permanently part of the segment and Cap'n Proto may choose to go back and use them even after new segments have been allocated.

fallocate with FALLOC_FL_COLLAPSE_RANGE could be a way to limit the waste to FS block padding in each segment. I'll have to think about it a bit more that shou. This would be Linux only and 3.15+ at that.


So you need some alternative framing approach.

While you're at it, you may want to aim for segments to be page-aligned.

Probably a good idea!

Maybe you could define a format in which all segment sizes are hard-coded, so you don't need any table. E.g. have the segment sizes be:

1: N pages
2: N pages
3: 2*N pages
4: 4*N pages
5: 8*N pages
...

Thus the file size is always a power of two and you can easily determine the count, size, and locations of all segments from the file size.

(Note that segments have a hard limit of 4GB, so once you hit that subsequent segments will have to be 4GB each and file sizes will no longer be power-of-two...)

I think my initial PoC will actually just have constant sized segments for ease of implementation. Once that's done, I can experiment with this other stuff.
 
Another possibility is to store each segment as a separate file.

Workable, though it introduces more ways for user error to mess things up.

Or yet another, crazy-but-awesome strategy would be to have all segments be 4GB, but initially allocated only with ftruncate(), which adds a "hole" to the end of the file for which pages aren't actually allocated until needed. This way each page will be allocated when you first touch it. Since Cap'n Proto fills the segment sequentially, this actually works really well, at least in theory. The down side is that files with "holes" tend to behave badly with standard tools, e.g. `ls` will show the file as being 4GB even when it only has a few bytes of data, and `cp` will turn the hole into zeros, losing the optimization. There may also be performance problems with the TLB cache being flushed every time a new page is allocated; I'm not sure.

This is another idea that could work and is pretty attractive. In fact, this could be a workaround for the TLB cache problem: map in the whole file, but mprotect(PROT_NONE) most of it. In a SIGSEGV handler, unprotect the pages and use madvise(MADV_WILLNEED), or alternatively only partially map the file and use mmap(MAP_POPULATE) to prefault. The effect is to amortize the cost of new pages being allocated using a sort of subsegment system implemented through the memory manager. (Let me know if this is too crazy and / or cannot work!)

Writing this has taken—on and off—most of the day, so I'll end it here. One last question: is there anything like an RFC process for Cap'n Proto related stuff?

-Kamal

Alex Elsayed

unread,
Jun 23, 2015, 5:05:40 PM6/23/15
to capn...@googlegroups.com
Kamal Marhubi wrote:

> On Tue, Jun 23, 2015 at 1:16 AM Kenton Varda <ken...@sandstorm.io> wrote:

<snip>
> Or yet another, crazy-but-awesome strategy would be to have all segments
> be
>> 4GB, but initially allocated only with ftruncate(), which adds a "hole"
>> to the end of the file for which pages aren't actually allocated until
>> needed. This way each page will be allocated when you first touch it.
>> Since Cap'n Proto fills the segment sequentially, this actually works
>> really well, at least in theory. The down side is that files with "holes"
>> tend to behave badly with standard tools, e.g. `ls` will show the file as
>> being 4GB even when it only has a few bytes of data, and `cp` will turn
>> the hole into zeros, losing the optimization. There may also be
>> performance problems with the TLB cache being flushed every time a new
>> page is allocated; I'm not sure.
>>
>
> This is another idea that could work and is pretty attractive. In fact,
> this could be a workaround for the TLB cache problem: map in the whole
> file, but mprotect(PROT_NONE) most of it. In a SIGSEGV handler, unprotect
> the pages and use madvise(MADV_WILLNEED), or alternatively only partially
> map the file and use mmap(MAP_POPULATE) to prefault. The effect is to
> amortize the cost of new pages being allocated using a sort of subsegment
> system implemented through the memory manager. (Let me know if this is too
> crazy and / or cannot work!)

One downside of this: Some filesystems do not support sparse files at all.
You're unlikely to see one very often on *Linux*, but one in particular
might be a big problem elsewhere: HFS+

Kamal Marhubi

unread,
Jun 23, 2015, 5:17:01 PM6/23/15
to Alex Elsayed, capn...@googlegroups.com
On Tue, Jun 23, 2015 at 5:05 PM Alex Elsayed <etern...@gmail.com> wrote:
One downside of this: Some filesystems do not support sparse files at all.
You're unlikely to see one very often on *Linux*, but one in particular
might be a big problem elsewhere: HFS+

This is a great point. I don't know how multiplatform all these things need to be, and to what degree it matters that the output of an mmap based MessageBuilder can be read by the MessageReaders in serialize.h. Kenton?

Alex Elsayed

unread,
Jun 23, 2015, 5:23:49 PM6/23/15
to capn...@googlegroups.com
Oh, it'd be readable. I don't mean it'd error out or anything; I mean that
on HFS+ if you seek forward 4gb or call ftruncate() or whatever, you're
going to be waiting on 4gb of writes.

Kamal Marhubi

unread,
Jun 23, 2015, 5:33:04 PM6/23/15
to Alex Elsayed, capn...@googlegroups.com


On Tue, Jun 23, 2015 at 5:23 PM Alex Elsayed <etern...@gmail.com> wrote:
Kamal Marhubi wrote:
> This is a great point. I don't know how multiplatform all these things
> need to be, and to what degree it matters that the output of an mmap based
> MessageBuilder can be read by the MessageReaders in serialize.h. Kenton?

Oh, it'd be readable. I don't mean it'd error out or anything; I mean that
on HFS+ if you seek forward 4gb or call ftruncate() or whatever, you're
going to be waiting on 4gb of writes.

Right right. I guess a better way to phrase it is: how much should compatibility be weighted vs good performance. A lot of Kenton's suggestions would be incompatible with those readers, since there is no segment table header. I like the idea of maintaining that compatibility, but would like others' input. Eg, yours! :-)

Paul Pelzl

unread,
Jun 24, 2015, 2:01:02 AM6/24/15
to Alex Elsayed, capnproto
On Tue, Jun 23, 2015 at 2:05 PM, Alex Elsayed <etern...@gmail.com> wrote:
Kamal Marhubi wrote:
> This is another idea that could work and is pretty attractive. In fact,
> this could be a workaround for the TLB cache problem: map in the whole
> file, but mprotect(PROT_NONE) most of it. In a SIGSEGV handler, unprotect
> the pages and use madvise(MADV_WILLNEED), or alternatively only partially
> map the file and use mmap(MAP_POPULATE) to prefault. The effect is to
> amortize the cost of new pages being allocated using a sort of subsegment
> system implemented through the memory manager. (Let me know if this is too
> crazy and / or cannot work!)

One downside of this: Some filesystems do not support sparse files at all.
You're unlikely to see one very often on *Linux*, but one in particular
might be a big problem elsewhere: HFS+

I think the following strategy has equivalent behavior, but avoids excessive allocations on filesystems without sparse file support:
    1) mmap() a very large region using a dummy device (maybe /dev/null?), just to reserve address space
    2) use ftruncate() to set a smallish file size
    3) mmap() with MAP_FIXED to remap the first part of region (1) so it points to the file
    4) mprotect() the rest of the region, as above
    5) SIGSEGV handler uses ftruncate() to extend the file, and *also* uses mmap() with MAP_FIXED to extend the mapping

Paul

Kamal Marhubi

unread,
Jun 24, 2015, 3:14:56 PM6/24/15
to Paul Pelzl, Alex Elsayed, capnproto
On Wed, Jun 24, 2015 at 2:01 AM Paul Pelzl <pel...@gmail.com> wrote:
I think the following strategy has equivalent behavior, but avoids excessive allocations on filesystems without sparse file support:
    1) mmap() a very large region using a dummy device (maybe /dev/null?), just to reserve address space
    2) use ftruncate() to set a smallish file size
    3) mmap() with MAP_FIXED to remap the first part of region (1) so it points to the file
    4) mprotect() the rest of the region, as above
    5) SIGSEGV handler uses ftruncate() to extend the file, and *also* uses mmap() with MAP_FIXED to extend the mapping

I like it! I came across MAP_FIXED last night, and wheels were whirring towards something like this. At least on Linux, it seems mapping /dev/null is a no go. I'll see if I can borrow someone's mac to take a look on there.

Paul

Kenton Varda

unread,
Jun 24, 2015, 5:10:02 PM6/24/15
to Kamal Marhubi, Scott Purdy, capnproto
On Tue, Jun 23, 2015 at 1:19 PM, Kamal Marhubi <ka...@marhubi.com> wrote:
On Tue, Jun 23, 2015 at 1:16 AM Kenton Varda <ken...@sandstorm.io> wrote:
On Mon, Jun 22, 2015 at 9:10 PM, Kamal Marhubi <ka...@marhubi.com> wrote:
Kenton—would this fit in the core Cap'n Proto library?

Maybe, I'll have to see how you do it. :)

Fair enough! Is a Linux only implementation ok for inclusion, assuming everything else is fine? It should be possible to make it use only POSIX, but I have to start somewhere.

Hmm, I think it's probably best to start out shipping this as a separate library and see where things go. It's a lot easier to add things to the core library later than it is to remove things from it.
 
Possible mitigation that preserves compatibility with the recommended byte stream framing: always have the segment table be a multiple of the page size, with a bunch of zero-length segments to pad it out.

You could do that, but you'd want to leave plenty of space in the table, and also note that you will still have wasted space at the end of each segment: unlike when streaming, you cannot retract the segment endpoint back to the end of the last object allocated in it, because there's no way to tell Cap'n Proto "hey this segment that you're still actively using for a live message just became shorter".
 
An alternative that does not preserve compatibility: move the segment table to the end, delimited by some sort of magic sequence. This requires me to look more carefully at the encoding spec to pick some magic.

IMO if you're breaking compatibility anyhow, and you can't get away from wasted space at the end of each segment, then the implicitly-doubling sequence of segments is almost strictly better than having a segment table at the end.
 
- Each segment will end up with some unused space at the end, since it's unlikely that the last object allocated in the segment will just happen to end at exactly the end of the segment. In the serialized format, we can easily just not serialize these unused bytes, but in an mmap() format that space is already on-disk. Currently you cannot ask Cap'n Proto to return those bytes to you; once allocated, those bytes are permanently part of the segment and Cap'n Proto may choose to go back and use them even after new segments have been allocated.

fallocate with FALLOC_FL_COLLAPSE_RANGE could be a way to limit the waste to FS block padding in each segment. I'll have to think about it a bit more that shou. This would be Linux only and 3.15+ at that.

You could perhaps use that as a finalization step, after the MessageBuilder has been torn down. But you'll only be able to remove block-aligned holes this way. More likely, each segment will have a handful of words wasted -- probably not worth worrying about, honestly.
 
This is another idea that could work and is pretty attractive. In fact, this could be a workaround for the TLB cache problem: map in the whole file, but mprotect(PROT_NONE) most of it. In a SIGSEGV handler, unprotect the pages and use madvise(MADV_WILLNEED), or alternatively only partially map the file and use mmap(MAP_POPULATE) to prefault. The effect is to amortize the cost of new pages being allocated using a sort of subsegment system implemented through the memory manager. (Let me know if this is too crazy and / or cannot work!)

This sounds like something the kernel should already be doing. :) Maybe it does. I don't really know. You should investigate the kernel's exact behavior before going down this route.

Intuitively (but without a deep understanding of how these things work) I would expect that if you can do something in userspace which avoids TLB flushes, then the kernel is probably already doing the same thing internally by default. Conversely, if the kernel isn't avoiding TLB flushes, then it's probably not possible to do so, even in userspace.
 
Writing this has taken—on and off—most of the day, so I'll end it here. One last question: is there anything like an RFC process for Cap'n Proto related stuff?

Just this mailing list.

I tend to believe in people owning their own designs and decisions, without approval processes or gatekeepers. Offering this functionality as a separate library, rather than a feature in the base library, aligns well with that. (This is a library I suspect I'll end up using in Sandstorm, FWIW.)
 
This is a great point. I don't know how multiplatform all these things need to be, and to what degree it matters that the output of an mmap based MessageBuilder can be read by the MessageReaders in serialize.h. Kenton?

It depends on where you plan to use this. I think it's perfectly fine to design a library which requires Linux-specific features, if your use case is Linux-specific anyway. Other people can develop more-portable (but less-efficient) libraries as needed. Of course I wouldn't design something to be non-portable without a good reason, but I think sparse files' potential to hugely simplify the implementation here is a worthwhile reason -- assuming it actually works and is performant in practice, which you'll want to find out.

FWIW, I regularly rely on Linux features in Sandstorm (and even more so Blackrock) code, since Sandstorm is already thoroughly dependent on Linux anyhow.

-Kenton

Paul Pelzl

unread,
Jun 25, 2015, 12:08:25 AM6/25/15
to Kamal Marhubi, Alex Elsayed, capnproto
On Wed, Jun 24, 2015 at 12:14 PM, Kamal Marhubi <ka...@marhubi.com> wrote:
On Wed, Jun 24, 2015 at 2:01 AM Paul Pelzl <pel...@gmail.com> wrote:
I think the following strategy has equivalent behavior, but avoids excessive allocations on filesystems without sparse file support:
    1) mmap() a very large region using a dummy device (maybe /dev/null?), just to reserve address space
    2) use ftruncate() to set a smallish file size
    3) mmap() with MAP_FIXED to remap the first part of region (1) so it points to the file
    4) mprotect() the rest of the region, as above
    5) SIGSEGV handler uses ftruncate() to extend the file, and *also* uses mmap() with MAP_FIXED to extend the mapping

I like it! I came across MAP_FIXED last night, and wheels were whirring towards something like this. At least on Linux, it seems mapping /dev/null is a no go. I'll see if I can borrow someone's mac to take a look on there.

It's not in POSIX, but it looks like BSD/OS X and Linux all support MAP_ANON with fd == -1.  (On Linux it's deprecated in favor of MAP_ANONYMOUS, but maybe it's one of those deprecated things that will never actually go away.)

Paul

Alex Elsayed

unread,
Jun 25, 2015, 2:18:05 AM6/25/15
to capn...@googlegroups.com
There's also the option of mmapping /dev/zero, whose original purpose was
specifically for anonymous mappings (though MAP_ANON[YMOUS] has mostly
supplanted that in modern times)

Alex Elsayed

unread,
Jun 25, 2015, 2:25:06 AM6/25/15
to capn...@googlegroups.com
One enormous downside: Libraries consuming signals is... very very nasty.
Not simply in the mechanics, but in the ramification for the library's users
and other libraries they may use. This is a can of worms I really wouldn't
suggest opening.

See also: Andrea Arcangeli (of hugepages fame) is currently working on a
'userfaultfd' interface *exactly* to avoid the need to use SIGSEGV to do
userspace memory stuff.

Reply all
Reply to author
Forward
0 new messages