vector alignment of arrays

488 views
Skip to first unread message

Gregory Allen

unread,
Aug 26, 2013, 10:57:41 AM8/26/13
to capn...@googlegroups.com
Does capnproto have any consideration for SIMD vector alignment of arrays of data? That's 16-byte alignment for SSE and 32-byte for AVX.

I do high-throughput scientific computing, which generally involves big N-D arrays of data with small amounts of associated metadata (think HDF5). I've never used Protocol Buffers because I can't afford the encoding/decoding. The fact that capnproto avoids that expense makes it very interesting. The ability to have vector alignment of arrays would be even better.

Thoughts?

Thanks,
-Greg

Gregory E. Allen, PhD, Engineering Scientist
Applied Research Laboratories: The University of Texas at Austin
512-835-3487

Kenton Varda

unread,
Aug 26, 2013, 3:00:03 PM8/26/13
to Gregory Allen, capnproto
Hi Gregory,

Currently, no, there are no considerations for that, and honestly I'm not entirely sure how it could be done.  How do you handle this normally?  E.g. if you call malloc(), I assume it only guarantees word alignment, so how do you get a buffer aligned to 32 bytes?  Do you use some other memory allocator, or do you allocate space for size + 1 elements and then leave padding at the front to get it aligned?

Things that seem like they'd need to happen for this to work:
- You'd need a custom MessageBuilder subclass that allocates aligned segments.
- You'd need a way to specify alignment when allocating Data blobs.
- When you read the message later, you'd need some way to make sure the MessageReader loads the segments aligned, which means either changing or re-implementing the serialization code.

-Kenton

Gregory Allen

unread,
Aug 26, 2013, 5:09:17 PM8/26/13
to Kenton Varda, capnproto
There are various ways to get an aligned buffer. The modern malloc-equivalent is posix_memalign (see http://linux.die.net/man/3/memalign). Some libraries provide their own (e.g. fftw_malloc/fftw_free). On MacOS, malloc just does it. Memory pages are aligned, so mmap() and shared memory are also aligned.

Alignment can make a large performance difference for code operating directly on the buffer. Misaligned loads and stores turn into two operations at the load/store unit. In SIMD programming, memory system latency and throughput are often the bottleneck.

I don't claim to know much about the internals of capnproto, so your "would need to happen" list is somewhat Greek to me.

I'd say the buffer should just be aligned *relative* to the start of the message. Then there's still no need to serialize. Users who want aligned arrays have to put their messages in aligned buffers.

One way to specify alignment could be new types, e.g. Float32x4. Intel sort of did this with their SIMD types (__m64, __m128, __m256). I just learned there's also an __m512 coming up.

Specifying an alignment in bytes seems more future-proof, but would probably need language modification. This is how the above types are declared in C:

typedef struct __declspec(align(16)) { float f[4]; } __m128;

Perhaps:

data @0 :List(Float32) align(16);

or

data @0 :List(Float32, 16);


I realize this is probably out-of-bounds of your current vision of capnproto. But it could enable new territory for your "insanely fast data interchange format". :)

Thanks,
-Greg

Andreas Stenius

unread,
Aug 26, 2013, 5:16:30 PM8/26/13
to Gregory Allen, Kenton Varda, capnproto
Actually, there are annotations [1] you can apply to any element in your schema.
Then it would be up to the compiler to adhere to the annotation.. it could look something like this:

struct Array {
  data @0 :List(Float32) $align(16);
}




2013/8/26 Gregory Allen <gal...@arlut.utexas.edu>

Andrew Lutomirski

unread,
Aug 26, 2013, 5:17:58 PM8/26/13
to Gregory Allen, Kenton Varda, capnproto
On Mon, Aug 26, 2013 at 2:09 PM, Gregory Allen <gal...@arlut.utexas.edu> wrote:
> There are various ways to get an aligned buffer. The modern malloc-equivalent is posix_memalign (see http://linux.die.net/man/3/memalign). Some libraries provide their own (e.g. fftw_malloc/fftw_free). On MacOS, malloc just does it. Memory pages are aligned, so mmap() and shared memory are also aligned.
>
> Alignment can make a large performance difference for code operating directly on the buffer. Misaligned loads and stores turn into two operations at the load/store unit. In SIMD programming, memory system latency and throughput are often the bottleneck.
>
> I don't claim to know much about the internals of capnproto, so your "would need to happen" list is somewhat Greek to me.
>
> I'd say the buffer should just be aligned *relative* to the start of the message. Then there's still no need to serialize. Users who want aligned arrays have to put their messages in aligned buffers.
>
> One way to specify alignment could be new types, e.g. Float32x4. Intel sort of did this with their SIMD types (__m64, __m128, __m256). I just learned there's also an __m512 coming up.
>
> Specifying an alignment in bytes seems more future-proof, but would probably need language modification. This is how the above types are declared in C:
>
> typedef struct __declspec(align(16)) { float f[4]; } __m128;
>
> Perhaps:
>
> data @0 :List(Float32) align(16);
>
> or
>
> data @0 :List(Float32, 16);
>
>
> I realize this is probably out-of-bounds of your current vision of capnproto. But it could enable new territory for your "insanely fast data interchange format". :)

I have an old project that could have used this stuff. I'm not really
convinced it should be part of the schema, though -- alignment
requirements may vary by architecture. Having an annotation and
serializer support could be useful, though.

Geoffrey Romer

unread,
Aug 26, 2013, 5:31:16 PM8/26/13
to Kenton Varda, Gregory Allen, capnproto


--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+...@googlegroups.com.
Visit this group at http://groups.google.com/group/capnproto.

Kenton Varda

unread,
Aug 27, 2013, 12:23:04 AM8/27/13
to Gregory Allen, capnproto
To be clear, I would like to support this.  I'm trying to figure out if there's a way to do it without too much trouble.

So, Cap'n Proto messages are composed of multiple "segments", each of which is composed of a number of "objects", where an object is e.g. a struct, a list, or a byte blob.  Each segment is a contiguous block of memory.  Objects are word aligned within their segments, where "word" always means 64-bit for Cap'n Proto.

So, in order to support higher alignment, we need to do two things:
1) Make sure objects are aligned relative to the segment start.
2) Make sure segments are aligned.  This has two sub-problems:
  a. Make sure segments of a newly-built message are aligned.
  b. Make sure segments read off the wire are aligned.

These turn out to be very different problems.

2a is the most trivial to solve:  Just use posix_memalign() instead of calloc() in MallocMessageBuilder and specify 32-byte alignment.  We could probably just do this for everyone whether they need it or not.  However, if people choose to pass a scratch array to MallocMessageBuilder's constructor, they will need to be responsible for aligning that array if it matters to them.  And, of course, anyone writing a custom MessageBuilder subclass has to deal with alignment.

2b is trickier:  A serialized message starts out with a table of segment sizes.  This table is always a whole number of words, but isn't aligned beyond that -- in fact, the most common table size is 1 word.  The first segment begins immediately after the table.

Now, InputStreamMessageReader actually reads the table separately from the content, so it could still allocate 32-byte aligned segments.  But there is also FlatArrayMessageReader, which takes a user-provided buffer containing the entire serialized message and references it directly.  FlatArrayMessageReader is particularly useful together with mmap().  Unfortunately, in the common case of a 1-word segment table, an mmap'd file will never end up with aligned segments.

One way we could fix this would be to automatically pad out the segment table to 16 or 32 bytes when writing a message which we know contains any content requiring alignment.  The table would claim that the message has more segments than it really does, but the extra segments would all be zero-size.  So, compatibility with the existing protocol can be preserved.


Finally, we come to point (1).  Complexifying the allocation code to support alignment is not a big deal in itself.  The bigger problem is deciding when to align.

I think that it probably makes the most sense to support higher alignment only for Data blobs.  We will not support higher-aligned fields within structs.  Here's why:

- The struct layout code is already too complex.

- If there are any systems where unaligned reads cause SIGBUS (or otherwise crash the app), then we'd need to verify alignment at the time the struct pointer is traversed.  However, we wouldn't want a protocol to lose backwards-compatibility when adding a new higher-aligned field to a struct that didn't have any higher-aligned fields before.  E.g. if I add a Float64x4 field to my struct, then go back and read an old message created before that field existed, the struct may not be aligned there.  That's actually OK, because that old struct won't contain the field anyway, but it will be complex to validate -- we'll need to keep track of the offset of the first aligned field and allow non-alignment if the struct is smaller than that offset.

- Struct lists are prefixed with a one-word tag, which could throw off alignment for the whole list.  We could support allocation with offset alignment -- e.g. "please allocate 17 words of memory aligned 1 word before a 4-word alignment boundary" -- but this is getting pretty weird.

- Cross-platformness.  If Cap'n Proto is going to explicitly support a Float64x4 type, it is going to have to be little-endian IEEE-754.  On any system where that is not acceptable to the vector processor (PPC, maybe?), all of this work on alignment is for naught, and we have to deal with byte-swapping.  On the other hand, if we don't offer an explicit Float64x4 type at all, and just say "you can allocate aligned byte blobs, but it's up to you what's in them", then we wash our hands of this problem -- it's up to the application to decide what format is appropriate for its needs.

- It seems to me that typical use cases for vector processing involve a large list of vectors anyway, so a Float64x4 field type may be a complete waste of time.


So, given all this, I'd proposing just adding two new types:  DataAligned128 and DataAligned256.  These two types work exactly like Data, but are guaranteed to be allocated aligned on a 128-bit/256-bit boundary from the start of the segment.  Moreover, allocating such a buffer anywhere in your message will set a flag which causes the segment table to be padded out so that all segments end up aligned relative to the start of the serialized message.  The system will always allocate 32-byte aligned segments, but if you are doing any allocation yourself (either because you are providing scratch space or because you are using FlatArrayMessageReader) then it's up to you to deal with alignment.  When reading a DataAligned128 or DataAligned256 pointer, the system will throw an exception if the target data is not actually aligned (whether because the sender failed to align it or because the segment is not aligned in local memory).

Thoughts?

-Kenton


On Mon, Aug 26, 2013 at 2:09 PM, Gregory Allen <gal...@arlut.utexas.edu> wrote:

David Renshaw

unread,
Aug 27, 2013, 8:46:04 AM8/27/13
to Kenton Varda, Gregory Allen, capnproto
What about lists of DataAligned128 and DataAligned256? They could be
laid out inline (with field size INLINE_COMPOSITE), rather than as
lists of pointers, right?

Kenton Varda

unread,
Aug 27, 2013, 9:08:07 AM8/27/13
to David Renshaw, Gregory Allen, capnproto
On Tue, Aug 27, 2013 at 5:46 AM, David Renshaw <dwre...@cs.cmu.edu> wrote:
What about lists of DataAligned128 and DataAligned256? They could be
laid out inline (with field size INLINE_COMPOSITE), rather than as
lists of pointers, right?

The idea here is that DataAligned* are equivalent to Data, except aligned.  Data is an arbitrary-length byte blob, so basically it's already a list.

Kenton Varda

unread,
Aug 27, 2013, 9:11:37 AM8/27/13
to Andreas Stenius, capnproto



On Tue, Aug 27, 2013 at 1:18 AM, Andreas Stenius <g...@astekk.se> wrote:
2013/8/27 Kenton Varda <temp...@gmail.com>
[...]
So, in order to support higher alignment, we need to do two things:
1) Make sure objects are aligned relative to the segment start.
2) Make sure segments are aligned.  This has two sub-problems:
  a. Make sure segments of a newly-built message are aligned.
  b. Make sure segments read off the wire are aligned.

I agree :)
 
Finally, we come to point (1).  Complexifying the allocation code to support alignment is not a big deal in itself.  The bigger problem is deciding when to align.

I think that it probably makes the most sense to support higher alignment only for Data blobs.  We will not support higher-aligned fields within structs.  Here's why:

- The struct layout code is already too complex.

Well, it's rather complex, yes. But; I looked at it, and from what I can tell, it wouldn't be too hard to support a generic alignment annotation to a field (allowing it to have a larger lgSize than default) along with support for lgSizes' > 6 this would solve it along with your proposal for making sure that the struct itself is aligned according to the largest lgSize of any field in the struct.
But then again, I guess I'd better write a patch for that to back the claim that it wouldn't be too hard.. :p

Right, I was thinking the same thing, but things always sound simpler in theory than they are in practice, and honestly it already sound hairy to me in theory.  :)
 


- Struct lists are prefixed with a one-word tag, which could throw off alignment for the whole list.  We could support allocation with offset alignment -- e.g. "please allocate 17 words of memory aligned 1 word before a 4-word alignment boundary" -- but this is getting pretty weird.

Oh.. this sure makes it trickier to ensure struct alignment..
 
So, given all this, I'd proposing just adding two new types:  DataAligned128 and DataAligned256.  These two types work exactly like Data, but are guaranteed to be allocated aligned on a 128-bit/256-bit boundary from the start of the segment.  Moreover, allocating such a buffer anywhere in your message will set a flag which causes the segment table to be padded out so that all segments end up aligned relative to the start of the serialized message.  The system will always allocate 32-byte aligned segments, but if you are doing any allocation yourself (either because you are providing scratch space or because you are using FlatArrayMessageReader) then it's up to you to deal with alignment.  When reading a DataAligned128 or DataAligned256 pointer, the system will throw an exception if the target data is not actually aligned (whether because the sender failed to align it or because the segment is not aligned in local memory).

Ah, now I get the suggestion with adding another type of Data; as Data is a pointer type, its alignment is not tied to that of the struct itself.


Thoughts?

But, can't the Data type be augmented with a align(X) annotation to make it more generic?
(Give it a few years, and along comes Mary asking for a DataAligned512 type... ;)
Oh well, maybe better put off the general case, or just add another type whenever the issue arise, if ever..

I think you got a pretty decent proposal on how to solve the alignment for 16 and 32 bytes data.


I suppose it could be AlignedData(x).

Gregory Allen

unread,
Aug 27, 2013, 6:53:52 PM8/27/13
to Kenton Varda, David Renshaw, capnproto
DataAligned* is a reasonable approach (that I'd be grateful to have), but does lose type information. There really shouldn't be any cross-platform concern; SIMD alignment and types are fairly cross-architecture, and you've already decreed that it shall be little endian.


The DataAligned* approach would always need another field to tell what type is in the blob (when it's an array of a regular type).

data @0 :DataAligned128;
typeOfData @1 :MyTypeEnum;


The approach of

data @0 :List(Float32x4);

or

data @0 :List(Float32Aligned128);

could turn into the same data structure in memory as DataAligned128 while communicating the type. Of course, you'd also get an explosion in type combinations:

{Int,UInt,Float}32x{4,8}
{Int,UInt,Float}64x{2,4}
{Int,UInt}16x{8,16}
{Int,UInt}8x{16,32}

The attribute approach avoids the type explosion, while still preserving the type info:

data @0 :List(Float32) $align(16);

However, this approach wouldn't capture complex<float>, so I'd still need a field to tell me that.

I think DataAligned* would be great.

One more thing: Intel announced AVX-512 last month. There are a few years before that gets into silicon (and a few more before there's actual 512-bit wide load/store), but DataAligned512 may want to be a consideration as well.

Kenton Varda

unread,
Aug 28, 2013, 2:23:05 AM8/28/13
to Gregory Allen, David Renshaw, capnproto
On Tue, Aug 27, 2013 at 3:53 PM, Gregory Allen <gal...@arlut.utexas.edu> wrote:
The DataAligned* approach would always need another field to tell what type is in the blob (when it's an array of a regular type).

data @0       :DataAligned128;
typeOfData @1 :MyTypeEnum;

Nah, it just needs a comment:

  data @0 :DataAligned128;
  # A list of Float32x4 vectors.

There's no need to send type information on the wire unless you actually expect different message instances to use different types -- and if you expect that, then static typing wouldn't help you anyway.
 
The attribute approach avoids the type explosion, while still preserving the type info:

data @0 :List(Float32) $align(16);

The problem here is that the List<float> type in C++ does not actually give you a direct pointer to the underlying buffer.  Instead, it gives you a Reader with operator[] (which takes care of byte swapping if you're on a big-endian machine).  So aligning this list doesn't actually help you at all, because at the end of the day the only way you're allowed to access it is one element at a time through operator[].

We could provide a way to bypass the Reader/Builder wrappers and get a direct pointer to the underlying data, but at that point you lose a lot of the benefit of having the thing be "type safe" in the first place.  We can't very well claim that this pointer actually points at a float array, for example, because on a big-endian system it would not be usable as a float array at all.  You end up not much better off than if you just used Data.

Moreover, List(Float32) is supposed to come with the guarantee that you can upgrade it to List(SomeStruct) where SomeStruct's @0 field has type Float32.  Such a change would cause problems for anyone who needs a direct pointer to the underlying data.

So I think the best we can do is AlignedData(x) where x is 64, 128, 256, 512...  I suppose we could support all powers of 2 without too much difficulty.

Andreas Stenius

unread,
Aug 28, 2013, 3:58:31 AM8/28/13
to Kenton Varda, Gregory Allen, David Renshaw, capnproto
2013/8/28 Kenton Varda <temp...@gmail.com>
[...]
So I think the best we can do is AlignedData(x) where x is 64, 128, 256, 512...  I suppose we could support all powers of 2 without too much difficulty.

Out of curiosity, are you talking about AlignedData as a new type or annotation?

Kenton Varda

unread,
Aug 28, 2013, 5:18:57 AM8/28/13
to Andreas Stenius, Gregory Allen, David Renshaw, capnproto
Type.  Annotations should be things that you can ignore if you don't care about them, but all implementations will be required to implement alignment correctly.

Andrew Lutomirski

unread,
Aug 28, 2013, 2:03:31 PM8/28/13
to Kenton Varda, Andreas Stenius, Gregory Allen, David Renshaw, capnproto
Am I correct in assuming that these things can't be upgraded to lists
of structs later on? That is, DataAligned(32) will be a variable
length array of bytes, aligned to a multiple of 32 bytes?

What will the length granularity be? 1 byte or N?

--Andy

Kenton Varda

unread,
Aug 28, 2013, 2:16:55 PM8/28/13
to Andrew Lutomirski, Andreas Stenius, Gregory Allen, David Renshaw, capnproto
On Wed, Aug 28, 2013 at 11:03 AM, Andrew Lutomirski <an...@luto.us> wrote:
Am I correct in assuming that these things can't be upgraded to lists
of structs later on?  That is, DataAligned(32) will be a variable
length array of bytes, aligned to a multiple of 32 bytes?

Correct.  The interface is exactly the same as for Data, meaning you get a direct pointer, so upgrade is not possible.

Note that I was thinking AlignedData(N) is aligned to N bits, not N bytes, just because field sizes are specified in bits.
 
What will the length granularity be?  1 byte or N?

Option 1 is that we require that the blob always be a multiple of N in size.  Option 2 is that we allow it to be any size, and the only difference is the guaranteed alignment of the start byte.  The latter would mean that the existing Data type is effectively AlignedData(64), since currently all blobs are word-aligned, but can have any byte length.

Gregory Allen

unread,
Aug 28, 2013, 3:58:16 PM8/28/13
to Kenton Varda, Andrew Lutomirski, Andreas Stenius, David Renshaw, capnproto
I'd say length granularity of 1 byte.

In SIMD code it's common to have leftover samples that aren't a multiple of a vector. You just have to deal with those leftovers in scalar code. Sure, it's not as fast, but life doesn't always come in multiples of 8 samples. :)

Option 2 allows the user to choose any length granularity they want. Option 1 would limit their choice.

Thanks,
-Greg

On Aug 28, 2013, at 1:16 PM, Kenton Varda <temp...@gmail.com> wrote:
> Option 1 is that we require that the blob always be a multiple of N in size. Option 2 is that we allow it to be any size, and the only difference is the guaranteed alignment of the start byte. The latter would mean that the existing Data type is effectively AlignedData(64), since currently all blobs are word-aligned, but can have any byte length.

Kenton Varda

unread,
Aug 28, 2013, 5:03:09 PM8/28/13
to Gregory Allen, Andrew Lutomirski, Andreas Stenius, David Renshaw, capnproto
Well, if your data isn't actually grouped in units of N, then it doesn't need alignment.  You can always process the first few elements in scalar code until you get to an aligned boundary, then switch over to SIMD.  :)

But yeah, I don't think there's any good reason to enforce larger granularity than bytes.  It's not a security issue because applications are likely to ignore the trailing bytes due to truncating divide anyway.

Gregory Allen

unread,
Aug 28, 2013, 8:09:35 PM8/28/13
to Kenton Varda, Andrew Lutomirski, Andreas Stenius, David Renshaw, capnproto
On Aug 28, 2013, at 4:03 PM, Kenton Varda <temp...@gmail.com> wrote:
> Well, if your data isn't actually grouped in units of N, then it doesn't need alignment. You can always process the first few elements in scalar code until you get to an aligned boundary, then switch over to SIMD. :)

You still want alignment, even without a multiple of N samples.

Consider a function that computes y += x:
vec_accumulate(float* y, const float* x, unsigned n)

bool alignedTheSame = ((x&15) == (y&15)); // for 16-byte / 128-bit SIMD

If x and y are aligned the same, you can do scalar ops for a few preceding and/or trailing samples, and SIMD for the middle (vector aligned) samples. You get an efficient SIMD implementation for arbitrary n.

If x and y are not aligned the same, one of the two loads is always misaligned. That incurs 3 loads (instead of 2) for every vector op.

Thanks,
-Greg

Gregory Allen

unread,
Jun 17, 2014, 5:02:59 PM6/17/14
to Kenton Varda, capnproto
Ware there ever any further steps toward AlignedData(64)? I notice it’s not present in the schema language docs. Is it still something you are interested in?

I’m interested because I’ve got some specific use cases. I’ve been working hard on a shared memory middleware that I call MCSB.
https://bitbucket.org/gallen/mcsb

It’s now publicly released. From the link:

The Multi-Client Shared Buffer, or MCSB, is a C++ middleware library designed to assist in implementing high-throughput, soft real-time DSP systems on POSIX systems. MCSB uses shared memory with a zero-copy interface to provide a scalable many-to-many message-based middleware, while providing formal guarantees about liveness and memory bounds. Message throughput on a shared memory machine is limited only by the memory bandwidth.

capnproto would be an excellent serialization format for me to use in performance-critical DSP applications built on MCSB, but vector alignment is a key issue for SIMD efficiency.

Thoughts?

Thanks,
-Greg

On Aug 28, 2013, at 1:16 PM, Kenton Varda <temp...@gmail.com> wrote:
> Option 1 is that we require that the blob always be a multiple of N in size. Option 2 is that we allow it to be any size, and the only difference is the guaranteed alignment of the start byte. The latter would mean that the existing Data type is effectively AlignedData(64), since currently all blobs are word-aligned, but can have any byte length.

Gregory E. Allen, PhD, Sr Engineering Scientist

Andrew Lutomirski

unread,
Jun 17, 2014, 5:22:09 PM6/17/14
to Gregory Allen, Kenton Varda, capnproto
On Tue, Jun 17, 2014 at 2:02 PM, Gregory Allen <gal...@arlut.utexas.edu> wrote:
> Ware there ever any further steps toward AlignedData(64)? I notice it’s not present in the schema language docs. Is it still something you are interested in?

It will stop being aligned if canonicalized, unless it gets its own
struct tag type. Maybe this isn't a big deal.

--Andy

Kenton Varda

unread,
Jun 17, 2014, 10:39:51 PM6/17/14
to Gregory Allen, Kenton Varda, capnproto
Hi Gregory,

Sorry, there's been no work on this. I personally can't really afford to spend time on anything that isn't on the critical path for Sandstorm.io, but I'd be happy to accept pull requests. :)

-Kenton
Reply all
Reply to author
Forward
0 new messages