RGB data

Khouri Giordano

unread,

Nov 5, 2013, 11:48:02 AM11/5/13

to nt2...@googlegroups.com

I am working with Boost.SIMD 3.0b2, not the overall NT2.

We have interleaved RGB data as 8 and 16-bit integer values. I would like to deinterleave a set of pixels and convert them to floats for filtering. Then I want to saturate and convert the floats back to 8/16-bit integer and interleave them. For image processing, this is almost the default way of operating.

What is the best way to handle the widening/narrowing of a pack?

I think I can find what I need in Boost.SIMD to do every part of that process, except the 3-channel deinterleave and interleave. I already know the SSE/AVX code to implement these functions and I can make function overloads with SIMD extension tag dispatch. However, If I wanted to integrate these more tightly with Boost.SIMD, what is the best way to extend Boost.SIMD to add functionality like this?

Joel Falcou

unread,

Nov 5, 2013, 3:15:02 PM11/5/13

to nt2...@googlegroups.com

My first idea woudl be to give your RGB data structure a proper
hierarchy type so you can recognize it so in dispatch context. Then you
can put your de/interleaving code in the load/store overload over this
hierarchy. Making the structure a fusion sequence may also help to get a
proper automatic AoS/SoA conversion.

This is what is done in the nt2 support for std complex even if it's not
complete. I can draft a small PoC during the week if you're interested.

Mathias Gaunard

unread,

Nov 6, 2013, 4:14:20 PM11/6/13

to nt2...@googlegroups.com

On 05/11/13 17:48, Khouri Giordano wrote:
> I am working with Boost.SIMD 3.0b2, not the overall NT2.
>
> We have interleaved RGB data as 8 and 16-bit integer values. I would
> like to deinterleave a set of pixels and convert them to floats for
> filtering. Then I want to saturate and convert the floats back to
> 8/16-bit integer and interleave them. For image processing, this is
> almost the default way of operating.
>
> What is the best way to handle the widening/narrowing of a pack?

widening/narrowing is done with split/group. You can also use groups to
narrow and saturate at the same time.

> I think I can find what I need in Boost.SIMD to do every part of that
> process, except the 3-channel deinterleave and interleave.

Boost.SIMD supports arbitrary shuffling, but I'm not entirely sure
what's the optimal way to load and shuffle such data.

Boost.SIMD can however deinterleave, load and widen/narrow at the same
time, but that process isn't particularly optimized at the moment.

It works like this:

typedef tuple<uint8_t, uint8_t, uint8_t> pixel;
typedef tuple< native<float, X>, native<float, X>, native<float, X> >
simd_pixel;

pixel* data;
simd_pixel rgb = load<simd_pixel>(data);

but unless you're using MIC or BG/Q, we just do all that deinterleaving
and converting in scalar mode.

> I already
> know the SSE/AVX code to implement these functions and I can make
> function overloads with SIMD extension tag dispatch. However, If I
> wanted to integrate these more tightly with Boost.SIMD, what is the best
> way to extend Boost.SIMD to add functionality like this?

If you've found a good syntax to do stuff like this I'm interested to
see it and integrate it.

Khouri Giordano

unread,

Nov 6, 2013, 9:57:36 PM11/6/13

to nt2...@googlegroups.com

On Wednesday, 6 November 2013 16:14:20 UTC-5, Mathias Gaunard wrote:

If you've found a good syntax to do stuff like this I'm interested to
see it and integrate it.

I'm not sure you could apply the method in a generic way, especially with things like AVX's 128-bit lanes. The code I have written is for the specific cases that we need. I got the general idea from an Intel guy who said reading data like this is just as fast as reading data that has already been deinterleaved. He said it's all over the internet, but I've looked and I never found anything like it.

To read and deinterleave:

For 32-bit values, the instructions are all there, so for my 8 and 16-bit values, I convert to 32-bit first.

For AVX, you have to first swap 128-bit halves so pixels 0-3 are completely in the low half and pixels 4-7 are completely in the high half.

Given four RGB pixels in three native registers:

i0: r0 g0 b0 r1

i1: g1 b1 r2 g2

i2: b2 r3 g3 b3

First collect values of the same channel with or(or(and(i0,m0),and(i1,m1)),and(i2,m2)) (m0,1,2 are masks) or with blend(blend(i0,i1,c0),i2,c1):

j0: r0 r3 r2 r1

j1: g1 g0 g3 g2

j2: b2 b1 b0 b3

Then shuffle the values within each to get the result:

k0: r0 r1 r2 r3

k1: g0 g1 g2 g3

k2: b0 b1 b2 b3

To interleave and write:

The operation is the reverse of the above starting with the three pack<int,4> or pack<float,4>:

k0: r0 r1 r2 r3

k1: g0 g1 g2 g3

k2: b0 b1 b2 b3

Shuffle the values to put them into their proper positions:

j0: r0 r3 r2 r1

j1: g1 g0 g3 g2

j2: b2 b1 b0 b3

Disperse the values with or(or(and(j0,m0),and(j1,m1)),and(j2,m2)) or with blend(blend(j0,j1,c0),j2,c1):

i0: r0 g0 b0 r1

i1: g1 b1 r2 g2

i2: b2 r3 g3 b3

Given the tuple overloads of (aligned_)load/store that you outlined, I can implement the ones that I need for my own use.

For the general case, I imagine something like the above, but without the element width change, that takes a vector M x pack<T,N> and returns N x pack<T,M>.

Khouri Giordano

unread,

Nov 8, 2013, 10:36:13 AM11/8/13

to nt2...@googlegroups.com

On Wednesday, 6 November 2013 16:14:20 UTC-5, Mathias Gaunard wrote:

typedef tuple<uint8_t, uint8_t, uint8_t> pixel;
typedef tuple< native<float, X>, native<float, X>, native<float, X> >
simd_pixel;

pixel* data;
simd_pixel rgb = load<simd_pixel>(data);

but unless you're using MIC or BG/Q, we just do all that deinterleaving
and converting in scalar mode.

I tried this and actually it splatted the single pixel (type) value at data (pointer) to all elements of simd_pixel.

Khouri Giordano

unread,

Nov 8, 2013, 10:55:25 AM11/8/13

to nt2...@googlegroups.com

On Wednesday, 6 November 2013 21:57:36 UTC-5, Khouri Giordano wrote:

For the general case, I imagine something like the above, but without the element width change, that takes a vector M x pack<T,N> and returns N x pack<T,M>.

I tried working through a transpose of four N-channel tuples for N=[2,6] range. While I was able to approach each case with the same general method I described previously, I don't think it can be done efficiently AND generally. On x86, this is also complicated by the lack of instructions for every element size.

I still think the general usefulness of (de)interleave3 is worth devoting dispatch tags and implementing a few cases.

Khouri Giordano

unread,

Nov 20, 2013, 3:37:49 PM11/20/13

to nt2...@googlegroups.com

On Wednesday, 6 November 2013 16:14:20 UTC-5, Mathias Gaunard wrote:

Boost.SIMD supports arbitrary shuffling, but I'm not entirely sure
what's the optimal way to load and shuffle such data.

Boost.SIMD can however deinterleave, load and widen/narrow at the same
time, but that process isn't particularly optimized at the moment.

It works like this:

typedef tuple<uint8_t, uint8_t, uint8_t> pixel;
typedef tuple< native<float, X>, native<float, X>, native<float, X> >
simd_pixel;

pixel* data;
simd_pixel rgb = load<simd_pixel>(data);

On Wednesday, 6 November 2013 16:14:20 UTC-5, Mathias Gaunard wrote:

Boost.SIMD can however deinterleave, load and widen/narrow at the same
time, but that process isn't particularly optimized at the moment.

It works like this:

typedef tuple<uint8_t, uint8_t, uint8_t> pixel;
typedef tuple< native<float, X>, native<float, X>, native<float, X> >
simd_pixel;

pixel* data;
simd_pixel rgb = load<simd_pixel>(data);

I'm sooo close to getting this worked out in the Boost.SIMD way. The only concept I'm having trouble with is creating the hierarchy for my declared types so I can catch them. These are the types that I'm working with and some of the code I would like to write:

template< typename T >

struct triple

{

typedef T scalar_type;

typedef boost::simd::pack< scalar_type > simd_type;

typedef boost::fusion::tuple< scalar_type, scalar_type, scalar_type > tuple_scalar_type;

typedef boost::fusion::tuple< simd_type, simd_type, simd_type > tuple_simd_type;

};

typedef detail::triple< uint16_t >::tuple_scalar_type u16x3_t;

typedef detail::triple< uint16_t >::tuple_simd_type u16x3_v;

...

const u16x3_t *pixel_data = ...;

const u16x3_v *simd_pixel_data = reinterpret_cast< const u16x3_v * >( pixel_data );

pack< uint16_t > index = enumerate< uint16_t >();

u16x3_v simd_pixel_0 = transpose_elements( load( simd_pixel_data ) );

u16x3_v simd_pixel_1 = load< u16x3_v >( pixel_data ); // built in transpose_elements

u16x3_v simd_pixel_2 = load< u16x3_v >( pixel_data, index, index < 5 ); // masked gather with transpose_elements

store( transpose_elements( simd_pixel_0 ), simd_pixel_data );

store( simd_pixel_1, pixel_data ); // built in transpose_elements

store( simd_pixel_2, pixel_data, index, index < 5 ); // masked scatter with transpose_elements

How can I catch a fusion_sequence_< simd_< undefined_< A0 > > > and ensure that the sequence types are all the same? I know how to do this with MPL for a fixed sequence length, but maybe you have a better generic way.

Transposing elements and an indexed load (gather) or store (scatter) doesn't make sense unless the load/store knows to do the transposing. Again, the implementation isn't so much a problem as catching the special cases where I can write SSE/AVX overloads.

I'm sure a generic implementation of transpose_elements isn't too hard to write and then add a few SSE/AVX specializations for N x [2,4,8,16,32] elements.

Mathias Gaunard

unread,

Nov 20, 2013, 7:54:53 PM11/20/13

to nt2...@googlegroups.com

On 20/11/2013 21:37, Khouri Giordano wrote:

>
> I'm sooo close to getting this worked out in the Boost.SIMD way.

Thank you for working on this!
Unfortunately I don't have a lot of time to look into it at the moment,
but this is definitely interesting.

> How can I catch a fusion_sequence_< simd_< undefined_< A0 > > > and
> ensure that the sequence types are all the same?

With fusion_sequence_ all you can do is fusion_sequence_<A0> + a SFINAE
condition (the BOOST_SIMD_FUNCTOR_IMPLEMENTATION_IF variants) that
checks that all elements of A0 are indeed SIMD and are the same.

I think we have an array_< simd_< undefined_<A0> > >, but only
boost::array goes through it at the moment. boost::array is also
considered a fusion sequence by the way (but array is a better match).

I suppose It could be a possibility to test if a fusion sequence is
array-like when hierarchizing it. Not necessarily much worse than the
SFINAE test.

> I'm sure a generic implementation of transpose_elements isn't too hard
> to write and then add a few SSE/AVX specializations for N x
> [2,4,8,16,32] elements.

You could maybe write it in terms of shuffle, but it's maybe more
optimal if written explicitly.

Joel Falcou

unread,

Nov 21, 2013, 1:08:14 AM11/21/13

to nt2...@googlegroups.com

On 21/11/2013 01:54, Mathias Gaunard wrote:
>
>> How can I catch a fusion_sequence_< simd_< undefined_< A0 > > > and
>> ensure that the sequence types are all the same?
>
> With fusion_sequence_ all you can do is fusion_sequence_<A0> + a
> SFINAE condition (the BOOST_SIMD_FUNCTOR_IMPLEMENTATION_IF variants)
> that checks that all elements of A0 are indeed SIMD and are the same.
>
> I think we have an array_< simd_< undefined_<A0> > >, but only
> boost::array goes through it at the moment. boost::array is also
> considered a fusion sequence by the way (but array is a better match).
>
> I suppose It could be a possibility to test if a fusion sequence is
> array-like when hierarchizing it. Not necessarily much worse than the
> SFINAE test.
>

I think we could have a homogeneous_sequence_< H, Size > somewhere in
between array and fusion sequence which is just this. I can shoot at it
today and see if it helps.

Mathias Gaunard

unread,

Nov 26, 2013, 7:43:17 AM11/26/13

to nt2...@googlegroups.com

This was a bug; it has been fixed in commit
9e2bb95935c7a6b087697ce6d48d3fd683a78684 and will be in the next release.

Mathias Gaunard

unread,

Nov 26, 2013, 7:45:19 AM11/26/13

to nt2...@googlegroups.com

To avoid adding overhead I think it's just better to only support
efficient handling of homogeneous fusion sequences if boost::/std::
array is used.

Khouri, is your code available somewhere on github? I'd like to see how
we can integrate it.

Reply all

Reply to author

Forward