I'm not sure you could apply the method in a generic way, especially with things like AVX's 128-bit lanes. The code I have written is for the specific cases that we need. I got the general idea from an Intel guy who said reading data like this is just as fast as reading data that has already been deinterleaved. He said it's all over the internet, but I've looked and I never found anything like it.
To read and deinterleave:
For 32-bit values, the instructions are all there, so for my 8 and 16-bit values, I convert to 32-bit first.
For AVX, you have to first swap 128-bit halves so pixels 0-3 are completely in the low half and pixels 4-7 are completely in the high half.
Given four RGB pixels in three native registers:
i0: r0 g0 b0 r1
i1: g1 b1 r2 g2
i2: b2 r3 g3 b3
First collect values of the same channel with or(or(and(i0,m0),and(i1,m1)),and(i2,m2)) (m0,1,2 are masks) or with blend(blend(i0,i1,c0),i2,c1):
j0: r0 r3 r2 r1
j1: g1 g0 g3 g2
j2: b2 b1 b0 b3
Then shuffle the values within each to get the result:
k0: r0 r1 r2 r3
k1: g0 g1 g2 g3
k2: b0 b1 b2 b3
To interleave and write:
The operation is the reverse of the above starting with the three pack<int,4> or pack<float,4>:
k0: r0 r1 r2 r3
k1: g0 g1 g2 g3
k2: b0 b1 b2 b3
Shuffle the values to put them into their proper positions:
j0: r0 r3 r2 r1
j1: g1 g0 g3 g2
j2: b2 b1 b0 b3
Disperse the values with or(or(and(j0,m0),and(j1,m1)),and(j2,m2)) or with blend(blend(j0,j1,c0),j2,c1):
i0: r0 g0 b0 r1
i1: g1 b1 r2 g2
i2: b2 r3 g3 b3
Given the tuple overloads of (aligned_)load/store that you outlined, I can implement the ones that I need for my own use.
For the general case, I imagine something like the above, but without the element width change, that takes a vector M x pack<T,N> and returns N x pack<T,M>.