yuy2 to rgb

Benjamin David Lunt

unread,

Feb 5, 2018, 6:10:59 PM2/5/18

to

Hi guys,

Here is something for you. If you are so inclined.

I have been working on my USB Camera code and have a routine
to convert the stream of data from yuy2 to RGB. Here is my
(generic) C routine:

void yuy2_to_rgb565(void *targ, void *src, int cnt) {
bit8u *s = (bit8u *) src;
bit16u *t = (bit16u *) targ;

while (cnt > 0) {
int y0 = *s++;
int u0 = *s++;
int y1 = *s++;
int v0 = *s++;
cnt -= 4;

int c = y0 - 16; // luma
int d = u0 - 128; // cr
int e = v0 - 128; // cb

*t++ =
((298 * c + 409 * e + 128) & 0xF800) | // R
(((298 * c - 100 * d - 208 * e + 128) & 0xFC00) >> 5) | //G
(((298 * c + 516 * d + 128) & 0xF800) >> 11); // B

c = y1 - 16;

*t++ =
((298 * c + 409 * e + 128) & 0xF800) | // R
(((298 * c - 100 * d - 208 * e + 128) & 0xFC00) >> 5) | // G
(((298 * c + 516 * d + 128) & 0xF800) >> 11); // B
}
}

(I tried to make it narrow enough not to wrap in the post, but
please watch for wrap)

The compiler creates a fairly quick assembly code out of it:

0084DB90 51 push ecx
0084DB91 8B442408 mov eax,[esp+0x8]
0084DB95 53 push ebx
0084DB96 56 push esi
0084DB97 8B742414 mov esi,[esp+0x14]
...
0084DB9D 89442418 mov [esp+0x18],eax
...
0084DBA6 83C404 add esp,byte +0x4
0084DBA9 87C9 xchg ecx,ecx
0084DBAB 8B442418 mov eax,[esp+0x18]
0084DBAF 85C0 test eax,eax
0084DBB1 0F8E04010000 jng dword 0x84dcbb
0084DBB7 48 dec eax
0084DBB8 C1E802 shr eax,byte 0x2
0084DBBB 40 inc eax
0084DBBC 55 push ebp
0084DBBD 8944241C mov [esp+0x1c],eax
0084DBC1 57 push edi
0084DBC2 0FB606 movzx eax,byte [esi]
0084DBC5 0FB64E01 movzx ecx,byte [esi+0x1]
0084DBC9 46 inc esi
0084DBCA 0FB65601 movzx edx,byte [esi+0x1]
0084DBCE 46 inc esi
0084DBCF 83C0F0 add eax,byte -0x10
0084DBD2 8D5980 lea ebx,[ecx-0x80]
0084DBD5 8BC8 mov ecx,eax
0084DBD7 69C02A010000 imul eax,eax,dword 0x12a
0084DBDD 89542410 mov [esp+0x10],edx
0084DBE1 69C92A010000 imul ecx,ecx,dword 0x12a
0084DBE7 0FB65601 movzx edx,byte [esi+0x1]
0084DBEB 46 inc esi
0084DBEC 8D6A80 lea ebp,[edx-0x80]
0084DBEF 8BD5 mov edx,ebp
0084DBF1 69ED99010000 imul ebp,ebp,dword 0x199
0084DBF7 896C2418 mov [esp+0x18],ebp
0084DBFB 69D2D0000000 imul edx,edx,dword 0xd0
0084DC01 8BFB mov edi,ebx
0084DC03 69DB04020000 imul ebx,ebx,dword 0x204
0084DC09 6BFF64 imul edi,edi,byte +0x64
0084DC0C 8BE9 mov ebp,ecx
0084DC0E 2BEF sub ebp,edi
0084DC10 2BEA sub ebp,edx
0084DC12 81C580000000 add ebp,0x80
0084DC18 C1FD05 sar ebp,byte 0x5
0084DC1B 8D8C0B80000000 lea ecx,[ebx+ecx+0x80]
0084DC22 81E5E0070000 and ebp,0x7e0
0084DC28 C1F90B sar ecx,byte 0xb
0084DC2B 83E11F and ecx,byte +0x1f
0084DC2E 0BE9 or ebp,ecx
0084DC30 8B4C2418 mov ecx,[esp+0x18]
0084DC34 8D840880000000 lea eax,[eax+ecx+0x80]
0084DC3B 2500F80000 and eax,0xf800
0084DC40 0BE8 or ebp,eax
0084DC42 8B44241C mov eax,[esp+0x1c]
0084DC46 668928 mov [eax],bp
0084DC49 83C002 add eax,byte +0x2
0084DC4C 8944241C mov [esp+0x1c],eax
0084DC50 8B442410 mov eax,[esp+0x10]
0084DC54 83C0F0 add eax,byte -0x10
0084DC57 8BC8 mov ecx,eax
0084DC59 69C02A010000 imul eax,eax,dword 0x12a
0084DC5F 69C92A010000 imul ecx,ecx,dword 0x12a
0084DC65 8BE9 mov ebp,ecx
0084DC67 2BEF sub ebp,edi
0084DC69 2BEA sub ebp,edx
0084DC6B 8B542418 mov edx,[esp+0x18]
0084DC6F 81C580000000 add ebp,0x80
0084DC75 C1FD05 sar ebp,byte 0x5
0084DC78 8D8C0B80000000 lea ecx,[ebx+ecx+0x80]
0084DC7F 8D841080000000 lea eax,[eax+edx+0x80]
0084DC86 C1F90B sar ecx,byte 0xb
0084DC89 81E5E0070000 and ebp,0x7e0
0084DC8F 2500F80000 and eax,0xf800
0084DC94 83E11F and ecx,byte +0x1f
0084DC97 0BE9 or ebp,ecx
0084DC99 0BE8 or ebp,eax
0084DC9B 8B44241C mov eax,[esp+0x1c]
0084DC9F 668928 mov [eax],bp
0084DCA2 83C002 add eax,byte +0x2
0084DCA5 8944241C mov [esp+0x1c],eax
0084DCA9 8B442420 mov eax,[esp+0x20]
0084DCAD 46 inc esi
0084DCAE 48 dec eax
0084DCAF 89442420 mov [esp+0x20],eax
0084DCB3 0F8509FFFFFF jnz dword 0x84dbc2
0084DCB9 5F pop edi
0084DCBA 5D pop ebp
...
0084DCC4 83C404 add esp,byte +0x4
0084DCC7 5E pop esi
0084DCC8 5B pop ebx
0084DCC9 59 pop ecx
0084DCCA C3 ret

(The ... is where I called a timing call.)

Since you have to convert two pixels at a time, the two
sets of calculations can be somewhat combined, since they
do almost the exact same thing (as the compiler's optimizer
figured out).

On top of that, everything is 16-bit so I even wrote a
16-bit version that used less memory access and more
register access.

With all the tries I did, I couldn't beat the compiler.
Surprise. Surprise.

However, and this is where I am at a loss. I don't know
or have not worked with SSE2 instructions, or whichever
instruction set will allow you to do multiple calculations
at a time. I have only read comments stating that I (the
reader) should use SSE2 (or AVX or whatever it is) to speed
up the code (generic post of someone else's code not even
remotely close to this routine).

So, this is why I came here. I know that some of you
are quite fluent in these matters. How would you write
a routine, in Intel x86 assembly, that would beat the
compiler's code above?

As your probably know, the conversion needs to be extremely
fast...

The sky is the limit, as long as this sky is 32-bit, not
64-bit. Any 32-bit instruction set is okay.

Ready? Go.

Thanks,
Ben

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Forever Young Software
http://www.fysnet.net/index.htm
http://www.fysnet.net/osdesign_book_series.htm
To reply by email, please remove the zzzzzz's

Batteries not included, some Assembly required.

Robert Wessel

unread,

Feb 5, 2018, 9:11:16 PM2/5/18

to

You're only going to get, at best, a limited be benefit if you don't
unroll that loop.

The exact code will depend on what level of SIMD support you're
assuming, I'll assume AVX2 for this, since AVX2 includes a bunch of
handy new integer operations.

AXV2 gets you 32 256 bit registers, so you can do eight iterations of
the above in parallel. You'd end up with something like:

;Compute eight iterations of above loop in parallel, AVX2:

;assumptions:
; esi points to source
; edi to output
; ymm20 = (0x000000ff) x 8
; ymm21 = (0x00000010) x 8 (16)
; ymm22 = (0x00000080) x 8 (128)
; ymm23 = (0x0000012a) x 8 (298)
; ymm24 = (0x00000199) x 8 (409)
; ymm25 = (0x00000064) x 8 (100)
; ymm26 = (0x000000d0) x 8 (208)
; ymm27 = (0x00000204) x 8 (516)
; ymm28 = (0x0000f800) x 8
; ymm29 = (0x0000fc00) x 8

; Load eight instances
; note: loads three bytes in front of array, so [esi-3]
; must be addressable.
movdqu ymm1,-3[esi] ;ymm1 is y0
movdqu ymm2,-2[esi] ;ymm2 is u0
movdqu ymm3,-1[esi] ;ymm3 is y1
movdqu ymm4,[esi] ;ymm4 is v0

; mask off extra bytes
vpand ymm1,ymm1,ymm20
vpand ymm2,ymm2,ymm20
vpand ymm3,ymm3,ymm20
vpand ymm4,ymm4,ymm20

; compute c, d, e, and c'
psubd ymm5,ymm1,ymm21 ;ymm5 = c (y0-16)
psubd ymm6,ymm2,ymm22 ;ymm6 = d (u0-128)
psubd ymm7,ymm4,ymm22 ;ymm7 = e (v0-128)
psubd ymm8,ymm3,ymm20 ;ymm8 = c' (y1-16)

; compute individual products
vpmulldd ymm10,ymm5,ymm23 ;ymm10 = c*298
vpmulldd ymm11,ymm7,ymm24 ;ymm11 = e*409
vpmulldd ymm12,ymm6,ymm25 ;ymm12 = d*100
vpmulldd ymm13,ymm7,ymm26 ;ymm13 = e*208
vpmulldd ymm14,ymm6,ymm27 ;ymm14= d*516
vpmulldd ymm15,ymm8,ymm23 ;ymm15 = c'*298

; First result, first term
vpaddd ymm17,ymm10,ymm11 ;ymm17 = c*298 + e*409
vpaddd ymm17,ymm17,ymm22 ;ymm17 = c*298 + e*409 + 128
vpand ymm17,ymm28 ;mask with 0xf800

; First result, second term
vpsubd ymm18,ymm10,ymm12 ;ymm18 = c*298 - d*100
vpsubd ymm18,ymm18,ymm13 ;ymm18 = c*298 - d*100 - e*208
vpaddd ymm18,ymm18,ymm22 ;ymm18 = c*298 - d*100 - e*208 + 128
vpand ymm18,ymm29 ;mask with 0xfc00
vpsrad ymm18,ymm18,5 ;shift right 5

; First result, third term
vpaddd ymm19,ymm10,ymm14 ;ymm17 = c*298 + d*516
vpaddd ymm19,ymm19,ymm22 ;ymm18 = c*298 + d*516 + 128
vpand ymm19,ymm28 ;mask with 0xf800
vpsrad ymm19,ymm18,5 ;shift right 11

; or three terms of first result together
vpord ymm9,ymm17,ymm18
vpord ymm9,ymm9,ymm19 ;ymm9=first result

; Second result, first term
vpaddd ymm17,ymm15,ymm11 ;ymm17 = c'*298 + e*409
vpaddd ymm17,ymm17,ymm22 ;ymm17 = c'*298 + e*409 + 128
vpand ymm17,ymm28 ;mask with 0xf800

; Second result, second term
vpsubd ymm18,ymm15,ymm12 ;ymm18 = c'*298 - d*100
vpsubd ymm18,ymm18,ymm13 ;ymm18 = c'*298 - d*100 - e*208
vpaddd ymm18,ymm18,ymm22 ;ymm18 = c'*298 - d*100 - e*208 + 128
vpand ymm18,ymm29 ;mask with 0xfc00
vpsrad ymm18,ymm18,5 ;shift right 5

; Second result, third term
vpaddd ymm19,ymm15,ymm14 ;ymm17 = c'*298 + d*516
vpaddd ymm19,ymm19,ymm22 ;ymm18 = c'*298 + d*516 + 128
vpand ymm19,ymm28 ;mask with 0xf800
vpsrad ymm19,ymm18,5 ;shift right 11

; or three terms of second result together
vpord ymm17,ymm17,ymm18
vpord ymm17,ymm17,ymm19 ;ymm17=second result

; Now ymm9 and ymm17 contain eight outputs, with the two
; results for each output alternating in the two registers.
vpunpckhwd ymm10,ymm8,ymm19 ;interleave high dwords to ymm10
vpunpcklwd ymm11,ymm8,ymm19 ;interleave low dwords to ymm11

; pack results from dwords to words
; note: if unsigned saturation is not desired, mask the source
; registers (ymm10, ymm11) with (0x0000ffff)*8
packusdw ymm12,ymm10,ymm11

; And store the results
movdqu ymm12,[edi]

add esi,32
add edi,32

; end

The above hasn't been compiled, tested, etc., nor has any care been
taken in scheduling instructions or trying to work out any other
optimizations. But it ought to perform eight iterations of your
original loop in one go. Modulo bugs, of course.

The end cases where you have fewer than 8 pixels left to process have
not been considered in the above, but several approaches are possible.
You could just compute extra pixels (IOW rounded up to a multiple of
eight) and discard extra results, or you could have a separate loop to
handle single cases (or perhaps multiples of four).

Another issue is the assumption that [esi-3] is readable, if that's a
problem, the first pixel may need special handling. This might
combine with the non-multiple-of-eight processing.

Robert Wessel

unread,

Feb 5, 2018, 9:26:18 PM2/5/18

to

Ignore the comment about unsigned saturation - all the values out of
the computation will be 16 bits, and so there are no overflows for the
pack to deal with.

Robert Wessel

unread,

Feb 5, 2018, 9:26:20 PM2/5/18

to

On Mon, 05 Feb 2018 19:56:30 -0600, Robert Wessel
<robert...@nospicedham.yahoo.com> wrote:

Robert Wessel

unread,

Feb 5, 2018, 9:26:22 PM2/5/18

to

On Mon, 05 Feb 2018 19:56:30 -0600, Robert Wessel
<robert...@nospicedham.yahoo.com> wrote:

Argh.

I forget pre-AVX512 had only 16 registers. *sigh* Ah well, you'll
have to shuffle the register usage around a bit and move the constants
to memory. Other than the constants, the above code uses 18
registers, shuffling things around to get to 16 would be easy. The
original loaded values (ymm1..ymm4), for example, aren't in the final
computations, and could trivially be used instead of ymm17..ymm19 in
those.

Benjamin David Lunt

unread,

Feb 6, 2018, 12:57:19 PM2/6/18

to

"Robert Wessel" <robert...@nospicedham.yahoo.com> wrote in message
news:v14i7ddpmndtqijv8...@4ax.com...

Thanks Robert.

I am going to have to study the avx instruction set and
see if I can implement it.

Thanks again,
Ben

Rod Pemberton

unread,

Feb 6, 2018, 7:44:32 PM2/6/18

to

On Mon, 5 Feb 2018 16:06:08 -0700

"Benjamin David Lunt" <zf...@nospicedham.fysnet.net> wrote:

> I have been working on my USB Camera code and have a routine
> to convert the stream of data from yuy2 to RGB.

Doesn't FFMPEG on Linux do such conversions? ...

> Here is my (generic) C routine:
>
> void yuy2_to_rgb565(void *targ, void *src, int cnt) {
> bit8u *s = (bit8u *) src;
> bit16u *t = (bit16u *) targ;
>
> while (cnt > 0) {
> int y0 = *s++;
> int u0 = *s++;
> int y1 = *s++;
> int v0 = *s++;
> cnt -= 4;
>
> int c = y0 - 16; // luma
> int d = u0 - 128; // cr
> int e = v0 - 128; // cb
>
> *t++ =
> ((298 * c + 409 * e + 128) & 0xF800) | // R
> (((298 * c - 100 * d - 208 * e + 128) & 0xFC00) >> 5) | //G
> (((298 * c + 516 * d + 128) & 0xF800) >> 11); // B
>
> c = y1 - 16;
>
> *t++ =
> ((298 * c + 409 * e + 128) & 0xF800) | // R
> (((298 * c - 100 * d - 208 * e + 128) & 0xFC00) >> 5) | // G
> (((298 * c + 516 * d + 128) & 0xF800) >> 11); // B
> }
> }
>
> (I tried to make it narrow enough not to wrap in the post, but
> please watch for wrap)
>

Well, I'm not familiar with these video formats or their mathematical
conversion, but I do notice an awful lot of multiplication, masking,
shifting, etc, while I don't see any look-up tables to reduce shifting
and masking, etc. I.e., I really doubt this is as fast as C can go.

Without a full test harness, i.e., a program to read in YUY2 file and
write out a RGB file, and sample YUY2 input video with converted RGB
output video, so I can to verify a replacement routine, I'm not about to
play around with this.

If you want to explain YUY2 format to us, you could drop a post on
a.l.a, a.o.d, or your website.

Rod Pemberton
--
"The strongest, richest, greatest nation in the world shouldn't leave
anyone behind," said Joe Kennedy. "We're going to put a lot of coal
miners and coal country out of business," said Hillary Clinton.

Benjamin David Lunt

unread,

Feb 6, 2018, 10:44:49 PM2/6/18

to

"Rod Pemberton" <NoE...@nospicedham.trraxvfeqa.prg> wrote in message
news:p5dg6b$1kim$1...@gioia.aioe.org...

> On Mon, 5 Feb 2018 16:06:08 -0700
>

> Well, I'm not familiar with these video formats or their mathematical
> conversion, but I do notice an awful lot of multiplication, masking,
> shifting, etc, while I don't see any look-up tables to reduce shifting
> and masking, etc. I.e., I really doubt this is as fast as C can go.

If I remember right, I got it down to 6 multiplies and 4 shifts.

> Without a full test harness, i.e., a program to read in YUY2 file and
> write out a RGB file, and sample YUY2 input video with converted RGB
> output video, so I can to verify a replacement routine, I'm not about to
> play around with this.

I don't blame you. I have to test it and gather its speed information
once a re-build and re-boot. Somewhat lengthy processes. 120 seconds
or so each time.

> If you want to explain YUY2 format to us, you could drop a post on
> a.l.a, a.o.d, or your website.

YUY2 uses a two-pixel format and gives the color not in red/green/blue
shades but in Y, Cb, and Cr values.

https://www.loc.gov/preservation/digital/formats/fdd/fdd000364.shtml
"A digital, color-difference component video picture format identified
by the FOURCC code YUY2. This format employs 4:2:2 chroma subsampling
with each sample represented by 8 bits of data. It is essentially the
same as UYVY but with different component ordering packed within the
two-pixel macropixel: Byte 0=8-bit Y'0; Byte 1=8-bit Cb; Byte 2=8-bit
Y'1; Byte 3=8-bit Cr."

Since the Y value will be the only difference for two adjacent
pixels, only this Y value is repeated for both pixels. Hence, it
is usually expressed as UYVY (four nibbles, two Y nibbles).

Anyway, I will see what I can do with Robert's suggestion, as
well as keep working on my own.

If this discussion needs to be moved from c.l.a.x, Frank will surely
let me know. (I rarely visit a.o.d any more, sorry).

Ben

Terje Mathisen

unread,

Feb 7, 2018, 2:30:05 AM2/7/18

to

Benjamin David Lunt wrote:
> "Rod Pemberton" <NoE...@nospicedham.trraxvfeqa.prg> wrote in message

>> If you want to explain YUY2 format to us, you could drop a post on
>> a.l.a, a.o.d, or your website.
>
> YUY2 uses a two-pixel format and gives the color not in red/green/blue
> shades but in Y, Cb, and Cr values.
>
> https://www.loc.gov/preservation/digital/formats/fdd/fdd000364.shtml
> "A digital, color-difference component video picture format identified
> by the FOURCC code YUY2. This format employs 4:2:2 chroma subsampling
> with each sample represented by 8 bits of data. It is essentially the
> same as UYVY but with different component ordering packed within the
> two-pixel macropixel: Byte 0=8-bit Y'0; Byte 1=8-bit Cb; Byte 2=8-bit
> Y'1; Byte 3=8-bit Cr."
>
> Since the Y value will be the only difference for two adjacent
> pixels, only this Y value is repeated for both pixels. Hence, it
> is usually expressed as UYVY (four nibbles, two Y nibbles).
>
> Anyway, I will see what I can do with Robert's suggestion, as
> well as keep working on my own.
>
> If this discussion needs to be moved from c.l.a.x, Frank will surely
> let me know. (I rarely visit a.o.d any more, sorry).

When you have (chroma) subsampling like that you probably need to work
on a complete macroblock, generating all the corresponding output RGB
pixels at once, right?

In the old days this was best done with lookup tables but today such
tables can easily lead to serial bottlenecks while a SIMD (SSE/AVX)
approach with explicit mul/add/sub/shift operations can run in parallel,
so I wouldn't be surprised to learn that this is now also faster.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"