Alpha blending algorithm

Jason Victor

unread,

Dec 30, 2001, 10:18:56 AM12/30/01

to

Hello.
If I have two colors (hex values) each an unsigned long, and I want to
find a color that will "blend" them (e.g. I have black and white I
want gray) how do I do this?

I'm pretty sure that it is the average of their respective R G and B
values, but I don't have those, just the hex color (0 for black,
0xffff for white, and i think 0x7bef for gray).

I'm programming in Xlib if that matters. Thanks for your help.

-j

Robert Olofsson

unread,

Dec 30, 2001, 2:31:35 PM12/30/01

to

: If I have two colors (hex values) each an unsigned long, and I want to

: find a color that will "blend" them (e.g. I have black and white I
: want gray) how do I do this?

average of R, G and B....

: I'm pretty sure that it is the average of their respective R G and B

: values, but I don't have those, just the hex color (0 for black,
: 0xffff for white, and i think 0x7bef for gray).

ok, my guess is that you have R= 5 bits, green = 6 bits and blue= 5 bits and
run in TrueColor mode. That just happens to fit the above equation well..

xdpyinfo can tell you more about which visual you are using. like for me:
default visual id: 0x21
visual:
visual id: 0x21
class: TrueColor
depth: 16 planes
available colormap entries: 64 per subfield
red, green, blue masks: 0xf800, 0x7e0, 0x1f
significant bits in color specification: 8 bits

which seems to be the same as you are using...

16 bits can be quite funny (or not). to get blue mask the 5 lower bits, for
red mask the 5 upper bits and shift, green mask out bit 6-12 and shift down.
reverse to compose the color.

one can also have 1 bit alpha, and 5 bits each for red, green and blue but
that is not so common anymore (was it ever?)...

If you want to be nice and distribute your program you will have to handle
other bit depths too.....

/robo
******************************************************************************
Robert Olofsson * d94...@nada.kth.se * "Your eyes can deceive you
Flygkårsv 5,1 * * use the force."
183 62 Täby * tel: 08-732 71 39 * /Obi-Wan-Kenobi
Sweden * http://www.nada.kth.se/~d94-rol
******************************************************************************

gordy

unread,

Dec 31, 2001, 2:29:32 AM12/31/01

to

given color (r, g, b)

mix = (((color1 - color2)*alpha) / 256) + color2

alpha would be 0..255

gordy

unread,

Dec 31, 2001, 2:33:16 AM12/31/01

to

> given color (r, g, b)
>
> mix = (((color1 - color2)*alpha) / 256) + color2
>
> alpha would be 0..255

ah for just a 50/50 blend, alpha would be 128 an that reduces to (color1 +
color2) / 2

David Kastrup

unread,

Dec 31, 2001, 4:09:40 PM12/31/01

to

"gordy" <gf...@home.com> writes:

That's junk. You need to divide by 255, and that does not take
rounding into account and processors in general truncating towards 0
instead of -infinity.

The following is correct including best rounding:

((255-alpha)*color2 + alpha*color1) + 127)/255;

It might be a good idea to have all variables unsigned char, and to
use unsigned constants as well (cast to unsigned short where
necessary). In that case chances are that the compiler will be able
by itself to replace the above arithmetic by something more efficient
but equivalent.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
Email: David....@t-online.de

gordy

unread,

Dec 31, 2001, 4:16:42 PM12/31/01

to

one gives you alpha in range of 0..256 (9bits) and the other is 0..255, same
rounding problems it looks to me. but then again its just for blending a
couple of colors.. if he's that worried about accuracy he would use floating
point maths and probably some lightness preserving method at that.

Steve Martin

unread,

Dec 31, 2001, 4:42:09 PM12/31/01

to

David Kastrup wrote:

> > given color (r, g, b)
> >
> > mix = (((color1 - color2)*alpha) / 256) + color2
> >
> > alpha would be 0..255
>
> That's junk. You need to divide by 255, and that does not take
> rounding into account and processors in general truncating towards 0
> instead of -infinity.
>
> The following is correct including best rounding:
>
> ((255-alpha)*color2 + alpha*color1) + 127)/255;

Bear in mind also (and perhaps this is understood, but
just in case...) that this calculation must be done
*separately* on each color component, not simply on
the 24-bit hex number representing each color, else
you run the risk of the subtraction causing color
errors as carries and borrows occur in the two's-
complement operations.

Jason Victor

unread,

Jan 1, 2002, 9:42:45 AM1/1/02

to

OK, is it impossible to alpha blend to hex colors?

-k

David Kastrup

unread,

Jan 1, 2002, 10:54:32 AM1/1/02

to

jaso...@yahoo.com (Jason Victor) writes:

> OK, is it impossible to alpha blend to hex colors?

Yes, provided you are too lazy to consider either looking up or
programming trivial code pieces.

Jason Victor

unread,

Jan 1, 2002, 12:40:11 PM1/1/02

to

> > OK, is it impossible to alpha blend to hex colors?
>
> Yes, provided you are too lazy to consider either looking up or
> programming trivial code pieces.

I don't understand your point. I'm not lazy, this is just my first
real stab at graphics programming.
Where do I get the lookup table?

Arne Schmitz

unread,

Jan 1, 2002, 1:07:49 PM1/1/02

to

Jason Victor wrote:

> I don't understand your point. I'm not lazy, this is just my first
> real stab at graphics programming.
> Where do I get the lookup table?

What's you problem? The hex-format is just a human-readable represantation!
As soon as you've fscanf'ed or sscanf'ed the hex-value, it's stored as an
integer value in the computer, and you can do arithmetic operations as
usual.

Arne

--
All the passions make us commit faults; love makes us commit the most
ridiculous ones.
-- La Rochefoucauld

[--- PGP key available on http://www.root42.de/ ---]

David Kastrup

unread,

Jan 1, 2002, 5:21:40 PM1/1/02

to

jaso...@yahoo.com (Jason Victor) writes:

Lookup table? What lookup table? Look, I have provided you with the
basic per-component formula. And there are literally hundreds of code
examples floating around on the web you can look up. If that is not
sufficient, you obviously need to have something tailor-made for you.
And if that's the case, you should be offering a reasonable payment
for services rendered specifically to you.

There are lots of resources easy to find in case you invest the time
necessary to look for them. If you do not want to invest the time,
what makes you think someone else will?

Jason Victor

unread,

Jan 1, 2002, 7:38:39 PM1/1/02

to

My problem is that with the hex color, and not individual red green
and blue values, i can't do the arithmetic i need to do alpha
blending.

unless anybody knows how to do alpha blending with two COLORS that are
already a hex value, so i DONT HAVE the individual r g b component for
each.

-j

Charles R. Bond

unread,

Jan 1, 2002, 9:26:53 PM1/1/02

to

Jason Victor wrote:

Jason,
I think part of the problem you are experiencing has to do with
accepted practices for representing machine numbers, and
the conventions used for representing colors. For example,
a 32-bit integer is often represented in hex format as: 0x12345678
where each digit is a 4-bit hex digit in the range from 0 to F.
(0123456789ABCDEF). You asked about converting this to
decimal and, indeed, it can be done within the C/C++ language
by any of several methods.

As for colors, it is usual to denote RGB colors with a hex number
such as: 0x00RRGGBB, where the RR means two hex digits which
form a 'red' value from 0 to 255, and the GG means two hex digits
which from the 'green' and BB for the blue. The 00 entries in this
format generally designate the alpha or blending channel. Now,
there are a variety of color formats, and some graphics environments
do not support an alpha channel explicitely, etc. , etc. So when you
want to blend colors you have to know if your color format is
24-bit RGB, or 16/15 bit TrueColor, or 8-bit color or whatever.

You may find it helpful to find a text which covers computer graphics
at this level and study it before you will be able to proceed
confidently.

Robert Olofsson

unread,

Jan 2, 2002, 2:47:46 AM1/2/02

to

: unless anybody knows how to do alpha blending with two COLORS that are

: already a hex value, so i DONT HAVE the individual r g b component for
: each.

And I have already told you that you probably has to split your color
(16 bits) with 5 top bits for red, 6 middle for green and 5 lower to
blue (but check with xdpyinfo that that is correct). With that info
you have the individual components...

Arne Schmitz

unread,

Jan 2, 2002, 4:35:15 AM1/2/02

to

Jason Victor wrote:

You could either

- use the "&"- and ">>"-operator to extract the RGB codes like this:

int hexcolor, r, g, b;
r = ( hexcolor >> 16 ) & 255;
g = ( hexcolor >> 8 ) & 255;
b = ( hexcolor ) & 255; (or the other way round?? RGB -> BGR ?)

But that is too dependent on screen depth (here 24/32 bpp).

- or better use XAllocColor(3X11) or XAllocColorCells(3X11) to allocate a
colour structure. This is the better way and is independent of any colour
depth.

- or use the RENDER-extension that is present in recent
XFree-implementations. It does alpha blending all on its own. I have no
experience with it, though, but a search on google will most probably help
you.

Arne

--
Old soldiers never die. Young ones do.

Donald Graft

unread,

Jan 2, 2002, 8:36:45 AM1/2/02

to

Maybe he has an indexed color format and doesn't know how to do
the palette lookup.

Don

"Robert Olofsson" <d94...@my.nada.kth.se> wrote in message news:a0udv2$dtb$1...@news.kth.se...

Jukka Liimatta

unread,

Jan 5, 2002, 10:10:15 PM1/5/02

to

> The following is correct including best rounding:
>
> ((255-alpha)*color2 + alpha*color1) + 127)/255;

The division is not always the quickest way to divide, if the range of
values is known, for instance for 8bit alpha blending, the maximum value
would be 255*255, divided by 255, we can rewrite the division:

int x = i + 127; // rounding
int v = x / 255;

Rewritten as:

int x = i + 127; // rounding
int v = ((x<<8) + x + (x>>7)) >> 16;

... this is a hard to explain as never had to prove this before, but here's
how I derived this:

y = (x / 256) - (x / 255)

.. where the y is the error from dividing with incorrect value, I use the
x/256+x/512 as the decision bits, they just carry into the final result,
which is why have to do the actual calculation like it is above. This is
*not* precise, it is guaranteed only to work with values between 0 and
255*255. We can eliminate one bit-shift and addition if we can live with
fact that we get incorrect result only twice or so for the whole range (the
incorrect value just has -1 difference to the correct value). This slower
method is absolutely bit-by-bit precise, though.

There is a benchmark code (not compilable as-is, since requieres linking
against a library with high-precision timer on it, but shows the "method" of
testing atleast). The result of this quick test is that the /255 -method is
roughly three times slower than the kludge-method(tm).

I didn't write this just because of this thread, though, not that crazy most
of the time. I just tested this concept ages ago, when division was even
slower than it is with more contemporary architechtures. I compiler and ran
this before posting, though. The result of 3x faster performance for the
no-division alternative was for Pentium4 1.7Ghz, Windows2000 SP2, Microsoft
Visual C++ 7.0 Beta2.

I'm sure the results will vary for non-intel and non-x86 processors,
different compilers, etc. But that's how it looks like on my configuration,
and that's on machine where division is supposed to be relatively cheap (not
sure if integer division folds into floating-point division in hardware or
not, but anyway.. that's the system and these were the results ;-)

// .... the benchmark/verification code
// <snip> <snip>

Timer timer;
float time0,time1;

// test the "alternative"
{
int sum = 0;
time0 = timer.GetTime();
for ( int i=0; i<255*255; ++i )
{
int x = i + 127;
int v = ((x<<8) + x + (x>>7)) >> 16;
sum += v;
}
time1 = timer.GetTime();
cout << "defeat optimizer value: " << sum << endl;
cout << "runtime: " << (time1-time0)*1000 << "ms" << endl;
}

// test the division
{
int sum = 0;
time0 = timer.GetTime();
for ( int i=0; i<255*255; ++i )
{
int v = (i+127) / 255;
sum += v;
}
time1 = timer.GetTime();
cout << "defeat optimizer value: " << sum << endl;
cout << "runtime: " << (time1-time0)*1000 << "ms" << endl;
}

// verify algorithm
for ( int i=0; i<255*255; ++i )
{
int x = i + 127;
int v0 = ((x<<8) + x + (x>>7)) >> 16;
int v1 = x / 255;
if ( v0 != v1 )
{
cout << "#### not verified ####" << endl;
}
}

return 0;
}

That said, I do (x>>8) in practise anyway. ;-)

--
Jukka

David Kastrup

unread,

Jan 6, 2002, 6:20:22 AM1/6/02

to

"Jukka Liimatta" <jukka.l...@twilight2d.com> writes:

> > The following is correct including best rounding:
> >
> > ((255-alpha)*color2 + alpha*color1) + 127)/255;
>
> The division is not always the quickest way to divide, if the range of
> values is known, for instance for 8bit alpha blending, the maximum value
> would be 255*255, divided by 255, we can rewrite the division:
>
> int x = i + 127; // rounding
> int v = x / 255;
>
> Rewritten as:
>
> int x = i + 127; // rounding
> int v = ((x<<8) + x + (x>>7)) >> 16;
>

Look, if you are using a shitty compiler, that is your problem. The
following code compiles as follows under GNU gcc:

unsigned char a,x,y;
[...]

printf("%d\n", (unsigned char)((a*(x-y) + 0xff7fu)/255u + y));
printf("%d\n", (a*x+(255u-a)*y + 127u)/255u);
unsigned xx = (a*x+(255u-a)*y+127u);
printf("%d\n", ((xx<<8)+x+(xx>>7))>>16);

Note that the last line is yours. The other two are two "straight"
variants, where the first variant tries to avoid one multiplication.
Looking at the generated code, this seems like a failure.

movl %eax, %edi
movl %ebx, %edx
subl %edi, %edx
movl %ebx, -20(%ebp)
movl %esi, %ebx
movl $-2139062143, %eax
imull %edx, %ebx
addl $65407, %ebx
mull %ebx
shrl $7, %edx
xorl %ebx, %ebx
addl %edx, %ecx
movb %cl, %bl

Ok, blended byte in %ebx, gcc did not perform a division instruction.

[...]
movl -20(%ebp), %eax
movl %esi, %edx
popl %ecx
imull %eax, %edx
movl $255, %eax
subl %esi, %eax
imull %edi, %eax
leal 127(%eax,%edx), %edi
movl $-2139062143, %eax
mull %edi
shrl $7, %edx
[...]
Ok, blended byte in %edx, gcc did not perform a division instruction.
Trying to save one multiplication

Your version, gcc reuses %edi. All of the below junk generated by
your code is about equivalent to the last three lines of the above.

movl -20(%ebp), %eax
movl %edi, %ecx
sall $8, %ecx
shrl $7, %edi
addl %eax, %ecx
addl %edi, %ecx
shrl $16, %ecx

Ok, blended byte in %ecx. gcc did not perform a division instruction.

As you can see, the straight variant not only was the most legible and
easily provably correct, it also compiled the fastest code.

"Optimizations" like the one you propose are very likely to hamper
compiler optimizations when going to a different compiler, or a
different processor, or a different compiler version. They should, if
at all, be surrounded by
#if specific compiler and specific target and specific compiler version
and only after benchmarking.

The only way to determine whether hand-optimizations are better than
what the compiler produces, is benchmarking, and benchmarking is only
valid on one and only one platform.

Oh, by the way, your version blends the triple 255,255,255 into 254,
so it also produces wrong results.

Congratulations. Less readable, slower, wrong.

I'd rather trust my compiler's optimizer.

Jukka Liimatta

unread,

Jan 6, 2002, 9:19:44 AM1/6/02

to

> Look, if you are using a shitty compiler, that is your problem. The
> following code compiles as follows under GNU gcc:

I have to use "a shitty compiler" when writing C++ for Windows.

> Oh, by the way, your version blends the triple 255,255,255 into 254,
> so it also produces wrong results.

If you run the test code, it will give the *same* result for the presented
cases. x : [0,255squared].

int x = 255;

int v0 = ((x<<8) + x + (x>>7)) >> 16;
int v1 = x / 255;

cout << v0 << endl;
cout << v1 << endl;

Please eloborate what you mean (255,255,255) is blended into 254, the above
gives same value for v0 and v1 for values for x between 0 and 255squared. It
gives same value than /255, so how can it be any more or less wrong than the
result from division?

> Congratulations. Less readable, slower, wrong.

Less readable, agreed. I just gave alternative to /255 which gives
bit-by-bit same result as the division. The wrong is thus what I don't agree
with.

> I'd rather trust my compiler's optimizer.

If I did, I would have three times slower code on a "crappy" compiler. I see
now reason to get upset, I just give alternatives and present information I
know to possess. Anyone can use what they like - but more information never
hurts.

--
Jukka

David Kastrup

unread,

Jan 6, 2002, 10:59:40 AM1/6/02

to

"Jukka Liimatta" <jukka.l...@twilight2d.com> writes:

> > Look, if you are using a shitty compiler, that is your problem. The
> > following code compiles as follows under GNU gcc:
>
> I have to use "a shitty compiler" when writing C++ for Windows.

g++'s port mingw32 is available free for Windows.

> > Oh, by the way, your version blends the triple 255,255,255 into 254,
> > so it also produces wrong results.
>
> If you run the test code, it will give the *same* result for the presented
> cases. x : [0,255squared].

Small note aside: the range in question is x:[127,255squared+127];

But that's probably not the real problem. Please reread the code
example I posted. It had a typo in it (I wrote x once instead of xx).
So it won't give the same result, but through a fault of mine, not
yours.

So what is the resulting code if we do it correctly?

printf("%d\n", (unsigned char)((a*(x-y) + 0xff7fu)/255u + y));
printf("%d\n", (a*x+(255u-a)*y + 127u)/255u);
unsigned xx = (a*x+(255u-a)*y+127u);

printf("%d\n", ((xx<<8)+xx+(xx>>7))>>16);

goes to
movl %esi, -20(%ebp)
movl %ebx, %esi
movl %esi, %edx
movl %eax, %ebx
subl %ebx, %edx
movl -20(%ebp), %edi
imull %edx, %edi
movl $-2139062143, %eax
addl $65407, %edi

mull %edi
shrl $7, %edx

addl %edx, %ecx
xorl %edx, %edx
movb %cl, %dl
result in edx, second version:
movl -20(%ebp), %ecx
movl $255, %eax
subl %ecx, %eax
movl -20(%ebp), %edx
imull %esi, %edx
imull %ebx, %eax
leal 127(%eax,%edx), %esi
from here we work from esi:
movl $-2139062143, %eax
mull %esi
shrl $7, %edx

result in edx

your version, starts with esi like above:
movl %esi, %edx
sall $8, %edx
addl %esi, %edx
shrl $7, %esi
addl %esi, %edx
shrl $16, %edx

So you see that your version is more complicated. Since the various
partial results all need the shift unit of the Pentium and there are
dependencies, this does not schedule so well.

> > Congratulations. Less readable, slower, wrong.
>
> Less readable, agreed. I just gave alternative to /255 which gives
> bit-by-bit same result as the division. The wrong is thus what I
> don't agree with.

Was a typo of mine. I would not trust your version, though, before
having it run through a program checking all values of x from 0 to
65535. Whereas my version is self-evidently correct.

> > I'd rather trust my compiler's optimizer.
>
> If I did, I would have three times slower code on a "crappy"
> compiler.

Then, as mentioned above, you should special-case your code just for
that particular compiler and take care that the better optimizable
and plain code gets compiled in all other situations.

Oh, BTW, you would in most cases be radically better off to use MMX
instructions if you knew your target to support that. Not for a
single pixel, of course, but for an array of them.

Gernot Hoffmann

unread,

Jan 6, 2002, 12:21:46 PM1/6/02

to

It has been overlooked, that "255" is merely a scale faktor S .

a= 0..1 Blend factor

c= (1-a)*c1+a*c2 New color

c= [S*(1-a)*c1+S*a*c2]/S Scaled

S=256 Example
A=S*a

c =[(S-A)*c1 + A*c2] Div 256

No division - simple shift.
S=128 for 16-bit signed integer, else any reasonable value.
For 32-bit integer, S can me made much larger.

Best regards --Gernot Hoffmann

Jukka Liimatta

unread,

Jan 6, 2002, 4:51:36 PM1/6/02

to

> g++'s port mingw32 is available free for Windows.

Finally, last time I checked it wasn't. I first time checked in 1996 when
the Prophecy SDK's first lines of code were written. Currently during some
point in the the code been compiler with some of the following (with
success):

MSVC++4,5,6,7
Borland C++Builder 4,5
GCC 2,3
Intel C/C++ 4,5

Platforms: Windows, BeOS, Linux and MS-DOS (the original implementation
platform, legacy stuff). Occasional straying to less mainstream platforms
have occured and not worth mentioning. ;-)

I didn't find the g++ for Windows last time, which was late 99' or early
2001 -- apologies if cannot recall precise date or time. Glad to hear this
is the case now, I will definitely look into this.

> Small note aside: the range in question is x:[127,255squared+127];

True, the test ran i from 0 to 255*255, but the x was i+127, so actually the
range being tested was: [127,255squared+127] -- I also separately tested
[0,255*255] on other occasion (w/o the rounding) and it also gave correct
testrun w/o errors.

I always test my claims or they are atleast based on previous experience, in
otherwords I don't intentionally mislead people, atleast on public forums. I
did not intend no insult or other harm to anyone. One arguing thread at a
time is sufficient for me (you must have not failed to notice the line
drawing thread, which is a great shame that I allowed myself to be lead).

> So you see that your version is more complicated.

It is more complicated code, there is no disagreement about that. There are
three bitwise shifts and various additions to go with it, I'm quite aware of
this. My only claim is that the versions of compilers I been mainly working
with in the past years favour the more complex code over the simpler one,
because the integer division is not a very fast instruction on Intel's x86
implementations.

This gives me impression that you are running on AMD based system-- I don't
claim that I do know this, only a guess - which is a very smart choise for
processor to build your desktop system around to in case you did.

> Since the various partial results all need the shift unit of the Pentium
and there are
> dependencies, this does not schedule so well.

I only claimed the bottom line on my box. I also found this to be true on
PentiumII and Pentium166 too many years ago to remember. It comes down to if
the division is faster than the more complex code or not. I am not disbuting
that this could not easily be the case for very large number of people.

When one begins to claim his code is the fastest, you are on a very thin
ice - this ofcourse is entirely platform dependant, and why I thought
alternative means to get to the same result KNOWN to be faster on SOME
systems are worth mentioning, atleast.

> Was a typo of mine. I would not trust your version, though, before
> having it run through a program checking all values of x from 0 to
> 65535. Whereas my version is self-evidently correct.

Correction to my earlier statement, it has been found to be working for
range from 0 to 65152 (255*255+127). The code is easy enough to verify by
compiling and running it. I definitely would NOT have posted if I wasn't
absolutely sure that the code is correct.

> Then, as mentioned above, you should special-case your code just for
> that particular compiler and take care that the better optimizable
> and plain code gets compiled in all other situations.

Indeed, but the above was not retail application or attempt to one. It was
just code presenting alternative method of doing division with constant
value for number in a certain range. When you actually begin to optimize for
various platforms is the time when all information is useful.

> Oh, BTW, you would in most cases be radically better off to use MMX
> instructions if you knew your target to support that. Not for a
> single pixel, of course, but for an array of them.

Ofcourse. You can download the Prophecy SDK from www.liimatta.org, look into
src/prcore/x86/ for _alphablend.asm for blending implementations for BeOS,
Linux and Windows. The ASM code is compilable through NASM, and supports
linkage for all the mentioned platforms. It's meant for blending rectangular
blocks of data in 32bit pixelformat only.

As you might notice the linkage is polluting global namespace, we could have
managed to do this with one third of the pollution if decorated the methods
with leading or trailing underscore in the C/CPP side of the package.
Tradeoffs as they say. ;-)

With more complex code, when you are writing innerloops for, say, a software
rasterizer you want to integrate the blending into the actual innerloops.
More passes is always a bad thing. I could go on and on about whatkind of
solutions we have in our software rasterizers, but that would be a whole
other story.. ;-)

The CPP fall-back code in the SDK is actually quite horrible for the
blending. We simply assume that PentiumMMX is the minimum system required
and that's that. The code *will* work on older systems aswell, but the rate
that things happen is not a condideration at all -- just that the code won't
bomb out is enough.

Where is this code used? It's used in texture generation, for example in
landscape rendering. This could be done within the 3D accelerator, IF
render-to-texture is supported. But we needed a software fallback route, and
that's it. The code is also quite handy for drawing anti-aliased fonts and
that sort of activity. Freetype and the 2D block blending code is a perfect
match, www.freetype.org -- recommended checking out.

--
Jukka

Jukka Liimatta

unread,

Jan 6, 2002, 4:59:06 PM1/6/02

to

> c =[(S-A)*c1 + A*c2] Div 256
>
> No division - simple shift.
> S=128 for 16-bit signed integer, else any reasonable value.
> For 32-bit integer, S can me made much larger.

That is very obvious, the point was that for some this is too imprecise: +-1
error in the final color out of 256 different combinations is too much. I
don't know how many layers they are blending and allowing the error
accumulate, but towards this criticism I developed the add-shift method of
doing "division by 255" a while ago.

I'm using the shift-right-eight method myself since the loss of color
resolution doesn't matter for me. The sort of errors that DO matter are
incorrect scanconversion (flickering pixels between triangles), under- or
overflow on gradients (what's nicer than getting white pixels where you
should get black?) and that sort.

IMHO, your suggestion is very prudent for most of the time. ;-)

[ other considerations ]

Why not make the multiplier 256, then the division would be correct? The
reason against that is that the result must be a (2^n)-1, for example 0xff
for byte-per-component pixelformats. This would overflow, unless we have
cheap saturation, which we do not have in C or C++. Just another
point-of-view to the problem..

( we actually DO have a relatively cheap way of saturating multiple
components at the same time, but this is quite a bit more complex than just
using "valid" ranges to begin with ).

--
Jukka

David Kastrup

unread,

Jan 6, 2002, 5:42:41 PM1/6/02

to

"Jukka Liimatta" <jukka.l...@twilight2d.com> writes:

> True, the test ran i from 0 to 255*255, but the x was i+127, so actually the
> range being tested was: [127,255squared+127] -- I also separately tested
> [0,255*255] on other occasion (w/o the rounding) and it also gave correct
> testrun w/o errors.
>
> I always test my claims or they are atleast based on previous experience, in
> otherwords I don't intentionally mislead people, atleast on public forums. I
> did not intend no insult or other harm to anyone. One arguing thread at a
> time is sufficient for me (you must have not failed to notice the line
> drawing thread, which is a great shame that I allowed myself to be lead).
>
>
> > So you see that your version is more complicated.
>
> It is more complicated code, there is no disagreement about that. There are
> three bitwise shifts and various additions to go with it, I'm quite aware of
> this. My only claim is that the versions of compilers I been mainly working
> with in the past years favour the more complex code over the simpler one,
> because the integer division is not a very fast instruction on Intel's x86
> implementations.
>
> This gives me impression that you are running on AMD based system-- I don't
> claim that I do know this, only a guess - which is a very smart choise for
> processor to build your desktop system around to in case you did.

You have not been paying attention at all, have you? You have not
looked at the generated code, have you? gcc does not issue a division
instruction for the division by 255. It issues code roughly
equivalent to your arithmetic, only that it uses a single
multiplication where you use several shifts and adds, and the value
range for which its operations are correct is larger.

Since you have been missing the point in spite of my many posts
explaining this, here it is again: gcc does not generate a divide
instruction for a division by 255, it issues code using multiplication
and shift instead. It does this when I tell it to optimize for
Pentium.

> I only claimed the bottom line on my box. I also found this to be
> true on PentiumII and Pentium166 too many years ago to remember. It
> comes down to if the division is faster than the more complex code
> or not.

I repeat again: the code gcc generates does not do division. It does
a multiplication and a shift. This code is faster than division, the
optimizer knows this and uses it for that reason. It is primarily the
compiler writer's task to know what implementation of arithmetic
involving constants takes up what amount of time, and optimize
accordingly, taking the desired target into account. In that respect,
the gcc writers have done their homework. Your compiler writers
seemingly have failed to do so.

> When one begins to claim his code is the fastest, you are on a very
> thin ice - this ofcourse is entirely platform dependant,

Which is the very reason the compiler is the right tool for making the
choice. It is the tool that has to take all of the idiosyncrasies of
the target processor into account.

> and why I thought alternative means to get to the same result KNOWN
> to be faster on SOME systems are worth mentioning, atleast.

They are worth mentioning, but it is also worth mentioning that, given
a good compiler, they might actually decrease performance.

Such attempts at optimization should only be attempted when all other
measures have been tried, they should be made exclusively for a
certain platform and compiler version, their equivalence needs to be
proven, and benchmarks have to be employed in order to make sure that
they are indeed an improvement.

You, however, have stated your case as if your code generally was to
be preferred.

David Kastrup

unread,

Jan 6, 2002, 5:45:24 PM1/6/02

to

hoff...@fho-emden.de (Gernot Hoffmann) writes:

> It has been overlooked, that "255" is merely a scale faktor S .

By whom?

> a= 0..1 Blend factor

It has been overlooked by some smart aleck, that a is in the range
0..255, not 0..1.

> c= (1-a)*c1+a*c2 New color
>
> c= [S*(1-a)*c1+S*a*c2]/S Scaled
>
> S=256 Example

It has been overlooked by somebody that S=255 was required.

> A=S*a
>
> c =[(S-A)*c1 + A*c2] Div 256
>
> No division - simple shift.

Please adapt your code to a=0..255, then come back. Oh, by the way,
somebody has forgotten about proper rounding.

> S=128 for 16-bit signed integer, else any reasonable value.
> For 32-bit integer, S can me made much larger.

--

Jukka Liimatta

unread,

Jan 6, 2002, 7:52:16 PM1/6/02

to

> You, however, have stated your case as if your code generally was to
> be preferred.

Untrue, only over division if such a thing occurs and is slower, and it
matters. Many parameters.

--
Jukka

Jukka Liimatta

unread,

Jan 6, 2002, 8:52:51 PM1/6/02

to

> You, however, have stated your case as if your code generally was to
> be preferred.

Alright, now I know better.

www.liimatta.org/misc/alpha.cpp

Compiled on Mandrake 8.1, g++ 2.96,

g++ -fomit-frame-pointer -O3 alpha.cpp

Results for me, PentiumII 400Mhz SMP,

division runtime: 140 ms
alternative runtime: 80 ms

I didn't check the compiler output, but atleast on my SMP box the
"alternative" wins out. What compiler switches I need to throw to g++ 2.96
for the "division runtime" -test to win?

--
Jukka

Stephane Brousseau

unread,

Jan 6, 2002, 10:28:07 PM1/6/02

to

shifts must be avoided on Pentium IV and up.

But more importantly, production application actually use a 256*256 look-up
table instead of multiplications/divisions. (clamping can also be
precalculated in these tables)

Finally, even version 6.0 of what you later refer to as a 'shitty compiler'
detects the contant in the for(;;) loop in your test code below removes the
loop
completly when optimizations are on. So I don't know how you could have run
the benchmark exactly as written below.

"Jukka Liimatta" <jukka.l...@twilight2d.com> wrote in message
news:a18f6g$cdl$1...@news.kolumbus.fi...

Jukka Liimatta

unread,

Jan 6, 2002, 10:54:48 PM1/6/02

to

> shifts must be avoided on Pentium IV and up.

Yes, because Pentium4 has only one barrel-shifter. The ALU itself is
double-pumped, meaning 1.7Ghz CPU's ALU runs at 3.4Ghz -- quite a few
bitshifts and additions flow through in the time it takes to do an integer
division.

> But more importantly, production application actually use a 256*256
look-up
> table instead of multiplications/divisions. (clamping can also be
> precalculated in these tables)

True, if the calculation is non-trivial enough to warrant 64K * sizeof(T)
lookup table and it's initialization. From experience I know lookup-table to
be actually quite efficient for this sort of application so it's definitely
one of the things that production application would try.

I did however present an alternative to / 255 with shifting and addition
operations only. A production application weights different alternatives and
doesn't claim any is the winner unless it is clearly defined what x86
implementation we are going to be running, for example.

> Finally, even version 6.0 of what you later refer to as a 'shitty
compiler'
> detects the contant in the for(;;) loop in your test code below removes
the
> loop

I am compiling with Microsoft Visual C++ 7.0 Beta2, actually. The "shitty"
was in quotes because it was not term I gave for the compiler.

> completly when optimizations are on. So I don't know how you could have
run
> the benchmark exactly as written below.

The compiler did not clearly strength-reduce the calculation into nothing
since there seemed to be three-fold difference in the runtimes with the two
alternatives. The writing into "sum" and cout of the sum after the loop
forced the code to be generated. The compiler did not seem to evaluate the
constant number of iterations of the loop and the value at compile-time,
even if it had all the possibility to do so (all the constant were visible
in the same translation unit).

The /Faout.txt output also seems to agree that the code was generated.

So I don't see any reason why I should not trust the results.

Just in case, because I do take constructive criticism into account, I did
move the code to be timed into other TLA unit, here's the code:

int test1(int count)
{
int sum = 0;
for ( int j=0; j<count; ++j )

for ( int i=0; i<255*255; ++i )
{
int x = i + 127;

int v = x / 255;
sum += v;
}
return sum;
}

int test2(int count)
{
int sum = 0;
for ( int j=0; j<count; ++j )

for ( int i=0; i<255*255; ++i )
{
int x = i + 127;
int v = ((x<<8) + x + (x>>7)) >> 16;
sum += v;
}

return sum;
}

This time the results are:

562 ms
160 ms

The assembly outputs were:

-------------
test1
-------------
xor ecx, ecx
$L278:

; 7 : {
; 8 : int x = i + 127;
; 9 : int v = x / 255;

lea esi, DWORD PTR [ecx+127]
mov eax, -2139062143 ; 80808081H
imul esi
add edx, esi
sar edx, 7
mov eax, edx
shr eax, 31 ; 0000001fH
add edx, eax

; 10 : sum += v;

add edi, edx
inc ecx
cmp ecx, 65025 ; 0000fe01H
jl SHORT $L278

-------------
test2
-------------
xor edx, edx
mov ecx, 32639 ; 00007f7fH
$L292:

; 22 : {
; 23 : int x = i + 127;
; 24 : int v = ((x<<8) + x + (x>>7)) >> 16;

lea edi, DWORD PTR [edx+127]
sar edi, 7
add edi, ecx
sar edi, 16 ; 00000010H

; 25 : sum += v;

add eax, edi
add ecx, 257 ; 00000101H
inc edx
cmp ecx, 16744064 ; 00ff7e80H
jl SHORT $L292

.... conclusions ....

The "shitty compiler" seems to also use trick similiar to g++ to avoid the
division, yet, the code is one quarter of the rate the alternative code
executes. Explain or spin this in anyway you like, now the code was in a
different translation unit and the global optimizations were switched OFF,
so there was no way the declarations could have been learking from different
unit to the one being currently compiled.

But that is HARDLY the point. It was to present alternatives ways to get the
same result. The code undeniably works, since all cases between 0 and
255*255+127 are tested and found to provide the same answer.

I'm surprised that I have to "defend" the trick in anyway at all. It should
be just interesting "tidbit" of code, curiosity, nothing more. But if I
claim it was 3 fold faster, I'm prepared to defend my integrity at least,
since that atleast still means something for me.

I do not lie. I do not exaggerate. I make mistakes like anyone else does,
but my futher trials with separate translation units used as the medium
through which I did the timing proved that the difference was actually 4
fold, not 3 fold so I was actually being overly modest.

The REAL question would be, if such "toy code" is any value to be measured
at all. The very same question I asked in another thread on one of the
listed groups in the message header.

Chill out! Christ! ;-)

--
Jukka

Jukka Liimatta

unread,

Jan 6, 2002, 11:32:47 PM1/6/02

to

> That's junk. You need to divide by 255, and that does not take
> rounding into account and processors in general truncating towards 0
> instead of -infinity.

Btw. this is also possible, what's the difference between:

x/256
x/255

It's scale of 256/255 = 1.003921568....

Another alternative, obvously, is to multiply by that value and THEN shift
down:

int v0 = x / 255;
int v1 = (x * 65793) >> 16;

.. more bits we can spare, more accurate the result is. The rounding can be
added to the x before this calculation. This is the alternative both g++
2.96 and MSVC++7 seem to implement in practise, looking at the assembly
listing.

How this performs against the "alternative #1" is debated enough elsewhere
on this thread and is actually least interesting thing overall. I'm just
interested in bringing alternatives to the /255 into light. This one is not
worth mentioning since g++ and MSVC++7 both seem to employ this
optimization, but for interest's sake I hope not don't sound too arrogant or
annoying, which is only explanation for the amount of replies I get.

--
Jukka

douglas_linder

unread,

Jan 7, 2002, 2:55:49 AM1/7/02

to

Woa.

It's not like this is a life and death thread. Take 5 and have a
breather before you post pls. We're straying toward not-so-
-very-nice territory in your last few posts.

Just chill and be cool man.

Later,
Doug.

David Kastrup

unread,

Jan 7, 2002, 7:04:33 AM1/7/02

to

"Jukka Liimatta" <jukka.l...@twilight2d.com> writes:

First, I used gcc-3.0.3. Second, please use the code I have presented
to you. You use signed integers where unsigned chars would be needed
in order to clue the compiler in about what ranges to expect. This is
unfair since your code also requires restricted ranges of input.

Third, you should specify -march=i686 as a compiler option.

Gernot Hoffmann

unread,

Jan 7, 2002, 8:20:20 AM1/7/02

to

jaso...@yahoo.com (Jason Victor) wrote in message news:<dd2b07b0.01123...@posting.google.com>...
> Hello.
> If I have two colors (hex values) each an unsigned long, and I want to
> find a color that will "blend" them (e.g. I have black and white I
> want gray) how do I do this?
> I'm pretty sure that it is the average of their respective R G and B
> values, but I don't have those, just the hex color (0 for black,
> 0xffff for white, and i think 0x7bef for gray).
> I'm programming in Xlib if that matters. Thanks for your help.
> -j

As far as I have read the thread, the gamma problem was not discussed.
Correct blending in image processing requires this workflow
(cg is a correction gamma value, a=0..1 the blend factor):

c1=c1^cg
c2=c2^cg
c =(1-a)*c1+a*c2
c =c^(1/cg)

Even if we blend nothing, e.g. a=1 or a=0, then the forward/reverse
conversion by byte lookup tables creates for cg=2.2 a code loss of 28%
of 256 codes. Some steps differ 3 or 4 LSBs.
The rounding for the tables was done correctly by float.
For me, some considerations about rounding errors by division etc.
are not so relevant then.

The correction gamma is not necessarily cg=2.2 for a gamma=2.2 cali-
brated system. Lower values deliver better results on the monitor.
http://www.fho-emden.de/~hoffmann/gamquest18102001.pdf (730kBytes).
Here are also some diagrams which show the quantization.

Best regards --Gernot Hoffmann

Jukka Liimatta

unread,

Jan 7, 2002, 10:37:55 AM1/7/02

to

> First, I used gcc-3.0.3.

Now you tell me, well, I stick to 2.96 - and you tell me how the code runs
on your setup (which we don't ever yet know what it is), don't even know if
it is x86 at all.

> Second, please use the code I have presented to you. You use signed
integers where unsigned chars would be needed
> in order to clue the compiler in about what ranges to expect. This is

8bit * 8bit = 16bit required for the result to fit in

The above leaves me very much option to use 8bit x, I tried 16bit variables,
but those were slower than 32bit ones on the architechture and compilers I
am using (x86).

So let's evaluate:

8bit - doesn't fit the result of modulation
16bit - fits, slower
32bit - fits, faster

I don't see how my test code was unfair.

> unfair since your code also requires restricted ranges of input.

What do you mean, "unfair" ? For alpha blending we are not doing
component*scale, where both are indeed unsigned char-- so the restricted
range of 0 - ~64K is fair in my opinion.

The alternatives I presented DO work. That haven't been a question for a
long time now. I haven't claimed they are fastest on all situations either,
I ONLY stated that they were so on Pentium4 1.7Ghz compiled with MSVC++7.
Now I also added PentiumII 400Mhz SMP g++ 2.96 into the list. Nothing more
or less, I don't claim to be fair or anything else.

Those are just simple facts.

> Third, you should specify -march=i686 as a compiler option.

As you wish.

division: 120 ms
alternative: 90 ms

--
Jukka

Patrick I Taylor

unread,

Jan 7, 2002, 1:46:52 PM1/7/02

to

Someone just told you how to get the individual RGB values
out of the hex value.

The process is:

convert 2 hex values to six individual values cRed1 cGreen1 CBlue1 cRed2
cGreen2 cBlue2
alpha blend values to cRed3 cGreen3 cBlue3
convert back to a hex value and use however.

example:

cRed1 = trunc(hColor / 0x800);
cGreen1 = trunc( (hColor AND 0x7FF)/0x20 );
cBlue1 = hColor AND 0x1f;

Pt

gordy

unread,

Jan 15, 2002, 1:41:38 PM1/15/02

to

I dont understand how this thread got so long.. it seems someone is
complaining about the least-significant bit being wrong in some cases due to
dividing by 256 vs. dividing by 255. the whole point of adjusting alpha to
be in the range 0..256 is to be able to use a binary shift for the divide.
with this method you can alphablend more than one pixel at a time and with
mmx up to 8 16bit pixels per iteration its a very efficient and accurate
method for alphablending two bitmaps - I've used it for years.

Jukka Liimatta

unread,

Jan 15, 2002, 2:41:25 PM1/15/02

to

We know, but if the source alpha is in the input(s), it's a byte, and can
only store values between [0,255] so.. ;-)

--
Jukka

David Kastrup

unread,

Jan 16, 2002, 11:50:43 AM1/16/02

to

"gordy" <gf...@home.com> writes:

When you have an alpha plane, its values are only from 0..255. In
addition, you want to have some sort of linearity and symmetry. The
following properties are nice to have (notation should be obvious from
equations):

0[b,c] = b
255[b,c] = c
a[255-b, 255-c] = 255-a[b,c]
(255-a)[b,c] = a[c,b]

The equation (divide by truncation)
a[b,c] = (a c + (255-a) b + 127)/255
provides all that.

However, so does
a[b,c] = (a c + (256-a) b + (255+c-b)/2)/256
or the equivalent
a[b,c] = (a c + (256-a) b + (256+c-b)/2)/256

Proving the equivalency of the last two forms is a nice exercise...