How to use SSE with OWC without intrinsics

Heiko Nitzsche

unread,

Sep 9, 2010, 6:41:09 PM9/9/10

to

I'm trying to keep the feature set of a multi-platform library
consistent across multiple compilers (specifically OWC, MSC, GCC).

After adding SSE support for MSC and GCC using SSE intrinsics
I now face the problem that OpenWatcom still does only have MMX
intrinsic support. I'm not really familiar with X86 assembler
programming but got at least the CPU SSE support detection via
CPUID and inline assembler working.

Is there a way to use SSE with OpenWatcom C compiler other than
assembler? I just need equivalents for the 4 instructions:
_mm_set_ps
_mm_load1_ps
_mm_mul_ps
_mm_add_ps

It would be really nice if someone who already solved this lesson
could share the results.

Many thanks.

Uwe Schmelich

unread,

Sep 10, 2010, 10:01:44 AM9/10/10

to

Heiko Nitzsche wrote:

> I'm trying to keep the feature set of a multi-platform library
> consistent across multiple compilers (specifically OWC, MSC, GCC).
>
> After adding SSE support for MSC and GCC using SSE intrinsics
> I now face the problem that OpenWatcom still does only have MMX
> intrinsic support. I'm not really familiar with X86 assembler
> programming but got at least the CPU SSE support detection via
> CPUID and inline assembler working.
>
> Is there a way to use SSE with OpenWatcom C compiler other than
> assembler? I just need equivalents for the 4 instructions:
> _mm_set_ps
> _mm_load1_ps
> _mm_mul_ps
> _mm_add_ps

You will need to write your own #pragma-macros in a similar spirit to the
ones in mmintrin.h.

i.e.:
extern __m128* _mm_mul_ps(__m128* dst, __m128* src);
#pragma aux _mm_mul_ps parm[eax][edx] value[eax]=\
".686" \
"movaps xmm0, [eax]" \
"mulps xmm0, [edx]" \
"movaps [eax], xmm0" \
modify exact[];

However a general problem with the OpenWatcom approach remains. It's not
very fast because all param in/out of every macro occurs with pointer
values. There is no inter-macro optimization like in the intel compiler
oder gcc using value passing params. We would need compiler support to do
it fast.

Regards
Uwe

Heiko Nitzsche

unread,

Sep 11, 2010, 2:52:13 PM9/11/10

to

> You will need to write your own #pragma-macros in a similar spirit to the
> ones in mmintrin.h.
>
> i.e.:
> extern __m128* _mm_mul_ps(__m128* dst, __m128* src);
> #pragma aux _mm_mul_ps parm[eax][edx] value[eax]=\
> ".686" \
> "movaps xmm0, [eax]" \
> "mulps xmm0, [edx]" \
> "movaps [eax], xmm0" \
> modify exact[];
>
> However a general problem with the OpenWatcom approach remains. It's not
> very fast because all param in/out of every macro occurs with pointer
> values. There is no inter-macro optimization like in the intel compiler
> oder gcc using value passing params. We would need compiler support to do
> it fast.

Thanks!

I tried the following style and it works fine but it is awful
slow, even slower than the FPU version:

typedef struct
{
float m128_f32[4];
} __m128;

#define _mm_mul_ps(_A, _B, _R) \
{ \
__asm \
{ \
__asm movups xmm0,_A \
__asm movups xmm1,_B \
__asm mulps xmm0,xmm1 \
__asm movups _R,xmm0 \
} \
}

This is the fastest version I reached after several trials.
So the stuff is now fully inlined but still about 10-15%
slower than the FPU version. With MSC and GCC I gained about
20-30%. So the difference between OWC and MSC is huge.

I think the main problem is that I have not figured out how
to align the __m128 struct with OpenWatcom and thus I can
only use movups instead of movaps. If the struct could be
16 byte aligned the code (like you proposed) would be more
efficient:

#define _mm_mul_ps(_A, _B, _R) \
{ \
__asm \
{ \
__asm movups xmm0,_A \
__asm mulps xmm0,_B \
__asm movups _R,xmm0 \
} \
}

I even tried to merge the individual operations:

#define mm_calc(a, r, g, b, weightp, bgra, _R) \
{ \
const float *weightf = weightp; \
(_R).m128_f32[0] = (b); \
(_R).m128_f32[1] = (g); \
(_R).m128_f32[2] = (r); \
(_R).m128_f32[3] = (a); \
__asm \
{ \
__asm movups xmm2,bgra \
__asm movups xmm1,weightf \
__asm movups xmm0,_R \
__asm mulps xmm0,xmm1 \
__asm addps xmm0,xmm2 \
__asm movups _R,xmm0 \
} \
}

Any idea how to force the alignment of __m128 struct only?

Wilton Helm

unread,

Sep 12, 2010, 2:27:04 PM9/12/10

to

take a close look at #pragma aux. It has two major advantages compared with
#define __asm.
1. __asm cause the compiler to flush any knowledge of anything kept in
registers, which wrecks havoc with optimization.
2. #pragma aux cooperates with optimization. It tries to optimize in
such a way the the parameters coming in and out of the pragma are in the
specified registers so they don't have to be moved back and forth.

Wilton

Uwe Schmelich

unread,

Sep 12, 2010, 6:33:52 PM9/12/10

to

Heiko Nitzsche wrote:

> I tried the following style and it works fine but it is awful
> slow, even slower than the FPU version:

It may depend on your cpu too. If you have an AMD K8 then the gain in using
SSE in comparison to FPU87 is not that big. Intel helped a bit to make it
look better by making it's own fpu a bit slower in some processors.

>
> This is the fastest version I reached after several trials.
> So the stuff is now fully inlined but still about 10-15%
> slower than the FPU version. With MSC and GCC I gained about
> 20-30%. So the difference between OWC and MSC is huge.

Because of the param passing with pointers in OW and the non existing
compiler support for sse-optimization, you should try to do a full block of
work (perhaps a pixel line) in the pragma to minimize this overhead if you
really need speed.

>
> I even tried to merge the individual operations:
>
> #define mm_calc(a, r, g, b, weightp, bgra, _R) \
> { \
> const float *weightf = weightp; \
> (_R).m128_f32[0] = (b); \
> (_R).m128_f32[1] = (g); \
> (_R).m128_f32[2] = (r); \
> (_R).m128_f32[3] = (a); \
> __asm \
> { \
> __asm movups xmm2,bgra \
> __asm movups xmm1,weightf \
> __asm movups xmm0,_R \

Depending on your cpu you may have a store-to-load-forwarding problem here
too. Or at least you would have it with movaps.

> __asm mulps xmm0,xmm1 \
> __asm addps xmm0,xmm2 \
> __asm movups _R,xmm0 \
> } \
> }

On newer cpus like K10 or i7 the movups should be more or less fast.

> Any idea how to force the alignment of __m128 struct only?

That's relatively easy.

#define ALIGN 16L
/*dynamically*/
char* cbuf=(char*)malloc(sizeof(type)+ALIGN-1);
type* tptr=(type*)((((ptrdiff_t)cbuf) +ALIGN-1) &~(ALIGN-1))

If you are in the stack (local vars) and don't want to use alloca (not C89)
you should use something like:
char cbuf[sizeof(type)+align-1];

For globals you could use:
static char cbuf[sizeof(type)+align-1];

If you have more than one var to align, the easiest may be to put them all
in a struct.
If you need it more programmer friendly create some preproc macros.

You could find some words on this too in the AMD optimization manual
(40546.pdf) under "Dynamic Memory Allocation Consideration" (chapter 2.7)

Regards
Uwe

Heiko Nitzsche

unread,

Sep 13, 2010, 6:17:47 PM9/13/10

to

Thanks to all!

But I think I give up on the SSE topic with OWC. It looks like
it is not possible to resolve the issue with reasonable effort.
Rewriting the whole algorithm in assembler is not an option as
it is not trivial. Also I had much bigger gains with parallel
processing, SSE was just a nice additional improvement idea.
Interestingly on 64bit MSC manual SSE coding seems to be not
necessary for many algorithms as the compiler generated code
reaches easily the same performance, maybe also because of
twice as much registers in x64 mode.

Finally I tried with pragma aux and 16 byte alignment and ended
up with the following. Hope it helps others while I'm not sure
that from a performance point of view it's really worth. Note
that _mm_set_ps is slightly different to standard intrinsic
and tailored for the specific use case:

typedef struct
{
float m128_f32[4];
} __m128;

/* ---------------------- */
#pragma aux __mm_binary1 = parm [eax] value[esi] modify exact []
#pragma aux __mm_binary2nr = parm [eax] [edx] modify exact []
#pragma aux __mm_binary2 = parm [eax] [edx] value[esi] modify exact []
#pragma aux __mm_binary4 = parm [eax] [ebx] [ecx] [edx] value[esi] modify exact []
/* ---------------------- */
static __m128 _mm_set_ps(int __D, int __C, int __B, int __A);
#pragma aux (__mm_binary4) _mm_set_ps = \
".686" \
"cvtsi2ss xmm3,eax" \
"cvtsi2ss xmm2,ebx" \
"cvtsi2ss xmm1,ecx" \
"cvtsi2ss xmm0,edx" \
"unpcklps xmm0,xmm2" \
"unpcklps xmm1,xmm3" \
"unpcklps xmm0,xmm1" \
"movups [esi],xmm0"
/* ---------------------- */
static __m128 _mm_load1_ps(float const *__V);
#pragma aux (__mm_binary1) _mm_load1_ps = \
".686" \
"movss xmm0,[eax]" \
"shufps xmm0,xmm0,0" \
"movups [esi],xmm0"
/* ---------------------- */
static void _mm_store_ps(float *__V, __m128 *__m);
#define _mm_store_ps(__V, __m) _mm_store_ps((__V), &(__m))
#pragma aux (__mm_binary2nr) _mm_store_ps = \
".686" \
"movaps xmm0,[edx]" \
"movaps [eax],xmm0"
/* ---------------------- */
static __m128 _mm_add_ps(__m128 *__m1, __m128 *__m2);
#define _mm_add_ps(__m1, __m2) _mm_add_ps(&(__m1), &(__m2))
#pragma aux (__mm_binary2) _mm_add_ps = \
".686" \
"movaps xmm1,[eax]" \
"addps xmm1,[edx]" \
"movups [esi],xmm1"
/* ---------------------- */
static __m128 _mm_mul_ps(__m128 *__m1, __m128 *__m2);
#define _mm_mul_ps(__m1, __m2) _mm_mul_ps(&(__m1), &(__m2))
#pragma aux (__mm_binary2) _mm_mul_ps = \
".686" \
"movaps xmm1,[eax]" \
"mulps xmm1,[edx]" \
"movups [esi],xmm1"
/* ---------------------- */
static gbm_boolean isSupported_SSE()
{
unsigned long _edxreg = 0;
__asm
{
__asm mov eax,1
__asm cpuid
__asm mov _edxreg,edx
}
if (_edxreg & 0x02000000)
{
return GBM_TRUE;
}
return GBM_FALSE;
}

/* ---------------------- */
/* ---------------------- */
...
#pragma pack(16)
typedef struct
{
__m128 weightf_bgra;
__m128 weightf_mul;
__m128 weightf;
__m128 data;
float results[4];
} SSE_VALUES;
#pragma pack()

#define ALIGN 16L
char ssebuf[sizeof(SSE_VALUES) + ALIGN - 1];
SSE_VALUES * const ssep=(SSE_VALUES*)((((ptrdiff_t)ssebuf) + ALIGN - 1) &~(ALIGN - 1));

...
ssep->data = _mm_set_ps(a8_2, r8_2, g8_2, b8_2);
ssep->weightf = _mm_load1_ps(&(contrib_x_j->weight));
ssep->weightf_mul = _mm_mul_ps(ssep->weightf, ssep->data);
ssep->weightf_bgra = _mm_add_ps(ssep->weightf_bgra, ssep->weightf_mul);
...

The processor is an Intel Core i5. As predicted movaps vs. movups
make no measurable difference on this cpu. Even the reduced number
of instructions have no effect. The whole code is still much slower
than the FPU version.
I compared the generated code with what MSC created (non-optimized)
and there is no much difference in the SSE part. So it is probably
related to the missing inter-macro optimization in OWC.

Uwe Schmelich

unread,

Sep 15, 2010, 5:08:54 AM9/15/10

to

Heiko Nitzsche wrote:

> But I think I give up on the SSE topic with OWC. It looks like
> it is not possible to resolve the issue with reasonable effort.
> Rewriting the whole algorithm in assembler is not an option as
> it is not trivial. Also I had much bigger gains with parallel
> processing, SSE was just a nice additional improvement idea.
> Interestingly on 64bit MSC manual SSE coding seems to be not
> necessary for many algorithms as the compiler generated code
> reaches easily the same performance, maybe also because of
> twice as much registers in x64 mode.

> The processor is an Intel Core i5. As predicted movaps vs. movups
> make no measurable difference on this cpu. Even the reduced number
> of instructions have no effect. The whole code is still much slower
> than the FPU version.
> I compared the generated code with what MSC created (non-optimized)
> and there is no much difference in the SSE part. So it is probably
> related to the missing inter-macro optimization in OWC.

Have you given the watcom sampler/profiler (wsample.exe,wprof.exe) a try? If
not and you are willing to spend some more hours on this it would be
interesting to see where the real culprits are.
From my experience I would say that the missing inter-macro opt has an
influence in the range of factor 1.5 to 2 when you use only short macros
with little functionality.
Don't know how often you use _mm_set_ps(), however cvtsi2ss is one of the
operations you like to omit if you need speed.

Regards
Uwe

Heiko Nitzsche

unread,

Sep 15, 2010, 7:16:56 PM9/15/10

to

> Have you given the watcom sampler/profiler (wsample.exe,wprof.exe) a try? If
> not and you are willing to spend some more hours on this it would be
> interesting to see where the real culprits are.

Nice tool! I tried it some time ago on OS/2 (OWC 1.4?) but the IIRC it
had serious resolution issues because the sampling rate was to slow.
Windows version is much better.

> From my experience I would say that the missing inter-macro opt has an
> influence in the range of factor 1.5 to 2 when you use only short macros
> with little functionality.

I think I found the root cause of the big difference. The algorithm
reads data not necessarily sequentially from memory (it's part of a
resampling scaler). Here is the inner loop of it for scaling the
bitmap vertically. That's the worst case for the problem.
Obviously the read/write of the SSE float array (which is not
optimized away by OWC) causes the big overhead and potentially
somehow also interferes with the data caching of the real bitmap
data to be processed. Note that the algorithm optimization is not
yet finished.

FPU based version:
------------------
for (j = 0; j < contrib_y->n; ++j)
{
contrib_y_j = &(contrib_y->p[j]);
pBGR8 = raster_tmp + (contrib_y_j->pixel * rowspan_tmp);
b8_2 = *pBGR8++;
g8_2 = *pBGR8++;
r8_2 = *pBGR8;
if (isFirst)
{
b8_1 = b8_2; g8_1 = g8_2; r8_1 = r8_2;
isFirst = GBM_FALSE;
}
else if ((b8_1 != b8_2) || (g8_1 != g8_2) || (r8_1 != r8_2))
{
isSame = GBM_FALSE;
}
weightf = contrib_y_j->weight;
weightf_b += weightf * b8_2;
weightf_g += weightf * g8_2;
weightf_r += weightf * r8_2;
}

SSE modified version:
---------------------
for (j = 0; j < contrib_y->n; ++j)
{
contrib_y_j = &(contrib_y->p[j]);
pBGR8 = raster_tmp + (contrib_y_j->pixel * rowspan_tmp);
b8_2 = *pBGR8++;
g8_2 = *pBGR8++;
r8_2 = *pBGR8;
if (isFirst)
{
b8_1 = b8_2; g8_1 = g8_2; r8_1 = r8_2;
isFirst = GBM_FALSE;
}
else if ((b8_1 != b8_2) || (g8_1 != g8_2) || (r8_1 != r8_2))
{
isSame = GBM_FALSE;
}
ssep->data = _mm_set_ps(0, r8_2, g8_2, b8_2);
ssep->weightf = _mm_load1_ps(&(contrib_y_j->weight));

ssep->weightf_mul = _mm_mul_ps(ssep->weightf, ssep->data);
ssep->weightf_bgra = _mm_add_ps(ssep->weightf_bgra, ssep->weightf_mul);
}

In the profiler it gets obvious that the memory transfer from XMM register
back to the memory (for the return value) is the bottleneck. In the FPU
version there are no memory transfers and that makes the difference.

I also had a look at the optimized MSC assembler code, and yes, the
inter-macro optimization greatly helps as it completely removes all
intermediate memory transfers for the intrinsics. Here are the dumps
of both, OWC optimized and MSC optimized:

OWC optimized:
--------------
ssep->data = _mm_set_ps(0, r8_2, g8_2, b8_2);
00406F32 movzx edx,byte ptr -08[ebp]
00406F36 movzx ecx,byte ptr -18[ebp]
00406F3A movzx ebx,byte ptr -14[ebp]
00406F3E xor eax,eax
00406F40 lea esi,-011C[ebp]
00406F46 cvtsi2ss xmm3,eax
00406F4A cvtsi2ss xmm2,ebx
00406F4E cvtsi2ss xmm1,ecx
00406F52 cvtsi2ss xmm0,edx
00406F56 unpcklps xmm0,xmm2
00406F59 unpcklps xmm1,xmm3
00406F5C unpcklps xmm0,xmm1
00406F5F movups [esi],xmm0
00406F62 mov eax,dword ptr -48[ebp]
00406F65 lea edi,30[eax]
00406F68 lea esi,-011C[ebp]
00406F6E movsd
00406F6F movsd
00406F70 movsd
00406F71 movsd
ssep->weightf = _mm_load1_ps(&(contrib_y_j->weight));
00406F72 mov eax,dword ptr -84[ebp]
00406F78 add eax,00000004
00406F7B lea esi,-012C[ebp]
00406F81 movss xmm0,[eax]
00406F85 shufps xmm0,xmm0,00
00406F89 movups [esi],xmm0
00406F8C mov eax,dword ptr -48[ebp]
00406F8F lea edi,20[eax]
00406F92 lea esi,-012C[ebp]
00406F98 movsd
00406F99 movsd
00406F9A movsd
00406F9B movsd

ssep->weightf_mul = _mm_mul_ps(ssep->weightf, ssep->data);

00406F9C mov edx,dword ptr -48[ebp]
00406F9F add edx,00000030
00406FA2 mov eax,dword ptr -48[ebp]
00406FA5 add eax,00000020
00406FA8 lea esi,-013C[ebp]
00406FAE movaps xmm1,[eax]
00406FB1 mulps xmm1,[edx]
00406FB4 movups [esi],xmm1
00406FB7 mov eax,dword ptr -48[ebp]
00406FBA lea edi,10[eax]
00406FBD lea esi,-013C[ebp]
00406FC3 movsd
00406FC4 movsd
00406FC5 movsd
00406FC6 movsd

ssep->weightf_bgra = _mm_add_ps(ssep->weightf_bgra, ssep->weightf_mul);

00406FC7 mov edx,dword ptr -48[ebp]
00406FCA add edx,00000010
00406FCD mov eax,dword ptr -48[ebp]
00406FD0 lea esi,-014C[ebp]
00406FD6 movaps xmm1,[eax]
00406FD9 addps xmm1,[edx]
00406FDC movups [esi],xmm1

MSC optimized:
--------------
ssep->data = _mm_set_ps(0, r8_2, g8_2, b8_2);
00404F1E movzx eax,al
00404F21 cvtsi2ss xmm2,eax
00404F25 movzx edx,dl
00404F28 movzx eax,cl
00404F2B movss xmm1,xmm4
00404F2F cvtsi2ss xmm5,edx
00404F33 cvtsi2ss xmm0,eax
00404F37 add esi,8
00404F3A sub dword ptr [esp+30h],1
00404F3F unpcklps xmm0,xmm2
00404F42 unpcklps xmm5,xmm1
ssep->weightf = _mm_load1_ps(&(contrib_y_j->weight));
00404F45 movss xmm1,dword ptr [esi-4]
00404F4A unpcklps xmm0,xmm5
00404F4D shufps xmm1,xmm1,0

ssep->weightf_mul = _mm_mul_ps(ssep->weightf, ssep->data);

00404F51 movaps xmm2,xmm0
00404F54 mulps xmm2,xmm1

ssep->weightf_bgra = _mm_add_ps(ssep->weightf_bgra, ssep->weightf_mul);

00404F57 addps xmm3,xmm2

The inner loop above is executed about 70.5 million times for the example
(see below). And this is just for the vertical scaling, but horizontal is
less because the image is first stretched horizontally and then vertically.
Based on the profiling results I guess it may be worth to do it in inverted
order to save multiplies for the column addressing and also gain more
efficiency for cpu data caching as horizontally the data is read sequentially.
Anyway...

The inter-macro optimization gives such a boost that the single-thread SSE
version of MSC requires less than half the execution time of the OWC FPU
version. Interestingly even the MSC FPU version is a lot faster, even though
I tuned the compiler option for OWC quite intensively. Obviously there is
still room for improvement.

Scaling a 1024x722x24bpp to a 5000x3525x24bpp bitmap with mitchell results in
the following on my Core i5 750 (times in seconds):

Threads MSC OWC
FPU SSE FPU SSE
1 1.92 1.54 3.72 5.48
2 1.26 1.08 2.34 3.22
4 0.95 0.86 1.65 2.11

Just for completeness, the MSC 64bit version without handcoded SSE
takes for the same task 0.73 seconds and with 0.75 seconds ;)
And this code it not even optimized for 64bit, wow.

The less efficient scaling for more threads probably comes from Turbo
Boost behavior of the Core i5 and most probably from memory bandwidth
limitations.

> Don't know how often you use _mm_set_ps(), however cvtsi2ss is one of the
> operations you like to omit if you need speed.

I also tried variants with movss but the cvtsi2ss was still the fastest.
And obviously even the MS compiler prefers it, so it can't be that slow ;)

Heiko Nitzsche

unread,

Sep 20, 2010, 6:07:14 PM9/20/10

to

> Scaling a 1024x722x24bpp to a 5000x3525x24bpp bitmap with mitchell
> results in the following on my Core i5 750 (times in seconds):
>
> Threads MSC OWC
> FPU SSE FPU SSE
> 1 1.92 1.54 3.72 5.48
> 2 1.26 1.08 2.34 3.22
> 4 0.95 0.86 1.65 2.11

OK, I resolved the issue by coding my own macros addressing
XMM registers directly and thus get to a similar code like
MSC with inter macro optimization. Based on the wsample
results I finished tuning the algorithm as well and now
get the following results for the mentioned scenario:

For 4 threads:
MSC 32bit SSE: 218ms (FPU: 265ms -> worse)
MSC 64bit SSE: 172ms (same for non-intrinsics version)
OpenWatcom SSE: 312ms (FPU: 375ms -> worse)

Well, the comparison to the very first measurements from
last message makes no sense because there was some other
stuff included additionally (as I figured out later :( ).
Nevertheless the new macro approach with direct XMM register
addressing results in similar improvements like MSC with
inter macro optimization, just the code looks a bit more
ugly ;)

Lynn McGuire

unread,

Sep 24, 2010, 5:57:33 PM9/24/10

to

> The inter-macro optimization gives such a boost that the single-thread SSE
> version of MSC requires less than half the execution time of the OWC FPU
> version. Interestingly even the MSC FPU version is a lot faster, even though
> I tuned the compiler option for OWC quite intensively. Obviously there is
> still room for improvement.

Which version of MSC ? And is this OW 1.9 ?

> Scaling a 1024x722x24bpp to a 5000x3525x24bpp bitmap with mitchell results in
> the following on my Core i5 750 (times in seconds):
>
> Threads MSC OWC
> FPU SSE FPU SSE
> 1 1.92 1.54 3.72 5.48
> 2 1.26 1.08 2.34 3.22
> 4 0.95 0.86 1.65 2.11

Wow, that is quite an execution time difference for MSC. I am
guessing that the code generator in MSC takes advantage of a
number of cpu features that the OW code generator does not.

> Just for completeness, the MSC 64bit version without handcoded SSE
> takes for the same task 0.73 seconds and with 0.75 seconds ;)
> And this code it not even optimized for 64bit, wow.

That timing improvement is not surprising to me. The x64 cpu has
many more registers than the x86 cpu. And a well written compiler
like MSC will probably take advantage of those with glee <g>.

Thanks for the interesting comparisons,
Lynn

Heiko Nitzsche

unread,

Sep 25, 2010, 9:48:59 AM9/25/10

to

> Which version of MSC ? And is this OW 1.9 ?

MSC: Version 15.00.30729.01
OWC: 1.8

I had some difficulties with OWC 1.9 with sporadic crashes in
code that is rock stable since several years. It works fine
since OWC 1.4 and also works fine GCC 3.3.5 (OS/2), IBM VAC
3.08 (OS/2), GCC 4.4.x (32/64bit Linux) and MSC (32/64bit Windows).
So I stick with OWC 1.8 until I find some time to analyze what
could be wrong with OWC 1.9 build.

> Wow, that is quite an execution time difference for MSC. I am
> guessing that the code generator in MSC takes advantage of a
> number of cpu features that the OW code generator does not.

The target for MSC was set to the default (blend which is /G6),
on OWC: -onatxhi -oe=100 -sg -ei -6r -fp6 -fpi87

> Thanks for the interesting comparisons,

I meanwhile did also some tests on GCC and there the usage of
intrinsics seems to be Not worth the effort as the compiler
generates much faster code when set to -mfpmath=sse -msse ;)

The code I have mainly uses single precision floating point
which is why SSE is sufficient, and that is available since
Pentium3 / AthlonXP. So quite sufficient backward compatibility.

Lynn McGuire

unread,

Sep 25, 2010, 6:29:10 PM9/25/10

to

> MSC: Version 15.00.30729.01

Is this the MSC in Visual Studio 2010 ?

> I meanwhile did also some tests on GCC and there the usage of
> intrinsics seems to be Not worth the effort as the compiler
> generates much faster code when set to -mfpmath=sse -msse ;)
>
> The code I have mainly uses single precision floating point
> which is why SSE is sufficient, and that is available since
> Pentium3 / AthlonXP. So quite sufficient backward compatibility.

Ah, I am interested in double precision.

Thanks,
Lynn

Heiko Nitzsche

unread,

Sep 26, 2010, 12:28:40 PM9/26/10

to

> Ah, I am interested in double precision.

Then try SSE2.