Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

paq8f slow execution in 64 bit Linux

28 views
Skip to first unread message

Matt Mahoney

unread,
Jan 17, 2007, 11:03:12 PM1/17/07
to
I finally got around to testing paq-x86_64.tgz, the 64 bit assembler
version of paq8f at http://cs.fit.edu/~mmahoney/compression/#paq8 which
unfortunately doesn't work on my Athlon-64 under Ubuntu. It
decompresses correctly on the files I tested, but the archives are
larger and not compatible with the 32-bit assembler and pure C++
versions. I need to fix this of course, but my question is about
execution speed. It seems to be much slower in 64-bit mode under Linux
than 32-bit mode under Windows, even for pure C++ code. Here are some
timings in seconds to compress 10^4 bytes of text (beginning of
enwik9): paq8f -6 enwik4 -> 2956 bytes

sec build
--- -----
0.9 Win32, g++, optimized, linked with paq7asm
1.5 Win32, g++, optimized, pure C++ (-DNOASM)
3.6 Win32, g++, no optimizations
13.9 Linux, g++, optimized, linked with paq7asm-x86_64 (output size is
3225)
59 Linux, g++, optimized, pure C++ (-DNOASM)
308 Linux, g++, no optimizations

This is a dual boot system with a 2.2 GHz Athlon-64 and 2 GB memory.
Win32: XP SP2 home, MinGW g++ 3.4.5, compiled with g++ -O2 -Os -s
-march=pentiumpro -fomit-frame-pointer
Linux: Ubuntu 2.6.15.27-amd64-generic, g++ 4.0.3 x86_64, compiled with
-O2 -Os -s -fomit-frame-pointer

I will look into this further, but I wonder if anyone has seen this
behavior in other programs. Long ago I observed extremely slow
execution on a Solaris port of an earlier PAQ version, but I thought
that was just due to not using the assembler code.

One major difference between the 32 and 64 bit versions is the size of
type long and pointers. However, I compared the archives and they are
bitwise identical from all builds except the 64 bit assembler version.
XP Home does not run 64 bit programs even if you have the processor.

The 32-bit NASM assembler code does 16 bit signed vector operations
using the 64 bit MMX registers. The 64 bit YASM code by Matthew Fite
does the same using the 128 bit SSE2 registers. Of course I expected
the 64 bit version to be faster, with or without the assembler code
(which I hope to use in later PAQ versions). So I wonder if anyone has
observed other programs running much slower in 64 bit mode?

-- Matt Mahoney

Sportman

unread,
Jan 18, 2007, 12:35:25 AM1/18/07
to
Matt Mahoney wrote:
> So I wonder if anyone has
> observed other programs running much slower in 64 bit mode?

I know chess software benefit a lot from 64bit (and multi core):

Something like:
32bit --> 64bit = 60% increase.
One core --> dual core = 70% increase.

See also some benchmarks:
http://www.sedatchess.com/hardwares.html

Christian

unread,
Jan 18, 2007, 4:57:06 AM1/18/07
to
Hi Matt,

> ... execution speed. It seems to be much slower in 64-bit mode under Linux


> than 32-bit mode under Windows, even for pure C++ code.

<snip>


> This is a dual boot system with a 2.2 GHz Athlon-64 and 2 GB memory.
> Win32: XP SP2 home, MinGW g++ 3.4.5, compiled with g++ -O2 -Os -s
> -march=pentiumpro -fomit-frame-pointer
> Linux: Ubuntu 2.6.15.27-amd64-generic, g++ 4.0.3 x86_64, compiled with
> -O2 -Os -s -fomit-frame-pointer

Why do you force the compiler to produce code for the old 32-bit
Pentium Pro processor if you want a high-performance run a 64-bit
Athlon CPU? Shouldn't you use "-march=athlon64" instead of
"-march=pentiumpro"?

Christian

Laurent

unread,
Jan 18, 2007, 5:13:59 AM1/18/07
to
Matt Mahoney wrote:
> This is a dual boot system with a 2.2 GHz Athlon-64 and 2 GB memory.
> Win32: XP SP2 home, MinGW g++ 3.4.5, compiled with g++ -O2 -Os -s
> -march=pentiumpro -fomit-frame-pointer
> Linux: Ubuntu 2.6.15.27-amd64-generic, g++ 4.0.3 x86_64, compiled with
> -O2 -Os -s -fomit-frame-pointer

If Ubuntu is able to run 32 bit code, try to compile with -m32
and see what the speed is.

> One major difference between the 32 and 64 bit versions is the size of
> type long and pointers.

The problem is that since long is 64 bit you double data size
which results in dcache thrashing. Try to replace long with
int.

> Of course I expected
> the 64 bit version to be faster, with or without the assembler code
> (which I hope to use in later PAQ versions). So I wonder if anyone has
> observed other programs running much slower in 64 bit mode?

I have never observed slowing down on compute bounded programs.
At least not when long has been replaced with int :)


Laurent

Matt Mahoney

unread,
Jan 18, 2007, 10:42:37 AM1/18/07
to

In fact I had to take out -march=pentiumpro in Linux (machine type not
supported error). I forgot to mention this. In Windows it was the
lowest processor type that didn't significantly affect performance.

Anyway I will post when I find the bug.

-- Matt Mahoney

Laurent

unread,
Jan 18, 2007, 12:37:40 PM1/18/07
to
As I explained to Matt by private e-mail, it looks like
there is a bug in the x86_64 C++ file:

for (int i=0; i<ncxt; ++i) {
for( int j=0; j< nx; j++ )
#ifdef NOASM // no assembly language
pr[i]=squash(dot_product(&tx[0], &wx[cxt[i]*N], nx)>>5);
#elif __x86_64
pr[i]=squash(dot_product_x86_64(&tx[0], &wx[cxt[i]*N], nx)>>5);
#else
pr[i]=squash(dot_product(&tx[0], &wx[cxt[i]*N], nx)>>5);
#endif

The internal j loop is spurious.

Matt Mahoney wrote:

> sec build
> --- -----
> 0.9 Win32, g++, optimized, linked with paq7asm
> 1.5 Win32, g++, optimized, pure C++ (-DNOASM)
> 3.6 Win32, g++, no optimizations
> 13.9 Linux, g++, optimized, linked with paq7asm-x86_64 (output size is
> 3225)
> 59 Linux, g++, optimized, pure C++ (-DNOASM)
> 308 Linux, g++, no optimizations
>
> This is a dual boot system with a 2.2 GHz Athlon-64 and 2 GB memory.
> Win32: XP SP2 home, MinGW g++ 3.4.5, compiled with g++ -O2 -Os -s
> -march=pentiumpro -fomit-frame-pointer
> Linux: Ubuntu 2.6.15.27-amd64-generic, g++ 4.0.3 x86_64, compiled with
> -O2 -Os -s -fomit-frame-pointer

On a 2.4 GHz Opteron, gcc 4.1.1 and NOASM I get this:

64 bit build: 1.55 sec
32 bit build: 1.96 sec

This definitely look better :-)


Laurent

Matt Mahoney

unread,
Jan 18, 2007, 1:09:13 PM1/18/07
to

Laurent wrote:
> As I explained to Matt by private e-mail, it looks like
> there is a bug in the x86_64 C++ file:
>
> for (int i=0; i<ncxt; ++i) {
> for( int j=0; j< nx; j++ )
> #ifdef NOASM // no assembly language
> pr[i]=squash(dot_product(&tx[0], &wx[cxt[i]*N], nx)>>5);
> #elif __x86_64
> pr[i]=squash(dot_product_x86_64(&tx[0], &wx[cxt[i]*N], nx)>>5);
> #else
> pr[i]=squash(dot_product(&tx[0], &wx[cxt[i]*N], nx)>>5);
> #endif
>
> The internal j loop is spurious.

Yes, that is exactly the problem (in Mixer::p()). I took out the extra
loop and it worked (with -DNOASM).

Now I just need to find the other bug in the 64 bit assembler code.
Hopefully this should result in 64 bit Linux versions of all PAQ
versions starting with PAQ7.

Matt Mahoney

unread,
Jan 19, 2007, 12:52:18 AM1/19/07
to

I have fixed the bugs in the 64 bit SSE2 assembler code (in train()).
The new code can be linked to any paq7 or paq8 version with no source
code changes to produce a 64 bit Linux executable. I have not tested
it with 64 bit Windows or 32 bit Linux but there is no reason the code
should not work as written. On my Athlon-64, the Linux-64 version is
about 7% faster than the Win32 version. I have produced Linux
executables for paq8f and paq8jd. The code is here:

http://cs.fit.edu/~mmahoney/compression/#paq8

direct links:

http://cs.fit.edu/~mmahoney/compression/paq8f.zip
http://cs.fit.edu/~mmahoney/compression/paq8jd.zip (newer, better
compression, but slower)

Archive contents:

paq8f.cpp or paq8jd.cpp - source code (unchanged)
paq7asm.asm - 32 bit NASM/YASM assembler code (unchanged)
paq7asm-x86_64.asm - 64 bit YASM assembler code ver. 2 (fixed)
paq7asm-x86_64.o - above, assembled with YASM for 64 bit Linux (new)
paq8f.exe or paq8jd.exe - Win32 executable linked with paq7asm
(unchanged)
paq8f or paq8jd - Linux x86_64 executable (new)

See paq7asm-x86_64.asm comments for 64 bit compilation instructions.

-- Matt Mahoney

Note: the original 64 bit code at
http://ilovemyking.googlepages.com/paqpage uses the old assembler code,
which does not work. Use this code instead.

Laurent

unread,
Jan 21, 2007, 7:47:01 AM1/21/07
to
Matt Mahoney a écrit :

> I have not tested
> it with 64 bit Windows or 32 bit Linux but there is no reason the code
> should not work as written. On my Athlon-64, the Linux-64 version is
> about 7% faster than the Win32 version.

Couldn't you try 32 bit Linux with -m32? I really would like
to see how it compares.


Laurent

Matt Mahoney

unread,
Jan 22, 2007, 10:35:09 PM1/22/07
to

I haven't tried it, but I think it should run the same speed as the 32
bit Windows version or a bit slower. Some compilers like Intel and
VC++ produce code a bit faster than g++.

-- Matt Mahoney

0 new messages