lmbench on Linux and DU, explanations welcome

3 views
Skip to first unread message

Anton Ertl

unread,
Mar 9, 1999, 3:00:00 AM3/9/99
to
I have run lmbench 1.1 on the same motherboard (a Cabriolet with a
300MHz 21064a and 2MB L2 cache) under both Digital Unix and Linux.
You can see the results at
http://www.complang.tuwien.ac.at/anton/a2a3.ps.gz. An HTML page about
this is available at
http://www.complang.tuwien.ac.at/anton/lmbench-linux-vs-du.html

There are interesting differences between the results for the two OSs.
I think the differences for the 128K-8M range can be explained by
different page allocation algorithms.

However, it seems unlikely to me that this is the explanation for the
differences in the 8K-16K range. So, my questions are:

Could these differences be caused by the virtually indexed L1 cache?
If I understand virtually indexed caches correctly, this cannot be the
case.

Or does Linux use only 8K (of 16K) data cache in the 21064A?
Incidentially, /proc/cpuinfo reports the CPU as EV4 (instead of EV45;
the EV4 has only 8KB cache).

Or is there another explanation for this result?

Thanks in advance.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Solar Designer

unread,
Mar 10, 1999, 3:00:00 AM3/10/99
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> I have run lmbench 1.1 on the same motherboard (a Cabriolet with a
> 300MHz 21064a and 2MB L2 cache) under both Digital Unix and Linux.
> You can see the results at
> http://www.complang.tuwien.ac.at/anton/a2a3.ps.gz. An HTML page about
> this is available at
> http://www.complang.tuwien.ac.at/anton/lmbench-linux-vs-du.html

Thanks for the interesting results.

> There are interesting differences between the results for the two OSs.
> I think the differences for the 128K-8M range can be explained by
> different page allocation algorithms.

> However, it seems unlikely to me that this is the explanation for the
> differences in the 8K-16K range. So, my questions are:

Isn't there a 50% chance of the two pages (or are there three, if the
array isn't page-aligned?) getting mapped onto physical addresses
corresponding to the same half of the cache?

> Or is there another explanation for this result?

While I don't have a good explanation, let me offer a simple benchmark
I did a while ago:

ftp://ftp.dataforce.net/pub/solar/membench.c

It has assembly code for Alpha, x86, and x86 with MMX, as well as a pure
C version. Requires GCC. The simple assembly loops should make it more
deterministic, which in turn should make analyzing its results easier.

P.S. Finally an opportunity to thank you for publishing the GCC "Labels
as Values" optimization for virtual machines that I've made use of. :-)
Thanks!

--
/sd

Anton Ertl

unread,
Mar 10, 1999, 3:00:00 AM3/10/99
to
In article <7c50a1$btd$1...@prince.dataforce.net>,

Solar Designer <so...@cannabis.dataforce.net> writes:
> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> > I have run lmbench 1.1 on the same motherboard (a Cabriolet with a
> > 300MHz 21064a and 2MB L2 cache) under both Digital Unix and Linux.
> > You can see the results at
> > http://www.complang.tuwien.ac.at/anton/a2a3.ps.gz. An HTML page about
> > this is available at
> > http://www.complang.tuwien.ac.at/anton/lmbench-linux-vs-du.html
...

> > However, it seems unlikely to me that this is the explanation for the
> > differences in the 8K-16K range. So, my questions are:
>
> Isn't there a 50% chance of the two pages (or are there three, if the
> array isn't page-aligned?) getting mapped onto physical addresses
> corresponding to the same half of the cache?

For a direct-mapped physically indexed cache, yes. What is the
organization of the L1 cache in the 21064A? One web page I read
claimed it is virtually indexed, which would mean that the physical
page mappings should be irrelevant (except maybe for inclusion in L2).

I have made ten runs on Cabriolets under Linux (you can find runs from
another machine with and without 512K cache in
http://www.complang.tuwien.ac.at/anton/a4.ps.gz). They all have the
step at 8K, not at 16K. The chance of this happening in ten runs by
chance is 0.1%.

> While I don't have a good explanation, let me offer a simple benchmark
> I did a while ago:
>
> ftp://ftp.dataforce.net/pub/solar/membench.c

Here's what I got (results in MB/s read bandwidth):

size Linux DU
1K 1497.11 1466.47
4K 1519.95 1538.94
7K 1494.72 1488.24
8K 1509.16 1490.77
9K 858.61 1492.08
15K 362.99 1515.74
16K 344.40 1516.17
17K 344.65 1069.26

I.e., we get the same picture as with lmbench. The results under
Linux look as if the machine had only 8K L1 D-cache, whereas under DU
we get full L1 bandwidth up to the full 16K of the L1 D-cache.

> P.S. Finally an opportunity to thank you for publishing the GCC "Labels
> as Values" optimization for virtual machines that I've made use of. :-)

Actually the work was done by the GCC people, I only used it and wrote
about it.

Solar Designer

unread,
Mar 10, 1999, 3:00:00 AM3/10/99
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> ...
>> > However, it seems unlikely to me that this is the explanation for the
>> > differences in the 8K-16K range. So, my questions are:
>>
>> Isn't there a 50% chance of the two pages (or are there three, if the

Actually, that is assuming a uniform distribution, which we probably
don't get.

>> array isn't page-aligned?) getting mapped onto physical addresses
>> corresponding to the same half of the cache?

> For a direct-mapped physically indexed cache, yes. What is the
> organization of the L1 cache in the 21064A? One web page I read
> claimed it is virtually indexed, which would mean that the physical
> page mappings should be irrelevant (except maybe for inclusion in L2).

I've just found another web page, which lists 21064A as physically
indexed:

Processor level size line assoc wp index fow coher
------------------------------------------------------------------------------
DEC Alpha 21064A L1 16k 32 1 WT->L2 PIPT no

http://www.dcs.warwick.ac.uk/~john/cache-specs.html

The right way to find out for sure (well, almost) would probably be to
download the PDF from Digital/Compaq and read.

> I have made ten runs on Cabriolets under Linux (you can find runs from
> another machine with and without 512K cache in
> http://www.complang.tuwien.ac.at/anton/a4.ps.gz). They all have the
> step at 8K, not at 16K. The chance of this happening in ten runs by
> chance is 0.1%.

Not necessarily; the page mapping might depend on things such as the
size of your binary's sections, which remain the same over the runs.

>> While I don't have a good explanation, let me offer a simple benchmark
>> I did a while ago:
>>
>> ftp://ftp.dataforce.net/pub/solar/membench.c

And also an old proof-of-concept program for detecting L1 data cache
size and associativeness (actually, in the opposite order):

ftp://ftp.dataforce.net/pub/solar/cache.c

The code is a bit dirty, but seems to work most of the time (there're
a few exceptions though, such as under a high load). For example,
on my 21164PC, it reports:
L1 data cache: 8 Kb, direct-mapped
and on a Pentium II:
L1 data cache: 16 Kb, 4-way associative
both seem correct. :-)

> Here's what I got (results in MB/s read bandwidth):

> size Linux DU
> 1K 1497.11 1466.47
> 4K 1519.95 1538.94
> 7K 1494.72 1488.24
> 8K 1509.16 1490.77
> 9K 858.61 1492.08
> 15K 362.99 1515.74
> 16K 344.40 1516.17
> 17K 344.65 1069.26

> I.e., we get the same picture as with lmbench. The results under
> Linux look as if the machine had only 8K L1 D-cache, whereas under DU
> we get full L1 bandwidth up to the full 16K of the L1 D-cache.

Yes. It does indeed look like Linux doesn't use the full cache;
I guess it might be time to check the PALcode. How are you booting
Linux? Are you using the correct MILO image, if any?

--
/sd

Anton Ertl

unread,
Mar 10, 1999, 3:00:00 AM3/10/99
to
In article <7c6a9r$pr3$1...@prince.dataforce.net>,

Solar Designer <so...@cannabis.dataforce.net> writes:
> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> The right way to find out for sure (well, almost) would probably be to
> download the PDF from Digital/Compaq and read.

I did that. The document is
http://ftp.digital.com/pub/Digital/info/semiconductor/literature/21064ads.pdf

It says on page 11, that there are two modes of using the D-cache: as
8K direct mapped cache; and as 16K cache that appears internally as
virtually-indexed direct-mapped cache; externally (I guess this means
for cache consistency bus snoops) it appears as 2-way set-associative
cache. The manual has a funny way of spelling "virtually-indexed":
"Virtual address [13] and physical address [12:5] are the index into
the cache."

Bit 12 (DC_16K) of the ABOX_CTL register controls whether the D-cache
is 8K (clear) or 16K (set).

I bet that our Cabriolets run in 8K-D-Cache mode.

> And also an old proof-of-concept program for detecting L1 data cache
> size and associativeness (actually, in the opposite order):
>
> ftp://ftp.dataforce.net/pub/solar/cache.c
>
> The code is a bit dirty, but seems to work most of the time (there're
> a few exceptions though, such as under a high load). For example,
> on my 21164PC, it reports:
> L1 data cache: 8 Kb, direct-mapped
> and on a Pentium II:
> L1 data cache: 16 Kb, 4-way associative
> both seem correct. :-)

On a Cabriolet under DU this gives:

L1 data cache: 16 Kb, direct-mapped

Looks right. On a Cabriolet under Linux this gives:

L1 data cache: 8 Kb, 4-way associative

Hmm, let's try another run:

L1 data cache: 8 Kb, direct-mapped

Another run, same result. More runs, various results:-( Back to ZDU,
varying results. On a 21164A this program produces

[a5:~/ftp:1001] a.out
Working...
L1 data cache: 32 Kb, 2-way associative
[a5:~/ftp:1002] a.out
Working...
L1 data cache: 32 Kb, 2-way associative
[a5:~/ftp:1003] a.out
Working...
L1 data cache: 64 Kb, 4-way associative

Maybe it needs a little more work:-)

> Yes. It does indeed look like Linux doesn't use the full cache;
> I guess it might be time to check the PALcode. How are you booting
> Linux? Are you using the correct MILO image, if any?

We are booting ARC->MILO->Linux, and we use a Cabriolet MILO.

Would simply setting the DC_16K bit be enough? I doubt it; OSs
typically need special magic to deal with virtually-indexed caches.

Solar Designer

unread,
Mar 10, 1999, 3:00:00 AM3/10/99
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Bit 12 (DC_16K) of the ABOX_CTL register controls whether the D-cache
> is 8K (clear) or 16K (set).

Now that's interesting.

> Another run, same result. More runs, various results:-( Back to ZDU,
> varying results. On a 21164A this program produces

> [a5:~/ftp:1001] a.out
> Working...
> L1 data cache: 32 Kb, 2-way associative
> [a5:~/ftp:1002] a.out
> Working...
> L1 data cache: 32 Kb, 2-way associative
> [a5:~/ftp:1003] a.out
> Working...
> L1 data cache: 64 Kb, 4-way associative

I've never tried it on a 21164A; looks like it can't really distinguish
L1 from L2 there. It also expects the number of cache line sets to be
a power of two (I didn't know the details of 21164 when I was trying
this out two years ago).

> Maybe it needs a little more work:-)

Actually, it needs re-coding. ;-)

> Would simply setting the DC_16K bit be enough? I doubt it; OSs
> typically need special magic to deal with virtually-indexed caches.

I am not familiar with such magic, so I'd better not comment on this.

[ Sorry for bringing up a slightly different topic; I think that the
information below is funny enough, and might be interesting for other
readers of this newsgroup. ]

I did, however, play around with cache control IPRs in the past; just
like I expected, I was only able to change some of the timings on a
running system. First, I needed this for slowing down the cache on
a broken Multia VX40 (don't all of them get cache problems?) so that
it didn't crash too fast; now I am doing the opposite on my 164SX-1M
for better performance (1% to 5% for different applications, running
since December, no crashes).

What I would really like now, is a copy of the 164SX SROM code (that
comes with EBSDK, but I don't feel like paying for overclocking, and
I'm afraid that shipping into Russia might be a bit expensive). All
I need is the 8K image (or is it 16K for 21164PC?).

Now, here's the 21164PC-533/164SX-1M cache speedup Linux kernel module.
The timings were adjusted for my particular board and might not work
for others. USE AT YOUR OWN RISK.

#define MODULE
#define __KERNEL__
#include <linux/module.h>

int init_module() {
int *reg = (int *)0xfffffcfffff00008UL;
int old, new;

old = *reg;
new = (old & ~0xfff00) | 0x25500;
*reg = new;

printk("CBOX_CONFIG: %x -> %x\n", old, new);

return 1;
}

void cleanup_module() {
}

--
/sd

Anton Ertl

unread,
Mar 11, 1999, 3:00:00 AM3/11/99
to
In article <7c6ps1$up3$1...@prince.dataforce.net>,

Solar Designer <so...@cannabis.dataforce.net> writes:
> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> > Bit 12 (DC_16K) of the ABOX_CTL register controls whether the D-cache
> > is 8K (clear) or 16K (set).

I looked at the kernel source (2.2.1), the milo-2.0.30 source, and the
palcode source included with milo-2.0.30. As far as I see the DC_16K
bit is set nowhere.

> > Would simply setting the DC_16K bit be enough? I doubt it; OSs
> > typically need special magic to deal with virtually-indexed caches.

Can anyone comment on this?

> [ Sorry for bringing up a slightly different topic; I think that the
> information below is funny enough, and might be interesting for other
> readers of this newsgroup. ]

Thanks, this inspired me for my solution below.

Your membench.c now gives 1478.98 Mb/sec even with 16K, and our LaTeX
benchmark improves from 32.67s to 31.17s (still some way to go to the
DU result of 24.8s).

- anton

Here's the source code for the module:

/* D-cache control on an 21064A

Just compile it with

gcc -O -c toggle_dc_16k.c

and invoke it with

insmod toggle_dc_16k.o

Never mind the error message. To see the result, invoke

dmesg |tail

which should contain a message like "abox_ctl=0x142c" or
"abox_ctl=0x42c". In the former case, the D-cache now has 16K, in the
latter 8K. Invoke the module again to toggle the state.


When you try to install this module, it toggles the state of the
D-cache between 8K and 16K and then prints the resulting value of the
ABOX_CTL register in hex. If bit 12 (mask: 0x1000) is set, the D-cache
is 16KB, if it is clear, the D-cache is 8KB. The module produces an
error, so it is not installed, but the effect is there nevertheless.

Warnings: It is not clear if Linux handles virtually indexed caches
correctly on a 21064A (there might be problems with cache
consistency, if the same physical page is mapped to several virtual
addresses). It is not clear if it is safe to switch during operation
(this module does not flush the cache).

However, we have experienced no problems during our experiments and
benchmarks with this module (on a non-critical machine:-)
*/

#define MODULE
#define __KERNEL__
#include <linux/module.h>

static long rd_abox_ctl()
{
long abox_ctl;
asm("mov 3, $18; call_pal 9; mov $0, %0": "=r" (abox_ctl): /* no inputs */
: "$0","$16","$17","$18"); /* read abox_ctl register */
return abox_ctl;
}

/* toggle bit 12 of the ABOX_CTL register; this bit controls whether
the Dache is 8KB or 16KB (virtually-indexed in the latter case) */
static void toggle_dc_16k()
{
long abox_ctl;
asm("mov 3, $18; call_pal 9; mov $0, %0": "=r" (abox_ctl): /* no inputs */
: "$0","$16","$17","$18"); /* read abox_ctl register */
abox_ctl^=0x1000;
asm("mov 6, $18; mov %0, $16; call_pal 9": : "r" (abox_ctl)
: "$0","$16","$17","$18"); /* write abox_ctl register */
}

int init_module()
{
toggle_dc_16k();
printk("abox_ctl=0x%x\n",rd_abox_ctl());
return 1; /* device or resource busy, so the module is not installed */
}

void cleanup_module()
{
}


Anton Ertl

unread,
Mar 12, 1999, 3:00:00 AM3/12/99
to
In article <7c8oqm$a9r$1...@news.tuwien.ac.at>,

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> In article <7c6ps1$up3$1...@prince.dataforce.net>,
> Solar Designer <so...@cannabis.dataforce.net> writes:
> > Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> > > Would simply setting the DC_16K bit be enough? I doubt it; OSs
> > > typically need special magic to deal with virtually-indexed caches.

Ok, I have tried to think up a case where the virtually indexed cache
could be a problem; I wrote a program to exercise this case
(http://www.complang.tuwien.ac.at/anton/toggle_dc_16k/mapcheck.c), and
it turns out that everything works well. I guess that the cache
consistency issues are handled in hardware. I just wonder why there is
an 8KB mode at all.

So, I think it is safe to switch the D-cache to 16K. You are invited
to try it (all the material is available at
http://www.complang.tuwien.ac.at/anton/toggle_dc_16k/), and report
results to me.

IMO this should go into the Linux startup code for EV45-based Alphas.
Does anyone know whom I should contact for these kernel patches (the
MAINTAINERS file does not list an Alpha maintainer).

Richard Rogers

unread,
Mar 12, 1999, 3:00:00 AM3/12/99
to Anton Ertl

Anton Ertl wrote:

> Here's the source code for the module:
>
> /* D-cache control on an 21064A

I tried this on a pair of AlphaStation 200 4/233s running RedHat 5.2 and got
poor results. One of them hung shortly after performing the insmod, and the
other started experiencing frequent but intermittent long pauses before
echoing/responding to what I type (rsh'd in over the 10bT, since both machines
are headless). The pauses would go away when I toggled the cache back to 8K. I
didn't try any benchmarks since I was worried about the stability. Any idea
what might be going wrong?

--
Richard Rogers, Research Software Engineer rro...@statsci.com
Tel: 800-569-0123x311 206-283-8802x311 Fax: 206-283-8691
MathSoft, Inc., 1700 Westlake Ave. N. #500 Seattle, WA 98109

Anton Ertl

unread,
Mar 16, 1999, 3:00:00 AM3/16/99
to
In article <36E98155...@statsci.com>,

Richard Rogers <rro...@statsci.com> writes:
> Anton Ertl wrote:
>
> > Here's the source code for the module:
> >
> > /* D-cache control on an 21064A
>
> I tried this on a pair of AlphaStation 200 4/233s running RedHat 5.2 and got
> poor results. One of them hung shortly after performing the insmod, and the
> other started experiencing frequent but intermittent long pauses before
> echoing/responding to what I type (rsh'd in over the 10bT, since both machines
> are headless). The pauses would go away when I toggled the cache back to 8K. I
> didn't try any benchmarks since I was worried about the stability. Any idea
> what might be going wrong?

After some emailing we have still not found the reason for the
problems or a solution; and I have heard about another case of a
problem with an Avanti (with an earlier attempt at switching the
D-Cache to 16K). So we have a Cabriolet where this works, and several
Avantis where it does not.

More experience reports are welcome (both positive and negative); you
can find the material at
http://www.complang.tuwien.ac.at/anton/toggle_dc_16k/

Anton Ertl

unread,
Mar 23, 1999, 3:00:00 AM3/23/99
to
In article <7clab6$k53$2...@news.tuwien.ac.at>,

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> In article <36E98155...@statsci.com>,
> Richard Rogers <rro...@statsci.com> writes:
> > Anton Ertl wrote:
[switching the 21064A D-cache to 16K].

> > I tried this on a pair of AlphaStation 200 4/233s running RedHat 5.2 and got
> > poor results. One of them hung shortly after performing the insmod, and the
> > other started experiencing frequent but intermittent long pauses before
> > echoing/responding to what I type (rsh'd in over the 10bT, since both machines
> > are headless). The pauses would go away when I toggled the cache back to 8K. I
> > didn't try any benchmarks since I was worried about the stability. Any idea
> > what might be going wrong?
>
> After some emailing we have still not found the reason for the
> problems or a solution; and I have heard about another case of a
> problem with an Avanti (with an earlier attempt at switching the
> D-Cache to 16K). So we have a Cabriolet where this works, and several
> Avantis where it does not.

We have fixed the problem (by setting the DOUBLE_INVAL bit in
ABOX_CTL), and now all the machines I have reports for (three Avantis
and a Cabriolet) are running stable with the cache switched to 16K.
This fix is present in the current version of the toggle_dc_16k kernel
module (http://www.complang.tuwien.ac.at/anton/toggle_dc_16k/).

More news: The kernel module works only if you have booted using MILO
(it depends on the MILO palcode). Fortnately, it apparently is not
necessary if you have booted using SRM without MILO (SRM switches the
cache to 16K by itself). Nikita Schmidt will change the MILO palcode
to use 16K of D-cache by default.

More experience reports are welcome (both positive and negative).

Reply all
Reply to author
Forward
0 new messages