Memory testing in a kernel thread

Kasper Dupont

unread,

May 4, 2005, 10:45:37 AM5/4/05

to

Here is my first attempt at a kernel module for testing memory:
https://www.daimi.au.dk/~kasperd/linux_kernel/kmemtest.c
It was first written for 2.6 but I commented out a few pieces
of the code to make it work in 2.4.

The background for this is, that I'd like to be able to test
memory on a running system and not having to take the system
offline for hours or even days running memtest86.

Looks like it does in fact work. In case an error is detected
I currently just execute this piece of code:

printk(KERN_CRIT "error on page %d (%d) - %d errors \n",
p[i]-mem_map,i,++errors);

Where p is an array with pointers to pages allocated with
alloc_page(). Now it would be a trivial matter to just leak
the page, but probably that is not enough to achieve a stable
system, and for the most reliable test results I should
probably keep testing the bad pages again and again.

But now I'd like to print some more useful information about
the page. What relevant information could I get from a pointer
to a struct page?

And assuming I want to test as much of the memory as possible,
are there any ways I could be more specific about which parts
of memory I want to allocate from rather than just using
alloc_page to allocate from a specific zone? If I free some
memory and try to allocate some new memory afterwards, I'd
probably just get the same.

--
Kasper Dupont -- der bruger for meget tid på usenet.
Note to self: Don't try to allocate 256000 pages
with GFP_KERNEL on x86.

Robert Redelmeier

unread,

May 4, 2005, 4:00:58 PM5/4/05

to

In comp.os.linux.development.system Kasper Dupont <kas...@daimi.au.dk> wrote:
> Here is my first attempt at a kernel module for testing memory:
> https://www.daimi.au.dk/~kasperd/linux_kernel/kmemtest.c

[snip]

> The background for this is, that I'd like to be able to test
> memory on a running system and not having to take the system
> offline for hours or even days running memtest86.

OK, not an unreasonable idea. `memtest86` is very good and
exhaustive because it checks _all_ cells by relocating itself.

I wrote `burnBX` and later `burnMMX` as userland pgms to very
intensively test RAM. But they cannot test kernel memory, nor
pages they're not allocated. And addresses mean nothing between
runs because userland cannot see page allocation.

But I found userland very interesting. When my machine was
excessively overclocked, I'd see RAM errors at a whole series
of addresses, but only in one or two bits of the 64 bit bus.
These bus errors only slowly disappeared with downclocking.
Never sam'em with `memtest86`.

One drawback of my burn* pgms is they are hogs and put enormous
pressure on VM. A kernel module could only test idle pages, and
even copy/move pages is they hadn't been tested. Considerably more
friendly as a background task.

-- Robert author `cpuburn` http://pages.sbcglobal.net/redelm

Nix

unread,

May 5, 2005, 9:49:47 AM5/5/05

to

On Wed, 04 May 2005, Robert Redelmeier yowled:

> I wrote `burnBX` and later `burnMMX` as userland pgms to very
> intensively test RAM. But they cannot test kernel memory, nor
> pages they're not allocated. And addresses mean nothing between
> runs because userland cannot see page allocation.

And thank you for those tools indeed! I found `burnK7' in particular
very useful when dealing with an undercooled CPU which the repair shop
insisted was cooled. I whipped up a tiny boot disk with /sbin/init
replaced with a stub which ran burnK7, then looped forever, printing out
periods, one per second. On the problem box, this froze in a couple of
seconds.

I handed the repairman the disk and told him that I'd consider the
problem gone when the machine *kept* printing dots for several minutes.

One week later, it came back, fixed. :)

--
`End users are just test loads for verifying that the system works, kind of
like resistors in an electrical circuit.' - Kaz Kylheku in c.o.l.d.s

Kasper Dupont

unread,

May 8, 2005, 2:42:48 PM5/8/05

to

Nix wrote:
>
> And thank you for those tools indeed! I found `burnK7' in particular
> very useful when dealing with an undercooled CPU which the repair shop
> insisted was cooled. I whipped up a tiny boot disk with /sbin/init
> replaced with a stub which ran burnK7, then looped forever, printing out
> periods, one per second. On the problem box, this froze in a couple of
> seconds.

Is burnK7 a memory testing program intended for
Athlon based systems? Maybe I should try it. The
system I'm having a litle problems with is using
an Athlon XP 2200+ CPU.

When my kmemtest had been running for 4 days it
had found 22 errors. All of them were on different
pages, I don't know if that will tell me anything.

Guess it is about time I repeat the test with just
one module at a time.

BTW: How should an #ifdef look like to tell if I'm
compiling for 2.4 or 2.6? At the moment I need my
kmemtest to work on both, and it is only a few
lines that need to differ.

Robert Redelmeier

unread,

May 8, 2005, 5:46:56 PM5/8/05

to

Kasper Dupont <kas...@daimi.au.dk> wrote:
> Is burnK7 a memory testing program intended for Athlon
> based systems?

No, burnK7 is a CPU-intensive test designed to load up the
execution units for maximum current consumption and heat
production. The folks at AMD have told me thy find it useful.

For testing RAM, I'd still recommend `burnMMX` which uses the
MMX unit for high bandwidth transfers. It is a little out-of-date
for modern CPUs, and I really need to make another release (but
I _hate_ doing MS-win32 compiles even with the excellent mingw).

> Maybe I should try it. The system I'm having
> a litle problems with is using an Athlon XP 2200+ CPU.

Definitely. There are too many combinations for the mfrs to
test, and I consider it absolutely critical to run hardware
acceptance tests. That way I have confidence in the hardware,
and can blame faults on software!

> When my kmemtest had been running for 4 days it had found
> 22 errors. All of them were on different pages, I don't
> know if that will tell me anything.

Sounds like bus errors. PSU or memory modules as combined/ordered.

> Guess it is about time I repeat the test with just one
> module at a time.

That would be a start. Perhaps burnMMX can generate errors
quicker for you.

> BTW: How should an #ifdef look like to tell if I'm compiling
> for 2.4 or 2.6? At the moment I need my kmemtest to work
> on both, and it is only a few lines that need to differ.

I don't know since I avoid `c`. You could try passing
it as a gcc option as -D26 .

-- Robert

Kasper Dupont

unread,

May 9, 2005, 3:20:36 PM5/9/05

to

Robert Redelmeier wrote:
>
> For testing RAM, I'd still recommend `burnMMX` which uses the
> MMX unit for high bandwidth transfers. It is a little out-of-date
> for modern CPUs, and I really need to make another release (but
> I _hate_ doing MS-win32 compiles even with the excellent mingw).

How did Windows come into this thread? Aren't we
talking about Linux user mode programs?

>
> > When my kmemtest had been running for 4 days it had found
> > 22 errors. All of them were on different pages, I don't
> > know if that will tell me anything.
>
> Sounds like bus errors. PSU or memory modules as combined/ordered.

Eight months ago when I returned from Switzerland
I had the computer running memtest86 for a few
days without showing any errors. Though that test
was without any harddisks connected. Since then I
have added an extra memory module.

>
> > BTW: How should an #ifdef look like to tell if I'm compiling
> > for 2.4 or 2.6? At the moment I need my kmemtest to work
> > on both, and it is only a few lines that need to differ.
>
> I don't know since I avoid `c`. You could try passing
> it as a gcc option as -D26 .

Well when you do a Makefile for a 2.6 module you
don't call gcc directly, so that would be quite
tricky. I could of course add a define when
compiling for 2.4, but I'd rather use some official
approach, I'm pretty sure there is some. Well, I
should go hunting in the source.

Nix

unread,

May 10, 2005, 12:18:53 PM5/10/05

to

On Mon, 09 May 2005, Kasper Dupont suggested tentatively:

> Robert Redelmeier wrote:
>>
>> For testing RAM, I'd still recommend `burnMMX` which uses the
>> MMX unit for high bandwidth transfers. It is a little out-of-date
>> for modern CPUs, and I really need to make another release (but
>> I _hate_ doing MS-win32 compiles even with the excellent mingw).
>
> How did Windows come into this thread? Aren't we
> talking about Linux user mode programs?

The burn* programs are a tiny bit of OS-independent assembler.

Robert Redelmeier

unread,

May 10, 2005, 1:19:48 PM5/10/05

to

Kasper Dupont <kas...@daimi.au.dk> wrote:
> How did Windows come into this thread? Aren't we
> talking about Linux user mode programs?

Yes, but I get too much demand for MS-win32 executables
to just ignore :(

> Eight months ago when I returned from Switzerland I had the
> computer running memtest86 for a few days without showing
> any errors. Though that test was without any harddisks
> connected. Since then I have added an extra memory module.

The extra module most definitely can change bus loading.
I like memtest86, but it isn't really high bandwidth
or severe. It is very thorough.

You may also be getting disk errors. I test by looping
`mdsum /dev/hdunmounted >>log` for a day or two.

> Well when you do a Makefile for a 2.6 module you don't
> call gcc directly, so that would be quite tricky. I could
> of course add a define when compiling for 2.4, but I'd
> rather use some official approach, I'm pretty sure there
> is some. Well, I should go hunting in the source.

If this is a kernel-tree makefile, then you can use the variable
$PATCHLEVEL defined at the top of /usr/src/linux/Makefile.
If it is your own makefile and you're compiling for that machine,
you can do something with `uname -r`.

-- Robert

Kasper Dupont

unread,

May 10, 2005, 5:02:05 PM5/10/05

to

Robert Redelmeier wrote:
>
> You may also be getting disk errors. I test by looping
> `mdsum /dev/hdunmounted >>log` for a day or two.

If that produces incorrect checksums, how would you
know where the error had happened? Doing a memory
test in a kernel module is supposed to eliminate the
possibility of getting errors caused by bits being
flipped when being written to disk and later being
read back again. So the errors I find should only
be in the memory (or possibly cache or CPU).

>
> > Well when you do a Makefile for a 2.6 module you don't
> > call gcc directly, so that would be quite tricky. I could
> > of course add a define when compiling for 2.4, but I'd
> > rather use some official approach, I'm pretty sure there
> > is some. Well, I should go hunting in the source.
>
> If this is a kernel-tree makefile, then you can use the variable
> $PATCHLEVEL defined at the top of /usr/src/linux/Makefile.
> If it is your own makefile and you're compiling for that machine,
> you can do something with `uname -r`.

Oh, I already found the defines I nedded. Now it
looks like this in my module source:
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0)

And I do use uname -r in my Makefile. (Currently
my Makefile is only for 2.6. For 2.4 you can
build the module without using a Makefile, and I
found that easier than hacking up a Makefile that
works in both cases).

Robert Redelmeier

unread,

May 10, 2005, 7:23:33 PM5/10/05

to

Kasper Dupont <kas...@daimi.au.dk> wrote:
> If that produces incorrect checksums, how would you know
> where the error had happened?

Easy. You _progressively_ validate the system by
testing outwards from proven components:

1) Test CPU, cooling, PSU & mobo PS with `burnK7`
2) Test L1 with `burnMMX F`
3) Test L2 with `burnMMX H`
4) Test Northbridge, RAM busses & 3.3V PSU with `burnMMX P`
5) Test RAM cells with `memtest86`
6) Test HD (& 5V PSU) with looping md5sums

Obviously any early failure stops the series. Failure
isolation is important, and one reason I leave `cpuburn`
as a collection of pesky small pgms.