2.4.16 memory badness (reproducible)

Leigh Orf

unread,

Dec 8, 2001, 10:39:14 AM12/8/01

to

I've been having confounding out-of-memory problems with 2.4.16 on my
1.4MHz Athlon with 1 GB of memory (2 GB of swap). I just caught it in
the act and I think it relates to some of the weirdness others have been
reporting.

I'm running RedHat 7.2. After bootup, it runs a program called updatedb
(slocate -u) which does a lot of file i/o as it indexes all the files on
my hard drives. Following this action, my machine is in a state which
make many applications give "cannot allocate memory" errors. It seems
the kernel is not freeing up buffered or cached memory, and even more
troubling is the fact that it isn't using any of my swap space.

Here is the state of the machine after updatedb runs:

home[1006]:/home/orf% free
total used free shared buffers cached
Mem: 1029820 1021252 8568 0 471036 90664
-/+ buffers/cache: 459552 570268
Swap: 2064344 0 2064344

home[1003]:/home/orf% cat /proc/meminfo
total: used: free: shared: buffers: cached:
Mem: 1054535680 1045901312 8634368 0 480497664 93954048
Swap: 2113888256 0 2113888256
MemTotal: 1029820 kB
MemFree: 8432 kB
MemShared: 0 kB
Buffers: 469236 kB
Cached: 91752 kB
SwapCached: 0 kB
Active: 383812 kB
Inactive: 229016 kB
HighTotal: 130992 kB
HighFree: 2044 kB
LowTotal: 898828 kB
LowFree: 6388 kB
SwapTotal: 2064344 kB
SwapFree: 2064344 kB

home[1005]:/home/orf% cat /proc/slabinfo
slabinfo - version: 1.1
kmem_cache 65 68 112 2 2 1
ip_conntrack 9 50 384 4 5 1
nfs_write_data 0 0 384 0 0 1
nfs_read_data 0 0 384 0 0 1
nfs_page 0 0 128 0 0 1
ip_fib_hash 10 112 32 1 1 1
urb_priv 0 0 64 0 0 1
clip_arp_cache 0 0 128 0 0 1
ip_mrt_cache 0 0 128 0 0 1
tcp_tw_bucket 0 0 128 0 0 1
tcp_bind_bucket 8 112 32 1 1 1
tcp_open_request 0 0 128 0 0 1
inet_peer_cache 4 59 64 1 1 1
ip_dst_cache 27 40 192 2 2 1
arp_cache 3 30 128 1 1 1
blkdev_requests 640 660 128 22 22 1
journal_head 0 0 48 0 0 1
revoke_table 0 0 12 0 0 1
revoke_record 0 0 32 0 0 1
dnotify cache 0 0 20 0 0 1
file lock cache 2 42 92 1 1 1
fasync cache 2 202 16 1 1 1
uid_cache 5 112 32 1 1 1
skbuff_head_cache 327 340 192 17 17 1
sock 188 198 1280 66 66 1
sigqueue 2 29 132 1 1 1
cdev_cache 2313 2360 64 40 40 1
bdev_cache 8 59 64 1 1 1
mnt_cache 19 59 64 1 1 1
inode_cache 439584 439586 512 62798 62798 1
dentry_cache 454136 454200 128 15140 15140 1
dquot 0 0 128 0 0 1
filp 1471 1500 128 50 50 1
names_cache 0 2 4096 0 2 1
buffer_head 144413 173280 128 5776 5776 1
mm_struct 57 80 192 4 4 1
vm_area_struct 2325 2760 128 92 92 1
fs_cache 56 118 64 2 2 1
files_cache 56 72 448 8 8 1
signal_act 64 72 1344 24 24 1
size-131072(DMA) 0 0 131072 0 0 32
size-131072 0 0 131072 0 0 32
size-65536(DMA) 0 0 65536 0 0 16
size-65536 1 1 65536 1 1 16
size-32768(DMA) 0 0 32768 0 0 8
size-32768 1 1 32768 1 1 8
size-16384(DMA) 0 0 16384 0 0 4
size-16384 1 1 16384 1 1 4
size-8192(DMA) 0 0 8192 0 0 2
size-8192 4 4 8192 4 4 2
size-4096(DMA) 0 0 4096 0 0 1
size-4096 64 68 4096 64 68 1
size-2048(DMA) 0 0 2048 0 0 1
size-2048 52 66 2048 27 33 1
size-1024(DMA) 0 0 1024 0 0 1
size-1024 11042 11048 1024 2762 2762 1
size-512(DMA) 0 0 512 0 0 1
size-512 12004 12016 512 1501 1502 1
size-256(DMA) 0 0 256 0 0 1
size-256 1678 1695 256 113 113 1
size-128(DMA) 2 30 128 1 1 1
size-128 29398 29430 128 980 981 1
size-64(DMA) 0 0 64 0 0 1
size-64 7954 7965 64 135 135 1
size-32(DMA) 34 59 64 1 1 1
size-32 66711 66729 64 1131 1131 1

Now, I try to run a common application:

home[1031]:/home/orf% xmms
Memory fault

Strace on xmms shows:

home[1008]:/home/orf/memfuck% cat xmms.strace
[snip]
modify_ldt(0x1, 0xbffff1fc, 0x10) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

Also, from my syslog (I have an ntfs partition):

Dec 8 09:55:01 orp kernel: NTFS: ntfs_insert_run: ntfs_vmalloc(new_size = 0x1000) failed
Dec 8 09:55:01 orp kernel: NTFS: ntfs_process_runs: ntfs_insert_run failed
Dec 8 09:55:01 orp kernel: NTFS: ntfs_getdir_unsorted(): Read failed. Returning error code -95.
Dec 8 09:55:01 orp kernel: NTFS: ntfs_insert_run: ntfs_vmalloc(new_size = 0x1000) failed
Dec 8 09:55:01 orp kernel: NTFS: ntfs_process_runs: ntfs_insert_run failed
Dec 8 09:55:01 orp kernel: NTFS: ntfs_getdir_unsorted(): Read failed. Returning error code -95.
Dec 8 09:55:01 orp kernel: NTFS: ntfs_insert_run: ntfs_vmalloc(new_size = 0x1000) failed
Dec 8 09:55:01 orp kernel: NTFS: ntfs_process_runs: ntfs_insert_run failed
Dec 8 09:55:01 orp kernel: NTFS: ntfs_insert_run: ntfs_vmalloc(new_size = 0x1000) failed
Dec 8 09:55:01 orp kernel: NTFS: ntfs_process_runs: ntfs_insert_run failed

The program nautilus, which is involved with the Gnome windowing stuff,
also complains it can't allocate memory if I log into the console after
udpatedb has run (that's what clued me into this problem in the first
place).

The only way I can find to make the system usable is to run an
application which aggressively recovers some of this buffered/cached
memory, and quit it. One easy way to do this:

home[1014]:/home/orf% lmdd opat=1 count=1 bs=900m

After I do this, much free memory is available.

Some applications are able to "reclaim" the buffered/cached memory,
while others aren't. Netscape doesn't have a problem, for instance,
running after updatedb runs.

This is a pretty serious problem. Interestingly enough, it does NOT
occur on my other machine, running same kernel and RH7.2, with 256M
memory and 512M swap.

Leigh Orf

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ken Brownfield

unread,

Dec 8, 2001, 10:56:20 AM12/8/01

to

This parallels what I'm seeing -- perhaps inode/dentry cache bloat is
causing the memory issue (which mimics if not _is_ a memory leak) _and_
my kswapd thrashing? It fits both the situation you report and what I'm
seeing with I/O across a large number of files (inodes) -- updatedb,
smb, NFS, etc.

I think Andrea was on to this issue, so I'm hoping his work will help.
Have you tried an -aa kernel or an aa patch onto a 2.4.17-pre4 to see
how the kernel's behavior changes?

--
Ken.
brow...@irridia.com

Leigh Orf

unread,

Dec 8, 2001, 1:54:17 PM12/8/01

to

Ken Brownfield wrote:

| This parallels what I'm seeing -- perhaps inode/dentry cache
| bloat is causing the memory issue (which mimics if not _is_
| a memory leak) _and_ my kswapd thrashing? It fits both the
| situation you report and what I'm seeing with I/O across a
| large number of files (inodes) -- updatedb, smb, NFS, etc.
|
| I think Andrea was on to this issue, so I'm hoping his work
| will help. Have you tried an -aa kernel or an aa patch onto
| a 2.4.17-pre4 to see how the kernel's behavior changes?
|
| --
| Ken.
| brow...@irridia.com

I get the exact same behavior with 2.4.17-pre4-aa1 - many applications
abort with ENOMEM after updatedb (filling the buffer and cache). Is
there another kernel/patch I should try?

Andrew Morton

unread,

Dec 8, 2001, 2:41:22 PM12/8/01

to

Leigh Orf wrote:
>
> Ken Brownfield wrote:
>
> | This parallels what I'm seeing -- perhaps inode/dentry cache
> | bloat is causing the memory issue (which mimics if not _is_
> | a memory leak) _and_ my kswapd thrashing? It fits both the
> | situation you report and what I'm seeing with I/O across a
> | large number of files (inodes) -- updatedb, smb, NFS, etc.
> |
> | I think Andrea was on to this issue, so I'm hoping his work
> | will help. Have you tried an -aa kernel or an aa patch onto
> | a 2.4.17-pre4 to see how the kernel's behavior changes?
> |
> | --
> | Ken.
> | brow...@irridia.com
>
> I get the exact same behavior with 2.4.17-pre4-aa1 - many applications
> abort with ENOMEM after updatedb (filling the buffer and cache). Is
> there another kernel/patch I should try?
>

Just for interest's sake:

--- linux-2.4.17-pre6/mm/memory.c Fri Dec 7 15:39:52 2001
+++ linux-akpm/mm/memory.c Sat Dec 8 11:13:30 2001
@@ -1184,6 +1184,7 @@ static int do_anonymous_page(struct mm_s
flush_page_to_ram(page);
entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
lru_cache_add(page);
+ activate_page(page);
}

set_pte(page_table, entry);

Leigh Orf

unread,

Dec 8, 2001, 3:04:40 PM12/8/01

to

No change - identical behavior.

Leigh Orf

unread,

Dec 8, 2001, 4:42:10 PM12/8/01

to

I've noticed a couple more things with the memory allocation problem
with large buffer and cache allocation. Some applications will fail
with ENOMEM *even if* there is a considerable amount (say, 62 MB as
below) of "truly" free memory.

The second thing I've noticed is that all these apps that die with
ENOMEM pretty much have the same strace output towards the end. What
is strange is "display *.tif" dies while "ee *.tif" and "gimp *.tif"
does not. Piping the strace output of commands that *don't* cause this
behavior and grepping for modify_ldt shows that modify_ldt is *not*
being called for apps that *don't* die.

So I don't know if it's a symptom or a cause, but modify_ldt seems to be
triggering the problem. Not being a kernel hacker, I leave the analysis
of this to those who are.

Leigh Orf

home[1029]:/home/orf% free

total used free shared buffers cached

Mem: 1029772 967096 62676 0 443988 98312
-/+ buffers/cache: 424796 604976
Swap: 2064344 0 2064344

home[1026]:/home/orf% strace xmms 2>&1 | tail
old_mmap(NULL, 1291080, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40316000
mprotect(0x40448000, 37704, PROT_NONE) = 0
old_mmap(0x40448000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x131000) = 0x40448000
old_mmap(0x4044e000, 13128, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4044e000
close(3) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40452000
munmap(0x40018000, 72129) = 0

modify_ldt(0x1, 0xbffff1fc, 0x10) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

home[1027]:/home/orf% strace nautilus 2>&1 | tail
old_mmap(NULL, 1291080, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40958000
mprotect(0x40a8a000, 37704, PROT_NONE) = 0
old_mmap(0x40a8a000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x131000) = 0x40a8a000
old_mmap(0x40a90000, 13128, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40a90000
close(3) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40a94000
munmap(0x40018000, 72129) = 0

modify_ldt(0x1, 0xbffff1fc, 0x10) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

home[1028]:/home/orf% strace display *.tif 2>&1 | tail
old_mmap(NULL, 1291080, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x404ff000
mprotect(0x40631000, 37704, PROT_NONE) = 0
old_mmap(0x40631000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x131000) = 0x40631000
old_mmap(0x40637000, 13128, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40637000
close(3) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4063b000
munmap(0x401a8000, 72129) = 0
modify_ldt(0x1, 0xbfffefac, 0x10) = -1 ENOMEM (Cannot allocate memory)

--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

Leigh Orf

unread,

Dec 8, 2001, 5:24:25 PM12/8/01

to

More clues...

The only way I can seem to bring the machine back to being totally
normal after buff/cache fullness is to force some swap to be written,
such as by doing

lmdd opat=1 count=1 bs=900m

If I do

lmdd opat=1 count=1 bs=500m

about 500MB of memory is freed but no swap is written, and modify_ldt
still returns ENOMEM if I run xmms, display, etc....

It looks like the problem is somewhere in vmalloc since that's what
returns a null pointer where ENOMEM gets set in arch/i386/kernel/ldt.c

BTW I have been running kernels with

CONFIG_HIGHMEM4G=y
CONFIG_HIGHMEM=y

I am compiling a kernel with

CONFIG_NOHIGHMEM=y

and will see if the bad memory behavior continues.

Leigh Orf

Leigh Orf wrote:

| I've noticed a couple more things with the memory allocation
| problem with large buffer and cache allocation. Some
| applications will fail with ENOMEM *even if* there is a
| considerable amount (say, 62 MB as below) of "truly" free
| memory.
|
| The second thing I've noticed is that all these apps that die
| with ENOMEM pretty much have the same strace output towards
| the end. What is strange is "display *.tif" dies while "ee
| *.tif" and "gimp *.tif" does not. Piping the strace output of
| commands that *don't* cause this behavior and grepping for
| modify_ldt shows that modify_ldt is *not* being called for
| apps that *don't* die.
|
| So I don't know if it's a symptom or a cause, but modify_ldt
| seems to be triggering the problem. Not being a kernel
| hacker, I leave the analysis of this to those who are.
|
| Leigh Orf

Hugh Dickins

unread,

Dec 11, 2001, 2:07:41 PM12/11/01

to

On Sat, 8 Dec 2001, Leigh Orf wrote:
>
> So I don't know if it's a symptom or a cause, but modify_ldt seems to be
> triggering the problem. Not being a kernel hacker, I leave the analysis
> of this to those who are.
>

> home[1029]:/home/orf% free
> total used free shared buffers cached
> Mem: 1029772 967096 62676 0 443988 98312
> -/+ buffers/cache: 424796 604976
> Swap: 2064344 0 2064344
>

> modify_ldt(0x1, 0xbffff1fc, 0x10) = -1 ENOMEM (Cannot allocate memory)

I believe this error comes, not from a (genuine or mistaken) shortage
of free memory, but from shortage or fragmentation of vmalloc's virtual
address space. Does patch below (to 2.4.17-pre4-aa1 since I think that's
what you tried last; easily adaptible to other trees) doubling vmalloc's
address space (on your 1GB machine or larger) make any difference?
Perhaps there's a vmalloc leak and this will only delay the error.

Hugh

--- 1704aa1/arch/i386/kernel/setup.c Tue Dec 11 15:22:53 2001
+++ linux/arch/i386/kernel/setup.c Tue Dec 11 19:01:37 2001
@@ -835,7 +835,7 @@
/*
* 128MB for vmalloc and initrd
*/
-#define VMALLOC_RESERVE (unsigned long)(128 << 20)
+#define VMALLOC_RESERVE (unsigned long)(256 << 20)
#define MAXMEM (unsigned long)(-PAGE_OFFSET-VMALLOC_RESERVE)
#ifdef CONFIG_HIGHMEM_EMULATION
#define ORDER_DOWN(x) ((x >> (MAX_ORDER-1)) << (MAX_ORDER-1))

Stephan von Krawczynski

unread,

Dec 11, 2001, 3:04:12 PM12/11/01

to

On Tue, 11 Dec 2001 19:07:41 +0000 (GMT)
Hugh Dickins <hu...@veritas.com> wrote:

> I believe this error comes, not from a (genuine or mistaken) shortage
> of free memory,

Me, too.

> but from shortage or fragmentation of vmalloc's virtual
> address space. Does patch below (to 2.4.17-pre4-aa1 since I think that's
> what you tried last; easily adaptible to other trees) doubling vmalloc's
> address space (on your 1GB machine or larger) make any difference?
> Perhaps there's a vmalloc leak and this will only delay the error.

At least I think this direction to search the bug looks a lot more promising than a general mem shortage problem. After reviewing modify_ldt this looked like the only useable idea to Leighs problem.

Regards,
Stephan

Andrea Arcangeli

unread,

Dec 11, 2001, 5:59:08 PM12/11/01

to

On Tue, Dec 11, 2001 at 07:07:41PM +0000, Hugh Dickins wrote:
> On Sat, 8 Dec 2001, Leigh Orf wrote:
> >
> > So I don't know if it's a symptom or a cause, but modify_ldt seems to be
> > triggering the problem. Not being a kernel hacker, I leave the analysis
> > of this to those who are.
> >
> > home[1029]:/home/orf% free
> > total used free shared buffers cached
> > Mem: 1029772 967096 62676 0 443988 98312
> > -/+ buffers/cache: 424796 604976
> > Swap: 2064344 0 2064344
> >
> > modify_ldt(0x1, 0xbffff1fc, 0x10) = -1 ENOMEM (Cannot allocate memory)
>

> I believe this error comes, not from a (genuine or mistaken) shortage

> of free memory, but from shortage or fragmentation of vmalloc's virtual

definitely agreed. This is the same I was wondering about right now here
while reading his report.

He always get vmalloc failures, this is way too suspect. If the VM
memory balancing was the culprit he should get failures with all the
other allocations too. So it has to be a problem with a shortage of the
address space available to vmalloc, not a problem with the page
allocator.

> address space. Does patch below (to 2.4.17-pre4-aa1 since I think that's
> what you tried last; easily adaptible to other trees) doubling vmalloc's
> address space (on your 1GB machine or larger) make any difference?
> Perhaps there's a vmalloc leak and this will only delay the error.
>

> Hugh
>
> --- 1704aa1/arch/i386/kernel/setup.c Tue Dec 11 15:22:53 2001
> +++ linux/arch/i386/kernel/setup.c Tue Dec 11 19:01:37 2001
> @@ -835,7 +835,7 @@
> /*
> * 128MB for vmalloc and initrd
> */
> -#define VMALLOC_RESERVE (unsigned long)(128 << 20)
> +#define VMALLOC_RESERVE (unsigned long)(256 << 20)
> #define MAXMEM (unsigned long)(-PAGE_OFFSET-VMALLOC_RESERVE)
> #ifdef CONFIG_HIGHMEM_EMULATION
> #define ORDER_DOWN(x) ((x >> (MAX_ORDER-1)) << (MAX_ORDER-1))

yes, this will tend to hide it.

Even better would be to change fs/ntfs/* to avoid using vmalloc for tons
of little pieces. It's not only a matter of wasting direct mapped
address space, but it's also a matter of running fast, mainly on SMP
with the IPI for the tlb flushes...

attr.c:233: new = ntfs_vmalloc(new_size);
attr.c:235: ntfs_error("ntfs_insert_run:
ntfs_vmalloc(new_size = "
attr.c:458: rlt = ntfs_vmalloc(rl_size);
inode.c:1297: rl = ntfs_vmalloc(rlen << sizeof(ntfs_runlist));
inode.c:1638: rlt =
ntfs_vmalloc(rl_size);
inode.c:1942: rl2 = ntfs_vmalloc(rl2_size);
inode.c:2006: rlt = ntfs_vmalloc(rl_size);
super.c:810: rlt = ntfs_vmalloc(rlsize);
super.c:1335: buf = ntfs_vmalloc(buf_size);
support.h:29:#include <linux/vmalloc.h>
support.h:35:#define ntfs_vmalloc(size) vmalloc_32(size)

In short there are three solutions avaialble:

1) don't use ntfs
2) fix ntfs
3) enlarge vmalloc address space with the above patch, but this won't be
a final solution because you'll overflow again the vmalloc address
space by adding the double of files in your fs

So I'd redirect this report to Anton Altaparmakov <ai...@cam.ac.uk> and
I still have no VM bugreport pending from my part.

thanks,

Andrea

Leigh Orf

unread,

Dec 12, 2001, 9:51:00 AM12/12/01

to

Andrea,

I disabled ntfs and as you suspected my problem went away. This worked
for both 2.4.16 and 2.4.17pre4aa1.

Thanks a lot,

Leigh Orf

Damir C.

unread,

Dec 17, 2001, 4:50:31 AM12/17/01

to

Leigh Orf <o...@mailbag.com> wrote in message news:<linux.kernel.200112...@orp.orf.cx>...

> I've been having confounding out-of-memory problems with 2.4.16 on my
> 1.4MHz Athlon with 1 GB of memory (2 GB of swap). I just caught it in
> the act and I think it relates to some of the weirdness others have been
> reporting.

I guess this is the right thread for me, since I have very similar
problems. I run Slackware Linux 8.0 (well, heavily modified ;)) with
kernel 2.4.16 on a Celeron 850, Chaintech 6BTM (Intel 440 BX chipset)
motherboard, 128 MB RAM, 2*128 MB swap partitions, a 20 GB Maxtor
harddrive (MAXTOR 4K020H1) and my machine starts oopsing each night at
04:40 (when updatedb runs!). Oddly enough I have run updatedb by hand
several times during the day and compiling the kernel in another vc,
but I couldn't reproduce the oops. I do notice that my machine is very
reluctant to go into swap, but the buffer numbers go very high (60-65
MB).

I have ruled out a hardware error because:
- I exchanged the motherboard two times (2 * different QDI Advance 10T
motherboards (VIA chipset), now the Chaintech 6BTM (Intel 440 BX) -
which worked without problems for 3 years in my desktop)
- I exchanged the RAM two times (3 different manufacturers, the one
that is installed now has been used on the same Chaintech board for
the last 2 years - and I had no memory related crashes)
- I exchanged the hard drive once (albeit for the same Maxtor type)
- I exchanged the IDE cables
- I exchanged the case and the power supply (to rule out an odd power
supply as the root of all evil)
- I exchanged the processor once (Celeron 800, now Celeron 850)
- I exchanged the network cards once (2 * Cnet 200WL Pro (DM9102,
modul dmfe))
- I exchanged the video card once (NVidia Vanta LT)

So I basically completely changed the machine, but the problem
persists (yes, I did a fresh reinstall to rule out filesystem
corruption from the first install).

What strikes me as very odd is the fact, that all crashes appear in
the same exact manner in the logs:

This one is from the day before yesterday:

Dec 15 04:40:51 gw kernel: invalid operand: 0000
Dec 15 04:40:51 gw kernel: CPU: 0
Dec 15 04:40:51 gw kernel: EIP: 0010:[rmqueue+217/424] Not
tainted
Dec 15 04:40:51 gw kernel: EFLAGS: 00010096
Dec 15 04:40:51 gw kernel: eax: c01fc0c0 ebx: c01fc0fc ecx:
00000004 edx: c1200200
Dec 15 04:40:51 gw kernel: esi: c1000600 edi: 00000003 ebp:
00000008 esp: c4ab7ea8
Dec 15 04:40:51 gw kernel: ds: 0018 es: 0018 ss: 0018
Dec 15 04:40:51 gw kernel: Process find (pid: 7588,
stackpage=c4ab7000)
Dec 15 04:40:51 gw kernel: Stack: c01fc2e0 00000201 00000000 00000000
00000018 00000286 00000000 c01fc0c0
Dec 15 04:40:51 gw kernel: c0128cef 000001d2 c1251758 00000000
c0273a90 c01fc168 c01fc2d8 000001d2
Dec 15 04:40:51 gw kernel: c01248f5 c0128b52 00000000 c012490c
00000000 00000000 00000000 00000000
Dec 15 04:40:51 gw kernel: Call Trace: [__alloc_pages+51/356]
[read_cache_page+61/276] [_alloc_pages+22/24] [read_cache_page+84/276]
[ext2_get_page+31/120]
Dec 15 04:40:51 gw kernel: [ext2_readpage+0/20]
[ext2_readdir+235/512] [vfs_readdir+91/124] [filldir64+0/364]
[sys_getdents64+79/276] [filldir64+0/364]
Dec 15 04:40:51 gw kernel: [sys_fcntl64+127/136]
[system_call+51/56]
Dec 15 04:40:51 gw kernel:
Dec 15 04:40:51 gw kernel: Code: 0f 0b 83 c3 f4 8b 03 89 70 04 89 06
89 5e 04 89 33 8b 44 24

And this one is from today:

Dec 17 04:41:11 gw kernel: invalid operand: 0000
Dec 17 04:41:11 gw kernel: CPU: 0
Dec 17 04:41:11 gw kernel: EIP: 0010:[rmqueue+217/424] Not
tainted
Dec 17 04:41:11 gw kernel: EFLAGS: 00010012
Dec 17 04:41:11 gw kernel: eax: c01f45c0 ebx: c01f45fc ecx:
00000004 edx: c1200200
Dec 17 04:41:11 gw kernel: esi: c1000600 edi: 00000003 ebp:
00000008 esp: c5245ea8
Dec 17 04:41:11 gw kernel: ds: 0018 es: 0018 ss: 0018
Dec 17 04:41:11 gw kernel: Process find (pid: 5259,
stackpage=c5245000)
Dec 17 04:41:11 gw kernel: Stack: c01f47e0 00000201 00000000 00000000
00000018 00000286 00000000 c01f45c0
Dec 17 04:41:11 gw kernel: c012531f 000001d2 c1251524 00000000
c02728f0 c01f4668 c01f47d8 000001d2
Dec 17 04:41:11 gw kernel: c0120f25 c0125182 00000000 c0120f3c
00000000 00000000 00000000 00000000
Dec 17 04:41:11 gw kernel: Call Trace: [__alloc_pages+51/356]
[read_cache_page+61/276] [_alloc_pages+22/24] [read_cache_page+84/276]
[ext2_get_page+31/120]
Dec 17 04:41:11 gw kernel: [ext2_readpage+0/20]
[ext2_readdir+235/512] [vfs_readdir+91/124] [filldir64+0/364]
[sys_getdents64+79/276] [filldir64+0/364]
Dec 17 04:41:11 gw kernel: [sys_fcntl64+127/136]
[system_call+51/56]
Dec 17 04:41:11 gw kernel:
Dec 17 04:41:11 gw kernel: Code: 0f 0b 83 c3 f4 8b 03 89 70 04 89 06
89 5e 04 89 33 8b 44 24

Also from today - after the first oops, each and every time that mrtg
runs, I get this:

Dec 17 08:35:01 gw kernel: <1>Unable to handle kernel paging request
at virtual address 55555559
Dec 17 08:35:01 gw kernel: p7rinting eip:
Dec 17 08:35:01 gw kernel: c0125024
Dec 17 08:35:01 gw kernel: *pde = 00000000
Dec 17 08:35:01 gw kernel: Oops: 0002
Dec 17 08:35:01 gw kernel: CPU: 0
Dec 17 08:35:01 gw kernel: EIP: 0010:[rmqueue+96/424] Not
tainted
Dec 17 08:35:01 gw kernel: EFLAGS: 00010087
Dec 17 08:35:01 gw kernel: eax: 55555555 ebx: c01f47e0 ecx:
c01f45c0 edx: c01f45d8
Dec 17 08:35:01 gw kernel: esi: c1000840 edi: 00000000 ebp:
c01f45d8 esp: c1809e54
Dec 17 08:35:01 gw kernel: ds: 0018 es: 0018 ss: 0018
Dec 17 08:35:01 gw kernel: Process mrtg (pid: 6747,
stackpage=c1809000)
Dec 17 08:35:01 gw kernel: Stack: c01f47e0 00000201 00000001 00000000
0000032a 00000282 00000000 c01f45c0
Dec 17 08:35:01 gw kernel: c012531f 000001d2 c69260a0 00000001
c6ee9c60 c01f4668 c01f47d8 000001d2
Dec 17 08:35:01 gw kernel: ffffff00 c0125182 00104025 c011ce7f
083e8004 c69260a0 00000001 c6ee9c60
Dec 17 08:35:01 gw kernel: Call Trace: [__alloc_pages+51/356]
[_alloc_pages+22/24] [do_anonymous_page+51/160] [do_no_page+51/268]
[handle_mm_fault+82/180]
Dec 17 08:35:01 gw kernel: [do_page_fault+352/1176]
[do_page_fault+0/1176] [do_generic_file_read+997/1008]
[do_brk+280/508] [sys_brk+187/228] [error_code+52/60]
Dec 17 08:35:01 gw kernel:
Dec 17 08:35:01 gw kernel: Code: 89 50 04 89 02 8b 44 24 1c 89 f3 2b
98 94 00 00 00 c1 fb 06

However the day before yesterday, the errors that were produced by
mrtg looked like this:

Dec 15 04:50:01 gw kernel: invalid operand: 0000
Dec 15 04:50:01 gw kernel: CPU: 0
Dec 15 04:50:01 gw kernel: EIP: 0010:[rmqueue+89/424] Not
tainted
Dec 15 04:50:01 gw kernel: EFLAGS: 00010812
Dec 15 04:50:01 gw kernel: eax: c1000800 ebx: c01fc2e0 ecx:
c01fc0c0 edx: c01fc0c0
Dec 15 04:50:01 gw kernel: esi: c1000800 edi: 00000005 ebp:
c01fc114 esp: c5ef9e54
Dec 15 04:50:01 gw kernel: ds: 0018 es: 0018 ss: 0018
Dec 15 04:50:01 gw kernel: Process mrtg (pid: 7627,
stackpage=c5ef9000)
Dec 15 04:50:01 gw kernel: Stack: c01fc2e0 00000201 00000001 00000000
00000220 00000282 00000000 c01fc0c0
Dec 15 04:50:01 gw kernel: c0128cef 000001d2 c695b140 00000001
c68c9bc0 c01fc168 c01fc2d8 000001d2
Dec 15 04:50:01 gw kernel: ffffffef c0128b52 00104025 c012084f
083dd03c c695b140 00000001 c68c9bc0
Dec 15 04:50:01 gw kernel: Call Trace: [__alloc_pages+51/356]
[_alloc_pages+22/24] [do_anonymous_page+51/160] [do_no_page+51/268]
[handle_mm_fault+82/180]
Dec 15 04:50:01 gw kernel: [do_page_fault+352/1176]
[do_page_fault+0/1176] [do_generic_file_read+997/1008]
[do_brk+280/508] [sys_brk+187/228] [error_code+52/60]
Dec 15 04:50:01 gw kernel:
Dec 15 04:50:01 gw kernel: Code: 0f 0b 8b 56 04 8b 06 89 50 04 89 02
8b 44 24 1c 89 f3 2b 98

I hope someone can help me on :( Please? :)

I will downgrade to 2.4.13 for the time being - still happily running
several machines on 2.4.13 without any problems :(

Thanks in advance to anyone who may look into it (or for that matter
even got this far with the message :))

Stay good,
D.

Holger Lubitz

unread,

Dec 18, 2001, 9:27:44 AM12/18/01

to

Andrea Arcangeli proclaimed:

> He always get vmalloc failures, this is way too suspect. If the VM
> memory balancing was the culprit he should get failures with all the
> other allocations too. So it has to be a problem with a shortage of the
> address space available to vmalloc, not a problem with the page
> allocator.

Leigh pointed me to your post in reply to another thread (modify_ldt
failing on highmem machine).

Is there any special vmalloc handling on highmem kernels? I only run
into the problem if I am using high memory support in the kernel. I
haven't been able to reproduce the problem with 896M or less, which
strikes me as slightly odd. Why does _more_ memory trigger "no memory"
failures?

The problem is indeed not vm specific. the last -ac kernel shows the
problem, too (and that one still has the old vm, doesn't it?)

Holger