Subtle MM bug

3 views
Skip to first unread message

Rik van Riel

unread,
Jan 7, 2001, 4:37:06 PM1/7/01
to
On 7 Jan 2001, Zlatko Calusic wrote:

> Things go berzerk if you have one big process whose working set
> is around your physical memory size.

"go berzerk" in what way? Does the system cause lots of extra
swap IO and does it make the system thrash where 2.2 didn't
even touch the disk ?

> Final effect is that physical memory gets extremely flooded with
> the swap cache pages and at the same time the system absorbs
> ridiculous amount of the swap space.

This is mostly because Linux 2.4 keeps dirty pages in the
swap cache. Under Linux 2.2 a page would be deleted from the
swap cache when a program writes to it, but in Linux 2.4 it
can stay in the swap cache.

Oh, and don't forget that pages in the swap cache can also
be resident in the process, so it's not like the swap cache
is "eating into" the process' RSS ;)

> For instance on my 192MB configuration, firing up the hogmem
> program which allocates let's say 170MB of memory and dirties it
> leads to 215MB of swap used.

So that's 170MB of swap space for hogmem and 45MB for
the other things in the system (daemons, X, ...).

Sounds pretty ok, except maybe for the fact that now
Linux allocates (not uses!) a lot more swap space then
before and some people may need to add some swap space
to their system ...


Now if 2.4 has worse _performance_ than 2.2 due to one
reason or another, that I'd like to hear about ;)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

Zlatko Calusic

unread,
Jan 7, 2001, 3:59:37 PM1/7/01
to
I'm trying to get more familiar with the MM code in 2.4.0, as can be
seen from lots of questions I have on the subject. I discovered nasty
mm behaviour under even moderate load (2.2 didn't have troubles).

Things go berzerk if you have one big process whose working set is

around your physical memory size. Typical memory hoggers are good
enough to trigger the bad behaviour. Final effect is that physical


memory gets extremely flooded with the swap cache pages and at the
same time the system absorbs ridiculous amount of the swap space.

xmem is as usual very good at detecting this and you just need to
press Alt-SysReq-M to see that most of the memory (e.g. 90%) is
populated with the swap cache pages.

For instance on my 192MB configuration, firing up the hogmem program
which allocates let's say 170MB of memory and dirties it leads to

215MB of swap used. vmstat 1 shows that the pagecache size is
constantly growing - that is swapcache enlarging in fact - during the
second pass of the hogmem program.

...
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 1 1 131488 1592 400 62384 4172 5188 1092 1298 353 1447 2 4 94
0 1 1 136584 1592 400 67428 5860 4104 1465 1034 322 1327 3 3 93
0 1 1 141668 1592 388 72536 5504 4420 1376 1106 323 1423 1 3 95
0 1 1 146724 1592 380 77592 5996 4236 1499 1060 335 1096 2 3 94
0 1 1 151876 1600 320 82764 6264 3712 1566 936 327 1226 3 4 93
0 1 1 157016 1600 320 87908 5284 4268 1321 1068 315 1248 1 2 96
1 0 0 157016 1600 308 87792 1836 5168 459 1293 281 1324 3 3 94
0 1 0 162204 1600 304 92892 7784 5236 1946 1315 385 1353 3 5 92
0 1 0 167216 1600 304 97780 3496 5016 874 1256 301 1222 0 2 97
0 1 1 177904 1608 284 108276 5160 5168 1290 1300 330 1453 1 4 94
0 1 2 182008 1588 288 112264 4936 3344 1268 838 293 801 2 3 95
0 2 1 183620 1588 260 114012 3064 1756 830 445 290 846 0 15 85
0 2 2 185384 1596 180 115864 2320 2620 635 658 285 722 1 29 70
0 3 2 187528 1592 220 117892 2488 2224 657 557 273 754 3 30 67
0 4 1 190512 1592 236 120772 2524 3012 725 760 343 1080 1 14 85
0 4 1 195780 1592 240 125868 2336 5316 613 1331 381 1624 2 2 96
1 0 1 200992 1592 248 131052 2080 2176 623 552 234 1044 3 23 74
0 1 0 200996 1592 252 130948 2208 3048 580 762 256 1065 10 10 80
0 1 1 206240 1592 252 136076 2988 5252 760 1314 309 1406 7 4 8
0 2 1 211408 1592 256 141080 5424 5180 1389 1303 395 1885 3 5 91
0 2 0 214744 1592 264 144280 4756 3328 1223 834 327 1211 1 5 95
1 0 0 214868 1592 244 144468 4344 5148 1087 1295 303 1189 11 2 86
0 1 1 214900 1592 248 144496 4360 3244 1098 812 318 1467 7 4 89
0 1 1 214916 1592 248 144520 4280 3452 1070 865 336 1602 3 3 94
0 1 1 214964 1592 248 144580 4972 4184 1243 1054 368 1620 3 5 92
0 2 2 214956 1592 272 144548 3700 4544 1081 1142 665 2952 1 1 98
0 1 0 214992 1592 272 144588 1220 5088 305 1274 282 1363 1 4 95
0 1 1 215012 1592 272 144600 3640 4420 910 1106 325 1579 3 2 9

Any thoughts on this?
--
Zlatko

Zlatko Calusic

unread,
Jan 7, 2001, 5:33:08 PM1/7/01
to
Rik van Riel <ri...@conectiva.com.br> writes:

> On 7 Jan 2001, Zlatko Calusic wrote:
>

> > Things go berzerk if you have one big process whose working set
> > is around your physical memory size.
>

> "go berzerk" in what way? Does the system cause lots of extra
> swap IO and does it make the system thrash where 2.2 didn't
> even touch the disk ?
>

Well, I think yes. I'll do some testing on the 2.2 before I can tell
you for sure, but definitely the system is behaving badly where I
think it should not.

> > Final effect is that physical memory gets extremely flooded with
> > the swap cache pages and at the same time the system absorbs
> > ridiculous amount of the swap space.
>

> This is mostly because Linux 2.4 keeps dirty pages in the
> swap cache. Under Linux 2.2 a page would be deleted from the
> swap cache when a program writes to it, but in Linux 2.4 it
> can stay in the swap cache.
>

OK, I can buy that.

> Oh, and don't forget that pages in the swap cache can also
> be resident in the process, so it's not like the swap cache
> is "eating into" the process' RSS ;)
>

So far so good... A little bit weird but not alarming per se.

> > For instance on my 192MB configuration, firing up the hogmem
> > program which allocates let's say 170MB of memory and dirties it
> > leads to 215MB of swap used.
>

> So that's 170MB of swap space for hogmem and 45MB for
> the other things in the system (daemons, X, ...).
>

Yes, that's it. So it looks like all of my processes are on the
swap. That can't be good. I mean, even Solaris (known to eat swap
space like there's no tomorrow :)) would probably be more polite.

> Sounds pretty ok, except maybe for the fact that now
> Linux allocates (not uses!) a lot more swap space then
> before and some people may need to add some swap space
> to their system ...
>

Yes, I would say really a lot more. Big diffeence.

Also, I don't see a diference between allocated and used swap space on
the Linux. Could you elaborate on that?

>
> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
>

I'll get back to you later with more data. Time to boot 2.2. :)

Wayne Whitney

unread,
Jan 8, 2001, 12:29:29 AM1/8/01
to

On Sunday, January 7, 20001, Rik van Riel <ri...@conectiva.com.br> wrote:

> Now if 2.4 has worse _performance_ than 2.2 due to one reason or
> another, that I'd like to hear about ;)

Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
and as it is the usual workload on my little cluster of 3 machines, they
are all running 2.2.19pre:

The application is some mathematics computations (modular symbols) using a
package called MAGMA; at times this requires very large matrices. The
RSS can get up to 870MB; for some reason a MAGMA process under linux
thinks it has run out of memory at 870MB, regardless of the actual
memory/swap in the machine. MAGMA is single-threaded.

The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
There is no problem with just one MAGMA process, it just hits that 870MB
barrier and gracefully exits. But if I do the following test, I notice
very different behaviour under 2.2 and 2.4: while running 'top d 1' I
simultaneously launch two instances of a job which actually requires more
than 870MB of memory to complete. So each instance will slowly grow in
RSS until it gets killed by OOM or hits that 870MB limit.

Under 2.2, everything proceeds smoothly: before physical RAM is exhausted,
top updates every second, and the jobs have all the CPU. When swapping
kicks in, top updates every 1-2 seconds and lists most of the CPU as
'system' (kswapd), but I perceive not much loss of interactivity.
Eventually the 1GB of virtual memory is exhausted, the OOM killer kills
one of the MAGMA's, and the other runs till it hits the 870MB barrier and
exits.

But under 2.4, interactivity suffers as soon as physical RAM is exhausted.
Top only updates every 2-10 seconds, the load average hits 3-4, and top
reports the CPUs are 90% idle. Eventually, the OOM killer kicks in and
all returns to normal. For practical purposes, the machine is unusual
while swapping like this.

I have heard 'vmstat' mentioned here, so below is the output of a 'vmstat
1' concommitant with the test above (top and the two MAGMA jobs). I would
be more than happy to provide any other relevant information about this.

I read the LKML via an archive that updates once a day, so please cc: me
if you would like a speedier response. I wish I knew of a newsgroup
interface to the LKML, then I could read it more often :-).

Cheers,
Wayne


procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id

0 0 0 49180 447840 840 54104 269 969 84 244 76 236 10 4 86
1 0 0 49180 443276 852 55972 0 0 470 0 163 150 15 2 83
2 0 0 49180 440060 852 56292 0 0 80 0 115 60 93 1 6
2 0 0 49180 438236 856 56292 0 0 1 0 107 53 99 1 0
2 0 0 49180 429468 856 56392 0 0 25 0 109 16 99 0 0
2 0 0 49180 421296 856 56392 0 0 0 0 104 13 98 2 0
2 0 0 49180 421132 856 56392 0 0 0 0 108 53 100 0 0
2 0 0 49180 421128 856 56392 0 0 0 0 108 47 100 0 0
2 0 0 49180 397520 856 56392 0 0 0 1 107 49 96 4 0
2 0 0 49180 364860 856 56392 0 0 0 0 106 47 95 5 0
2 0 0 49180 332244 856 56392 0 0 0 0 106 49 95 5 0
2 0 0 49180 299660 856 56392 0 0 0 0 106 54 92 8 0
2 0 0 49180 267076 856 56392 0 0 0 0 109 56 95 5 0
2 0 0 49180 234632 856 56392 0 0 0 0 110 57 94 6 0
2 0 0 49180 202096 872 56448 32 0 18 0 117 70 95 5 0
2 0 0 49180 169544 872 56448 0 0 0 0 103 13 96 4 0
2 0 0 49180 137108 872 56448 0 0 0 0 107 49 93 7 0
2 0 0 49180 104600 872 56448 0 0 0 0 107 51 94 6 0
2 0 0 49180 72368 872 56448 0 0 0 52 136 54 93 7 0
2 0 0 49180 39964 872 56448 0 0 0 0 110 59 92 8 0
2 0 2 7296 1576 96 13072 0 720 0 184 130 465 74 22 4


procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id

1 2 2 53620 1564 116 23512 1012 31876 565 7969 883 3802 1 8 92
2 1 2 68800 1560 96 20128 68 15396 17 3850 291 2775 1 7 92
3 0 1 99484 1556 96 26096 84 29552 21 7388 594 3832 1 4 95
1 3 2 114708 1560 104 32528 284 14696 161 3674 374 3125 0 4 96
1 4 2 175484 1560 124 31112 360 63000 237 15753 1404 14952 1 5 94
1 2 2 205900 1560 96 32748 12 30080 3 7520 606 8356 1 5 94
2 1 2 221156 1560 96 17848 412 14256 103 3564 308 8450 1 10 89
1 2 2 222128 1564 96 12736 0 16100 7 4025 346 1010 0 5 95
1 2 2 236580 1560 108 15220 276 13988 97 3497 347 4102 0 7 92
2 1 2 267488 1560 104 32044 260 17376 69 4346 405 1265 0 7 93
3 1 1 282756 1560 96 29380 16 15304 4 3827 335 4359 1 7 92
2 1 2 282756 1580 96 11460 92 14948 23 3737 332 4120 1 5 94
2 1 2 313496 1560 100 30476 200 15484 54 3871 318 2359 0 9 90
2 1 2 313496 1560 100 14148 0 13076 1 3270 246 5165 1 8 91
3 1 1 344564 1572 96 23892 16 18444 11 4613 419 1555 0 7 93
2 1 2 375020 1560 96 25400 172 26988 43 6747 556 2910 1 7 93
1 2 2 375020 1968 96 22760 8 17136 2 4284 378 787 0 2 98
2 1 2 406056 1568 96 20432 212 17320 53 4330 393 2704 1 10 89
3 0 3 421316 1560 96 25056 72 14416 18 3604 281 1731 0 5 94
1 3 0 452120 1544 100 21216 240 31480 116 7870 715 2681 1 6 94
2 2 2 467488 1588 108 27248 440 15056 123 3765 385 2206 0 5 94


procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id

2 1 0 467488 1564 136 13352 88 15376 49 3844 368 2913 1 4 95
3 0 1 482864 1560 96 15256 128 15384 32 3846 296 986 1 7 92
3 0 1 497920 1560 96 14144 0 12636 0 3159 245 2302 1 9 90
3 1 1 529844 1540 96 18632 940 33340 569 8336 1104 1366 1 10 88
0 1 0 269856 205944 148 21772 2628 0 1196 2 267 313 0 3 97
0 1 0 269856 182736 156 33180 11180 0 2854 0 309 451 6 3 91
0 1 0 269856 158668 156 44696 11516 0 2879 0 314 462 12 4 83
0 1 0 269856 131928 156 57588 12892 0 3223 0 312 466 8 4 88
0 1 0 269856 105176 156 70448 12864 0 3216 0 332 506 12 3 85
0 1 0 269856 79056 156 82644 12196 0 3049 0 456 602 10 6 83
1 1 0 269856 46948 156 96900 14252 0 3563 0 359 518 21 7 72

Andi Kleen

unread,
Jan 8, 2001, 12:42:25 AM1/8/01
to
On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
> The application is some mathematics computations (modular symbols) using a
> package called MAGMA; at times this requires very large matrices. The
> RSS can get up to 870MB; for some reason a MAGMA process under linux
> thinks it has run out of memory at 870MB, regardless of the actual
> memory/swap in the machine. MAGMA is single-threaded.

I think it's caused by the way malloc maps its memory.
Newer glibc should work a bit better by falling back to mmap even for smaller
allocations (older does it only for very big ones)

-Andi

Linus Torvalds

unread,
Jan 8, 2001, 1:04:04 AM1/8/01
to
In article <2001010806...@gruyere.muc.suse.de>,

Andi Kleen <a...@suse.de> wrote:
>On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
>> The application is some mathematics computations (modular symbols) using a
>> package called MAGMA; at times this requires very large matrices. The
>> RSS can get up to 870MB; for some reason a MAGMA process under linux
>> thinks it has run out of memory at 870MB, regardless of the actual
>> memory/swap in the machine. MAGMA is single-threaded.
>
>I think it's caused by the way malloc maps its memory.
>Newer glibc should work a bit better by falling back to mmap even for smaller
>allocations (older does it only for very big ones)

That doesn't resolve the "2.4.x behaves badly" thing, though.

I've seen that one myself, and it seems to be simply due to the fact
that we're usually so good at gettign memory from page_launder() that we
never bother to try to swap stuff out. And when we _do_ start swapping
stuff out it just moves to the dirty list, and page_launder() will take
care of it.

So far so good. The problem appears to be that we don't swap stuff out
smoothly: we start doing the VM scanning, but when we get enough dirty
pages, we'll let it be, and go back to page_launder() again. Which means
that we don't walk theough the whole VM space, we just do some "spot
cleaning".

Linus

Rik van Riel

unread,
Jan 8, 2001, 12:16:48 PM1/8/01
to
On Sun, 7 Jan 2001, Wayne Whitney wrote:

> Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,

> The typical machine is a dual Intel box with 512MB RAM and 512MB swap.

How does 2.4 perform when you add an extra GB of swap ?

2.4 keeps dirty pages in the swap cache, so you will need
more swap to run the same programs...

Linus: is this something we want to keep or should we give
the user the option to run in a mode where swap space is
freed when we swap in something non-shared ?

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

-

Rik van Riel

unread,
Jan 8, 2001, 12:44:11 PM1/8/01
to
On 7 Jan 2001, Linus Torvalds wrote:

> That doesn't resolve the "2.4.x behaves badly" thing, though.
>
> I've seen that one myself, and it seems to be simply due to the
> fact that we're usually so good at gettign memory from
> page_launder() that we never bother to try to swap stuff out.
> And when we _do_ start swapping stuff out it just moves to the
> dirty list, and page_launder() will take care of it.
>
> So far so good. The problem appears to be that we don't swap
> stuff out smoothly: we start doing the VM scanning, but when we
> get enough dirty pages, we'll let it be, and go back to
> page_launder() again. Which means that we don't walk theough the
> whole VM space, we just do some "spot cleaning".

You are right in that we need to refill the inactive list
before calling page_launder(), but we'll also need a few
other modifications:

1. adopt the latest FreeBSD tactic in page_launder()
- mark dirty pages we see but don't flush
- in the first loop, flush up to maxlaunder of the
already seen dirty pages
- in the second loop, flush as many pages as we
need to refill the free&inactive_clean list

2. go back to having a _static_ free target, at
max(freepages.high, SUM(zone->pages_high) ... this
means free_shortage() will never be very big

3. keep track of how many pages we need to free in
page_launder() and substract one from the target
when we submit a page for IO ... no need to flush
20MB of dirty pages when we only need 1MB pages
cleaned

I have these things in my local tree and it seems to smooth
out the load quite well for a very large haskell run and for
the fillmem program from Juan Quintela's memtest suite.

When combined with your idea of refilling the freelist _first_,
we should be able to get the VM quite a bit smoother under loads
with lots of dirty pages.

I will work on this while travelling to and being in Australia.
Expect a clean patch to fix this problem once the 2.4 bugfix-only
period is over.

Other people on this list are invited to apply the VM patches from
my home page and give them a good beating. I want to be able to
submit a well-tested, known-good patch to Linus once 2.4 is out of
the bugfix-only period...

Linus Torvalds

unread,
Jan 8, 2001, 12:58:15 PM1/8/01
to

On Mon, 8 Jan 2001, Rik van Riel wrote:

> On Sun, 7 Jan 2001, Wayne Whitney wrote:
>
> > Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
>
> > The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
>
> How does 2.4 perform when you add an extra GB of swap ?
>
> 2.4 keeps dirty pages in the swap cache, so you will need
> more swap to run the same programs...
>
> Linus: is this something we want to keep or should we give
> the user the option to run in a mode where swap space is
> freed when we swap in something non-shared ?

I'd prefer just documenting it and keeping it. I'd hate to have two fairly
different modes of behaviour. It's always been the suggested "twice the
amount of RAM", although there's historically been the "Linux doesn't
really need that much" that we just killed with 2.4.x.

If you have 512MB or RAM, you can probably afford another 40GB or so of
harddisk. They are disgustingly cheap these days.

Linus

Linus Torvalds

unread,
Jan 8, 2001, 1:02:12 PM1/8/01
to

On Mon, 8 Jan 2001, Rik van Riel wrote:
>

> You are right in that we need to refill the inactive list
> before calling page_launder(), but we'll also need a few
> other modifications:

NONE of your three additions do _anything_ to help us at all if we don't
even see the dirty bit because the page is on the active list and the
dirty bit is in somebodys VM space.

I agree that they look ok, but they are all complicating the code. I
propose getting rid of complications, and getting rid of the precarious
"when do we actually scan the VM tables" balancing issue.

Quite frankly, I'd rather see somebody try the vmscan stuff FIRST. Your
suggestions look fine, but apart from the "let dirty pages go twice
through the list" they look like tweaks that would need re-tweaking after
the balancing stuff is ripped out.

Szabolcs Szakacsits

unread,
Jan 8, 2001, 3:39:43 PM1/8/01
to

Andi Kleen <a...@suse.de> wrote:
> On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
> > package called MAGMA; at times this requires very large matrices. The
> > RSS can get up to 870MB; for some reason a MAGMA process under linux
> > thinks it has run out of memory at 870MB, regardless of the actual
> > memory/swap in the machine. MAGMA is single-threaded.
> I think it's caused by the way malloc maps its memory.
> Newer glibc should work a bit better by falling back to mmap even
> for smaller allocations (older does it only for very big ones)

AFAIK newer glibc = CVS glibc but the malloc() tune parameters
work via environment variables for the current stable ones as well,
e.g. to overcome the above "out of memory" one could do,
% export MALLOC_MMAP_MAX_=1000000
% export MALLOC_MMAP_THRESHOLD_=0
% magma

At default, on a 32bit Linux current stable glibc malloc uses brk
between 0x08??????-0x40000000 and max (MALLOC_MMAP_MAX_) 128 mmap if
the requested chunk is greater than 128 kB (MALLOC_MMAP_THRESHOLD_).
If MAGMA mallocs memory in less than 128 kB chunks then the above out
of memory behaviour is expected.

Szaka

Wayne Whitney

unread,
Jan 8, 2001, 4:56:08 PM1/8/01
to
On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:

> AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> via environment variables for the current stable ones as well,

Hmm, this must have been introduced in libc6? Unfortunately, I don't have
the source code to MAGMA, and the binary I have is statically linked. It
does not contain the names of the environment variables you mentioned.

I'll arrange a binary linked against glibc2.2, and then your suggestion
will hopefully do the trick. Thanks for your kind help!

Cheers,
Wayne

Andrea Arcangeli

unread,
Jan 8, 2001, 5:15:58 PM1/8/01
to
On Mon, Jan 08, 2001 at 02:00:19PM -0800, Wayne Whitney wrote:
> I'd ask if this jives with your theory: if I configure the linux kernel
> to be able to use 2GB of RAM, then the 870MB limit becomes much lower, to
> 230MB.

It's because the virtual address space for userspace tasks gets reduced
from 3G to 2G to give an additional giga of direct mapping to the kernel.

Also the other limit you hit (at around 800mbyte) is partly because
of the too low userspace virtual address space.

You can use this hack by me to allow the tasks to grow up to 3.5G per task on
IA32 on 2.4.0 (equivalent hack exists for 2.2.19pre6aa1 with bigmem, btw it
makes sense also without bigmem if you have lots of swap, that's all about
virtual memory not physical RAM). However it doesn't work with PAE enabled
yet.

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test11-pre5/per-process-3.5G-IA32-no-PAE-1

If you run your program on any 64bit architecture (in 64bit userspace mode)
supported by linux, you won't run into those per-process address space limits.

Andrea

Wayne Whitney

unread,
Jan 8, 2001, 4:30:56 PM1/8/01
to
On Mon, 8 Jan 2001, Rik van Riel wrote:

> How does 2.4 perform when you add an extra GB of swap ?

OK, some more data:

First, I tried booting 2.4.0 with "nosmp" to see if the behavior I observe
is SMP related. It isn't, there was no difference under 2.4.0 between
512MB/512MB/1CPU and 512MB/512MB/2CPUs.

Second, I tried going to 2GB of swap with 2.4.0, so 512MB/2GB/2CPUs.
Again, there is no difference: as soon as swapping begins with two MAGMA
processes, interactivity suffers. I notice that while swapping in this
situation, the HD light is blinking only intermittently.

I also tried logging in to a fourth VT during this second test, and it got
nowhere. In fact, this stopped the top updates completely and the HD
light also stopped. After 30 seconds of nothing (all I could do is switch
VT's), I gave up and sent a ^Z to one MAGMA process; this eventually was
received, and the system immediately recovered.

Perhaps there is some sort of I/O starvation triggered by two swapping
processes?

Again, under 2.2.19pre6, the exact same tests yield hardly any loss of
interactivity, I can log in fine (a little slowly) during the top / two
MAGMA process test. And once swapping begins, the HD light is continually
lit.

Again, I'd be happy to do any additional tests, provide more info about my
machine, etc.

Cheers,
Wayne

Wayne Whitney

unread,
Jan 8, 2001, 5:00:19 PM1/8/01
to
On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:

> AFAIK newer glibc = CVS glibc but the malloc() tune parameters work

> via environment variables for the current stable ones as well, e.g. to


> overcome the above "out of memory" one could do,
>
> % export MALLOC_MMAP_MAX_=1000000
> % export MALLOC_MMAP_THRESHOLD_=0
> % magma

As I just mentioned, I haven't been able to test this yet due to my
current binary being linked against an older libc with doesn't seem to
have these parameters. But here's one other data point, I just thought


I'd ask if this jives with your theory: if I configure the linux kernel
to be able to use 2GB of RAM, then the 870MB limit becomes much lower, to
230MB.

Cheers, Wayne

Wayne Whitney

unread,
Jan 8, 2001, 6:22:44 PM1/8/01
to
On Mon, 8 Jan 2001, Wayne Whitney wrote:

> On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:
>
> > AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> > via environment variables for the current stable ones as well,
>

> I'll arrange a binary linked against glibc2.2, and then your suggestion
> will hopefully do the trick. Thanks for your kind help!

OK, I now have a binary dynamically linked against /lib/libc.so.6,
(according to ldd), and that points to glibc-2.1.92. And I tried setting
the environment variables you suggested, I checked that they are set and
checked that they appear in /lib/libc.so.6. But the behaviour is
unchanged: MAGMA still hits this barrier at 830M (not 870M, that was a
typo).

I guess I conclude that either (1) MAGMA does not use libc's malloc
(checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
variables but has not yet implemented the tuning (I'll try glibc-2.2) or
(3) this is not the problem.

I'll look at Andrea's hack as well. Thanks for everybody's help!

Andrea Arcangeli

unread,
Jan 8, 2001, 6:30:02 PM1/8/01
to
On Mon, Jan 08, 2001 at 03:22:44PM -0800, Wayne Whitney wrote:
> I guess I conclude that either (1) MAGMA does not use libc's malloc
> (checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
> variables but has not yet implemented the tuning (I'll try glibc-2.2) or
> (3) this is not the problem.

You should monitor the program with strace while it fails (last few syscalls).
You can breakpoint at exit() and run `cat /proc/pid/maps` to show us the vma
layout of the task. Then we'll see why it's failing. With CONFIG_1G in 2.2.x
or 2.4.x (confinguration option doesn't matter) you should at least reach
something like 1.5G.

Andrea

Linus Torvalds

unread,
Jan 8, 2001, 7:37:45 PM1/8/01
to
In article <2001010900...@athlon.random>,

Andrea Arcangeli <and...@suse.de> wrote:
>On Mon, Jan 08, 2001 at 03:22:44PM -0800, Wayne Whitney wrote:
>> I guess I conclude that either (1) MAGMA does not use libc's malloc
>> (checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
>> variables but has not yet implemented the tuning (I'll try glibc-2.2) or
>> (3) this is not the problem.
>
>You should monitor the program with strace while it fails (last few syscalls).
>You can breakpoint at exit() and run `cat /proc/pid/maps` to show us the vma
>layout of the task. Then we'll see why it's failing. With CONFIG_1G in 2.2.x
>or 2.4.x (confinguration option doesn't matter) you should at least reach
>something like 1.5G.

It might be doing its own memory management with brk() directly - some
older UNIX programs will do that (for various reasons - it can be faster
than malloc() etc if you know your access patterns, for example).

If you do that, and you have shared libraries, you'll get a failure
around the point Wayne sees it.

But your suggestion to check with strace is a good one.

Linus

Zlatko Calusic

unread,
Jan 8, 2001, 9:01:52 PM1/8/01
to
Rik van Riel <ri...@conectiva.com.br> writes:

> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
>

Oh, well, it seems that I was wrong. :)


First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
192MB machine)

kernel | swap usage | speed
-------------------------------
2.2.17 | 48 MB | 11.8 MB/s
-------------------------------
2.4.0 | 206 MB | 11.1 MB/s
-------------------------------

So 2.2 is only marginally faster. Also it can be seen that 2.4 uses 4
times more swap space. If Linus says it's ok... :)


Second test: kernel compile make -j32 (empirically this puts the VM
under load, but not excessively!)

2.2.17 -> make -j32 392.49s user 47.87s system 168% cpu 4:21.13 total
2.4.0 -> make -j32 389.59s user 31.29s system 182% cpu 3:50.24 total

Now, is this great news or what, 2.4.0 is definitely faster.

--
Zlatko

Zlatko Calusic

unread,
Jan 8, 2001, 6:41:14 PM1/8/01
to
Linus Torvalds <torv...@transmeta.com> writes:

> On Mon, 8 Jan 2001, Rik van Riel wrote:
>
> > On Sun, 7 Jan 2001, Wayne Whitney wrote:
> >
> > > Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
> >
> > > The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
> >

> > How does 2.4 perform when you add an extra GB of swap ?
> >

> > 2.4 keeps dirty pages in the swap cache, so you will need
> > more swap to run the same programs...
> >
> > Linus: is this something we want to keep or should we give
> > the user the option to run in a mode where swap space is
> > freed when we swap in something non-shared ?
>
> I'd prefer just documenting it and keeping it. I'd hate to have two fairly
> different modes of behaviour. It's always been the suggested "twice the
> amount of RAM", although there's historically been the "Linux doesn't
> really need that much" that we just killed with 2.4.x.
>
> If you have 512MB or RAM, you can probably afford another 40GB or so of
> harddisk. They are disgustingly cheap these days.
>

Yes, but a lot more data on the swap also means degraded performance,
because the disk head has to seek around in the much bigger area. Are
you sure this is all OK?

Linus Torvalds

unread,
Jan 8, 2001, 9:58:00 PM1/8/01
to

On 9 Jan 2001, Zlatko Calusic wrote:
>
> Yes, but a lot more data on the swap also means degraded performance,
> because the disk head has to seek around in the much bigger area. Are
> you sure this is all OK?

Yes and no.

I'm not _sure_, obviously.

However, one thing I _am_ sure of is that the sticky page-cache simplifies
some things enormously, and make some things possible that simply weren't
possible before.

Linus

Eric W. Biederman

unread,
Jan 9, 2001, 1:20:58 AM1/9/01
to
Zlatko Calusic <zla...@iskon.hr> writes:

>
> Yes, but a lot more data on the swap also means degraded performance,
> because the disk head has to seek around in the much bigger area. Are
> you sure this is all OK?

I don't think we have more data on the swap, just more data has an
allocated home on the swap. With the earlier allocation we should
(I haven't verified) allocate contiguous chunks of memory contiguously
on the swap. And reusing the same swap pages helps out with this.

Eric

Linus Torvalds

unread,
Jan 9, 2001, 2:27:15 AM1/9/01
to

On 8 Jan 2001, Eric W. Biederman wrote:

> Zlatko Calusic <zla...@iskon.hr> writes:>
> >
> > Yes, but a lot more data on the swap also means degraded performance,
> > because the disk head has to seek around in the much bigger area. Are
> > you sure this is all OK?
>
> I don't think we have more data on the swap, just more data has an
> allocated home on the swap.

I think Zlatko's point is that because of the extra allocations, we will
have worse locality (more seeks etc).

Clearly we should not actually do any more actual IO. But the sticky
allocation _might_ make the IO we do be more spread out.

To offset that, I think the sticky allocation makes us much better able to
handle things like clustering etc more intelligently, which is why I think
it's very much worth it. But let's not close our eyes to potential
downsides.

Linus

Zlatko Calusic

unread,
Jan 9, 2001, 7:29:00 AM1/9/01
to
Linus Torvalds <torv...@transmeta.com> writes:

> On 8 Jan 2001, Eric W. Biederman wrote:
>
> > Zlatko Calusic <zla...@iskon.hr> writes:>
> > >
> > > Yes, but a lot more data on the swap also means degraded performance,
> > > because the disk head has to seek around in the much bigger area. Are
> > > you sure this is all OK?
> >
> > I don't think we have more data on the swap, just more data has an
> > allocated home on the swap.
>
> I think Zlatko's point is that because of the extra allocations, we will
> have worse locality (more seeks etc).

Yes that was my concern.

But in the end I'm not sure. I made two simple tests and haven't found
any problems with 2.4.0 mm logic (opposed to 2.2.17). In fact, the new
kernel was faster in the more interesting (make -j32) test.

Also I have found that new kernel allocates 4 times more swap space
under some circumstances. That may or may not be alarming, it remains
to be seen.

--
Zlatko

Eric W. Biederman

unread,
Jan 9, 2001, 6:38:56 AM1/9/01
to
Linus Torvalds <torv...@transmeta.com> writes:

> On 8 Jan 2001, Eric W. Biederman wrote:
>
> > Zlatko Calusic <zla...@iskon.hr> writes:>
> > >
> > > Yes, but a lot more data on the swap also means degraded performance,
> > > because the disk head has to seek around in the much bigger area. Are
> > > you sure this is all OK?
> >
> > I don't think we have more data on the swap, just more data has an
> > allocated home on the swap.
>
> I think Zlatko's point is that because of the extra allocations, we will
> have worse locality (more seeks etc).
>

> Clearly we should not actually do any more actual IO. But the sticky
> allocation _might_ make the IO we do be more spread out.

The tradeoff when implemented correctly is that writes will tend to be
more spread out and reads should be better clustered together.

> To offset that, I think the sticky allocation makes us much better able to
> handle things like clustering etc more intelligently, which is why I think
> it's very much worth it. But let's not close our eyes to potential
> downsides.

Certainly, keeping ours eyes open is a good a good thing.

But it has been apparent for a long time that by doing allocation as
we were doing it, that when it came to heavy swapping we were taking a
performance hit. So I'm relieved that we are now being more aggressive.

From the sounds of it what we are currently doing actually sucks worse
for some heavy loads. But it still feels like the right direction.

It's been my impression that work loads where we are actively swapping
are a lot different from work loads where we really don't swap. To
the extent that it might make sense to make the actively swapping case
a config option to get our attention in the code. It would be nice
to have a linux kernel for once that handles heavy swapping (below
the level of thrashing) gracefully. :)

Eric

Linus Torvalds

unread,
Jan 9, 2001, 1:47:57 PM1/9/01
to

On 9 Jan 2001, Zlatko Calusic wrote:
>

> But in the end I'm not sure. I made two simple tests and haven't found
> any problems with 2.4.0 mm logic (opposed to 2.2.17). In fact, the new
> kernel was faster in the more interesting (make -j32) test.

I personally think 2.4.x is going to be as fast or faster at just about
anything. We do have some MM issues still to hash out, and tuning to do,
but I'm absolutely convinced that 2.4.x is going to be a _lot_ easier to
tune than 2.2.x ever was. The "scan the page tables without doing any IO"
thing just makes the 2.4.x memory management several orders of magnitude
more flexible than 2.2.x ever was.

(This is why I worked so hard at getting the PageDirty semantics right in
the last two months or so - and why I released 2.4.0 when I did. Getting
PageDirty right was the big step to make all of the VM stuff possible in
the first place. Even if it probably looked a bit foolhardy to change the
semantics of "writepage()" quite radically just before 2.4 was released).

> Also I have found that new kernel allocates 4 times more swap space
> under some circumstances. That may or may not be alarming, it remains
> to be seen.

Yes. The new VM will allocate the swap space a _lot_ more aggressively.
Many of those allocations will not necessarily ever actually be used, but
the fact that we _have_ allocated backing store for a page is what allows
us to drop it from the VM page tables, so that it can be processed by
page_launder().

And this _is_ a downside, there's no question about it. There's the worry
about the potential loss of locality, but there's also the fact that you
effectively need a bigger swap partition with 2.4.x - never mind that
large portions of the allocations may never be used. You still need the
disk space for good VM behaviour.

There are always trade-offs, I think the 2.4.x tradeoff is a good one.

Linus

-

Daniel Phillips

unread,
Jan 9, 2001, 2:09:43 PM1/9/01
to
Linus Torvalds wrote:
> (This is why I worked so hard at getting the PageDirty semantics right in
> the last two months or so - and why I released 2.4.0 when I did. Getting
> PageDirty right was the big step to make all of the VM stuff possible in
> the first place. Even if it probably looked a bit foolhardy to change the
> semantics of "writepage()" quite radically just before 2.4 was released).

On the topic of writepage, it's not symmetric with readpage at the
moment - it still takes (struct file *). Is this in the cleanup
pipeline? It looks like nfs_readpage already ignores the struct file *,
but maybe some other net filesystems are still depending on it.

--
Daniel

Trond Myklebust

unread,
Jan 9, 2001, 2:29:02 PM1/9/01
to
>>>>> " " == Daniel Phillips <phil...@innominate.de> writes:

> Linus Torvalds wrote:
>> (This is why I worked so hard at getting the PageDirty
>> semantics right in the last two months or so - and why I
>> released 2.4.0 when I did. Getting PageDirty right was the big
>> step to make all of the VM stuff possible in the first
>> place. Even if it probably looked a bit foolhardy to change the
>> semantics of "writepage()" quite radically just before 2.4 was
>> released).

> On the topic of writepage, it's not symmetric with readpage at
> the moment - it still takes (struct file *). Is this in the
> cleanup pipeline? It looks like nfs_readpage already ignores
> the struct file *, but maybe some other net filesystems are
> still depending on it.

NO! We definitely want to pass the struct file down to nfs_readpage()
when it's available.

Al has mentioned that he wants us to move towards a *BSD-like system
of credentials (i.e. struct ucred) that could be used here, but that's
in the far future. In the meantime, we cache RPC credentials in the
struct file...

Cheers,
Trond

Simon Kirby

unread,
Jan 9, 2001, 2:53:52 PM1/9/01
to
On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:

> And this _is_ a downside, there's no question about it. There's the worry
> about the potential loss of locality, but there's also the fact that you
> effectively need a bigger swap partition with 2.4.x - never mind that
> large portions of the allocations may never be used. You still need the
> disk space for good VM behaviour.
>
> There are always trade-offs, I think the 2.4.x tradeoff is a good one.

Hmm, perhaps you could clarify...

For boxes that rarely ever use swap with 2.2, will they now need more
swap space on 2.4 to perform well, or just boxes which don't have enough
RAM to handle everything nicely?

I've always been tending to make swap partitions smaller lately, as it
helps in the case where we have to wait for a runaway process to eat up
all of the swap space before it gets killed. Making the swap size
smaller speeds up the time it takes for this to happen, albeit something
which isn't supposed to happen anyway.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ s...@stormix.com ][ s...@netnation.com ]
[ Opinions expressed are not necessarily those of my employers. ]

Linus Torvalds

unread,
Jan 9, 2001, 2:37:05 PM1/9/01
to
In article <3A5B61F7...@innominate.de>,

Daniel Phillips <phil...@innominate.de> wrote:
>Linus Torvalds wrote:
>> (This is why I worked so hard at getting the PageDirty semantics right in
>> the last two months or so - and why I released 2.4.0 when I did. Getting
>> PageDirty right was the big step to make all of the VM stuff possible in
>> the first place. Even if it probably looked a bit foolhardy to change the
>> semantics of "writepage()" quite radically just before 2.4 was released).
>
>On the topic of writepage, it's not symmetric with readpage at the
>moment - it still takes (struct file *). Is this in the cleanup
>pipeline? It looks like nfs_readpage already ignores the struct file *,
>but maybe some other net filesystems are still depending on it.

readpage() is always a synchronous operation, and is actually much more
closely linked to "prepare_write()"/"commit_write()" than to writepage,
despite the naming similarities.

So no, the two are not symmetric, and they really shouldn't be.

"readpage()" is for reading a page into the page cache, and is always
synchronous with the reader (even prefetching is "synchronous" in the
sense that it's done by the reader: it's asynchronous in the sense that
we don't wait for the results, but the _calling_ of readpage() is
synchronous, if you see what I mean).

Similarly, prepare_write() and commit_write() are synchronous to the
writer (again, we do not wait for the writes to have actually
_happened_, but we call the functions synchronously and they can choose
to let the actual IO happen asynchronously - the VM doesn't care about
that small detail).

So "readpage()" and "prepare_write()/commit_write()" are pairs. They
are different simply because reading is assumed to be a cacheable and
prefetchable operation (think regular CPU caches), while writing
obviously has to give a much stricter "write _these_ bytes, not the
whole cache line".

In contrast, writepage() is a completely different animal. It's
basically a cache eviction notice, and happens asynchronously to any
operations that actually fill or dirty the cache. So despite the name,
it really as an operation has absolutely nothing in common with
readpage(), other than the fact that it is supposed to obviously do the
IO associated with the name.

Writepage has a friend in "sync_page()", which is another asynchronous
call-back that basically says "we want you to start your IO _now_". It's
similar to "writepage()" in that it's a kind of cache state
notification: while writepage() notifies that the cached page wants to
be evicted, "sync_page()" notifies that the cached page is waited upon
by somebody else and that we want to speed up any background IO on it.

You'll notice that writepage()/sync_page() have similar calling
convention, while readpage/prepare_write/commit_write have similar
calling conventions.

The one operation that _really_ stands out is "bmap()". It has
absolutely no calling convention at all, and is not symmetric with
anything. Pretty ugly, but easily supported.

Linus

Zlatko Calusic

unread,
Jan 9, 2001, 3:10:54 PM1/9/01
to
Simon Kirby <s...@stormix.com> writes:

> On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:
>
> > And this _is_ a downside, there's no question about it. There's the worry
> > about the potential loss of locality, but there's also the fact that you
> > effectively need a bigger swap partition with 2.4.x - never mind that
> > large portions of the allocations may never be used. You still need the
> > disk space for good VM behaviour.
> >
> > There are always trade-offs, I think the 2.4.x tradeoff is a good one.
>
> Hmm, perhaps you could clarify...
>
> For boxes that rarely ever use swap with 2.2, will they now need more
> swap space on 2.4 to perform well, or just boxes which don't have enough
> RAM to handle everything nicely?
>

Just boxes that were already short on memory (swapped a lot) will need
more swap, empirically up to 4 times as much. If you already had
enough memory than things will stay almost the same for you.

But anyway, after some testing I've done recently I would now not
recommend anybody to have less than 2 x RAM size swap partition.

> I've always been tending to make swap partitions smaller lately, as it
> helps in the case where we have to wait for a runaway process to eat up
> all of the swap space before it gets killed. Making the swap size
> smaller speeds up the time it takes for this to happen, albeit something
> which isn't supposed to happen anyway.
>

Well, if you continue with that practice now you will be even more
successful in killing such processes, I would say. :)
--
Zlatko

Linus Torvalds

unread,
Jan 9, 2001, 3:08:41 PM1/9/01
to

On Tue, 9 Jan 2001, Simon Kirby wrote:
>
> On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:
>
> > And this _is_ a downside, there's no question about it. There's the worry
> > about the potential loss of locality, but there's also the fact that you
> > effectively need a bigger swap partition with 2.4.x - never mind that
> > large portions of the allocations may never be used. You still need the
> > disk space for good VM behaviour.
> >
> > There are always trade-offs, I think the 2.4.x tradeoff is a good one.
>
> Hmm, perhaps you could clarify...
>
> For boxes that rarely ever use swap with 2.2, will they now need more
> swap space on 2.4 to perform well, or just boxes which don't have enough
> RAM to handle everything nicely?

If you don't have any swap, or if you run out of swap, the major
difference between 2.2.x and 2.4.x is probably going to be the oom
handling: I suspect that 2.4.x might be more likely to kill things off
sooner (but it tries to be graceful about which processes to kill).

Not having any swap is going to be a performance issue for both 2.2.x and
2.4.x - Linux likes to push inactive dirty pages out to swap where they
can lie around without bothering anybody, even if there is no _major_
memory crunch going on.

If you do have swap, but it's smaller than your available physical RAM, I
suspect that the Linux-2.4 swap pre-allocate may cause that kind of
performance degradation earlier than 2.2.x would have. Another way of
putting this: in 2.2.x you could use a fairly small swap partition to pick
up some of the slack, and in 2.4.x a really small swap-partition doesn't
really buy you much anything.

> I've always been tending to make swap partitions smaller lately, as it
> helps in the case where we have to wait for a runaway process to eat up
> all of the swap space before it gets killed. Making the swap size
> smaller speeds up the time it takes for this to happen, albeit something
> which isn't supposed to happen anyway.

Yes, that kind of swap size tuning will still work in 2.4.x, but the sizes
you tune for would be different, I'm afraid. If you have, say, 128MB or
RAM, and you used to make a smallish partition of 64MB for "slop" in
2.2.x, I really suspect that you might like to increase it to 128MB or
196MB.

Of course, if you really only used your swap for "slop", I don't think
you'll necessarily notice the difference.

NOTE! The above guide-lines are pure guesses. The machines I use have had
big swap-partitions or none at all, so I think we'll just have to wait and
see.

Linus

Reply all
Reply to author
Forward
0 new messages