2.6 /proc/interrupts fails on systems with many CPUs

Randy.Dunlap

unread,

Nov 11, 2003, 5:34:59 PM11/11/03

to Erik Jacobson, linux-...@vger.kernel.org

On Tue, 11 Nov 2003 11:21:01 -0600 Erik Jacobson <er...@subway.americas.sgi.com> wrote:

| Howdy.
|
| On systems with lots of processors (512 for example), catting /proc/interrupts
| fails with a "not enough memory" error.
|
| This was observed in 2.6.0-test8
|
| I tracked this down to this in proc_misc.c:
|
| static int interrupts_open(struct inode *inode, struct file *file)
| {
| unsigned size = 4096 * (1 + num_online_cpus() / 8);
| char *buf = kmalloc(size, GFP_KERNEL);
|
| The kmalloc fails here.
|
| I'm looking for suggestions on how to fix this. I came up with one fix
| that seems to work OK for ia64. I have attached it to this message.
| I'm looking for advice on what should be proposed for the real fix.

An alternative is to limit 'size' to a maximum of 128 KB (or whatever
the max. kmalloc() on ia64 is) and continue to use kmalloc().
Does that work for you?

Another alternative is to convert show_interrupts to use a seq_file
iterator.

--
~Randy
MOTD: Always include version info.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Martin J. Bligh

unread,

Nov 11, 2003, 5:37:57 PM11/11/03

to Erik Jacobson, linux-...@vger.kernel.org

> On systems with lots of processors (512 for example), catting /proc/interrupts
> fails with a "not enough memory" error.
>
> This was observed in 2.6.0-test8
>
> I tracked this down to this in proc_misc.c:
>
> static int interrupts_open(struct inode *inode, struct file *file)
> {
> unsigned size = 4096 * (1 + num_online_cpus() / 8);
> char *buf = kmalloc(size, GFP_KERNEL);
>
> The kmalloc fails here.
>
> I'm looking for suggestions on how to fix this. I came up with one fix
> that seems to work OK for ia64. I have attached it to this message.
> I'm looking for advice on what should be proposed for the real fix.

I think it'd make more sense to only use vmalloc when it's explicitly
too big for kmalloc - or simply switch on num_online_cpus > 100 or
whatever a sensible cutoff is (ie nobody but you would ever see this ;-))

M.

Robert Love

unread,

Nov 11, 2003, 5:52:39 PM11/11/03

to Erik Jacobson, linux-...@vger.kernel.org

On Tue, 2003-11-11 at 12:21, Erik Jacobson wrote:

> static int interrupts_open(struct inode *inode, struct file *file)
> {
> unsigned size = 4096 * (1 + num_online_cpus() / 8);
> char *buf = kmalloc(size, GFP_KERNEL);
>
> The kmalloc fails here.

Ew.

We should probably use seq_file here, although vmalloc() should not
hurt.

Why __vmalloc() over vmalloc(), though? Eh, I do not even know what
__vmalloc() is?? ;)

Robert Love

Manfred Spraul

unread,

Nov 11, 2003, 6:17:35 PM11/11/03

to Erik Jacobson, linux-...@vger.kernel.org

Erik wrote:

> /* don't ask for more than the kmalloc() max size, currently 128 KB */
> if (size > 128 * 1024)
> size = 128 * 1024;
>- buf = kmalloc(size, GFP_KERNEL);
>+ buf = __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
>
>
kmalloc needs a contiguous memory block. Thus after a long runtime,
large kmalloc calls can fail due to fragmentation. I'd prefer if you
switch to vmalloc after 2*PAGE_SIZE.

Or switch to a line based seq file iterator, then you wouldn't need the
huge buffer.
--
Manfred

Linus Torvalds

unread,

Nov 11, 2003, 6:19:06 PM11/11/03

to Erik Jacobson, Kernel Mailing List, Al Viro

On Tue, 11 Nov 2003, Erik Jacobson wrote:
>
> I'm looking for suggestions on how to fix this. I came up with one fix
> that seems to work OK for ia64. I have attached it to this message.
> I'm looking for advice on what should be proposed for the real fix.

This is not the real fix.

Allowing people to use up vmalloc() space by opening the /proc files would
be a major DoS attack. Not worth it.

Instead, just make /proc/interrupts use the proper _sequence_ things, so
that instead of trying to print out everything in one go, you have the
"s_next()" thing to print them out one at a time. The seqfile interfaces
will then do the rigth thing with blocking/caching, and you only need a
single page.

Al - do we have some good documentation of how to use the seq-file
interface?

In the meantime, without documentation, the best place to look is just at
other examples. One such example would be the kernel/kallsyms.c case: see
how it does s_start/s_show/s_next/s_stop (or /proc/slabinfo, or vmstat, or
any number of them).

Linus

vi...@parcelfarce.linux.theplanet.co.uk

unread,

Nov 11, 2003, 6:24:16 PM11/11/03

to Linus Torvalds, Erik Jacobson, Kernel Mailing List

On Tue, Nov 11, 2003 at 10:17:47AM -0800, Linus Torvalds wrote:

> Al - do we have some good documentation of how to use the seq-file
> interface?

There was a text on LWN and it's OK for starting point. I'll try to
put together something compact, but somebody will have to go through
the result and turn it into proper English...

Linus Torvalds

unread,

Nov 11, 2003, 6:27:21 PM11/11/03

to Martin J. Bligh, Erik Jacobson, linux-...@vger.kernel.org

On Tue, 11 Nov 2003, Martin J. Bligh wrote:
>
> I think it'd make more sense to only use vmalloc when it's explicitly
> too big for kmalloc - or simply switch on num_online_cpus > 100 or
> whatever a sensible cutoff is (ie nobody but you would ever see this ;-))

No, please please please don't do these things.

vmalloc() is NOT SOMETHING YOU SHOULD EVER USE! It's only valid for when
you _need_ a big array, and you don't have any choice. It's slow, and it's
a very restricted resource: it's a global resource that is literally
restricted to a few tens of megabytes. It should be _very_ carefully used.

There are basically no valid new uses of it. There's a few valid legacy
users (I think the file descriptor array), and there are some drivers that
use it (which is crap, but drivers are drivers), and it's _really_ valid
only for modules. Nothing else.

Basically: if you think you need more memory than a kmalloc() can give,
you need to re-organize your data structures. To either not need a big
area, or to be able to allocate it in chunks.

Linus

Martin J. Bligh

unread,

Nov 11, 2003, 6:33:18 PM11/11/03

to Linus Torvalds, Erik Jacobson, linux-...@vger.kernel.org

>> I think it'd make more sense to only use vmalloc when it's explicitly
>> too big for kmalloc - or simply switch on num_online_cpus > 100 or
>> whatever a sensible cutoff is (ie nobody but you would ever see this ;-))
>
> No, please please please don't do these things.
>
> vmalloc() is NOT SOMETHING YOU SHOULD EVER USE! It's only valid for when
> you _need_ a big array, and you don't have any choice. It's slow, and it's
> a very restricted resource: it's a global resource that is literally
> restricted to a few tens of megabytes. It should be _very_ carefully used.
>
> There are basically no valid new uses of it. There's a few valid legacy
> users (I think the file descriptor array), and there are some drivers that
> use it (which is crap, but drivers are drivers), and it's _really_ valid
> only for modules. Nothing else.
>
> Basically: if you think you need more memory than a kmalloc() can give,
> you need to re-organize your data structures. To either not need a big
> area, or to be able to allocate it in chunks.

OK, I was actually trying to avoid the use of vmalloc, instead of the
unconditional conversion to vmalloc, which is what the original patch did ;-)

But you are, of course, correct - in this case, it should be easy to use
the seq_file stuff to do it in smaller chunks, and use a smaller buffer.

M.

Linus Torvalds

unread,

Nov 11, 2003, 6:38:24 PM11/11/03

to Martin J. Bligh, Erik Jacobson, linux-...@vger.kernel.org

On Tue, 11 Nov 2003, Martin J. Bligh wrote:
>

> OK, I was actually trying to avoid the use of vmalloc, instead of the
> unconditional conversion to vmalloc, which is what the original patch did ;-)

Yes, I realize that, but it's the old case of

"I'm totally faithful to my husband - I never sleep with other men when
he is around"

joke.

Basically, if it's wrong to use, it's wrong to use even occasionally. In
fact, having two different code-paths just makes the code worse.

Yes, I realize that sometimes you have to do it that way, and it might be
the simplest way to fix something. In this case, though, the cost and
fragility of a generic interface is not worth it, since the problem isn't
actually in the generic code at all..

Linus

Randy.Dunlap

unread,

Nov 11, 2003, 6:39:26 PM11/11/03

to Linus Torvalds, er...@subway.americas.sgi.com, linux-...@vger.kernel.org, vi...@parcelfarce.linux.theplanet.co.uk

On Tue, 11 Nov 2003 10:17:47 -0800 (PST) Linus Torvalds <torv...@osdl.org> wrote:

|
| On Tue, 11 Nov 2003, Erik Jacobson wrote:
| >
| > I'm looking for suggestions on how to fix this. I came up with one fix
| > that seems to work OK for ia64. I have attached it to this message.
| > I'm looking for advice on what should be proposed for the real fix.
|
| This is not the real fix.
|
| Allowing people to use up vmalloc() space by opening the /proc files would
| be a major DoS attack. Not worth it.
|
| Instead, just make /proc/interrupts use the proper _sequence_ things, so
| that instead of trying to print out everything in one go, you have the
| "s_next()" thing to print them out one at a time. The seqfile interfaces
| will then do the rigth thing with blocking/caching, and you only need a
| single page.
|
| Al - do we have some good documentation of how to use the seq-file
| interface?

See http://lwn.net/Articles/22355/
and http://www.xenotime.net/linux/doc/seq_file_howto.txt

| In the meantime, without documentation, the best place to look is just at
| other examples. One such example would be the kernel/kallsyms.c case: see
| how it does s_start/s_show/s_next/s_stop (or /proc/slabinfo, or vmstat, or
| any number of them).

--
~Randy
MOTD: Always include version info.

Anton Blanchard

unread,

Nov 11, 2003, 7:27:37 PM11/11/03

to Erik Jacobson, linux-...@vger.kernel.org

> On systems with lots of processors (512 for example), catting
> /proc/interrupts fails with a "not enough memory" error.

FYI we are seeing this on ppc64 too (less cpus but probably more
interrupt sources).

Anton

Len Brown

unread,

Nov 11, 2003, 7:57:09 PM11/11/03

to linux-...@vger.kernel.org

Speaking of /proc/interrupts...

Does anybody else find hw_interrupt_type.typename lacking?

"XT-PIC" doesn't tell us if it is level or edge triggered, high or low
polarity.

"IO-APIC-edge" and "IO-APIC-level" don't tell us if it is high or low
polarity.

Having this info easily available in /proc/interrupts would make
recognizing and diagnosing interrupt configuration issues much easier.

thanks,
-Len

Jonathan Corbet

unread,

Nov 11, 2003, 8:16:12 PM11/11/03

to vi...@parcelfarce.linux.theplanet.co.uk, Erik Jacobson, torv...@osdl.org, Kernel Mailing List

> > Al - do we have some good documentation of how to use the seq-file
> > interface?
>
> There was a text on LWN and it's OK for starting point.

That would be http://lwn.net/Articles/22355/.

If you'd like a version packaged up for Documentation/, say the word and I
can do that.

jon

Jonathan Corbet
Executive editor, LWN.net
cor...@lwn.net

Anton Blanchard

unread,

Nov 11, 2003, 8:20:21 PM11/11/03

to Linus Torvalds, Martin J. Bligh, Erik Jacobson, linux-...@vger.kernel.org

> There are basically no valid new uses of it. There's a few valid legacy
> users (I think the file descriptor array), and there are some drivers that
> use it (which is crap, but drivers are drivers), and it's _really_ valid
> only for modules. Nothing else.

The IPC code is doing ugly things too:

void* ipc_alloc(int size)
{
void* out;
if(size > PAGE_SIZE)
out = vmalloc(size);
else
out = kmalloc(size, GFP_KERNEL);
return out;
}

Anton

Martin J. Bligh

unread,

Nov 11, 2003, 10:17:12 PM11/11/03

to Anton Blanchard, Linus Torvalds, Erik Jacobson, linux-...@vger.kernel.org

>> There are basically no valid new uses of it. There's a few valid legacy
>> users (I think the file descriptor array), and there are some drivers that
>> use it (which is crap, but drivers are drivers), and it's _really_ valid
>> only for modules. Nothing else.
>
> The IPC code is doing ugly things too:
>
> void* ipc_alloc(int size)
> {
> void* out;
> if(size > PAGE_SIZE)
> out = vmalloc(size);
> else
> out = kmalloc(size, GFP_KERNEL);
> return out;
> }

That seems particularly .... odd ... as PAGE_SIZE isn't anywhere near the
breakpoint. Worst case (and I know I'll get yelled at for this, but I'll
get another amusing analogy out of Linus ;-)) we should just call kmalloc
and if it fails, then try vmalloc ...

M.

Zwane Mwaikambo

unread,

Nov 11, 2003, 10:34:58 PM11/11/03

to Martin J. Bligh, Anton Blanchard, Linus Torvalds, Erik Jacobson, linux-...@vger.kernel.org

On Tue, 11 Nov 2003, Martin J. Bligh wrote:

> That seems particularly .... odd ... as PAGE_SIZE isn't anywhere near the
> breakpoint. Worst case (and I know I'll get yelled at for this, but I'll
> get another amusing analogy out of Linus ;-)) we should just call kmalloc
> and if it fails, then try vmalloc ...

You're Cruisin' for a bruisin' ;)

I wonder if it ever does exceed 128k anyway...

static int semctl_main(...)
{
int nsems = sma->sem_nsems;
...
sem_io = ipc_alloc(sizeof(ushort)*nsems);
...

Erlend Aasland

unread,

Nov 11, 2003, 11:38:46 PM11/11/03

to Linux Kernel Mailing List

Hi,

This is a bit OT in this thread but...
Doesn't the information in /proc/interrupts really belong somewhere in sysfs?

Regards
Erlend Aasland

Martin J. Bligh

unread,

Nov 12, 2003, 2:10:58 AM11/12/03

to Erlend Aasland, Linux Kernel Mailing List

> This is a bit OT in this thread but...
> Doesn't the information in /proc/interrupts really belong somewhere in sysfs?

Only if you want to open 10 billion files to get the data every time.

M.