gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?

Jim

unread,

Nov 6, 2010, 7:30:02 AM11/6/10

to

After upgrading my Dell laptop, both OS+kernel the i8k interface was giving
nonsensical output. As it turned out it's not the kernel but compiler
upgrade which broke this.

Guys at Archlinux have found the underlying cause (but don't seem to have
submitted a patch yet):
https://bbs.archlinux.org/viewtopic.php?pid=780692#p780692
gcc seems to optimize the assembly statements away.

And indeed, applying this patch makes the i8k interface work again,
i.e. replacing the asm(..) construct by asm volatile(..)

--- i8k.c.ORIG 2010-08-02 17:20:46.000000000 +0200
+++ i8k.c 2010-10-30 13:03:07.000000000 +0200
@@ -119,7 +119,7 @@
int eax = regs->eax;

#if defined(CONFIG_X86_64)
- asm("pushq %%rax\n\t"
+ asm volatile("pushq %%rax\n\t"
"movl 0(%%rax),%%edx\n\t"
"pushq %%rdx\n\t"
"movl 4(%%rax),%%ebx\n\t"
@@ -145,7 +145,7 @@
: "a"(regs)
: "%ebx", "%ecx", "%edx", "%esi", "%edi", "memory");
#else
- asm("pushl %%eax\n\t"
+ asm volatile("pushl %%eax\n\t"
"movl 0(%%eax),%%edx\n\t"
"push %%edx\n\t"
"movl 4(%%eax),%%ebx\n\t"

gcc version:
Reading specs from /usr/lib/gcc/i486-slackware-linux/4.5.1/specs
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/i486-slackware-linux/4.5.1/lto-wrapper
Target: i486-slackware-linux
Configured with: ../gcc-4.5.1/configure --prefix=/usr --libdir=/usr/lib
--mandir=/usr/man --infodir=/usr/info --enable-shared --enable-bootstrap
--enable-languages=ada,c,c++,fortran,java,objc --enable-threads=posix
--enable-checking=release --with-system-zlib
--with-python-dir=/lib/python2.6/site-packages
--disable-libunwind-exceptions --enable-__cxa_atexit --enable-libssp
--with-gnu-ld --verbose --with-arch=i486 --target=i486-slackware-linux
--build=i486-slackware-linux --host=i486-slackware-linux
Thread model: posix
gcc version 4.5.1 (GCC)

as version:
GNU assembler (Linux/GNU Binutils) 2.20.51.0.11.20100810
Copyright 2010 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `i486-slackware-linux'.

I would be happy to submit a proper patch if I new this is actually the
correct fix, also there might be other places in the kernel source
suffering from this ??

Thanks,
Jim Bos

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andi Kleen

unread,

Nov 7, 2010, 4:40:03 PM11/7/10

to

Jim <jim...@xs4all.nl> writes:

> After upgrading my Dell laptop, both OS+kernel the i8k interface was giving
> nonsensical output. As it turned out it's not the kernel but compiler
> upgrade which broke this.
>
> Guys at Archlinux have found the underlying cause (but don't seem to have
> submitted a patch yet):
> https://bbs.archlinux.org/viewtopic.php?pid=780692#p780692
> gcc seems to optimize the assembly statements away.
>
> And indeed, applying this patch makes the i8k interface work again,
> i.e. replacing the asm(..) construct by asm volatile(..)

The compiler really should not optimize the asm away, because
it has both input and output arguments which are later used.
"asm volatile" normally just means "don't move significantly"

I tested it with gcc version 4.5.0 20100604 [gcc-4_5-branch revision
160292] (SUSE Linux)
and the asm statement is there for both 32bit and 64bit
(with an allmodconfig, with both -O2 and -Os)

If gcc 4.5.1 broke that over 4.5.0 you should really file a bug report
for the compiler, it seems like a serious regression in 4.5.1

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Andreas Schwab

unread,

Nov 7, 2010, 5:50:02 PM11/7/10

to

Andi Kleen <an...@firstfloor.org> writes:

> Jim <jim...@xs4all.nl> writes:
>
>> After upgrading my Dell laptop, both OS+kernel the i8k interface was giving
>> nonsensical output. As it turned out it's not the kernel but compiler
>> upgrade which broke this.
>>
>> Guys at Archlinux have found the underlying cause (but don't seem to have
>> submitted a patch yet):
>> https://bbs.archlinux.org/viewtopic.php?pid=780692#p780692
>> gcc seems to optimize the assembly statements away.
>>
>> And indeed, applying this patch makes the i8k interface work again,
>> i.e. replacing the asm(..) construct by asm volatile(..)
>
> The compiler really should not optimize the asm away, because
> it has both input and output arguments which are later used.
> "asm volatile" normally just means "don't move significantly"

The asm fails to mention that it modifies *regs.

Andreas.

--
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Andi Kleen

unread,

Nov 7, 2010, 6:10:02 PM11/7/10

to

Andreas Schwab <sch...@linux-m68k.org> writes:
>
> The asm fails to mention that it modifies *regs.

It has a memory clobber, that should be enough, no?

Besides in any case it cannot be eliminated because it has
valid non dead inputs and outputs.

-Andi
--
a...@linux.intel.com -- Speaking for myself only.

Richard Guenther

unread,

Nov 8, 2010, 5:50:02 AM11/8/10

to

On Mon, Nov 8, 2010 at 12:03 AM, Andi Kleen <an...@firstfloor.org> wrote:
> Andreas Schwab <sch...@linux-m68k.org> writes:
>>
>> The asm fails to mention that it modifies *regs.
>
> It has a memory clobber, that should be enough, no?

No. A memory clobber does not cover automatic storage.

Btw, I can't see a testcase anywhere so I just assume Andreas got
it right as usual.

Richard.

Richard Guenther

unread,

Nov 8, 2010, 6:30:02 AM11/8/10

to

On Mon, Nov 8, 2010 at 12:20 PM, Andi Kleen <an...@firstfloor.org> wrote:
> Richard Guenther <richard....@gmail.com> writes:
>
>> On Mon, Nov 8, 2010 at 12:03 AM, Andi Kleen <an...@firstfloor.org> wrote:
>>> Andreas Schwab <sch...@linux-m68k.org> writes:
>>>>
>>>> The asm fails to mention that it modifies *regs.
>>>
>>> It has a memory clobber, that should be enough, no?
>>
>> No. A memory clobber does not cover automatic storage.
>

> That's a separate problem.

>
>> Btw, I can't see a testcase anywhere so I just assume Andreas got
>> it right as usual.
>

> An asm with live inputs and outputs should never be optimized
> way. If 4.5.1 started doing that it's seriously broken.

Please provide a testcase, such asms can be optimized if the
outputs are dead.

Richard.

Andi Kleen

unread,

Nov 8, 2010, 6:30:03 AM11/8/10

to

Richard Guenther <richard....@gmail.com> writes:

> On Mon, Nov 8, 2010 at 12:03 AM, Andi Kleen <an...@firstfloor.org> wrote:
>> Andreas Schwab <sch...@linux-m68k.org> writes:
>>>
>>> The asm fails to mention that it modifies *regs.
>>
>> It has a memory clobber, that should be enough, no?
>
> No. A memory clobber does not cover automatic storage.

That's a separate problem.

> Btw, I can't see a testcase anywhere so I just assume Andreas got
> it right as usual.

An asm with live inputs and outputs should never be optimized

way. If 4.5.1 started doing that it's seriously broken.

-Andi

Paul Koning

unread,

Nov 8, 2010, 6:50:02 AM11/8/10

to

On Nov 8, 2010, at 6:20 AM, Richard Guenther wrote:

> On Mon, Nov 8, 2010 at 12:20 PM, Andi Kleen <an...@firstfloor.org> wrote:
>> Richard Guenther <richard....@gmail.com> writes:
>>
>>> On Mon, Nov 8, 2010 at 12:03 AM, Andi Kleen <an...@firstfloor.org> wrote:
>>>> Andreas Schwab <sch...@linux-m68k.org> writes:
>>>>>
>>>>> The asm fails to mention that it modifies *regs.
>>>>
>>>> It has a memory clobber, that should be enough, no?
>>>
>>> No. A memory clobber does not cover automatic storage.
>>
>> That's a separate problem.
>>
>>> Btw, I can't see a testcase anywhere so I just assume Andreas got
>>> it right as usual.
>>
>> An asm with live inputs and outputs should never be optimized
>> way. If 4.5.1 started doing that it's seriously broken.
>
> Please provide a testcase, such asms can be optimized if the
> outputs are dead.

I don't know about 4.5, but I noticed that with 4.6 (trunk), testcasese like gcc.c-torture/compile/20000804-1.c optimize away the asm and all the operand generation except for -O0.

paul

Michael Matz

unread,

Nov 8, 2010, 7:30:01 AM11/8/10

to

Hi,

On Mon, 8 Nov 2010, Andi Kleen wrote:

> Richard Guenther <richard....@gmail.com> writes:
>
> > On Mon, Nov 8, 2010 at 12:03 AM, Andi Kleen <an...@firstfloor.org> wrote:
> >> Andreas Schwab <sch...@linux-m68k.org> writes:
> >>>
> >>> The asm fails to mention that it modifies *regs.
> >>
> >> It has a memory clobber, that should be enough, no?
> >
> > No. A memory clobber does not cover automatic storage.
>
> That's a separate problem.
>
> > Btw, I can't see a testcase anywhere so I just assume Andreas got
> > it right as usual.
>
> An asm with live inputs and outputs should never be optimized
> way. If 4.5.1 started doing that it's seriously broken.

You know the drill: testcase -> gcc.gnu.org/bugzilla/

(In particular up to now it's only speculation in some forum that the asm
really is optimized away, which I agree would be a bug, or if it isn't
merely that regs->eax isn't reloaded after the asm(), which would be
caused by the problem Andreas mentioned)

Ciao,
Michael.

Jakub Jelinek

unread,

Nov 8, 2010, 8:00:01 AM11/8/10

to

On Mon, Nov 08, 2010 at 06:47:59AM -0500, Paul Koning wrote:
> I don't know about 4.5, but I noticed that with 4.6 (trunk), testcasese
> like gcc.c-torture/compile/20000804-1.c optimize away the asm and all the
> operand generation except for -O0.

That's fine, the asm isn't volatile and the output is not used.

Jakub

Dave Korn

unread,

Nov 8, 2010, 1:20:02 PM11/8/10

to

On 08/11/2010 11:20, Andi Kleen wrote:

> An asm with live inputs and outputs should never be optimized
> way. If 4.5.1 started doing that it's seriously broken.

I don't see that. Consider:

void foo (void)
{
int x, y, z;
x = 23;
y = x + 1;
z = y + 1;
}

So far, you'd agree the compiler may optimise the entire function away? So
why not this:

void foo (void)
{
int x, y, z;
x = 23;
asm ("do something" : "=r" (y) : "r" (x) );
z = y + 1;
}

?

cheers,
DaveK

Michael Matz

unread,

Nov 9, 2010, 8:10:02 AM11/9/10

to

Hi,

On Mon, 8 Nov 2010, Dave Korn wrote:

> void foo (void)
> {
> int x, y, z;
> x = 23;
> asm ("do something" : "=r" (y) : "r" (x) );
> z = y + 1;
> }

The case in i8k.c really is different. It does use the value by
influencing the return value and the callers use the returned value in
conditionals and the like. It really, really _is_ used :-) and if GCC
removes the asm (which up to now is only speculation) then it's a GCC bug.

The code outlines like so:

int i8k_smm (regs) {
int rc;
asm (... : "=r"(rc) ...);
if (rc != 0 || ...)
return -EINVAL;
return 0;
}

...
struct regs regs = {.eax = ...}
return i8k_smm(regs) ?: regs.eax;
...

My speculation is, that the asm is not removed but rather that regs.eax
isn't reloaded after the asm because the memory clobber doesn't clobber
automatic variables.

Ciao,
Michael.

Andi Kleen

unread,

Nov 9, 2010, 8:50:01 AM11/9/10

to

> My speculation is, that the asm is not removed but rather that regs.eax
> isn't reloaded after the asm because the memory clobber doesn't clobber
> automatic variables.

Yes that makes sense. I wasn't able to verify it so far though.

Maybe the original poster could try the obvious patch
instead of the volatile change.

i8k: tell gcc that regs gets clobbered

Signed-off-by: Andi Kleen <a...@linux.intel.com>

diff --git a/drivers/char/i8k.c b/drivers/char/i8k.c
index 3bc0eef..f3bbf73 100644
--- a/drivers/char/i8k.c
+++ b/drivers/char/i8k.c
@@ -142,7 +142,7 @@ static int i8k_smm(struct smm_regs *regs)
"lahf\n\t"
"shrl $8,%%eax\n\t"
"andl $1,%%eax\n"
- :"=a"(rc)
+ :"=a"(rc), "=m" (*regs)

: "a"(regs)
: "%ebx", "%ecx", "%edx", "%esi", "%edi", "memory");
#else

@@ -167,7 +167,7 @@ static int i8k_smm(struct smm_regs *regs)
"movl %%edx,0(%%eax)\n\t"
"lahf\n\t"
"shrl $8,%%eax\n\t"
- "andl $1,%%eax\n":"=a"(rc)
+ "andl $1,%%eax\n":"=a"(rc), "=m" (*regs)

: "a"(regs)
: "%ebx", "%ecx", "%edx", "%esi", "%edi", "memory");

#endif

-Andi

Andreas Schwab

unread,

Nov 9, 2010, 9:00:02 AM11/9/10

to

Andi Kleen <an...@firstfloor.org> writes:

> @@ -142,7 +142,7 @@ static int i8k_smm(struct smm_regs *regs)
> "lahf\n\t"
> "shrl $8,%%eax\n\t"
> "andl $1,%%eax\n"
> - :"=a"(rc)
> + :"=a"(rc), "=m" (*regs)

I think this should be "+m".

Andreas.

--
Andreas Schwab, sch...@redhat.com
GPG Key fingerprint = D4E8 DBE3 3813 BB5D FA84 5EC7 45C6 250E 6F00 984E

"And now for something completely different."

Jim

unread,

Nov 9, 2010, 11:50:04 AM11/9/10

to

On 11/09/2010 02:57 PM, Andreas Schwab wrote:
> Andi Kleen <an...@firstfloor.org> writes:
>
>> @@ -142,7 +142,7 @@ static int i8k_smm(struct smm_regs *regs)
>> "lahf\n\t"
>> "shrl $8,%%eax\n\t"
>> "andl $1,%%eax\n"
>> - :"=a"(rc)
>> + :"=a"(rc), "=m" (*regs)
>
> I think this should be "+m".
>
> Andreas.
>

Just tested Andi's patch with Andreas' suggestion to make it +m,
i.e. like attached and can confirm it solves the issue.

Thanks guys,
Jim Bos

PATCH.i8k.c

Jim Bos

unread,

Nov 13, 2010, 6:20:02 AM11/13/10

to

More recent GCC caused the i8k driver to stop working, on Slackware
compiler was upgraded from gcc-4.4.4 to gcc-4.5.1 after which it
didn't work anymore, meaning the driver didn't load or gave total
nonsensical output.
As it turned out the asm(..) statement forgot to mention it modifies
the *regs variable.
Credits to Andi Kleen <an...@firstfloor.org> and Andreas Schwab
<sch...@linux-m68k.org> for providing the fix.

Signed-off-by: Jim Bos <jim...@xs4all.nl>

PATCH.i8k.c

James Cloos

unread,

Nov 14, 2010, 8:30:01 PM11/14/10

to

Gcc 4.5.1 running on an amd64 box "cross"-compiling for a P3 i8k fails
to compile the module since commit 6b4e81db2552bad04100e7d5ddeed7e848f53b48
with:

CC drivers/char/i8k.o
drivers/char/i8k.c: In function ‘i8k_smm’:
drivers/char/i8k.c:149:2: error: can't find a register in class ‘GENERAL_REGS’ while reloading ‘asm’
drivers/char/i8k.c:149:2: error: ‘asm’ operand has impossible constraints

-JimC
--
James Cloos <cl...@jhcloos.com> OpenPGP: 1024D/ED7DAEA6

Linus Torvalds

unread,

Nov 14, 2010, 10:30:01 PM11/14/10

to

On Sun, Nov 14, 2010 at 4:52 PM, James Cloos <cl...@jhcloos.com> wrote:
> Gcc 4.5.1 running on an amd64 box "cross"-compiling for a P3 i8k fails
> to compile the module since commit 6b4e81db2552bad04100e7d5ddeed7e848f53b48
> with:
>
> CC drivers/char/i8k.o
> drivers/char/i8k.c: In function ‘i8k_smm’:
> drivers/char/i8k.c:149:2: error: can't find a register in class ‘GENERAL_REGS’ while reloading ‘asm’
> drivers/char/i8k.c:149:2: error: ‘asm’ operand has impossible constraints

At this point, I think this falls clearly under "unresolvable gcc bug".

Quite frankly, I think gcc was buggy to begin with: since we had a
memory clobber, the "+m" (*regs) should not have mattered. The fact
that "*regs" may be some local variable doesn't make any difference
what-so-ever, since we took the address of the variable. So the memory
clobber _clearly_ can change that variable.

So when Richard Gunther says "a memory clobber doesn't cover automatic
storage", to me that very clearly spells "gcc is buggy as hell".
Because automatic storage with its address taken _very_ much gets
clobbered by things like memset etc. If the compiler doesn't
understand that, the compiler is just broken.

And now, if even the (superfluous) "+m" isn't working, it sounds like
we have no sane options left. Except to say that gcc-4.5.1 is totally
broken wrt asms.

Can we just get gcc to realize that when you pass the address of
automatic storage to an asm, that means that "memory" really does
clobber it? Because clearly that is the case.

Linus

Jakub Jelinek

unread,

Nov 15, 2010, 4:00:02 AM11/15/10

to

On Sun, Nov 14, 2010 at 07:21:50PM -0800, Linus Torvalds wrote:
> So when Richard Gunther says "a memory clobber doesn't cover automatic
> storage", to me that very clearly spells "gcc is buggy as hell".
> Because automatic storage with its address taken _very_ much gets
> clobbered by things like memset etc. If the compiler doesn't
> understand that, the compiler is just broken.

I'll leave the discussion about meaning of "memory" clobber aside to
Richard,

> And now, if even the (superfluous) "+m" isn't working, it sounds like
> we have no sane options left. Except to say that gcc-4.5.1 is totally

just to say that of course there are sane options left.

:"=a"(rc), "+m" (*regs)

: "a"(regs)
: "%ebx", "%ecx", "%edx", "%esi", "%edi", "memory");

is simply too high register pressure for i386 if you force also
-fno-omit-frame-pointer, there is not a single register left.

Yes, reload should figure out it has address of regs already tied to %eax,
unfortunately starting with IRA it doesn't (I'll file a GCC bug about that;
so that leaves 4.4/4.5/4.6 currently not being able to compile it).

That said, changing the inline asm to just clobber one less register
would be completely sufficient to make it work well with all gccs out there,
just push/pop one of the register around the whole body. I doubt calling
out SMM BIOS is actually so performance critical that one push and one pop
would ruin it. Of course x86_64 version can stay as is, there are enough
registers left...

Jakub

Andi Kleen

unread,

Nov 15, 2010, 4:20:02 AM11/15/10

to

> That said, changing the inline asm to just clobber one less register
> would be completely sufficient to make it work well with all gccs out there,
> just push/pop one of the register around the whole body. I doubt calling
> out SMM BIOS is actually so performance critical that one push and one pop
> would ruin it. Of course x86_64 version can stay as is, there are enough
> registers left...

Yes traditionally clobbering all registers has been dangerous
and it clearly can be done inside the asm too.

Here's a untested patch to do some manual push/pops too. Could someone with
the hardware please test it? (running a 32bit kernel)

-Andi

---

i8k: Clobber less registers

gcc doesn't like inline assembler statements that clobber nearly
all registers. Save a few registers manually on i386 to avoid this
problem.

Fix suggested by Jakub Jelinek

Signed-off-by: Andi Kleen <a...@linux.intel.com>

diff --git a/drivers/char/i8k.c b/drivers/char/i8k.c
index f0863be..a2da38b 100644
--- a/drivers/char/i8k.c
+++ b/drivers/char/i8k.c
@@ -146,7 +146,10 @@ static int i8k_smm(struct smm_regs *regs)

: "a"(regs)
: "%ebx", "%ecx", "%edx", "%esi", "%edi", "memory");

#else
- asm("pushl %%eax\n\t"

+ asm("pushl %%ebx\n\t"
+ "pushl %%ecx\n\t"
+ "pushl %%edx\n\t"
+ "pushl %%eax\n\t"

"movl 0(%%eax),%%edx\n\t"
"push %%edx\n\t"
"movl 4(%%eax),%%ebx\n\t"

@@ -167,10 +170,13 @@ static int i8k_smm(struct smm_regs *regs)
"movl %%edx,0(%%eax)\n\t"

"lahf\n\t"
"shrl $8,%%eax\n\t"

- "andl $1,%%eax\n"
+ "andl $1,%%eax\n\t"
+ "popl %%edx\n\t"
+ "popl %%ecx\n\t"
+ "popl %%ebx\n"

:"=a"(rc), "+m" (*regs)
: "a"(regs)

- : "%ebx", "%ecx", "%edx", "%esi", "%edi", "memory");
+ : "%esi", "%edi", "memory");
#endif
if (rc != 0 || (regs->eax & 0xffff) == 0xffff || regs->eax == eax)
return -EINVAL;

--
a...@linux.intel.com -- Speaking for myself only.

Jakub Jelinek

unread,

Nov 15, 2010, 4:30:02 AM11/15/10

to

On Mon, Nov 15, 2010 at 09:56:05AM +0100, Jakub Jelinek wrote:
> Yes, reload should figure out it has address of regs already tied to %eax,
> unfortunately starting with IRA it doesn't (I'll file a GCC bug about that;

http://gcc.gnu.org/PR46479

Jakub Jelinek

unread,

Nov 15, 2010, 5:10:02 AM11/15/10

to

On Mon, Nov 15, 2010 at 09:56:05AM +0100, Jakub Jelinek wrote:

> On Sun, Nov 14, 2010 at 07:21:50PM -0800, Linus Torvalds wrote:
> > So when Richard Gunther says "a memory clobber doesn't cover automatic
> > storage", to me that very clearly spells "gcc is buggy as hell".
> > Because automatic storage with its address taken _very_ much gets
> > clobbered by things like memset etc. If the compiler doesn't
> > understand that, the compiler is just broken.
>
> I'll leave the discussion about meaning of "memory" clobber aside to
> Richard,

And for this the starting point should be what has been requested,
i.e. preprocessed source + gcc options + gcc version and some hints what
actually misbehaves (with the , "+m" (*regs) change reverted)
in gcc bugzilla. Only with that we can actually look at what has been
happening, see whether it is the tree optimizations or RTL and which one
makes a difference.
If I've missed a PR about this I apologize.

Richard Guenther

unread,

Nov 15, 2010, 5:30:02 AM11/15/10

to

On Mon, Nov 15, 2010 at 9:56 AM, Jakub Jelinek <ja...@redhat.com> wrote:
> On Sun, Nov 14, 2010 at 07:21:50PM -0800, Linus Torvalds wrote:
>> So when Richard Gunther says "a memory clobber doesn't cover automatic
>> storage", to me that very clearly spells "gcc is buggy as hell".
>> Because automatic storage with its address taken _very_ much gets
>> clobbered by things like memset etc. If the compiler doesn't
>> understand that, the compiler is just broken.
>
> I'll leave the discussion about meaning of "memory" clobber aside to
> Richard,

Of course GCC handles memset just fine. Note that I was refering
to non-address taken automatic storage for "memory" (even though
when double-checking the current implementation GCC even thinks
that all address-taken memory is clobbered by asms as soon as
they have at least one memory operand or a "memory" clobber).

It's just that in future we might want to improve this and I think
not covering non-address taken automatic storage for "memory"
is sensible. And I see that you don't see address-taken automatic
storage as a sensible choice to exclude from "memory", and I
have noted that.

Btw, I still haven't seen an testcase for the actual problem we are
talking about.

Richard.

Andi Kleen

unread,

Nov 15, 2010, 6:00:02 AM11/15/10

to

> And for this the starting point should be what has been requested,
> i.e. preprocessed source + gcc options + gcc version and some hints what
> actually misbehaves (with the , "+m" (*regs) change reverted)
> in gcc bugzilla. Only with that we can actually look at what has been
> happening, see whether it is the tree optimizations or RTL and which one
> makes a difference.
> If I've missed a PR about this I apologize.

I tried to file one, but I can't reproduce it currently
(I don't have hardware, so have to rely on code reading and the 32bit
code looks correct to me even without the additional +m)

The preprocessed source is at
http://halobates.de/tmp/i8k.i

Options I used:

-D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -O2 -m32 -msoft-float -mregparm=3 -freg-struct-return -mpreferred-stack-boundary=2 -march=i686 -mtune=pentium3 -mtune=generic -maccumulate-outgoing-args -Wa,-mtune=generic32 -ffreestanding -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wframe-larger-than=2048 -fno-stack-protector -fno-omit-frame-pointer -fno-optimize-sibling-calls -pg -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack

-Andi
--
a...@linux.intel.com -- Speaking for myself only.

Jakub Jelinek

unread,

Nov 15, 2010, 6:20:01 AM11/15/10

to

On Mon, Nov 15, 2010 at 11:54:46AM +0100, Andi Kleen wrote:
> > And for this the starting point should be what has been requested,
> > i.e. preprocessed source + gcc options + gcc version and some hints what
> > actually misbehaves (with the , "+m" (*regs) change reverted)
> > in gcc bugzilla. Only with that we can actually look at what has been
> > happening, see whether it is the tree optimizations or RTL and which one
> > makes a difference.
> > If I've missed a PR about this I apologize.
>
> I tried to file one, but I can't reproduce it currently
> (I don't have hardware, so have to rely on code reading and the 32bit
> code looks correct to me even without the additional +m)
>
> The preprocessed source is at
> http://halobates.de/tmp/i8k.i
>
> Options I used:
>
> -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -O2 -m32 -msoft-float -mregparm=3 -freg-struct-return -mpreferred-stack-boundary=2 -march=i686 -mtune=pentium3 -mtune=generic -maccumulate-outgoing-args -Wa,-mtune=generic32 -ffreestanding -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wframe-larger-than=2048 -fno-stack-protector -fno-omit-frame-pointer -fno-optimize-sibling-calls -pg -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack

Indeed, with this and 4.5.2 20101111 (prerelease) from SVN as well as
gcc-4.5.1-5.fc14:

...
movl %eax, -16(%ebp) # regs, %sfp
movl (%eax), %eax # regs_2(D)->eax,
movl %eax, -20(%ebp) #, %sfp
movl -16(%ebp), %eax # %sfp,
#APP
# 149 "/home/lsrc/git/linux-work2/drivers/char/i8k.c" 1
...
#NO_APP
testl %eax, %eax #
movl $-22, %edx #, D.18378
movl %eax, -24(%ebp) #, %sfp
je .L7 #,
.L2:
movl -12(%ebp), %ebx #,
movl %edx, %eax # D.18378,
movl -8(%ebp), %esi #,
movl -4(%ebp), %edi #,
movl %ebp, %esp #,
popl %ebp #
ret
.p2align 4,,7
.p2align 3
.L7:
movl -16(%ebp), %eax # %sfp,
movl (%eax), %ecx # regs_2(D)->eax, D.18371
cmpw $-1, %cx #, D.18371
je .L2 #,
cmpl %ecx, -20(%ebp) # D.18371, %sfp
cmovne -24(%ebp), %edx # %sfp,, D.18378
jmp .L2 #
.size i8k_smm, .-i8k_smm

I don't see any problems on the assembly level. i8k_smm is
not inlined in this case and checks all 3 conditions.

Guess we need somebody who actually reported the problem, state what
gcc was actually used and post preprocessed source, gcc options
from his case.

Jakub

Andi Kleen

unread,

Nov 15, 2010, 6:40:03 AM11/15/10

to

> Guess we need somebody who actually reported the problem, state what
> gcc was actually used and post preprocessed source, gcc options
> from his case.

Jim Bos,
Can you please supply that?

Please use

rm drivers/char/i8k.o
make V=1 drivers/char/i8k.o
make drivers/char/i8k.i

and supply the .i file and the output of the first make line

Thanks,
-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Linus Torvalds

unread,

Nov 15, 2010, 11:10:02 AM11/15/10

to

On Mon, Nov 15, 2010 at 3:16 AM, Jakub Jelinek <ja...@redhat.com> wrote:
>
> I don't see any problems on the assembly level. i8k_smm is
> not inlined in this case and checks all 3 conditions.

If it really is related to gcc not understanding that "*regs" has
changed due to the memory being an automatic variable, and passing in
"regs" itself as a pointer to that automatic variable together with
the "memory" clobber not being sufficient, than I think it's the lack
of inlining that will automatically hide the bug.

(Side note: and I think this does show how much of a gcc bug it is not
to consider "memory" together with passing in a pointer to an asm to
always be a clobber).

Because if it isn't inlined, then "regs" will be seen a a real pointer
to some external memory (the caller) rather than being optimized to
just be the auto structure on the stack. Because *mem is auto only
within the context of the caller.

Which actually points to a possible simpler:
- remove the "+m" since it adds too much register pressure
- mark the i8k_smm() as "noinline" instead.

Quite frankly, I'd hate to add even more crud to that inline asm (to
save/restore the registers manually). It's already not the prettiest
thing around.

So does the attached patch work for everybody?

Linus

patch.diff

Jim Bos

unread,

Nov 15, 2010, 12:50:04 PM11/15/10

to

Hmm, that doesn't work.

[ Not sure if you read to whole thread but initial workaround was to
change the asm(..) to asm volatile(..) which did work. ]

Jim.

Jakub Jelinek

unread,

Nov 15, 2010, 12:50:02 PM11/15/10

to

On Mon, Nov 15, 2010 at 06:36:06PM +0100, Jim Bos wrote:
> On 11/15/2010 12:37 PM, Andi Kleen wrote:
> See attached, note this is the vanilla 2.6.36 i8k.c (without any patch).
> And to be 100% sure, if I build this (make drivers/char/i8k.ko) it won't
> work.
>
> [ The i8k.i is rather big, even gzipped 80k, not sure if it'll bounce ]

Please also say which exact gcc you are using.

Note, I've compiled it with current 4.5 branch and made the function
always_inline and still didn't see any issues in the *.optimized dump,
regs.eax after the inline asm has always been compared to the constant
that has been stored into regs.eax before the inline asm.

Jakub

Linus Torvalds

unread,

Nov 15, 2010, 1:20:02 PM11/15/10

to

On Mon, Nov 15, 2010 at 9:40 AM, Jim Bos <jim...@xs4all.nl> wrote:
>
> Hmm, that doesn't work.
>
> [ Not sure if you read to whole thread but initial workaround was to
> change the asm(..) to asm volatile(..) which did work. ]

Since I have a different gcc than yours (and I'm not going to compile
my own), have you posted your broken .s file anywhere? In fact, with
the noinline (and the removal of the "+m" thing - iow just the patch
you tried), what does just the "i8k_smm" function assembly look like
for you after you've done a "make drivers/char/i8k.s"?

If the asm just doesn't exist AT ALL, that's just odd. Because every
single call-site of i8k_smm() clearly looks at the return value. So
the volatile really shouldn't make any difference from that
standpoint. Odd.

Linus

Jim Bos

unread,

Nov 15, 2010, 1:20:02 PM11/15/10

to

# gcc -v
Reading specs from /usr/lib/gcc/i486-slackware-linux/4.5.1/specs
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/i486-slackware-linux/4.5.1/lto-wrapper
Target: i486-slackware-linux
Configured with: ../gcc-4.5.1/configure --prefix=/usr --libdir=/usr/lib
--mandir=/usr/man --infodir=/usr/info --enable-shared --enable-bootstrap
--enable-languages=ada,c,c++,fortran,java,objc --enable-threads=posix
--enable-checking=release --with-system-zlib
--with-python-dir=/lib/python2.6/site-packages
--disable-libunwind-exceptions --enable-__cxa_atexit --enable-libssp
--with-gnu-ld --verbose --with-arch=i486 --target=i486-slackware-linux
--build=i486-slackware-linux --host=i486-slackware-linux
Thread model: posix
gcc version 4.5.1 (GCC)

I'm re-reading this thread where I found the asm-> asm volatine suggestion:
https://bbs.archlinux.org/viewtopic.php?pid=752099#p752099
but nobody there reported their gcc version (but apparently first
people started complaining May 1st).

_
Jim

Jakub Jelinek

unread,

Nov 15, 2010, 1:30:02 PM11/15/10

to

On Mon, Nov 15, 2010 at 07:17:31PM +0100, Jim Bos wrote:
> # gcc -v
> Reading specs from /usr/lib/gcc/i486-slackware-linux/4.5.1/specs
> COLLECT_GCC=gcc
> COLLECT_LTO_WRAPPER=/usr/libexec/gcc/i486-slackware-linux/4.5.1/lto-wrapper
> Target: i486-slackware-linux
> Configured with: ../gcc-4.5.1/configure --prefix=/usr --libdir=/usr/lib
> --mandir=/usr/man --infodir=/usr/info --enable-shared --enable-bootstrap
> --enable-languages=ada,c,c++,fortran,java,objc --enable-threads=posix
> --enable-checking=release --with-system-zlib
> --with-python-dir=/lib/python2.6/site-packages
> --disable-libunwind-exceptions --enable-__cxa_atexit --enable-libssp
> --with-gnu-ld --verbose --with-arch=i486 --target=i486-slackware-linux
> --build=i486-slackware-linux --host=i486-slackware-linux
> Thread model: posix
> gcc version 4.5.1 (GCC)

Does it have any patches applied? The gcc options look the same as what
I've been already trying earlier.
Thus, can you run gcc with those options on i8k.i and add -fverbose-asm
to make it easier to read and post i8k.s you get?

Jim Bos

unread,

Nov 15, 2010, 1:40:03 PM11/15/10

to

On 11/15/2010 07:30 PM, Jim Bos wrote:

> On 11/15/2010 07:08 PM, Linus Torvalds wrote:
>> On Mon, Nov 15, 2010 at 9:40 AM, Jim Bos <jim...@xs4all.nl> wrote:
>>>
>>> Hmm, that doesn't work.
>>>
>>> [ Not sure if you read to whole thread but initial workaround was to
>>> change the asm(..) to asm volatile(..) which did work. ]
>>
>> Since I have a different gcc than yours (and I'm not going to compile
>> my own), have you posted your broken .s file anywhere? In fact, with
>> the noinline (and the removal of the "+m" thing - iow just the patch
>> you tried), what does just the "i8k_smm" function assembly look like
>> for you after you've done a "make drivers/char/i8k.s"?
>>
>> If the asm just doesn't exist AT ALL, that's just odd. Because every
>> single call-site of i8k_smm() clearly looks at the return value. So
>> the volatile really shouldn't make any difference from that
>> standpoint. Odd.
>>
>> Linus
>>
>

> Attached version with plain 2.6.36 source and version with the committed
> patch, i.e with the '"+m" (*regs)'
>
>
> _
> Jim
>
>

And I just tried with your noninline patch which results in exactly the
same .s file as with plain 2.6.36 source, i.e. the noninline patch is
not doing anything here.

_
Jim

Jeff Law

unread,

Nov 15, 2010, 1:50:02 PM11/15/10

to

On 11/07/10 15:41, Andreas Schwab wrote:
> Andi Kleen<an...@firstfloor.org> writes:
>

>> Jim<jim...@xs4all.nl> writes:
>>
>>> After upgrading my Dell laptop, both OS+kernel the i8k interface was giving
>>> nonsensical output. As it turned out it's not the kernel but compiler
>>> upgrade which broke this.
>>>
>>> Guys at Archlinux have found the underlying cause (but don't seem to have
>>> submitted a patch yet):
>>> https://bbs.archlinux.org/viewtopic.php?pid=780692#p780692
>>> gcc seems to optimize the assembly statements away.
>>>
>>> And indeed, applying this patch makes the i8k interface work again,
>>> i.e. replacing the asm(..) construct by asm volatile(..)
>> The compiler really should not optimize the asm away, because
>> it has both input and output arguments which are later used.
>> "asm volatile" normally just means "don't move significantly"

> The asm fails to mention that it modifies *regs.

But there's a memory clobber, that should be sufficient to indicate
*regs is modified.

jeff

Jeff Law

unread,

Nov 15, 2010, 1:50:02 PM11/15/10

to

On 11/08/10 03:49, Richard Guenther wrote:

> On Mon, Nov 8, 2010 at 12:03 AM, Andi Kleen<an...@firstfloor.org> wrote:
>> Andreas Schwab<sch...@linux-m68k.org> writes:
>>> The asm fails to mention that it modifies *regs.

>> It has a memory clobber, that should be enough, no?
> No. A memory clobber does not cover automatic storage.
A memory clobber should clobber anything in memory, including autos in
memory; if it doesn't, then that seems like a major problem. I'd like
to see the rationale behind not clobbering autos in memory.

Jeff

Linus Torvalds

unread,

Nov 15, 2010, 2:00:03 PM11/15/10

to

On Mon, Nov 15, 2010 at 10:30 AM, Jim Bos <jim...@xs4all.nl> wrote:
>
> Attached version with plain 2.6.36 source and version with the committed
> patch, i.e with the '"+m" (*regs)'

Looks 100% identical in i8k_smm() itself, and I'm not seeing anything
bad. The asm has certainly not been optimized away as implied in the
archlinux thread.

There are differences, but they are with code generation *elsewhere*.

To me it is starting to look like the real problem is that gcc has
decided that the "i8k_smm()" function is "__attribute__((const))".

Which is clearly totally bogus. If a function has an inline asm that
has a memory clobber, it is clearly *not* 'const'. But that does
explain the bug, and does explain why "+m" makes a difference and why
"noinline" does not.

So what I _think_ happens is that

- gcc logic for the automatic 'const' attribute for functions is
broken, so it marks that function 'const'.

- since the rule for a const function is that it only _looks_ at its
attributes and has no side effects, now the callers will decide that
'i8k_smm()' cannot change the passed-in structure, so they'll happily
optimize away all the accesses to it.

Jakub Jelinek

unread,

Nov 15, 2010, 2:10:02 PM11/15/10

to

On Mon, Nov 15, 2010 at 07:30:35PM +0100, Jim Bos wrote:
> On 11/15/2010 07:08 PM, Linus Torvalds wrote:

> > On Mon, Nov 15, 2010 at 9:40 AM, Jim Bos <jim...@xs4all.nl> wrote:
> >>
> >> Hmm, that doesn't work.
> >>
> >> [ Not sure if you read to whole thread but initial workaround was to
> >> change the asm(..) to asm volatile(..) which did work. ]
> >
> > Since I have a different gcc than yours (and I'm not going to compile
> > my own), have you posted your broken .s file anywhere? In fact, with
> > the noinline (and the removal of the "+m" thing - iow just the patch
> > you tried), what does just the "i8k_smm" function assembly look like
> > for you after you've done a "make drivers/char/i8k.s"?
> >
> > If the asm just doesn't exist AT ALL, that's just odd. Because every
> > single call-site of i8k_smm() clearly looks at the return value. So
> > the volatile really shouldn't make any difference from that
> > standpoint. Odd.
> >
> > Linus
> >
>

> Attached version with plain 2.6.36 source and version with the committed
> patch, i.e with the '"+m" (*regs)'

Thanks, this actually helped to see the problem.
The problem is not inside of i8k_smm, which is not inlined, but in the
callers.
ipa-pure-const.c pass thinks i8k_smm is a pure function, thus
regs = {};
regs.eax = 166;
x = i8k_smm (&regs);
if (!x) x = regs.eax;
in the callers is optimized into
regs = {}
regs.eax = 166;
x = i8k_smm (&regs);
if (!x) x = 166;
Now, not sure why this happens, as there is
case GIMPLE_ASM:
for (i = 0; i < gimple_asm_nclobbers (stmt); i++)
{
tree op = gimple_asm_clobber_op (stmt, i);
if (simple_cst_equal(TREE_VALUE (op), memory_identifier_string) == 1)
{
if (dump_file)
fprintf (dump_file, " memory asm clobber is not const/pure");
/* Abandon all hope, ye who enter here. */
local->pure_const_state = IPA_NEITHER;
}
}
Debugging...

Jakub

Jakub Jelinek

unread,

Nov 15, 2010, 2:20:02 PM11/15/10

to

On Mon, Nov 15, 2010 at 07:58:48PM +0100, Jakub Jelinek wrote:
> Now, not sure why this happens, as there is
> case GIMPLE_ASM:
> for (i = 0; i < gimple_asm_nclobbers (stmt); i++)
> {
> tree op = gimple_asm_clobber_op (stmt, i);
> if (simple_cst_equal(TREE_VALUE (op), memory_identifier_string) == 1)
> {
> if (dump_file)
> fprintf (dump_file, " memory asm clobber is not const/pure");
> /* Abandon all hope, ye who enter here. */
> local->pure_const_state = IPA_NEITHER;
> }
> }
> Debugging...

Ah, the problem is that memory_identifier_string is only initialized in
ipa-reference.c's initialization, so it can be (and is in this case) NULL in
ipa-pure-const.c.

Two possible fixes (the latter is apparently what is used in
tree-ssa-operands.c, so is probably sufficient). Guess ipa-reference.c
should be changed to do the same and just drop memory_identifier_string.

Jakub

1

2

Linus Torvalds

unread,

Nov 15, 2010, 2:20:02 PM11/15/10

to

On Mon, Nov 15, 2010 at 10:45 AM, Jeff Law <l...@redhat.com> wrote:
>
> A memory clobber should clobber anything in memory, including autos in
> memory; if it doesn't, then that seems like a major problem. I'd like to
> see the rationale behind not clobbering autos in memory.

Yes. It turns out that the "asm optimized away" was entirely wrong (we
never saw that, it was just a report on another mailing list).

Looking at the asm posted, it seems to me that gcc actually compiles
the asm itself 100% correctly, and the "memory" clobber is working
fine inside that function. So the code generated for i8k_smm() itself
is all good.

But _while_ generating the good code, gcc doesn't seem to realize that
it writes to anything, so it decides to mark the function
"__attribute__((const))", which is obviously wrong (a memory clobber
definitely implies that it's not const). And as a result, the callers
will be mis-optimized, because they do things like

static int i8k_get_bios_version(void)
{
struct smm_regs regs = { .eax = I8K_SMM_BIOS_VERSION, };

return i8k_smm(&regs) ? : regs.eax;
}

and since gcc has (incorrectly) decided that "i8k_smm()" is a const
function, it thinks that "regs.eax" hasn't changed, so it doesn't
bother to reload it: it "knows" that it is still I8K_SMM_BIOS_VERSION
that it initialized it with. So it will basically have rewritten that
final return statement as

return i8k_smm(&regs) ? : I8K_SMM_BIOS_VERSION;

which obviously doesn't really work.

This also explains why adding "volatile" worked. The "asm volatile"
triggered "this is not a const function".

Similarly, the "+m" works, because it also makes clear that the asm is
writing to memory, and isn't a const function.

Now, the "memory" clobber should clearly also have done that, but I'd
be willing to bet that some version of gcc (possibly extra slackware
patches) had forgotten the trivial logic to say "a memory clobber also
makes the user function non-const".

Linus

Jim Bos

unread,

Nov 15, 2010, 2:20:02 PM11/15/10

to

On 11/15/2010 07:26 PM, Jakub Jelinek wrote:
> On Mon, Nov 15, 2010 at 07:17:31PM +0100, Jim Bos wrote:
>> # gcc -v
>> Reading specs from /usr/lib/gcc/i486-slackware-linux/4.5.1/specs
>> COLLECT_GCC=gcc
>> COLLECT_LTO_WRAPPER=/usr/libexec/gcc/i486-slackware-linux/4.5.1/lto-wrapper
>> Target: i486-slackware-linux
>> Configured with: ../gcc-4.5.1/configure --prefix=/usr --libdir=/usr/lib
>> --mandir=/usr/man --infodir=/usr/info --enable-shared --enable-bootstrap
>> --enable-languages=ada,c,c++,fortran,java,objc --enable-threads=posix
>> --enable-checking=release --with-system-zlib
>> --with-python-dir=/lib/python2.6/site-packages
>> --disable-libunwind-exceptions --enable-__cxa_atexit --enable-libssp
>> --with-gnu-ld --verbose --with-arch=i486 --target=i486-slackware-linux
>> --build=i486-slackware-linux --host=i486-slackware-linux
>> Thread model: posix
>> gcc version 4.5.1 (GCC)
>
> Does it have any patches applied? The gcc options look the same as what
> I've been already trying earlier.
> Thus, can you run gcc with those options on i8k.i and add -fverbose-asm
> to make it easier to read and post i8k.s you get?
>
> Jakub
>

Slackware is typically not patching much (and I'm just using the
pre-compiled binary). Here is the link to how it's built:
http://slackware.osuosl.org/slackware-current/source/d/gcc/
there doesn't appear to be anything relevant changed.

I already posted the .s files, plain 2.6.36 and the one with working
patch, I =think= that's already using -fverbose-asm, at least that shows
in the output.

_
Jim

Linus Torvalds

unread,

Nov 15, 2010, 2:30:02 PM11/15/10

to

On Mon, Nov 15, 2010 at 11:12 AM, Jakub Jelinek <ja...@redhat.com> wrote:
>
> Ah, the problem is that memory_identifier_string is only initialized in
> ipa-reference.c's initialization, so it can be (and is in this case) NULL in
> ipa-pure-const.c.

Ok. And I guess you can verify that all versions of gcc do this
correctly for "asm volatile"?

Because since we'll have to work around this problem in the kernel, I
suspect the simplest solution is to remove the "+m" that causes
register pressure problems, and then use "asm volatile" to work around
the const-function bug.

And add a large comment about why "asm volatile" is probably always a
good idea when you have a memory clobber and don't have any other
visible memory modifications.

I do wonder if this explains some of the problems we had with the
bitop asms too.

Hmm?

Linus

Richard Henderson

unread,

Nov 15, 2010, 3:00:02 PM11/15/10

to

On 11/15/2010 11:12 AM, Jakub Jelinek wrote:
> - if (simple_cst_equal(TREE_VALUE (op), memory_identifier_string) == 1)
> + if (strcmp (TREE_STRING_POINTER (TREE_VALUE (link)), "memory") == 0)

I prefer this solution. I think memory_identifier_string is over-engineering.
Patch to remove it entirely is pre-approved.

r~

Jakub Jelinek

unread,

Nov 15, 2010, 3:00:02 PM11/15/10

to

On Mon, Nov 15, 2010 at 11:21:30AM -0800, Linus Torvalds wrote:
> On Mon, Nov 15, 2010 at 11:12 AM, Jakub Jelinek <ja...@redhat.com> wrote:
> >
> > Ah, the problem is that memory_identifier_string is only initialized in
> > ipa-reference.c's initialization, so it can be (and is in this case) NULL in
> > ipa-pure-const.c.
>
> Ok. And I guess you can verify that all versions of gcc do this
> correctly for "asm volatile"?

Yes, reading 4.1/4.2/4.3/4.4/4.5/4.6 code ipa-pure-const.c handles
asm volatile correctly, in each case the function is no longer assumed to be
pure or const in the discovery (of course, user can still say the
function is const or pure). 4.0 and earlier didn't have ipa-pure-const.c.

Using the simplified

extern void abort (void);

__attribute__((noinline)) int
foo (int *p)
{
int r;
asm ("movl $6, (%1)\n\txorl %0, %0" : "=r" (r) : "r" (p) : "memory");
return r;
}

int
main (void)
{
int p = 8;
if ((foo (&p) ? : p) != 6)
abort ();
return 0;
}

testcase shows that in 4.1/4.2/4.3/4.4 this is miscompiled only when using
-fno-ipa-reference, in 4.5 it is miscompiled always when optimizing
unless -fno-ipa-pure-const (as 4.5 added local-pure-const pass which is run
before ipa-reference) and in 4.6 this has been fixed by Honza when
doing ipa cleanups.

> Because since we'll have to work around this problem in the kernel, I
> suspect the simplest solution is to remove the "+m" that causes
> register pressure problems, and then use "asm volatile" to work around
> the const-function bug.

Yes.

Jakub

Jim Bos

unread,

Nov 15, 2010, 3:30:03 PM11/15/10

to

Linus,

In case you didn't already fixed this, here's the follow-up patch.
---

The fix to work around the gcc miscompiling i8k.c to add "+m (*regs)"
caused register pressure problems. Changing the 'asm' statement to
'asm volatile' instead should prevent that and works around the gcc
bug as well.

Signed-off-by: Jim Bos <jim...@xs4all.nl>

PATCH2.i8k.c

Richard Guenther

unread,

Nov 15, 2010, 5:10:01 PM11/15/10

to

On Mon, Nov 15, 2010 at 7:45 PM, Jeff Law <l...@redhat.com> wrote:
> On 11/08/10 03:49, Richard Guenther wrote:
>>
>> On Mon, Nov 8, 2010 at 12:03 AM, Andi Kleen<an...@firstfloor.org> wrote:
>>>
>>> Andreas Schwab<sch...@linux-m68k.org> writes:
>>>>
>>>> The asm fails to mention that it modifies *regs.
>>>
>>> It has a memory clobber, that should be enough, no?
>>
>> No. A memory clobber does not cover automatic storage.
>
> A memory clobber should clobber anything in memory, including autos in
> memory; if it doesn't, then that seems like a major problem. I'd like to
> see the rationale behind not clobbering autos in memory.

Non-address taken automatic storage. (note that we don't excercise this
in optimization yet)

It's difficult to model thins kind of non-aliased memory with this kind
of aliasing mechanism (apart from taking all asms as clobbering
everything as we currently do).

Richard.

Andi Kleen

unread,

Nov 15, 2010, 5:50:01 PM11/15/10

to

> testcase shows that in 4.1/4.2/4.3/4.4 this is miscompiled only when using
> -fno-ipa-reference, in 4.5 it is miscompiled always when optimizing
> unless -fno-ipa-pure-const (as 4.5 added local-pure-const pass which is run
> before ipa-reference) and in 4.6 this has been fixed by Honza when
> doing ipa cleanups.

Maybe it would be better to simply change the kernel Makefiles to pass
-fno-ipa-pure-const instead of adding volatiles everywhere.

-Andi

Jakub Jelinek

unread,

Nov 15, 2010, 5:50:01 PM11/15/10

to

On Mon, Nov 15, 2010 at 11:43:22PM +0100, Andi Kleen wrote:
> > testcase shows that in 4.1/4.2/4.3/4.4 this is miscompiled only when using
> > -fno-ipa-reference, in 4.5 it is miscompiled always when optimizing
> > unless -fno-ipa-pure-const (as 4.5 added local-pure-const pass which is run
> > before ipa-reference) and in 4.6 this has been fixed by Honza when
> > doing ipa cleanups.
>
> Maybe it would be better to simply change the kernel Makefiles to pass
> -fno-ipa-pure-const instead of adding volatiles everywhere.

If you do this, please do it for 4.5.[012] only. If you disable all gcc
passes that ever had any bugs in it, you'd need to disable most of them if
not all.

Jakub

Richard Guenther

unread,

Nov 15, 2010, 6:10:01 PM11/15/10

to

On Mon, Nov 15, 2010 at 11:58 PM, Jeff Law <l...@redhat.com> wrote:
> On 11/15/10 15:07, Richard Guenther wrote:
>>
>> On Mon, Nov 15, 2010 at 7:45 PM, Jeff Law<l...@redhat.com> wrote:
>>>
>>> On 11/08/10 03:49, Richard Guenther wrote:
>>>>
>>>> On Mon, Nov 8, 2010 at 12:03 AM, Andi Kleen<an...@firstfloor.org>
>>>> wrote:
>>>>>
>>>>> Andreas Schwab<sch...@linux-m68k.org> writes:
>>>>>>
>>>>>> The asm fails to mention that it modifies *regs.
>>>>>
>>>>> It has a memory clobber, that should be enough, no?
>>>>
>>>> No. A memory clobber does not cover automatic storage.
>>>
>>> A memory clobber should clobber anything in memory, including autos in
>>> memory; if it doesn't, then that seems like a major problem. I'd like to
>>> see the rationale behind not clobbering autos in memory.
>>
>> Non-address taken automatic storage. (note that we don't excercise this
>> in optimization yet)
>

> If the address of the auto isn't taken, then why is the object in memory to
> begin with (with the obvious exception for aggregates).

Exactly sort of my point. If people pass the address of &x to an asm
and modify &x + 8 expecting the "adjacent" stack location to be changed
I want to tell them that's not a supported way to get to another stack
variable (even if they clobber "memory"). Or consider the C-decl guy
who wants to access adjacent parameters by address arithmetic on
the address of the first param ...

Jeff Law

unread,

Nov 15, 2010, 11:20:01 PM11/15/10

to

On 11/15/10 16:07, Richard Guenther wrote:
>>
>> If the address of the auto isn't taken, then why is the object in memory to
>> begin with (with the obvious exception for aggregates).
> Exactly sort of my point. If people pass the address of&x to an asm

> and modify&x + 8 expecting the "adjacent" stack location to be changed

> I want to tell them that's not a supported way to get to another stack
> variable (even if they clobber "memory"). Or consider the C-decl guy
> who wants to access adjacent parameters by address arithmetic on
> the address of the first param ...

Well, in that case, I think we can easily say that the programmer has
gone off the deep end and has entered the realm of undefined behavior.

Presumably we rooted out all relevant instances of the latter over the
last 20 years... It was fairly common in the past, but I doubt anyone
worth caring about is still writing code assuming they can take the
address of parameter A, offset it and get parameters B, C, D, etc.

jeff

Matthew Garrett

unread,

Jun 3, 2011, 9:40:01 AM6/3/11

to

On Fri, Jun 03, 2011 at 03:05:26PM +0200, Jim Bos wrote:
>
> *,
>
> Just applied 2.6.39.1 but my system: 64-bit SLackware, 8GB RAM, MSI mobo
> P67A-C45 bios V1.9. previously happily booting via EFI (nice to see
> Linux and the other OS actually NOT overwriting each others boot
> loaders!) immediately re-booted just after loading the kernel.
>
> Not even a single message is making it on the console, almost immediate
> system reset. As I noticed there are several EFI related patches, I
> first tried to backout the change to setup.c but that didn't help so
> backing out all the efi changes in the 2.6.39.1 patch, i.e. the files:
> - arch/x86/kernel/setup.c
> - arch/x86/platform/efi/efi.c
> - arch/x86/platform/efi/efi_64.c
> and sure enough system is booting on 2.6.39.1 with these changes removed.

Can you try just reverting

"x86, efi: Retain boot service code until after switching to virtual mode"

? You've got 143 boot services/code regions, which is more than I'd
tested against, so I'm unsure whether we're overflowing something here.

--
Matthew Garrett | mj...@srcf.ucam.org

Jim Bos

unread,

Jun 3, 2011, 10:30:01 AM6/3/11

to

On 06/03/2011 03:33 PM, Matthew Garrett wrote:
> On Fri, Jun 03, 2011 at 03:05:26PM +0200, Jim Bos wrote:
>>
>> *,
>>
>> Just applied 2.6.39.1 but my system: 64-bit SLackware, 8GB RAM, MSI mobo
>> P67A-C45 bios V1.9. previously happily booting via EFI (nice to see
>> Linux and the other OS actually NOT overwriting each others boot
>> loaders!) immediately re-booted just after loading the kernel.
>>
>> Not even a single message is making it on the console, almost immediate
>> system reset. As I noticed there are several EFI related patches, I
>> first tried to backout the change to setup.c but that didn't help so
>> backing out all the efi changes in the 2.6.39.1 patch, i.e. the files:
>> - arch/x86/kernel/setup.c
>> - arch/x86/platform/efi/efi.c
>> - arch/x86/platform/efi/efi_64.c
>> and sure enough system is booting on 2.6.39.1 with these changes removed.
>
> Can you try just reverting
>
> "x86, efi: Retain boot service code until after switching to virtual mode"
>
> ? You've got 143 boot services/code regions, which is more than I'd
> tested against, so I'm unsure whether we're overflowing something here.
>

That's seems to be the only EFI patch in 2.6.39.1 and I effectively
removed by =not= applying (skipping) the parts of the 2.6.39.1 patch to
above 3 files.
So yes removing "x86, efi: Retain boot service code until after
switching to virtual mode" indeed fixes the problem for me.

Jim

Matthew Garrett

unread,

Jun 3, 2011, 10:50:01 AM6/3/11

to

On Fri, Jun 03, 2011 at 04:26:27PM +0200, Jim Bos wrote:
> On 06/03/2011 03:33 PM, Matthew Garrett wrote:
> > ? You've got 143 boot services/code regions, which is more than I'd
> > tested against, so I'm unsure whether we're overflowing something here.
> >
>
> That's seems to be the only EFI patch in 2.6.39.1 and I effectively
> removed by =not= applying (skipping) the parts of the 2.6.39.1 patch to
> above 3 files.
> So yes removing "x86, efi: Retain boot service code until after
> switching to virtual mode" indeed fixes the problem for me.

Ok, thanks. I'll look into that. Might be best to drop it from stable
for the moment until I've made sure it works on machines with excessive
maps.

--
Matthew Garrett | mj...@srcf.ucam.org

Maarten Lankhorst

unread,

Jun 5, 2011, 9:00:02 AM6/5/11

to

Hi,

2011/6/5 Jim Bos <jim...@xs4all.nl>:

> On 06/03/2011 04:46 PM, Matthew Garrett wrote:
>> On Fri, Jun 03, 2011 at 04:26:27PM +0200, Jim Bos wrote:
>>> On 06/03/2011 03:33 PM, Matthew Garrett wrote:
>>>> ? You've got 143 boot services/code regions, which is more than I'd
>>>> tested against, so I'm unsure whether we're overflowing something here.
>>>>
>>>
>>> That's seems to be the only EFI patch in 2.6.39.1 and I effectively
>>> removed by =not= applying (skipping) the parts of the 2.6.39.1 patch to
>>> above 3 files.
>>> So yes removing "x86, efi: Retain boot service code until after
>>> switching to virtual mode" indeed fixes the problem for me.
>>
>> Ok, thanks. I'll look into that. Might be best to drop it from stable
>> for the moment until I've made sure it works on machines with excessive
>> maps.
>>
>

> Matthew,
>
> Another datapoint, just tried a 3.0.0-rc1 kernel, good news is that it
> boots. Maybe the "x86, efi: Merge contiguous memory regions of the same
> type" helps with these large number of maps?
>
> However, there seems to be something else wrong as I get several "BUG:
> Bad page state in process swapper pfn:00000" which have efi functions
> on the stack, see dmesg output.
For what it's worth, I have noticed the exact same problems here when
booting 2.6.39.1 on an asrock P67 chipset. The reboot on 2.6.39.1 and
the bag page states. I also noticed some instability with 3.0 with it
freezing up, but that's probably related to broken binary nvidia
drivers.

~Maarten

Maarten Lankhorst

unread,

Jun 6, 2011, 11:10:02 AM6/6/11

to

Hi Matthew,

2011/6/3 Matthew Garrett <mj...@srcf.ucam.org>:

> On Fri, Jun 03, 2011 at 04:26:27PM +0200, Jim Bos wrote:
>> On 06/03/2011 03:33 PM, Matthew Garrett wrote:
>> > ? You've got 143 boot services/code regions, which is more than I'd
>> > tested against, so I'm unsure whether we're overflowing something here.
>> >
>>
>> That's seems to be the only EFI patch in 2.6.39.1 and I effectively
>> removed by =not= applying (skipping) the parts of the 2.6.39.1 patch to
>> above 3 files.
>> So yes removing "x86, efi: Retain boot service code until after
>> switching to virtual mode" indeed fixes the problem for me.
>
> Ok, thanks. I'll look into that. Might be best to drop it from stable
> for the moment until I've made sure it works on machines with excessive
> maps.

Looking at your patch in 2.6.39.1

I see:
+ memblock_x86_reserve_range(start, start + size, "EFI Boot");

and to free it:
+ free_bootmem_late(start, size);

Maybe this is causing the pager issue on 3.0, can you test this patch?

efi: free memory with the correct call

Commit 916f676f8dc introduced a call to free_bootmem_late while it
reserves memory with memblock_x64_reserve_range
Fix this call to silence the swapper BUGs:

BUG: Bad page state in process swapper pfn:00000

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 0d3a4fa..d2eefaa 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -29,7 +29,6 @@
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/efi.h>
-#include <linux/bootmem.h>
#include <linux/memblock.h>
#include <linux/spinlock.h>
#include <linux/uaccess.h>
@@ -334,7 +333,7 @@ static void __init efi_free_boot_services(void)
md->type != EFI_BOOT_SERVICES_DATA)
continue;

- free_bootmem_late(start, size);
+ memblock_x86_free_memory_in_range(start, start + size);
}
}

--
~Maarten

Maarten Lankhorst

unread,

Jun 6, 2011, 11:30:03 AM6/6/11

to

Hi Jim,

2011/6/3 Jim Bos <jim...@xs4all.nl>:

> On 06/03/2011 03:33 PM, Matthew Garrett wrote:
>> On Fri, Jun 03, 2011 at 03:05:26PM +0200, Jim Bos wrote:
>>>
>>> *,
>>>
>>> Just applied 2.6.39.1 but my system: 64-bit SLackware, 8GB RAM, MSI mobo
>>> P67A-C45 bios V1.9. previously happily booting via EFI (nice to see
>>> Linux and the other OS actually NOT overwriting each others boot
>>> loaders!) immediately re-booted just after loading the kernel.
>>>
>>> Not even a single message is making it on the console, almost immediate
>>> system reset. As I noticed there are several EFI related patches, I
>>> first tried to backout the change to setup.c but that didn't help so
>>> backing out all the efi changes in the 2.6.39.1 patch, i.e. the files:
>>> - arch/x86/kernel/setup.c
>>> - arch/x86/platform/efi/efi.c
>>> - arch/x86/platform/efi/efi_64.c
>>> and sure enough system is booting on 2.6.39.1 with these changes removed.
>>
>> Can you try just reverting
>>
>> "x86, efi: Retain boot service code until after switching to virtual mode"
>>
>> ? You've got 143 boot services/code regions, which is more than I'd
>> tested against, so I'm unsure whether we're overflowing something here.
>>
>
> That's seems to be the only EFI patch in 2.6.39.1 and I effectively
> removed by =not= applying (skipping) the parts of the 2.6.39.1 patch to
> above 3 files.
> So yes removing "x86, efi: Retain boot service code until after
> switching to virtual mode" indeed fixes the problem for me.

Does manually applying commit 9cd2b07c1 fix things for 2.6.39.1?
commit is: x86, efi: Consolidate EFI nx control

There will be 1 apply failure, need to change the call to
early_mapping_set_exec in early_runtime_code_mapping_set_exec to
efi_set_executable.

~Maarten

Jim Bos

unread,

Jun 6, 2011, 11:50:03 AM6/6/11

to

For me that indeed gets rid of the "BUG Bad page state...", ie. all is
fine on 3.0-rc1 with this patch on top.

Tested-By: Jim Bos <jim...@xs4all.nl>

Jim.

Matthew Garrett

unread,

Jun 6, 2011, 11:50:03 AM6/6/11

to

On Mon, Jun 06, 2011 at 05:01:58PM +0200, Maarten Lankhorst wrote:
> Maybe this is causing the pager issue on 3.0, can you test this patch?

Looks good to me. Can you post this to x86@ ?

--
Matthew Garrett | mj...@srcf.ucam.org

Jim Bos

unread,

Jun 6, 2011, 12:20:03 PM6/6/11

to

Maarten,

Yes that boots, but with the "BUG Bad page state .." as well (so would
need you other patch).

Jim

Maarten Lankhorst

unread,

Jun 6, 2011, 12:50:03 PM6/6/11

to

Hi Jim,

Op 06-06-11 18:11, Jim Bos schreef:

So the options are applying that patch + free fix to stable,
or revert the change. Also looking at the comments of the
original patch, it seems it was changed because of Yinghai's comments:
> No, at that point memblock is not used any more. after mm_init()/mem_init()
> need to use free_bootmem_late() in free_efi_boot_services

But that appears to have been incorrect now, seeing that
commit 774ea0bcb27f57b removed that transition

~Maarten

Yinghai Lu

unread,

Jun 6, 2011, 8:20:01 PM6/6/11

to

that is not right fix.

>>
>> Jim
> So the options are applying that patch + free fix to stable,
> or revert the change. Also looking at the comments of the
> original patch, it seems it was changed because of Yinghai's comments:
>> No, at that point memblock is not used any more. after mm_init()/mem_init()
>> need to use free_bootmem_late() in free_efi_boot_services
>
> But that appears to have been incorrect now, seeing that
> commit 774ea0bcb27f57b removed that transition

free_bootmem_late() mean free early reserved pages (from memblock or
bootmem) after slab is available.

assume EFI in ram is not page-aligned?

Thanks

Yinghai Lu

Matthew Garrett

unread,

Jun 6, 2011, 9:50:01 PM6/6/11

to

On Mon, Jun 06, 2011 at 05:19:17PM -0700, Yinghai Lu wrote:

> assume EFI in ram is not page-aligned?

They'll be 4K aligned at least.

--
Matthew Garrett | mj...@srcf.ucam.org

Yinghai Lu

unread,

Jun 6, 2011, 10:10:01 PM6/6/11

to

On 06/06/2011 06:41 PM, Matthew Garrett wrote:
> On Mon, Jun 06, 2011 at 05:19:17PM -0700, Yinghai Lu wrote:
>
>> assume EFI in ram is not page-aligned?
>
> They'll be 4K aligned at least.
>

can you get boot log with "memblock=debug"?

wonder if

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 4be9b39..c6724e4 100644 (file)
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -912,6 +912,13 @@ void __init setup_arch(char **cmdline_p)
memblock.current_limit = get_max_mapped();
memblock_x86_fill();

+ /*
+ * The EFI specification says that boot service code won't be called
+ * after ExitBootServices(). This is, in fact, a lie.
+ */
+ if (efi_enabled)
+ efi_reserve_boot_services(

wonder if double memblock ram array in memblock_x86_fill() that is overlapping with boot services.

Thanks

Yinghai

Maarten Lankhorst

unread,

Jun 7, 2011, 4:30:01 AM6/7/11

to

Hi,

Op 07-06-11 04:05, Yinghai Lu schreef:

> On 06/06/2011 06:41 PM, Matthew Garrett wrote:
>> On Mon, Jun 06, 2011 at 05:19:17PM -0700, Yinghai Lu wrote:
>>
>>> assume EFI in ram is not page-aligned?
>> They'll be 4K aligned at least.
>>
>
> can you get boot log with "memblock=debug"?
>
> wonder if
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 4be9b39..c6724e4 100644 (file)
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -912,6 +912,13 @@ void __init setup_arch(char **cmdline_p)
> memblock.current_limit = get_max_mapped();
> memblock_x86_fill();
>
> + /*
> + * The EFI specification says that boot service code won't be called
> + * after ExitBootServices(). This is, in fact, a lie.
> + */
> + if (efi_enabled)
> + efi_reserve_boot_services(
>
> wonder if double memblock ram array in memblock_x86_fill() that is overlapping with boot services.

Seems that numa's NODE_DATA beats the boot services to reserving the boot block, and it's already being reserved. Kinda makes my previous patch miss the point. :-)

Is moving efi_reserve_boot_memory to inside efi_init(); allowed?

[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 3.0.0-rc1-patser+ (mlank...@patser.local) (gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) ) #27 SMP PREEMPT Tue Jun 7 09:49:18 CEST 2011
[ 0.000000] Command line: ro rootflags=ssd,discard root=/dev/sdb2 devtmpfs.mount=1 video=efifb reboot=a,e,p init=/init quiet splash vt.handoff=7 memblock=debug
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000100 - 00000000000a0000 (usable)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000befa5000 (usable)
[ 0.000000] BIOS-e820: 00000000befa5000 - 00000000bf2c0000 (reserved)
[ 0.000000] BIOS-e820: 00000000bf2c0000 - 00000000bf311000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000bf311000 - 00000000bf60c000 (reserved)
[ 0.000000] BIOS-e820: 00000000bf60c000 - 00000000bf60d000 (usable)
[ 0.000000] BIOS-e820: 00000000bf60d000 - 00000000bf60e000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000bf60e000 - 00000000bf618000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000bf618000 - 00000000bf63d000 (reserved)
[ 0.000000] BIOS-e820: 00000000bf63d000 - 00000000bf680000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000bf680000 - 00000000bf800000 (usable)
[ 0.000000] BIOS-e820: 00000000fed1c000 - 00000000fed40000 (reserved)
[ 0.000000] BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 000000023f800000 (usable)
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] DMI not present or invalid.
[ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
[ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
[ 0.000000] No AGP bridge found
[ 0.000000] last_pfn = 0x23f800 max_arch_pfn = 0x400000000
[ 0.000000] MTRR default type: uncachable
[ 0.000000] MTRR fixed ranges enabled:
[ 0.000000] 00000-9FFFF write-back
[ 0.000000] A0000-BFFFF uncachable
[ 0.000000] C0000-CFFFF write-protect
[ 0.000000] D0000-DFFFF uncachable
[ 0.000000] E0000-E7FFF write-through
[ 0.000000] E8000-FFFFF write-protect
[ 0.000000] MTRR variable ranges enabled:
[ 0.000000] 0 base 000000000 mask E00000000 write-back
[ 0.000000] 1 base 200000000 mask FC0000000 write-back
[ 0.000000] 2 base 0C0000000 mask FC0000000 uncachable
[ 0.000000] 3 base 23F800000 mask FFF800000 uncachable
[ 0.000000] 4 disabled
[ 0.000000] 5 disabled
[ 0.000000] 6 disabled
[ 0.000000] 7 disabled
[ 0.000000] 8 disabled
[ 0.000000] 9 disabled
[ 0.000000] x86 PAT enabled: cpu 0, old 0x7010600070106, new 0x7010600070106
[ 0.000000] e820 update range: 00000000c0000000 - 0000000100000000 (usable) ==> (reserved)
[ 0.000000] last_pfn = 0xbf800 max_arch_pfn = 0x400000000
[ 0.000000] found SMP MP-table at [ffff8800000fcf50] fcf50
[ 0.000000] memblock_x86_reserve_range: [0x000fcf50-0x000fcf5f] * MP-table mpf
[ 0.000000] memblock_x86_reserve_range: [0x000fcbf0-0x000fced3] * MP-table mpc
[ 0.000000] MEMBLOCK configuration:
[ 0.000000] memory size = 0x1fe8b6000
[ 0.000000] memory.cnt = 0x5
[ 0.000000] memory[0x0] [0x00000000010000-0x0000000009ffff], 0x90000 bytes
[ 0.000000] memory[0x1] [0x00000000100000-0x000000befa4fff], 0xbeea5000 bytes
[ 0.000000] memory[0x2] [0x000000bf60c000-0x000000bf60cfff], 0x1000 bytes
[ 0.000000] memory[0x3] [0x000000bf680000-0x000000bf7fffff], 0x180000 bytes
[ 0.000000] memory[0x4] [0x00000100000000-0x0000023f7fffff], 0x13f800000 bytes
[ 0.000000] reserved.cnt = 0x2
[ 0.000000] reserved[0x0] [0x0000000009d800-0x000000000fffff], 0x62800 bytes
[ 0.000000] reserved[0x1] [0x00000001000000-0x00000001c67fff], 0xc68000 bytes
[ 0.000000] initial memory mapped : 0 - 20000000
[ 0.000000] memblock_x86_reserve_range: [0x00098000-0x0009cfff] TRAMPOLINE
[ 0.000000] Base memory trampoline at [ffff880000098000] 98000 size 20480
[ 0.000000] init_memory_mapping: 0000000000000000-00000000bf800000
[ 0.000000] 0000000000 - 00bf800000 page 2M
[ 0.000000] kernel direct mapping tables up to bf800000 @ bf7fc000-bf800000
[ 0.000000] memblock_x86_reserve_range: [0xbf7fc000-0xbf7fdfff] PGTABLE
[ 0.000000] init_memory_mapping: 0000000100000000-000000023f800000
[ 0.000000] 0100000000 - 023f800000 page 2M
[ 0.000000] kernel direct mapping tables up to 23f800000 @ 23f7f6000-23f800000
[ 0.000000] memblock_x86_reserve_range: [0x23f7f6000-0x23f7fafff] PGTABLE
[ 0.000000] ACPI: RSDP 00000000000fcf60 00024 (v02 ALASKA)
[ 0.000000] ACPI: XSDT 00000000bf309070 0005C (v01 ALASKA A M I 01072009 AMI 00010013)
[ 0.000000] ACPI: FACP 00000000bf310048 000F4 (v04 ALASKA A M I 01072009 AMI 00010013)
[ 0.000000] ACPI: DSDT 00000000bf309158 06EED (v02 ALASKA A M I 00000000 INTL 20051117)
[ 0.000000] ACPI: FACS 00000000bf60ff80 00040
[ 0.000000] ACPI: APIC 00000000bf310140 00072 (v03 ALASKA A M I 01072009 AMI 00010013)
[ 0.000000] ACPI: SSDT 00000000bf3101b8 00102 (v01 AMICPU PROC 00000001 MSFT 03000001)
[ 0.000000] ACPI: MCFG 00000000bf3102c0 0003C (v01 ALASKA A M I 01072009 MSFT 00000097)
[ 0.000000] ACPI: AAFT 00000000bf310300 0006F (v01 ALASKA OEMAAFT 01072009 MSFT 00000097)
[ 0.000000] ACPI: HPET 00000000bf310370 00038 (v01 ALASKA A M I 01072009 AMI. 00000004)
[ 0.000000] ACPI: DMAR 00000000bf3103a8 000B0 (v01 ALASKA A M I 00000001 INTL 00000001)
[ 0.000000] ACPI: Local APIC address 0xfee00000
[ 0.000000] No NUMA configuration found
[ 0.000000] Faking a node at 0000000000000000-000000023f800000
[ 0.000000] Initmem setup node 0 0000000000000000-000000023f800000
[ 0.000000] memblock_x86_reserve_range: [0x23f7fb000-0x23f7fffff] NODE_DATA
[ 0.000000] NODE_DATA [000000023f7fb000 - 000000023f7fffff]
[ 0.000000] memblock_x86_reserve_range: [0x23f7f5000-0x23f7f5fff] BOOTMEM
[ 0.000000] memblock_x86_reserve_range: [0x23f3f5000-0x23f7f4fff] BOOTMEM
[ 0.000000] memblock_x86_reserve_range: [0x23f3f4a00-0x23f3f4fff] BOOTMEM
[ 0.000000] memblock_x86_reserve_range: [0x23eff4a00-0x23f3f49ff] BOOTMEM
[ 0.000000] memblock_x86_reserve_range: [0x236e00000-0x23edfffff] BOOTMEM
[ 0.000000] memblock_x86_reserve_range: [0x23eff3000-0x23eff3fff] BOOTMEM
[ 0.000000] memblock_x86_reserve_range: [0x23eff2000-0x23eff2fff] BOOTMEM
[ 0.000000] memblock_x86_free_range: [0x23de00000-0x23edfffff]
[ 0.000000] [ffffea0000000000-ffffea0007dfffff] PMD -> [ffff880236e00000-ffff88023ddfffff] on node 0
[ 0.000000] memblock_x86_free_range: [0x23eff4a00-0x23f3f49ff]
[ 0.000000] memblock_x86_free_range: [0x23f3f5000-0x23f7f4fff]

~Maarten

Maarten Lankhorst

unread,

Jun 7, 2011, 5:20:02 AM6/7/11

to

Hi,

Op 07-06-11 04:05, Yinghai Lu schreef:

> On 06/06/2011 06:41 PM, Matthew Garrett wrote:
>> On Mon, Jun 06, 2011 at 05:19:17PM -0700, Yinghai Lu wrote:
>>
>>> assume EFI in ram is not page-aligned?
>> They'll be 4K aligned at least.
>>
>
> can you get boot log with "memblock=debug"?

Well that definitely helped me isolate things. It seems some ranges are reserved already.
I added a simple patch to ignore the reservations there. Just a proof of concept,
don't rate for style. :-)

diff --git a/arch/x86/mm/memblock.c b/arch/x86/mm/memblock.c
index aa11693..013ecf5 100644
--- a/arch/x86/mm/memblock.c
+++ b/arch/x86/mm/memblock.c
@@ -8,7 +8,7 @@
#include <linux/range.h>

/* Check for already reserved areas */
-static bool __init check_with_memblock_reserved_size(u64 *addrp, u64 *sizep, u64 align)
+bool __init check_with_memblock_reserved_size(u64 *addrp, u64 *sizep, u64 align)
{
struct memblock_region *r;
u64 addr = *addrp, last;
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 0d3a4fa..eb3a4d9 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -304,20 +304,29 @@ static void __init print_efi_memmap(void)
}
#endif /* EFI_DEBUG */

+extern bool __init check_with_memblock_reserved_size(u64 *addrp,
+ u64 *sizep,
+ u64 align);
+
void __init efi_reserve_boot_services(void)
{
void *p;

for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
efi_memory_desc_t *md = p;
- unsigned long long start = md->phys_addr;
- unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
+ u64 start = md->phys_addr;
+ u64 size = md->num_pages << EFI_PAGE_SHIFT;

if (md->type != EFI_BOOT_SERVICES_CODE &&

md->type != EFI_BOOT_SERVICES_DATA)
continue;
-

- memblock_x86_reserve_range(start, start + size, "EFI Boot");
+ if (check_with_memblock_reserved_size(&start, &size, 1<<EFI_PAGE_SHIFT)) {
+ /* Could not reserve, skip it */
+ md->num_pages = 0;
+ printk(KERN_INFO PFX "Could not reserve boot area "
+ "[0x%llx - 0x%llx]\n", start, start+size);
+ } else
+ memblock_x86_reserve_range(start, start+size, "EFI Boot");
}
}

@@ -334,6 +343,10 @@ static void __init efi_free_boot_services(void)

md->type != EFI_BOOT_SERVICES_DATA)
continue;

+ /* Could not reserve boot area */
+ if (size)
+ continue;
+
free_bootmem_late(start, size);

Maarten Lankhorst

unread,

Jun 7, 2011, 8:30:02 AM6/7/11

to

Op 07-06-11 11:08, Maarten Lankhorst schreef:

> Hi,
>
> Op 07-06-11 04:05, Yinghai Lu schreef:
>> On 06/06/2011 06:41 PM, Matthew Garrett wrote:
>>> On Mon, Jun 06, 2011 at 05:19:17PM -0700, Yinghai Lu wrote:
>>>
>>>> assume EFI in ram is not page-aligned?
>>> They'll be 4K aligned at least.
>>>
>> can you get boot log with "memblock=debug"?
> Well that definitely helped me isolate things. It seems some ranges are reserved already.
> I added a simple patch to ignore the reservations there. Just a proof of concept,
> don't rate for style. :-)
>

> @@ -334,6 +343,10 @@ static void __init efi_free_boot_services(void)
> md->type != EFI_BOOT_SERVICES_DATA)
> continue;
>
> + /* Could not reserve boot area */
> + if (size)

Oops.

> + continue;
> +
> free_bootmem_late(start, size);
> }
> }
>
>

It seems the error still occurs when calling free_bootmem_late
even if I only reserve blocks that haven't been reserved yet.

~Maarten

Yinghai Lu

unread,

Jun 7, 2011, 11:20:02 AM6/7/11

to

On 06/07/2011 01:25 AM, Maarten Lankhorst wrote:
> Hi,
>
> Op 07-06-11 04:05, Yinghai Lu schreef:
>> On 06/06/2011 06:41 PM, Matthew Garrett wrote:
>>> On Mon, Jun 06, 2011 at 05:19:17PM -0700, Yinghai Lu wrote:
>>>
>>>> assume EFI in ram is not page-aligned?
>>> They'll be 4K aligned at least.
>>>
>>
>> can you get boot log with "memblock=debug"?
>>
>> wonder if
>>
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 4be9b39..c6724e4 100644 (file)
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -912,6 +912,13 @@ void __init setup_arch(char **cmdline_p)
>> memblock.current_limit = get_max_mapped();
>> memblock_x86_fill();
>>
>> + /*
>> + * The EFI specification says that boot service code won't be called
>> + * after ExitBootServices(). This is, in fact, a lie.
>> + */
>> + if (efi_enabled)
>> + efi_reserve_boot_services(
>>
>> wonder if double memblock ram array in memblock_x86_fill() that is overlapping with boot services.
> Seems that numa's NODE_DATA beats the boot services to reserving the boot block, and it's already being reserved. Kinda makes my previous patch miss the point. :-)

NODE_DATA are get allocated from initmem_init() and it is rather later after efi_reserve_boot_services()

>
> Is moving efi_reserve_boot_memory to inside efi_init(); allowed?

no, memblock reserved array can not be resized yet at that point, so if you have too many entries for efi entries, will have problem.

should have some
memblock_x86_reserve_range: [xxxxx-xxxx] EFI Boot
...

Yinghai Lu

unread,

Jun 7, 2011, 6:30:02 PM6/7/11

to

On 06/07/2011 05:22 AM, Maarten Lankhorst wrote:
> It seems the error still occurs when calling free_bootmem_late
> even if I only reserve blocks that haven't been reserved yet.

please check this one.

---
arch/x86/platform/efi/efi.c | 2 +-
include/linux/mm.h | 2 ++
mm/page_alloc.c | 32 ++++++++++++++++++++++++++++++++
3 files changed, 35 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/platform/efi/efi.c
===================================================================
--- linux-2.6.orig/arch/x86/platform/efi/efi.c
+++ linux-2.6/arch/x86/platform/efi/efi.c
@@ -368,7 +368,7 @@ static void __init efi_free_boot_service

md->type != EFI_BOOT_SERVICES_DATA)
continue;

- free_bootmem_late(start, size);
+ free_bootmem_late_with_active_regions(start, size);
}
}

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -1322,6 +1322,8 @@ extern void get_pfn_range_for_nid(unsign
extern unsigned long find_min_pfn_with_active_regions(void);
extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
+void free_bootmem_late_with_active_regions(unsigned long addr,
+ unsigned long size);
int add_from_early_node_map(struct range *range, int az,
int nr_range, int nid);
u64 __init find_memory_core_early(int nid, u64 size, u64 align,
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -3828,6 +3828,38 @@ void __init free_bootmem_with_active_reg
}
}

+/**
+ * free_bootmem_late_with_active_regions - Call free_bootmem_late for each active range
+ * @addr: starting address of the range
+ * @size: size of the range in bytes
+ *
+ * this function make sure on active regions only
+ */
+void __init free_bootmem_late_with_active_regions(unsigned long addr,
+ unsigned long size)
+{
+ int i;
+ int nid = MAX_NUMNODES;
+ unsigned long start_pfn = PFN_UP(addr);
+ unsigned long end_pfn = PFN_DOWN(addr + size);
+
+ if (start_pfn >= end_pfn)
+ return;
+
+ for_each_active_range_index_in_nid(i, nid) {
+ unsigned long common_start, common_end;
+
+ common_start = max(start_pfn, early_node_map[i].start_pfn);
+ common_end = min(end_pfn, early_node_map[i].end_pfn);
+
+ if (common_start >= common_end)
+ continue;
+
+ free_bootmem_late(common_start << PAGE_SHIFT,
+ (common_end - common_start) << PAGE_SHIFT);
+ }
+}
+
#ifdef CONFIG_HAVE_MEMBLOCK
/*
* Basic iterator support. Return the last range of PFNs for a node

Yinghai Lu

unread,

Jun 8, 2011, 3:20:01 PM6/8/11

to

On 06/08/2011 09:44 AM, Jim Bos wrote:

> I applied this on top of 3.0-rc1 (with offsets), kernels boots but see
> now in dmesg output a new 'BUG: Bad page state in process swapper
> pfn:01800' see attached dmesg output.

that is 24M...

also looks like boot services will overlapped with kernel and ramdisk...

we will reserve boot service again, and later if free boot service range, will free up some kernel or ramdisk range.

[ 0.000000] EFI: mem00: type=3, attr=0xf, range=[0x0000000000000000-0x0000000000008000) (0MB)
[ 0.000000] EFI: mem01: type=7, attr=0xf, range=[0x0000000000008000-0x0000000000075000) (0MB)
[ 0.000000] EFI: mem02: type=2, attr=0xf, range=[0x0000000000075000-0x0000000000077000) (0MB)
[ 0.000000] EFI: mem03: type=4, attr=0xf, range=[0x0000000000077000-0x0000000000078000) (0MB)
[ 0.000000] EFI: mem04: type=3, attr=0xf, range=[0x0000000000078000-0x00000000000a0000) (0MB)
[ 0.000000] EFI: mem05: type=7, attr=0xf, range=[0x0000000000100000-0x0000000001000000) (15MB)
[ 0.000000] EFI: mem06: type=2, attr=0xf, range=[0x0000000001000000-0x0000000001100000) (1MB)
[ 0.000000] EFI: mem07: type=4, attr=0xf, range=[0x0000000001100000-0x000000000133a000) (2MB)
[ 0.000000] EFI: mem08: type=3, attr=0xf, range=[0x000000000133a000-0x000000000133e000) (0MB)
[ 0.000000] EFI: mem09: type=4, attr=0xf, range=[0x000000000133e000-0x0000000001342000) (0MB)
[ 0.000000] EFI: mem10: type=3, attr=0xf, range=[0x0000000001342000-0x0000000001345000) (0MB)
[ 0.000000] EFI: mem11: type=4, attr=0xf, range=[0x0000000001345000-0x000000000135b000) (0MB)
[ 0.000000] EFI: mem12: type=3, attr=0xf, range=[0x000000000135b000-0x000000000135d000) (0MB)
[ 0.000000] EFI: mem13: type=4, attr=0xf, range=[0x000000000135d000-0x00000000013c5000) (0MB)
[ 0.000000] EFI: mem14: type=3, attr=0xf, range=[0x00000000013c5000-0x00000000013c7000) (0MB)
[ 0.000000] EFI: mem15: type=4, attr=0xf, range=[0x00000000013c7000-0x00000000013ca000) (0MB)
[ 0.000000] EFI: mem16: type=3, attr=0xf, range=[0x00000000013ca000-0x00000000013d2000) (0MB)
[ 0.000000] EFI: mem17: type=4, attr=0xf, range=[0x00000000013d2000-0x00000000013d3000) (0MB)
[ 0.000000] EFI: mem18: type=3, attr=0xf, range=[0x00000000013d3000-0x00000000013d4000) (0MB)
[ 0.000000] EFI: mem19: type=4, attr=0xf, range=[0x00000000013d4000-0x00000000013d8000) (0MB)
[ 0.000000] EFI: mem20: type=3, attr=0xf, range=[0x00000000013d8000-0x00000000013da000) (0MB)
[ 0.000000] EFI: mem21: type=4, attr=0xf, range=[0x00000000013da000-0x00000000013dc000) (0MB)
[ 0.000000] EFI: mem22: type=3, attr=0xf, range=[0x00000000013dc000-0x00000000013de000) (0MB)
[ 0.000000] EFI: mem23: type=4, attr=0xf, range=[0x00000000013de000-0x00000000013df000) (0MB)
[ 0.000000] EFI: mem24: type=3, attr=0xf, range=[0x00000000013df000-0x00000000013e0000) (0MB)
[ 0.000000] EFI: mem25: type=4, attr=0xf, range=[0x00000000013e0000-0x0000000002334000) (15MB)

we need to revert patch from mjg.

From 916f676f8dc016103f983c7ec54c18ecdbb6e349 Mon Sep 17 00:00:00 2001
From: Matthew Garrett <m...@redhat.com>
Date: Wed, 25 May 2011 09:53:13 -0400
Subject: [PATCH] x86, efi: Retain boot service code until after switching to
virtual mode

Thanks

Yinghai

Yinghai Lu

unread,

Jun 8, 2011, 3:30:02 PM6/8/11

to

On 06/08/2011 12:23 PM, Matthew Garrett wrote:

> On Wed, Jun 08, 2011 at 12:17:52PM -0700, Yinghai Lu wrote:
>> we need to revert patch from mjg.
>>
>> From 916f676f8dc016103f983c7ec54c18ecdbb6e349 Mon Sep 17 00:00:00 2001
>> From: Matthew Garrett <m...@redhat.com>
>> Date: Wed, 25 May 2011 09:53:13 -0400
>> Subject: [PATCH] x86, efi: Retain boot service code until after switching to
>> virtual mode
>

> That's not a long-term option, since we have no way to distinguish
> between a machine that requires boot services code to be mapped and a
> machine that doesn't. Reverting just breaks the former set again.

what former? you can not fix some corner case by breaking most other cases.

> We need to ensure that the kernel isn't overlapping boot services code.

bootloader will put kernel from 16M, and boot services are on those area already.

Matthew Garrett

unread,

Jun 8, 2011, 3:30:02 PM6/8/11

to

On Wed, Jun 08, 2011 at 12:17:52PM -0700, Yinghai Lu wrote:

> we need to revert patch from mjg.
>
> From 916f676f8dc016103f983c7ec54c18ecdbb6e349 Mon Sep 17 00:00:00 2001
> From: Matthew Garrett <m...@redhat.com>
> Date: Wed, 25 May 2011 09:53:13 -0400
> Subject: [PATCH] x86, efi: Retain boot service code until after switching to
> virtual mode

That's not a long-term option, since we have no way to distinguish

between a machine that requires boot services code to be mapped and a

machine that doesn't. Reverting just breaks the former set again. We

need to ensure that the kernel isn't overlapping boot services code.

--
Matthew Garrett | mj...@srcf.ucam.org

Matthew Garrett

unread,

Jun 8, 2011, 3:40:02 PM6/8/11

to

On Wed, Jun 08, 2011 at 12:35:54PM -0700, Yinghai Lu wrote:

> On 06/08/2011 12:29 PM, Matthew Garrett wrote:
> >> what former? you can not fix some corner case by breaking most other cases.
> >

> > All Dell laptops, all new Apples, some Lenovos, various Intel server
> > platforms. That I've found so far.
>
> do you mean before that patch, all those machine will not boot linux kernel with UEFI support?

Correct.

> > And we need to be able to map the boot services code, so we can't put
> > the kernel on top of it.
>
> after bootloader, those area should be free already.

That's what the spec says. Reality says differently. We need those
ranges to be available to the kernel until after SetVirtualAddressMap()
has been called, which means we need to avoid putting the kernel on top
of them.

--
Matthew Garrett | mj...@srcf.ucam.org

Yinghai Lu

unread,

Jun 8, 2011, 3:40:03 PM6/8/11

to

On 06/08/2011 12:29 PM, Matthew Garrett wrote:

> On Wed, Jun 08, 2011 at 12:27:03PM -0700, Yinghai Lu wrote:
>> On 06/08/2011 12:23 PM, Matthew Garrett wrote:
>>> On Wed, Jun 08, 2011 at 12:17:52PM -0700, Yinghai Lu wrote:
>>>> we need to revert patch from mjg.
>>>>
>>>> From 916f676f8dc016103f983c7ec54c18ecdbb6e349 Mon Sep 17 00:00:00 2001
>>>> From: Matthew Garrett <m...@redhat.com>
>>>> Date: Wed, 25 May 2011 09:53:13 -0400
>>>> Subject: [PATCH] x86, efi: Retain boot service code until after switching to
>>>> virtual mode
>>>
>>> That's not a long-term option, since we have no way to distinguish
>>> between a machine that requires boot services code to be mapped and a
>>> machine that doesn't. Reverting just breaks the former set again.
>>
>> what former? you can not fix some corner case by breaking most other cases.
>

> All Dell laptops, all new Apples, some Lenovos, various Intel server
> platforms. That I've found so far.

do you mean before that patch, all those machine will not boot linux kernel with UEFI support?

>

>>> We need to ensure that the kernel isn't overlapping boot services code.
>>
>> bootloader will put kernel from 16M, and boot services are on those area already.
>

> And we need to be able to map the boot services code, so we can't put
> the kernel on top of it.

after bootloader, those area should be free already.

Matthew Garrett

unread,

Jun 8, 2011, 3:40:05 PM6/8/11

to

On Wed, Jun 08, 2011 at 12:27:03PM -0700, Yinghai Lu wrote:

> On 06/08/2011 12:23 PM, Matthew Garrett wrote:
> > On Wed, Jun 08, 2011 at 12:17:52PM -0700, Yinghai Lu wrote:
> >> we need to revert patch from mjg.
> >>
> >> From 916f676f8dc016103f983c7ec54c18ecdbb6e349 Mon Sep 17 00:00:00 2001
> >> From: Matthew Garrett <m...@redhat.com>
> >> Date: Wed, 25 May 2011 09:53:13 -0400
> >> Subject: [PATCH] x86, efi: Retain boot service code until after switching to
> >> virtual mode
> >
> > That's not a long-term option, since we have no way to distinguish
> > between a machine that requires boot services code to be mapped and a
> > machine that doesn't. Reverting just breaks the former set again.
>
> what former? you can not fix some corner case by breaking most other cases.

All Dell laptops, all new Apples, some Lenovos, various Intel server

platforms. That I've found so far.

> > We need to ensure that the kernel isn't overlapping boot services code.

>
> bootloader will put kernel from 16M, and boot services are on those area already.

And we need to be able to map the boot services code, so we can't put

the kernel on top of it.

--
Matthew Garrett | mj...@srcf.ucam.org

Yinghai Lu

unread,

Jun 8, 2011, 3:50:03 PM6/8/11

to

On 06/08/2011 12:38 PM, Matthew Garrett wrote:
> On Wed, Jun 08, 2011 at 12:35:54PM -0700, Yinghai Lu wrote:
>> On 06/08/2011 12:29 PM, Matthew Garrett wrote:
>>>> what former? you can not fix some corner case by breaking most other cases.
>>>
>>> All Dell laptops, all new Apples, some Lenovos, various Intel server
>>> platforms. That I've found so far.
>>
>> do you mean before that patch, all those machine will not boot linux kernel with UEFI support?
>
> Correct.

with or without EFI runtime services?

Yinghai

Yinghai Lu

unread,

Jun 8, 2011, 3:50:03 PM6/8/11

to

On 06/08/2011 12:38 PM, Matthew Garrett wrote:

> On Wed, Jun 08, 2011 at 12:35:54PM -0700, Yinghai Lu wrote:
>> On 06/08/2011 12:29 PM, Matthew Garrett wrote:
>>>> what former? you can not fix some corner case by breaking most other cases.
>>>
>>> All Dell laptops, all new Apples, some Lenovos, various Intel server
>>> platforms. That I've found so far.
>>
>> do you mean before that patch, all those machine will not boot linux kernel with UEFI support?
>
> Correct.

good, they never test that. just let them to use bootcamp.

>
>>> And we need to be able to map the boot services code, so we can't put
>>> the kernel on top of it.
>>
>> after bootloader, those area should be free already.
>
> That's what the spec says. Reality says differently. We need those
> ranges to be available to the kernel until after SetVirtualAddressMap()
> has been called, which means we need to avoid putting the kernel on top
> of them.

bootloader will load kernel (bzImage) high, and it will decompressed to 16M ram position.

can you call SetVirtualAddressMap before you exit bootloader instead?

Yinghai

Matthew Garrett

unread,

Jun 8, 2011, 4:00:03 PM6/8/11

to

On Wed, Jun 08, 2011 at 12:48:48PM -0700, Yinghai Lu wrote:
> On 06/08/2011 12:38 PM, Matthew Garrett wrote:
> > On Wed, Jun 08, 2011 at 12:35:54PM -0700, Yinghai Lu wrote:
> >> On 06/08/2011 12:29 PM, Matthew Garrett wrote:
> >>>> what former? you can not fix some corner case by breaking most other cases.
> >>>
> >>> All Dell laptops, all new Apples, some Lenovos, various Intel server
> >>> platforms. That I've found so far.
> >>
> >> do you mean before that patch, all those machine will not boot linux kernel with UEFI support?
> >
> > Correct.
>
> with or without EFI runtime services?

They'll boot with noefi, but that's useless because there's then no way
to configure a bootloader.

--
Matthew Garrett | mj...@srcf.ucam.org

Matthew Garrett

unread,

Jun 8, 2011, 4:00:03 PM6/8/11

to

On Wed, Jun 08, 2011 at 12:46:04PM -0700, Yinghai Lu wrote:
> On 06/08/2011 12:38 PM, Matthew Garrett wrote:
> >>> All Dell laptops, all new Apples, some Lenovos, various Intel server
> >>> platforms. That I've found so far.
> >>
> >> do you mean before that patch, all those machine will not boot linux kernel with UEFI support?
> >
> > Correct.
>
> good, they never test that. just let them to use bootcamp.

If you'd like to tell Intel to stop filing bugs about Intel SDVs that
won't boot via EFI, be my guest. These are systems that have no problem
booting Windows because SetVirtualAddressMap() is called in the Windows
bootloader rather than in the Windows kernel.

> >> after bootloader, those area should be free already.
> >
> > That's what the spec says. Reality says differently. We need those
> > ranges to be available to the kernel until after SetVirtualAddressMap()
> > has been called, which means we need to avoid putting the kernel on top
> > of them.
>
> bootloader will load kernel (bzImage) high, and it will decompressed to 16M ram position.

Well that's a problem.

> can you call SetVirtualAddressMap before you exit bootloader instead?

No. We don't know where the kernel will map the runtime regions.

--
Matthew Garrett | mj...@srcf.ucam.org

Yinghai Lu

unread,

Jun 8, 2011, 4:10:03 PM6/8/11

to

On 06/08/2011 12:52 PM, Matthew Garrett wrote:
> On Wed, Jun 08, 2011 at 12:48:48PM -0700, Yinghai Lu wrote:
>> On 06/08/2011 12:38 PM, Matthew Garrett wrote:
>>> On Wed, Jun 08, 2011 at 12:35:54PM -0700, Yinghai Lu wrote:
>>>> On 06/08/2011 12:29 PM, Matthew Garrett wrote:
>>>>>> what former? you can not fix some corner case by breaking most other cases.
>>>>>
>>>>> All Dell laptops, all new Apples, some Lenovos, various Intel server
>>>>> platforms. That I've found so far.
>>>>
>>>> do you mean before that patch, all those machine will not boot linux kernel with UEFI support?
>>>
>>> Correct.
>>
>> with or without EFI runtime services?
>
> They'll boot with noefi, but that's useless because there's then no way
> to configure a bootloader.

efibootmgr will need boot services in addition to run-time services?

if really need some info from boot services, could let bootloader to get them and pass them to kernel.

Yinghai

Matthew Garrett

unread,

Jun 8, 2011, 4:10:03 PM6/8/11

to

On Wed, Jun 08, 2011 at 01:03:25PM -0700, Yinghai Lu wrote:
> On 06/08/2011 12:52 PM, Matthew Garrett wrote:
> > On Wed, Jun 08, 2011 at 12:48:48PM -0700, Yinghai Lu wrote:
> >> On 06/08/2011 12:38 PM, Matthew Garrett wrote:
> >>> On Wed, Jun 08, 2011 at 12:35:54PM -0700, Yinghai Lu wrote:
> >>>> On 06/08/2011 12:29 PM, Matthew Garrett wrote:
> >>>>>> what former? you can not fix some corner case by breaking most other cases.
> >>>>>
> >>>>> All Dell laptops, all new Apples, some Lenovos, various Intel server
> >>>>> platforms. That I've found so far.
> >>>>
> >>>> do you mean before that patch, all those machine will not boot linux kernel with UEFI support?
> >>>
> >>> Correct.
> >>
> >> with or without EFI runtime services?
> >
> > They'll boot with noefi, but that's useless because there's then no way
> > to configure a bootloader.
>
> efibootmgr will need boot services in addition to run-time services?

No, SetVirtualAddressMap() calls into boot services. This is code, not
data. We have no control over what it does.

--
Matthew Garrett | mj...@srcf.ucam.org

Yinghai Lu

unread,

Jun 8, 2011, 4:30:02 PM6/8/11

to

On 06/08/2011 01:09 PM, Matthew Garrett wrote:
> On Wed, Jun 08, 2011 at 01:03:25PM -0700, Yinghai Lu wrote:
>> On 06/08/2011 12:52 PM, Matthew Garrett wrote:
>>> On Wed, Jun 08, 2011 at 12:48:48PM -0700, Yinghai Lu wrote:
>>>> On 06/08/2011 12:38 PM, Matthew Garrett wrote:
>>>>> On Wed, Jun 08, 2011 at 12:35:54PM -0700, Yinghai Lu wrote:
>>>>>> On 06/08/2011 12:29 PM, Matthew Garrett wrote:
>>>>>>>> what former? you can not fix some corner case by breaking most other cases.
>>>>>>>
>>>>>>> All Dell laptops, all new Apples, some Lenovos, various Intel server
>>>>>>> platforms. That I've found so far.
>>>>>>
>>>>>> do you mean before that patch, all those machine will not boot linux kernel with UEFI support?
>>>>>
>>>>> Correct.
>>>>
>>>> with or without EFI runtime services?
>>>
>>> They'll boot with noefi, but that's useless because there's then no way
>>> to configure a bootloader.
>>
>> efibootmgr will need boot services in addition to run-time services?
>
> No, SetVirtualAddressMap() calls into boot services. This is code, not
> data. We have no control over what it does.

Maybe you can parse efimeminfo in arch/x86/boot/compressed/head_64.S to find right decompress position for kernel.

Yinghai

Matthew Garrett

unread,

Jun 8, 2011, 4:40:02 PM6/8/11

to

On Wed, Jun 08, 2011 at 01:23:38PM -0700, Yinghai Lu wrote:
> On 06/08/2011 01:09 PM, Matthew Garrett wrote:
> > On Wed, Jun 08, 2011 at 01:03:25PM -0700, Yinghai Lu wrote:
> >> On 06/08/2011 12:52 PM, Matthew Garrett wrote:
> >>> On Wed, Jun 08, 2011 at 12:48:48PM -0700, Yinghai Lu wrote:
> >>>> On 06/08/2011 12:38 PM, Matthew Garrett wrote:
> >>>>> On Wed, Jun 08, 2011 at 12:35:54PM -0700, Yinghai Lu wrote:
> >>>>>> On 06/08/2011 12:29 PM, Matthew Garrett wrote:
> >>>>>>>> what former? you can not fix some corner case by breaking most other cases.
> >>>>>>>
> >>>>>>> All Dell laptops, all new Apples, some Lenovos, various Intel server
> >>>>>>> platforms. That I've found so far.
> >>>>>>
> >>>>>> do you mean before that patch, all those machine will not boot linux kernel with UEFI support?
> >>>>>
> >>>>> Correct.
> >>>>
> >>>> with or without EFI runtime services?
> >>>
> >>> They'll boot with noefi, but that's useless because there's then no way
> >>> to configure a bootloader.
> >>
> >> efibootmgr will need boot services in addition to run-time services?
> >
> > No, SetVirtualAddressMap() calls into boot services. This is code, not
> > data. We have no control over what it does.
>
> Maybe you can parse efimeminfo in arch/x86/boot/compressed/head_64.S to find right decompress position for kernel.

Hm. I'll take a look into that. Thanks!

--
Matthew Garrett | mj...@srcf.ucam.org

Yinghai Lu

unread,

Jun 8, 2011, 4:40:02 PM6/8/11

to

On 06/08/2011 01:30 PM, Matthew Garrett wrote:
> On Wed, Jun 08, 2011 at 01:23:38PM -0700, Yinghai Lu wrote:
>> On 06/08/2011 01:09 PM, Matthew Garrett wrote:
>>> On Wed, Jun 08, 2011 at 01:03:25PM -0700, Yinghai Lu wrote:
>>>> On 06/08/2011 12:52 PM, Matthew Garrett wrote:
>>>>> On Wed, Jun 08, 2011 at 12:48:48PM -0700, Yinghai Lu wrote:
>>>>>> On 06/08/2011 12:38 PM, Matthew Garrett wrote:
>>>>>>> On Wed, Jun 08, 2011 at 12:35:54PM -0700, Yinghai Lu wrote:
>>>>>>>> On 06/08/2011 12:29 PM, Matthew Garrett wrote:
>>>>>>>>>> what former? you can not fix some corner case by breaking most other cases.
>>>>>>>>>
>>>>>>>>> All Dell laptops, all new Apples, some Lenovos, various Intel server
>>>>>>>>> platforms. That I've found so far.
>>>>>>>>
>>>>>>>> do you mean before that patch, all those machine will not boot linux kernel with UEFI support?
>>>>>>>
>>>>>>> Correct.
>>>>>>
>>>>>> with or without EFI runtime services?
>>>>>
>>>>> They'll boot with noefi, but that's useless because there's then no way
>>>>> to configure a bootloader.
>>>>
>>>> efibootmgr will need boot services in addition to run-time services?
>>>
>>> No, SetVirtualAddressMap() calls into boot services. This is code, not
>>> data. We have no control over what it does.
>>
>> Maybe you can parse efimeminfo in arch/x86/boot/compressed/head_64.S to find right decompress position for kernel.
>
> Hm. I'll take a look into that. Thanks!

or just let bootloader to mark those boot services just like run-time services in e820 table or setupdata?

Yinghai

Matthew Garrett

unread,

Jun 8, 2011, 4:50:02 PM6/8/11

to

On Wed, Jun 08, 2011 at 01:36:57PM -0700, Yinghai Lu wrote:

> or just let bootloader to mark those boot services just like run-time services in e820 table or setupdata?

That was my original approach, but if there's boot services code at the
top of RAM it means that max_pfn is wrong and it's difficult to recover
the memory.

--
Matthew Garrett | mj...@srcf.ucam.org

Yinghai Lu

unread,

Jun 8, 2011, 4:50:02 PM6/8/11

to

On 06/08/2011 01:42 PM, Matthew Garrett wrote:
> On Wed, Jun 08, 2011 at 01:36:57PM -0700, Yinghai Lu wrote:
>
>> or just let bootloader to mark those boot services just like run-time services in e820 table or setupdata?
>
> That was my original approach, but if there's boot services code at the
> top of RAM it means that max_pfn is wrong and it's difficult to recover
> the memory.

not all boot services ram. just those are called by run-time services code.

Yinghai

Matthew Garrett

unread,

Jun 8, 2011, 5:10:02 PM6/8/11

to

On Wed, Jun 08, 2011 at 01:46:45PM -0700, Yinghai Lu wrote:
> On 06/08/2011 01:42 PM, Matthew Garrett wrote:
> > On Wed, Jun 08, 2011 at 01:36:57PM -0700, Yinghai Lu wrote:
> >
> >> or just let bootloader to mark those boot services just like run-time services in e820 table or setupdata?
> >
> > That was my original approach, but if there's boot services code at the
> > top of RAM it means that max_pfn is wrong and it's difficult to recover
> > the memory.
>
> not all boot services ram. just those are called by run-time services code.

We have no way of telling which regions those are until they've been
called.

--
Matthew Garrett | mj...@srcf.ucam.org

Linus Torvalds

unread,

Jun 8, 2011, 5:10:02 PM6/8/11

to

On Wed, Jun 8, 2011 at 1:42 PM, Matthew Garrett <mj...@srcf.ucam.org> wrote:
> On Wed, Jun 08, 2011 at 01:36:57PM -0700, Yinghai Lu wrote:
>
>> or just let bootloader to mark those boot services just like run-time services in e820 table or setupdata?
>
> That was my original approach, but if there's boot services code at the
> top of RAM it means that max_pfn is wrong and it's difficult to recover
> the memory.

WHO CARES if the memory is difficult to recover? Just let it be. EFI
is an abomination in the eyes of God, and we sure as hell shouldn't
bend over backwards over the stupidities in it. If it means that you
lose a meg of RAM when you use EFI, that's the least of our problems.

F*%#ing morons who thought that we wanted some kind of extensible
firmware interface. We want a *cut-down* firmware interface, not the
crazy thing that is EFI. A boot loader, not some kind of run-time
services crap. And we definitely don't want to make it any more
complex than we need to.

So just turn the EFI stuff into the memory map, and let the kernel do
as little as possible with it.

Linus

Matthew Garrett

unread,

Jun 8, 2011, 5:30:02 PM6/8/11

to

On Wed, Jun 08, 2011 at 02:06:21PM -0700, Linus Torvalds wrote:
> On Wed, Jun 8, 2011 at 1:42 PM, Matthew Garrett <mj...@srcf.ucam.org> wrote:
> > On Wed, Jun 08, 2011 at 01:36:57PM -0700, Yinghai Lu wrote:
> >
> >> or just let bootloader to mark those boot services just like run-time services in e820 table or setupdata?
> >
> > That was my original approach, but if there's boot services code at the
> > top of RAM it means that max_pfn is wrong and it's difficult to recover
> > the memory.
>
> WHO CARES if the memory is difficult to recover? Just let it be. EFI
> is an abomination in the eyes of God, and we sure as hell shouldn't
> bend over backwards over the stupidities in it. If it means that you
> lose a meg of RAM when you use EFI, that's the least of our problems.

Boot services data includes everything that was allocated by the EFI
memory allocator. Depending on what the system decided to do before
deigning to run our code, that might be a meg - or it might be several
hundred. And in the process it's probably fragmented RAM into god knows
how many small chunks.

E820 limits us to 128 ranges, and systems I'm looking at right now are
already using over 140 boot services regions. Some of them are
contiguous and we could just merge them for e820, but that leaves us at
the whim of whoever wrote the allocator for the firmware. And do you
want to bet on them having done this sanely?

--
Matthew Garrett | mj...@srcf.ucam.org

H. Peter Anvin

unread,

Jun 8, 2011, 5:40:02 PM6/8/11

to

On 06/08/2011 02:28 PM, Matthew Garrett wrote:
>
> E820 limits us to 128 ranges, and systems I'm looking at right now are
> already using over 140 boot services regions. Some of them are
> contiguous and we could just merge them for e820, but that leaves us at
> the whim of whoever wrote the allocator for the firmware. And do you
> want to bet on them having done this sanely?
>

No, we're not limited to 128. We're limited to 128 in the boot
structure, but more can be passed via the linked list.

-hpa

Yinghai Lu

unread,

Jun 8, 2011, 5:40:02 PM6/8/11

to

On 06/08/2011 02:28 PM, Matthew Garrett wrote:

> E820 limits us to 128 ranges, and systems I'm looking at right now are
> already using over 140 boot services regions.

parse_setup_data() is used to handle entries above 128.

Yinghai

Matthew Garrett

unread,

Jun 8, 2011, 5:40:02 PM6/8/11

to

On Wed, Jun 08, 2011 at 02:31:38PM -0700, H. Peter Anvin wrote:
> On 06/08/2011 02:28 PM, Matthew Garrett wrote:
> >
> > E820 limits us to 128 ranges, and systems I'm looking at right now are
> > already using over 140 boot services regions. Some of them are
> > contiguous and we could just merge them for e820, but that leaves us at
> > the whim of whoever wrote the allocator for the firmware. And do you
> > want to bet on them having done this sanely?
> >
>
> No, we're not limited to 128. We're limited to 128 in the boot
> structure, but more can be passed via the linked list.

Oh, ok, that's not as bad as I thought. We're still potentially losing a
pile of memory, but...

--
Matthew Garrett | mj...@srcf.ucam.org

Linus Torvalds

unread,

Jun 8, 2011, 5:40:02 PM6/8/11

to

On Wed, Jun 8, 2011 at 2:28 PM, Matthew Garrett <mj...@srcf.ucam.org> wrote:
>
> Boot services data includes everything that was allocated by the EFI
> memory allocator. Depending on what the system decided to do before
> deigning to run our code, that might be a meg - or it might be several
> hundred. And in the process it's probably fragmented RAM into god knows
> how many small chunks.

In reality?

Whatever. I really think our EFI support is just fundamnetally broken.
We should do *everything* in the bootloader, and nothing at all in the
kernel. IOW, I think doing the whole "SetVirtualAddrMap()" (or
whatever) in the boot loader too, and just promise to neve rever call
any EFI routines from the kernel.

IOW, a sane EFI boot loader should just make things look like a
regular BIOS, and not bother the kernel with the EFI crap.

EFI was misdesigned. That doesn't mean that _we_ should then
mis-design our support for it.

Linus

Matthew Garrett

unread,

Jun 8, 2011, 5:50:02 PM6/8/11

to

On Wed, Jun 08, 2011 at 02:31:55PM -0700, Linus Torvalds wrote:
> On Wed, Jun 8, 2011 at 2:28 PM, Matthew Garrett <mj...@srcf.ucam.org> wrote:
> >
> > Boot services data includes everything that was allocated by the EFI
> > memory allocator. Depending on what the system decided to do before
> > deigning to run our code, that might be a meg - or it might be several
> > hundred. And in the process it's probably fragmented RAM into god knows
> > how many small chunks.
>
> In reality?
>
> Whatever. I really think our EFI support is just fundamnetally broken.
> We should do *everything* in the bootloader, and nothing at all in the
> kernel. IOW, I think doing the whole "SetVirtualAddrMap()" (or
> whatever) in the boot loader too, and just promise to neve rever call
> any EFI routines from the kernel.

Windows and MacOS both do all the EFI setup in their bootloader, and it
obviously makes a lot of things much easier. But unfortunately even in
that case we still need to make EFI calls from runtime, because there's
no other way to tell the firmware where we put the bootloader...

--
Matthew Garrett | mj...@srcf.ucam.org

H. Peter Anvin

unread,

Jun 8, 2011, 6:00:02 PM6/8/11

to

On 06/08/2011 02:31 PM, Linus Torvalds wrote:
> On Wed, Jun 8, 2011 at 2:28 PM, Matthew Garrett <mj...@srcf.ucam.org> wrote:
>>
>> Boot services data includes everything that was allocated by the EFI
>> memory allocator. Depending on what the system decided to do before
>> deigning to run our code, that might be a meg - or it might be several
>> hundred. And in the process it's probably fragmented RAM into god knows
>> how many small chunks.
>
> In reality?
>
> Whatever. I really think our EFI support is just fundamnetally broken.
> We should do *everything* in the bootloader, and nothing at all in the
> kernel. IOW, I think doing the whole "SetVirtualAddrMap()" (or
> whatever) in the boot loader too, and just promise to neve rever call
> any EFI routines from the kernel.
>
> IOW, a sane EFI boot loader should just make things look like a
> regular BIOS, and not bother the kernel with the EFI crap.
>
> EFI was misdesigned. That doesn't mean that _we_ should then
> mis-design our support for it.
>

No argument that our EFI support is misdesigned.

However, I suspect that what we *should* do is carry an kernel EFI stub
to go along with the BIOS stub... otherwise we're forever at mercy of
getting all the boot loader authors to change in lockstep, and there are
specific ones which are notoriously hard to work with.

The "kernel carries its own stub" approach been very successful in
dealing with BIOS, and would make a lot of sense to me for EFI as well.

-hpa

Linus Torvalds

unread,

Jun 8, 2011, 7:00:02 PM6/8/11

to

On Wed, Jun 8, 2011 at 2:51 PM, H. Peter Anvin <h...@linux.intel.com> wrote:
>
> However, I suspect that what we *should* do is carry an kernel EFI stub
> to go along with the BIOS stub... otherwise we're forever at mercy of
> getting all the boot loader authors to change in lockstep, and there are
> specific ones which are notoriously hard to work with.

Yes, that would probably be a good approach. We obviously have some
low-level asm code for the BIOS case that is technically linked into
the kernel, but is running before the kernel boots and not really
"part" of the kernel. Doing something similar for EFI support sounds
entirely sane to me.

Linus

Maarten Lankhorst

unread,

Jun 8, 2011, 8:00:02 PM6/8/11

to

Hey,

Op 09-06-11 00:57, Linus Torvalds schreef:

> On Wed, Jun 8, 2011 at 2:51 PM, H. Peter Anvin <h...@linux.intel.com> wrote:
>> However, I suspect that what we *should* do is carry an kernel EFI stub
>> to go along with the BIOS stub... otherwise we're forever at mercy of
>> getting all the boot loader authors to change in lockstep, and there are
>> specific ones which are notoriously hard to work with.
> Yes, that would probably be a good approach. We obviously have some
> low-level asm code for the BIOS case that is technically linked into
> the kernel, but is running before the kernel boots and not really
> "part" of the kernel. Doing something similar for EFI support sounds
> entirely sane to me.

I agree that's a long term solution, meanwhile can we just hope
that not too much boot memory is reserved and not free
that memory so it at least boots for more people?

Maybe add a printk of 'X amount of boot memory will not freed',
so at least people can quantify and judge for themselves whether
it's worth disabling kernel efi support or not.

~Maarten

Matthew Garrett

unread,

Jun 10, 2011, 12:50:02 PM6/10/11

to

So this is obviously even more of a hack, but before you check whether
the memblock has already been reserved could you __check_region it as
well? That ought to avoid us touching the kernel. I've got a patch for
grub that'll avoid the situation where we load the kernel on top of an
existing resource and I'll port that to grub2, but that's still going to
be awkward for existing bootloaders.

--
Matthew Garrett | mj...@srcf.ucam.org

Matthew Garrett

unread,

Jun 10, 2011, 2:00:02 PM6/10/11

to

On Fri, Jun 10, 2011 at 07:51:46PM +0200, Maarten Lankhorst wrote:
> Well,
>
> Op 10-06-11 18:47, Matthew Garrett schreef:

> > So this is obviously even more of a hack, but before you check whether
> > the memblock has already been reserved could you __check_region it as
> > well? That ought to avoid us touching the kernel. I've got a patch for
> > grub that'll avoid the situation where we load the kernel on top of an
> > existing resource and I'll port that to grub2, but that's still going to
> > be awkward for existing bootloaders.
> >

> Erm, __check_region calls __requestion_region which does a kzalloc,
> if I call __check_region it doesn't boot, probably because of that.

Oh, bother.

> Do you want me to manually run through iomem_resource? Is it even available up at that point?

Should be - we've already called insert_resource to set up the kernel at
this point.

Maarten Lankhorst

unread,

Jun 10, 2011, 2:00:02 PM6/10/11

to

Well,

Op 10-06-11 18:47, Matthew Garrett schreef:

> So this is obviously even more of a hack, but before you check whether
> the memblock has already been reserved could you __check_region it as
> well? That ought to avoid us touching the kernel. I've got a patch for
> grub that'll avoid the situation where we load the kernel on top of an
> existing resource and I'll port that to grub2, but that's still going to
> be awkward for existing bootloaders.
>

Erm, __check_region calls __requestion_region which does a kzalloc,
if I call __check_region it doesn't boot, probably because of that.

Do you want me to manually run through iomem_resource? Is it even available up at that point?

~Maarten

Maarten Lankhorst

unread,

Jun 10, 2011, 6:50:02 PM6/10/11

to

Op 10-06-11 19:54, Matthew Garrett schreef:

> On Fri, Jun 10, 2011 at 07:51:46PM +0200, Maarten Lankhorst wrote:
>> Well,
>>
>> Op 10-06-11 18:47, Matthew Garrett schreef:
>>> So this is obviously even more of a hack, but before you check whether
>>> the memblock has already been reserved could you __check_region it as
>>> well? That ought to avoid us touching the kernel. I've got a patch for
>>> grub that'll avoid the situation where we load the kernel on top of an
>>> existing resource and I'll port that to grub2, but that's still going to
>>> be awkward for existing bootloaders.
>>>
>> Erm, __check_region calls __requestion_region which does a kzalloc,
>> if I call __check_region it doesn't boot, probably because of that.
> Oh, bother.
>
>> Do you want me to manually run through iomem_resource? Is it even available up at that point?
> Should be - we've already called insert_resource to set up the kernel at
> this point.
>

Version with yinghai's free_bootmem_late_with_active_regions.

Still has an issue though, I'm getting 2 warnings from swapper:
[ 2.867034] BUG: Bad page state in process swapper pfn:01900
[ 2.867303] page:ffffea0000057800 count:0 mapcount:-127 mapping: (null) index:0x0
[ 2.867683] page flags: 0x100000000000000()
[ 2.867887] Pid: 1, comm: swapper Not tainted 2.6.39.1-patser+ #15
[ 2.867888] Call Trace:
[ 2.867893] [<ffffffff810f349b>] ? dump_page+0x9b/0xd0
[ 2.867894] [<ffffffff810f3599>] bad_page+0xc9/0x120
[ 2.867896] [<ffffffff810f36af>] free_pages_prepare+0xbf/0x110
[ 2.867898] [<ffffffff810f4fa9>] free_hot_cold_page+0x49/0x440
[ 2.867899] [<ffffffff810f59fd>] __free_pages+0x2d/0x40
[ 2.867900] [<ffffffff810f5a53>] free_pages+0x43/0x50
[ 2.867903] [<ffffffff81029542>] free_init_pages+0x132/0x1c0
[ 2.867904] [<ffffffff81029cd3>] mark_rodata_ro+0x143/0x150
[ 2.867906] [<ffffffff810001d8>] init_post+0x18/0xd0
[ 2.867909] [<ffffffff81ab7d45>] kernel_init+0x158/0x163
[ 2.867911] [<ffffffff815688d4>] kernel_thread_helper+0x4/0x10
[ 2.867913] [<ffffffff81ab7bed>] ? start_kernel+0x3dc/0x3dc
[ 2.867914] [<ffffffff815688d0>] ? gs_change+0xb/0xb
[ 2.867915] Disabling lock debugging due to kernel taint
[ 2.867922] BUG: Bad page state in process swapper pfn:01910
[ 2.868187] page:ffffea0000057b80 count:0 mapcount:-127 mapping: (null) index:0x0
[ 2.868567] page flags: 0x100000000000000()
[ 2.868769] Pid: 1, comm: swapper Tainted: G B 2.6.39.1-patser+ #15
[ 2.868770] Call Trace:
[ 2.868771] [<ffffffff810f349b>] ? dump_page+0x9b/0xd0
[ 2.868773] [<ffffffff810f3599>] bad_page+0xc9/0x120
[ 2.868774] [<ffffffff810f36af>] free_pages_prepare+0xbf/0x110
[ 2.868775] [<ffffffff810f4fa9>] free_hot_cold_page+0x49/0x440
[ 2.868777] [<ffffffff810f59fd>] __free_pages+0x2d/0x40
[ 2.868778] [<ffffffff810f5a53>] free_pages+0x43/0x50
[ 2.868779] [<ffffffff81029542>] free_init_pages+0x132/0x1c0
[ 2.868781] [<ffffffff81029cd3>] mark_rodata_ro+0x143/0x150
[ 2.868782] [<ffffffff810001d8>] init_post+0x18/0xd0
[ 2.868784] [<ffffffff81ab7d45>] kernel_init+0x158/0x163
[ 2.868785] [<ffffffff815688d4>] kernel_thread_helper+0x4/0x10
[ 2.868787] [<ffffffff81ab7bed>] ? start_kernel+0x3dc/0x3dc
[ 2.868788] [<ffffffff815688d0>] ? gs_change+0xb/0xb

Also don't rate for style, that wasn't the scope of this patch. This is just to have something to test with ;)

diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
index 19ae14b..0cd3800 100644
--- a/arch/x86/include/asm/memblock.h
+++ b/arch/x86/include/asm/memblock.h
@@ -4,7 +4,6 @@
#define ARCH_DISCARD_MEMBLOCK

u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
-void memblock_x86_to_bootmem(u64 start, u64 end);

void memblock_x86_reserve_range(u64 start, u64 end, char *name);
void memblock_x86_free_range(u64 start, u64 end);
@@ -19,5 +18,6 @@ u64 memblock_x86_hole_size(u64 start, u64 end);
u64 memblock_x86_find_in_range_node(int nid, u64 start, u64 end, u64 size, u64 align);
u64 memblock_x86_free_memory_in_range(u64 addr, u64 limit);
u64 memblock_x86_memory_in_range(u64 addr, u64 limit);
+bool memblock_x86_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);

#endif
diff --git a/arch/x86/mm/memblock.c b/arch/x86/mm/memblock.c
index aa11693..992da5e 100644
--- a/arch/x86/mm/memblock.c
+++ b/arch/x86/mm/memblock.c
@@ -8,7 +8,7 @@
#include <linux/range.h>

/* Check for already reserved areas */
-static bool __init check_with_memblock_reserved_size(u64 *addrp, u64 *sizep, u64 align)
+bool __init memblock_x86_check_reserved_size(u64 *addrp, u64 *sizep, u64 align)
{
struct memblock_region *r;
u64 addr = *addrp, last;
@@ -59,7 +59,7 @@ u64 __init memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align)
if (addr >= ei_last)
continue;
*sizep = ei_last - addr;
- while (check_with_memblock_reserved_size(&addr, sizep, align))
+ while (memblock_x86_check_reserved_size(&addr, sizep, align))
;

if (*sizep)
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 02b48dc..46e63ad 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -315,20 +315,85 @@ static void __init print_efi_memmap(void)
}
#endif /* EFI_DEBUG */

+static struct resource * __available_resource(struct resource *root, struct resource *new)
+{
+ resource_size_t start = new->start;
+ resource_size_t end = new->end;
+ struct resource *tmp, **p;
+
+ if (end < start)
+ return root;
+ if (start < root->start)
+ return root;
+ if (end > root->end)
+ return root;
+ p = &root->child;
+ for (;;) {
+ tmp = *p;
+ if (!tmp || tmp->start > end)
+ return NULL;
+ p = &tmp->sibling;
+ if (tmp->end < start)
+ continue;
+ return tmp;
+ }
+}
+
+static int is_used_region(struct resource *parent, struct resource *new)
+{
+ struct resource *first, *next;
+
+ for (;; parent = first) {
+ first = __available_resource(parent, new);
+ if (!first)
+ return 0;
+
+ if (first == parent)
+ return 1;
+ if (WARN_ON(first == new)) /* duplicated insertion */
+ return 1;
+
+ if ((first->start > new->start) || (first->end < new->end))
+ break;
+ if ((first->start == new->start) && (first->end == new->end))
+ break;
+ }
+
+ for (next = first; ; next = next->sibling) {
+ if (next->start < new->start || next->end > new->end)
+ return 1;
+ if (!next->sibling)
+ break;
+ if (next->sibling->start > new->end)
+ break;
+ }
+
+ return 0;
+}
+
+
void __init efi_reserve_boot_services(void)
{
void *p;

for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) {
efi_memory_desc_t *md = p;
- unsigned long long start = md->phys_addr;
- unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
+ u64 start = md->phys_addr;
+ u64 size = md->num_pages << EFI_PAGE_SHIFT;
+ struct resource dummy = { .start = start, .end = start + size };

if (md->type != EFI_BOOT_SERVICES_CODE &&
md->type != EFI_BOOT_SERVICES_DATA)
continue;
-
- memblock_x86_reserve_range(start, start + size, "EFI Boot");
+ if (is_used_region(&iomem_resource, &dummy) ||
+ memblock_x86_check_reserved_size(&start, &size,
+ 1<<EFI_PAGE_SHIFT)) {
+ /* Could not reserve, skip it */
+ md->num_pages = 0;
+ printk(KERN_INFO PFX "Could not reserve boot area "
+ "[0x%llx-0x%llx)\n", start, start+size);
+ } else
+ memblock_x86_reserve_range(start, start+size, "EFI Boot");
}
}

@@ -345,7 +410,11 @@ static void __init efi_free_boot_services(void)
md->type != EFI_BOOT_SERVICES_DATA)
continue;

- free_bootmem_late(start, size);
+ /* Could not reserve boot area */
+ if (!size)
+ continue;
+
+ free_bootmem_late_with_active_regions(start, size);
}
}

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6507dde..713287f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1352,6 +1352,8 @@ extern void get_pfn_range_for_nid(unsigned int nid,
extern unsigned long find_min_pfn_with_active_regions(void);
extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
+void free_bootmem_late_with_active_regions(unsigned long addr,
+ unsigned long size);
int add_from_early_node_map(struct range *range, int az,
int nr_range, int nid);
u64 __init find_memory_core_early(int nid, u64 size, u64 align,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e78b324..4c3bcd7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3790,6 +3790,38 @@ void __init free_bootmem_with_active_regions(int nid,
}
}

+/**
+ * free_bootmem_late_with_active_regions - Call free_bootmem_late for each active range
+ * @addr: starting address of the range
+ * @size: size of the range in bytes
+ *
+ * this function make sure on active regions only
+ */
+void __init free_bootmem_late_with_active_regions(unsigned long addr,
+ unsigned long size)
+{
+ int i;
+ int nid = MAX_NUMNODES;
+ unsigned long start_pfn = PFN_UP(addr);
+ unsigned long end_pfn = PFN_DOWN(addr + size);
+
+ if (start_pfn >= end_pfn)
+ return;
+
+ for_each_active_range_index_in_nid(i, nid) {
+ unsigned long common_start, common_end;
+
+ common_start = max(start_pfn, early_node_map[i].start_pfn);
+ common_end = min(end_pfn, early_node_map[i].end_pfn);
+
+ if (common_start >= common_end)
+ continue;
+
+ free_bootmem_late(common_start << PAGE_SHIFT,
+ (common_end - common_start) << PAGE_SHIFT);
+ }
+}
+
#ifdef CONFIG_HAVE_MEMBLOCK
/*
* Basic iterator support. Return the last range of PFNs for a node

Yinghai Lu

unread,

Jun 10, 2011, 7:00:02 PM6/10/11

to

the problem is : overlapping between kernel code with boot services code.

now e820 table that is passed from bootloader do not include boot services code range. and also current boot/head_64.S will not
try to find usable range for decompressed kernel ( too early )...

So solution will be:
1. revert Matthew Garrett's patch, because it breaking unknown good platform.
2. ask vendor of system that Matthew try to fix to go back fix their firmware. otherwise user have stay with CSM with it.

Thanks

Yinghai Lu