And when X exits I get a bunch of lines like:
kernel: Xorg:3349 freeing invalid memtype 80020000-8002a000
I also noticed artifacts (a band of about 2 cm high across the screen) after
X goes to black but before the switch to VT1.
When I unset X86_PAT all this disappeared.
Apparently this has been seen before:
http://lkml.org/lkml/2008/5/2/139
The author of that mail also points to:
http://marc.info/?l=linux-kernel&m=120612742217604&w=2
Is this a bug in PAT or not. If it is, should PAT really be recommended to
be enabled by default?
System: Intel desktop with D945GCZ mainboard
Intel(R) Pentium(R) D CPU 3.20GHz
x86_64 kernel; Debian unstable
Video: 00:02.0 VGA compatible controller [0300]:
Intel Corporation 82945G/GZ Integrated Graphics Controller
[8086:2772] (rev 02)
Cheers,
FJP
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Copying the X-guys.
The messages all look like debug messages from X and should not cause
any side-effects other than little annoyance. I am not sure why band of
2 cm is there...
Do you see any performance difference with or without PAT.
Thanks,
Venki
These messages? They're coming from the kernel it looks like, from the
map_devmem routine in pat.c. I'm not sure they're accurate though; for PCI
regions /dev/mem is *supposed* to map with UC- and not WB, so maybe this
function needs to be updated?
> >And when X exits I get a bunch of lines like:
> >kernel: Xorg:3349 freeing invalid memtype 80020000-8002a000
> >
> >I also noticed artifacts (a band of about 2 cm high across the
> >screen) after
> >X goes to black but before the switch to VT1.
This is just a transient issue during VT switch or server exit though, right?
X functionality isn't affected, and your VTs work fine? If so, it might not
be a PAT issue but just a different memory layout or something (and therefore
it would really just be a cosmetic bug in the X driver).
I really think PAT should be on by default; if you're running into real
functional or performance problems we'd better get them fixed rather than
disabling PAT...
Thanks,
Jesse
Indeed.
I think these messages are due to X using the mprotect workaround to
change UC_MINUS to WB.
I don't see these error messages on my 965 here. May be I have different
x version.
What may be happening:
1) process A mmaps /dev/mem and gets UC_MINUS
2) Changes the page table to make pg_prot WB
3) Does a fork to create process B
4) While copying the vma, we go through map_devmem request WB, but get
UC_MINUS back
5) We are not changing vma pg_prot to new value at this point (we should
change this), so one more round of errors will be there when forked
process exits.
Again, this should not have any side-effect like the band etc. It just a
"friendly warning". It should go away when X moves to using WC or does
not use the mprotect workaround to make pg_prot WB.
Thanks,
Venki
Hm, yeah that could be. strace would tell us.
>
> What may be happening:
> 1) process A mmaps /dev/mem and gets UC_MINUS
> 2) Changes the page table to make pg_prot WB
> 3) Does a fork to create process B
> 4) While copying the vma, we go through map_devmem request WB, but get
> UC_MINUS back
> 5) We are not changing vma pg_prot to new value at this point (we should
> change this), so one more round of errors will be there when forked
> process exits.
>
> Again, this should not have any side-effect like the band etc. It just a
> "friendly warning". It should go away when X moves to using WC or does
> not use the mprotect workaround to make pg_prot WB.
More recent versions of X will use sysfs rather than /dev/mem (though we're
still using the mprotect hack there to give us UC-), so yeah this warning
should already be gone in more recent builds.
Jesse
> On Friday 02 May 2008, Jesse Barnes wrote:
> > This is just a transient issue during VT switch or server exit though,
> > right? X functionality isn't affected, and your VTs work fine?
>
> Transient only. I've just tested again and this time the band was
> visible on top of the text on VT1 for about 2 seconds. Then it
> disappeared. The artifacts also appear when I log out from KDE (i.e.
> without exiting the server), and I also get the messages when logging
> out and logging in again.
>
> I do not see any performance issues, but I've only used this kernel
> for a very short time.
>
> > If so, it might not be a PAT issue but just a different memory
> > layout or something (and therefore it would really just be a
> > cosmetic bug in the X driver).
>
> The artifacts may not be a PAT issue directly, but it is a clear
> regression for me as I currently have a nice clean screen when X shuts
> down. I'm also 100% sure that it is caused by enabling PAT. A kernel
> with same config and only PAT disabled does not show the artifacts.
ok, Cc:-ed more folks - this bug has to be resolved regardless of the
default selection.
Ingo
can you boot with pat=off and send out /proc/mtrr?
YH
I suspect an i810 driver bug is being uncovered here, since we do have
transient VT switch corruption on some other platforms (we're just exposing
our chip reprogramming on the screen, rather than keeping it off the whole
time). But there could also be something PAT specific going on, I'll have to
walk through those code paths...
> Also attached a dmesg that shows the messages. As you can see there are
> some repeats.
> - first 22 "expected mapping type" when X is started
> - second 22 "expected mapping type" when logging in
> - 9 "freeing invalid memtype" when logging out
> - 22 "expected mapping type" when logging in again
> - 9 "freeing invalid memtype" when logging out again
> - last series "expected mapping type" when restarting X server
>
> I must say that I'm fairly disappointed by your (plural) attitude to these
> regressions, especially given that you both seem to feel it is important
> that people will actually use PAT.
Oh the messages should be removed or somehow minimized, I agree. I'm just not
sure if the other bug is serious enough to block PAT by default yet, but
either way we should fix the bugs!
> That said, I'll be happy to help trace and get these issues fixed.
Thanks for your help so far...
Hmmm. Is 'pat=off' supposed to do anything?
With a default boot I see:
kernel: Command line: root=/dev/mapper/main-root ro vga=791
[...]
kernel: x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
And with 'pat=off' I get the same:
kernel: Command line: root=/dev/mapper/main-root ro vga=791 pat=off
[...]
kernel: x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
Anyway, the info in /proc/mtrr is identical with and without the option, and
is also identical to what I get with 2.6.25.1:
$ cat /proc/mtrr
reg00: base=0x00000000 ( 0MB), size=2048MB: write-back, count=1
reg01: base=0x7f800000 (2040MB), size= 8MB: uncachable, count=1
reg02: base=0x7f700000 (2039MB), size= 1MB: uncachable, count=1
reg03: base=0x80000000 (2048MB), size= 256MB: write-combining, count=1
Cheers,
FJP
>-----Original Message-----
>From: Frans Pop [mailto:ele...@planet.nl]
>Sent: Monday, May 05, 2008 9:55 AM
>To: Yinghai Lu
>Cc: Barnes, Jesse; Pallipadi, Venkatesh;
>linux-...@vger.kernel.org; Ingo Molnar; Packard, Keith
>Subject: Re: [git head] Should X86_PAT really default to yes?
>
>On Sunday 04 May 2008, Yinghai Lu wrote:
>> can you boot with pat=off and send out /proc/mtrr?
>
>Hmmm. Is 'pat=off' supposed to do anything?
>
The option for this is actually "nopat".
Thanks,
Venki
On Monday 05 May 2008, Jesse Barnes wrote:
> > > If so, it might not be a PAT issue but just a different memory layout
> > > or something (and therefore it would really just be a cosmetic bug in
> > > the X driver).
> >
> > The artifacts may not be a PAT issue directly, but it is a clear
> > regression for me as I currently have a nice clean screen when X shuts
> > down. I'm also 100% sure that it is caused by enabling PAT. A kernel
> > with same config and only PAT disabled does not show the artifacts.
> >
> > Would you like me to file a bug against X for these artifacts?
> > If so, against what component? The i810 driver or the server?
>
> I suspect an i810 driver bug is being uncovered here, since we do have
> transient VT switch corruption on some other platforms (we're just
> exposing our chip reprogramming on the screen, rather than keeping it off
> the whole time). But there could also be something PAT specific going
> on, I'll have to walk through those code paths...
I suspect it could be vesafb/fbcon related instead. Normally I boot my
system with 'quiet vga=791', i.e. with vesafb. I then see the artifacts.
When I boot without 'vga=791', I hit another, unrelated regression (which
I'll report separately) [1].
When I boot with 'video=vfb:off', I do _not_ get the artifacts when X exits.
Note that the "expected mapping type" errors remain the same both with and
without framebuffer console.
> Oh the messages should be removed or somehow minimized, I agree. I'm
> just not sure if the other bug is serious enough to block PAT by default
> yet, but either way we should fix the bugs!
OK. Thanks. Guess you've also seen "Xorg crash with xf86MapVidMem error"
that turned out to be due to PAT:
http://www.gossamer-threads.com/lists/linux/kernel/915300 ?
Cheers,
FJP
[1] Short version of this unrelated issue.
If I boot without vga=791 the console stops being updated after PCI probes.
At first I thought this was #9310, but this time it's unrelated to the
config option mentioned there. Symptoms are awfully similar though.
Ahhh, I missed that part of your config. That could definitely have an effect
on things... You'll probably want something like the attached (there are
other places in the fb layer that want similar treatment, iirc, maybe
fb_pgprotect?).
> Note that the "expected mapping type" errors remain the same both with and
> without framebuffer console.
>
> > Oh the messages should be removed or somehow minimized, I agree. I'm
> > just not sure if the other bug is serious enough to block PAT by default
> > yet, but either way we should fix the bugs!
>
> OK. Thanks. Guess you've also seen "Xorg crash with xf86MapVidMem error"
> that turned out to be due to PAT:
> http://www.gossamer-threads.com/lists/linux/kernel/915300 ?
No, I hadn't seen that... Venki/Ingo has that issue been fixed already?
Jesse
YH
No. We are following up on that issue to root cause the problem there.
AFAIK, that is the only other open issue related to PAT at this time.
There was another one reported here
http://bugzilla.kernel.org/show_bug.cgi?id=10580
which was root caused as a app bug unmasked by PAT.
Thanks,
Venki
Not sure what to make of this.
With your patch I still get the artifacts, but they are displayed a lot
shorter. I get only a flash instead of a-2 seconds.
> (there are other places in the fb layer that want similar treatment,
> iirc, maybe fb_pgprotect?).
Do you want to try a more complete review/patch or should this be punted
over to the framebuffer folks?
Cheers,
FJP
Yes, that does work. Thanks.
Contents of /proc/mtrr remain unchanged.
Hm, so I guess some other user is still using UC on the region. [Note to fb
list: the patch just made vesafb use ioremap_wc instead of ioremap).
> > (there are other places in the fb layer that want similar treatment,
> > iirc, maybe fb_pgprotect?).
>
> Do you want to try a more complete review/patch or should this be punted
> over to the framebuffer folks?
Yeah, those bits have changed since I last looked at them, hopefully the fb
guys can help.
Jesse
The patch below plugs the mprotect hole and should eliminate the
"expected mapping type" error messages. Can you check.
Thanks,
Venki
There is a hole in mprotect, which lets the user to change the page
cache type bits by-passing the kernel reserve_memtype and free_memtype
wrappers. Fix the hole by not letting mprotect change the PAT bits.
Some versions of X used the mprotect hole to change caching type from UC to WB,
so that it can then use mtrr to program WC for that region [1]. Change the
mmap of pci space through /sys or /proc interfaces from UC to UC_MINUS.
With this change, X will not need to use mprotect hole to get WC type.
[1] lkml.org/lkml/2008/4/16/369
Signed-off-by: Venkatesh Pallipadi <venkatesh...@intel.com>
Signed-off-by: Suresh Siddha <suresh....@intel.com>
---
arch/x86/pci/i386.c | 4 +---
include/asm-x86/pgtable.h | 5 ++++-
include/linux/mm.h | 1 +
mm/mmap.c | 13 +++++++++++++
mm/mprotect.c | 4 +++-
5 files changed, 22 insertions(+), 5 deletions(-)
Index: linux-2.6/include/asm-x86/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable.h 2008-05-06 14:16:50.000000000 -0700
+++ linux-2.6/include/asm-x86/pgtable.h 2008-05-06 14:18:57.000000000 -0700
@@ -65,6 +65,8 @@
#define _PAGE_CACHE_UC_MINUS (_PAGE_PCD)
#define _PAGE_CACHE_UC (_PAGE_PCD | _PAGE_PWT)
+#define _PAGE_PROT_PRESERVE_BITS (_PAGE_CACHE_MASK)
+
#define PAGE_NONE __pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED)
#define PAGE_SHARED __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_NX)
@@ -288,7 +290,8 @@ static inline pte_t pte_modify(pte_t pte
* Chop off the NX bit (if present), and add the NX portion of
* the newprot (if present):
*/
- val &= _PAGE_CHG_MASK & ~_PAGE_NX;
+ /* We also preserve PAT bits from existing pte */
+ val &= (_PAGE_CHG_MASK | _PAGE_PROT_PRESERVE_BITS) & ~_PAGE_NX;
val |= pgprot_val(newprot) & __supported_pte_mask;
return __pte(val);
Index: linux-2.6/arch/x86/pci/i386.c
===================================================================
--- linux-2.6.orig/arch/x86/pci/i386.c 2008-05-06 14:16:50.000000000 -0700
+++ linux-2.6/arch/x86/pci/i386.c 2008-05-06 15:28:57.000000000 -0700
@@ -301,15 +301,13 @@ int pci_mmap_page_range(struct pci_dev *
prot = pgprot_val(vma->vm_page_prot);
if (pat_wc_enabled && write_combine)
prot |= _PAGE_CACHE_WC;
- else if (pat_wc_enabled)
+ else if (pat_wc_enabled || boot_cpu_data.x86 > 3)
/*
* ioremap() and ioremap_nocache() defaults to UC MINUS for now.
* To avoid attribute conflicts, request UC MINUS here
* aswell.
*/
prot |= _PAGE_CACHE_UC_MINUS;
- else if (boot_cpu_data.x86 > 3)
- prot |= _PAGE_CACHE_UC;
vma->vm_page_prot = __pgprot(prot);
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2008-05-06 14:16:50.000000000 -0700
+++ linux-2.6/include/linux/mm.h 2008-05-06 14:18:57.000000000 -0700
@@ -1177,6 +1177,7 @@ static inline unsigned long vma_pages(st
}
pgprot_t vm_get_page_prot(unsigned long vm_flags);
+pgprot_t vm_get_page_prot_preserve(unsigned long vm_flags, pgprot_t oldprot);
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-05-06 14:16:50.000000000 -0700
+++ linux-2.6/mm/mmap.c 2008-05-06 14:18:57.000000000 -0700
@@ -77,6 +77,19 @@ pgprot_t vm_get_page_prot(unsigned long
}
EXPORT_SYMBOL(vm_get_page_prot);
+#ifndef _PAGE_PROT_PRESERVE_BITS
+#define _PAGE_PROT_PRESERVE_BITS 0
+#endif
+
+pgprot_t vm_get_page_prot_preserve(unsigned long vm_flags, pgprot_t oldprot)
+{
+ pteval_t newprotval = pgprot_val(oldprot);
+
+ newprotval &= _PAGE_PROT_PRESERVE_BITS;
+ newprotval |= pgprot_val(vm_get_page_prot(vm_flags));
+ return __pgprot(newprotval);
+}
+
int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
int sysctl_overcommit_ratio = 50; /* default is 50% */
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c 2008-05-06 14:16:50.000000000 -0700
+++ linux-2.6/mm/mprotect.c 2008-05-06 14:18:57.000000000 -0700
@@ -192,7 +192,9 @@ success:
* held in write mode.
*/
vma->vm_flags = newflags;
- vma->vm_page_prot = vm_get_page_prot(newflags);
+ vma->vm_page_prot = vm_get_page_prot_preserve(newflags,
+ vma->vm_page_prot);
+
if (vma_wants_writenotify(vma)) {
vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
dirty_accountable = 1;
> There is a hole in mprotect, which lets the user to change the page
> cache type bits by-passing the kernel reserve_memtype and free_memtype
> wrappers. Fix the hole by not letting mprotect change the PAT bits.
>
> Some versions of X used the mprotect hole to change caching type from
> UC to WB, so that it can then use mtrr to program WC for that region
> [1]. Change the mmap of pci space through /sys or /proc interfaces
> from UC to UC_MINUS. With this change, X will not need to use mprotect
> hole to get WC type.
>
> [1] lkml.org/lkml/2008/4/16/369
>
> Signed-off-by: Venkatesh Pallipadi <venkatesh...@intel.com>
> Signed-off-by: Suresh Siddha <suresh....@intel.com>
>
> ---
> arch/x86/pci/i386.c | 4 +---
> include/asm-x86/pgtable.h | 5 ++++-
> include/linux/mm.h | 1 +
> mm/mmap.c | 13 +++++++++++++
> mm/mprotect.c | 4 +++-
> 5 files changed, 22 insertions(+), 5 deletions(-)
hm, that's one dangerous looking patch. (Cc:-ed more MM folks. I've
attached the patch below for reference.)
the purpose of the fix itself seems to make some sense - we dont want
mprotect() change the PAT bits in the pte from what they got populated
with at fault or mmap time.
the pte_modify() change looks correct at first sight.
The _PAGE_PROT_PRESERVE_BITS solution looks a bit ugly (although we do
have a couple of other similar #ifndefs in the MM already).
at minimum we should add vm_get_page_prot_preserve() as an inline
function to mm.h if _PAGE_PROT_PRESERVE_BITS is not defined, and make it
call vm_get_page_prot(). Also, vm_get_page_prot() in mm/mmap.c should
probably be marked inline so that we'll have only a single function call
[vm_get_page_prot() is trivial].
but i'm wondering why similar issues never came up on other
architectures - i thought it would be rather common to have immutable
pte details. So maybe i'm missing something here ...
Ingo
------------------------------------->
Subject: generic, x86, PAT: fix mprotect
From: Venki Pallipadi <venkatesh...@intel.com>
Date: Tue, 6 May 2008 15:42:40 -0700
There is a hole in mprotect, which lets the user to change the page
cache type bits by-passing the kernel reserve_memtype and free_memtype
wrappers. Fix the hole by not letting mprotect change the PAT bits.
Some versions of X used the mprotect hole to change caching type from UC to WB,
so that it can then use mtrr to program WC for that region [1]. Change the
mmap of pci space through /sys or /proc interfaces from UC to UC_MINUS.
With this change, X will not need to use mprotect hole to get WC type.
[1] lkml.org/lkml/2008/4/16/369
Signed-off-by: Venkatesh Pallipadi <venkatesh...@intel.com>
Signed-off-by: Suresh Siddha <suresh....@intel.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
arch/x86/pci/i386.c | 4 +---
include/asm-x86/pgtable.h | 5 ++++-
include/linux/mm.h | 1 +
mm/mmap.c | 13 +++++++++++++
mm/mprotect.c | 4 +++-
5 files changed, 22 insertions(+), 5 deletions(-)
Index: linux-x86.q/arch/x86/pci/i386.c
===================================================================
--- linux-x86.q.orig/arch/x86/pci/i386.c
+++ linux-x86.q/arch/x86/pci/i386.c
@@ -301,15 +301,13 @@ int pci_mmap_page_range(struct pci_dev *
prot = pgprot_val(vma->vm_page_prot);
if (pat_wc_enabled && write_combine)
prot |= _PAGE_CACHE_WC;
- else if (pat_wc_enabled)
+ else if (pat_wc_enabled || boot_cpu_data.x86 > 3)
/*
* ioremap() and ioremap_nocache() defaults to UC MINUS for now.
* To avoid attribute conflicts, request UC MINUS here
* aswell.
*/
prot |= _PAGE_CACHE_UC_MINUS;
- else if (boot_cpu_data.x86 > 3)
- prot |= _PAGE_CACHE_UC;
vma->vm_page_prot = __pgprot(prot);
Index: linux-x86.q/include/asm-x86/pgtable.h
===================================================================
--- linux-x86.q.orig/include/asm-x86/pgtable.h
+++ linux-x86.q/include/asm-x86/pgtable.h
@@ -66,6 +66,8 @@
#define _PAGE_CACHE_UC_MINUS (_PAGE_PCD)
#define _PAGE_CACHE_UC (_PAGE_PCD | _PAGE_PWT)
+#define _PAGE_PROT_PRESERVE_BITS (_PAGE_CACHE_MASK)
+
#define PAGE_NONE __pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED)
#define PAGE_SHARED __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_NX)
@@ -289,7 +291,8 @@ static inline pte_t pte_modify(pte_t pte
* Chop off the NX bit (if present), and add the NX portion of
* the newprot (if present):
*/
- val &= _PAGE_CHG_MASK & ~_PAGE_NX;
+ /* We also preserve PAT bits from existing pte */
+ val &= (_PAGE_CHG_MASK | _PAGE_PROT_PRESERVE_BITS) & ~_PAGE_NX;
val |= pgprot_val(newprot) & __supported_pte_mask;
return __pte(val);
Index: linux-x86.q/include/linux/mm.h
===================================================================
--- linux-x86.q.orig/include/linux/mm.h
+++ linux-x86.q/include/linux/mm.h
@@ -1177,6 +1177,7 @@ static inline unsigned long vma_pages(st
}
pgprot_t vm_get_page_prot(unsigned long vm_flags);
+pgprot_t vm_get_page_prot_preserve(unsigned long vm_flags, pgprot_t oldprot);
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
Index: linux-x86.q/mm/mmap.c
===================================================================
--- linux-x86.q.orig/mm/mmap.c
+++ linux-x86.q/mm/mmap.c
@@ -77,6 +77,19 @@ pgprot_t vm_get_page_prot(unsigned long
}
EXPORT_SYMBOL(vm_get_page_prot);
+#ifndef _PAGE_PROT_PRESERVE_BITS
+#define _PAGE_PROT_PRESERVE_BITS 0
+#endif
+
+pgprot_t vm_get_page_prot_preserve(unsigned long vm_flags, pgprot_t oldprot)
+{
+ pteval_t newprotval = pgprot_val(oldprot);
+
+ newprotval &= _PAGE_PROT_PRESERVE_BITS;
+ newprotval |= pgprot_val(vm_get_page_prot(vm_flags));
+ return __pgprot(newprotval);
+}
+
int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
int sysctl_overcommit_ratio = 50; /* default is 50% */
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
Index: linux-x86.q/mm/mprotect.c
===================================================================
--- linux-x86.q.orig/mm/mprotect.c
+++ linux-x86.q/mm/mprotect.c
Thanks, Ingo. I've added Nick too (in part because I'm wondering
whether his s390 SPECIAL bit gets destroyed by mprotect and needs
similar attention).
>
> the purpose of the fix itself seems to make some sense - we dont want
> mprotect() change the PAT bits in the pte from what they got populated
> with at fault or mmap time.
Indeed.
>
> the pte_modify() change looks correct at first sight.
>
> The _PAGE_PROT_PRESERVE_BITS solution looks a bit ugly (although we do
> have a couple of other similar #ifndefs in the MM already).
>
> at minimum we should add vm_get_page_prot_preserve() as an inline
> function to mm.h if _PAGE_PROT_PRESERVE_BITS is not defined, and make it
> call vm_get_page_prot(). Also, vm_get_page_prot() in mm/mmap.c should
> probably be marked inline so that we'll have only a single function call
> [vm_get_page_prot() is trivial].
I've tried doing it slightly differently below,
don't know whether you'll consider it an improvement or not.
>
> but i'm wondering why similar issues never came up on other
> architectures - i thought it would be rather common to have immutable
> pte details. So maybe i'm missing something here ...
I believe it has come up from time to time in certain drivers,
but they've worked around it somehow, so it's never got important
enough to handle in the core. Glad we're getting to do so now.
>
> Ingo
>
> ------------------------------------->
> Subject: generic, x86, PAT: fix mprotect
> From: Venki Pallipadi <venkatesh...@intel.com>
> Date: Tue, 6 May 2008 15:42:40 -0700
>
> There is a hole in mprotect, which lets the user to change the page
> cache type bits by-passing the kernel reserve_memtype and free_memtype
> wrappers. Fix the hole by not letting mprotect change the PAT bits.
Maybe say "defect" rather than "hole"? I stupidly thought for a
while that you were talking about some gap in the address space.
>
> Some versions of X used the mprotect hole to change caching type from UC to WB,
> so that it can then use mtrr to program WC for that region [1]. Change the
> mmap of pci space through /sys or /proc interfaces from UC to UC_MINUS.
> With this change, X will not need to use mprotect hole to get WC type.
I'll trust you on that end of it, UC_MINUS is beyond me.
(Amused to see _PAGE_CACHE_MASK is very different from PAGE_CACHE_MASK!)
> #define PAGE_NONE __pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED)
> #define PAGE_SHARED __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
> _PAGE_ACCESSED | _PAGE_NX)
> @@ -289,7 +291,8 @@ static inline pte_t pte_modify(pte_t pte
> * Chop off the NX bit (if present), and add the NX portion of
> * the newprot (if present):
> */
> - val &= _PAGE_CHG_MASK & ~_PAGE_NX;
(Nothing to do with your patch, but that's a silly line, isn't it?
_PAGE_NX isn't in _PAGE_CHG_MASK so it doesn't need further masking out.)
> + /* We also preserve PAT bits from existing pte */
> + val &= (_PAGE_CHG_MASK | _PAGE_PROT_PRESERVE_BITS) & ~_PAGE_NX;
> val |= pgprot_val(newprot) & __supported_pte_mask;
>
> return __pte(val);
An odd thing is that you end up keeping the PAT bits from the old
pte, and the PAT bits from vm_page_prot. So if they can get out
of synch, you'd be oring them together; and if they can't get out
of synch, then do you need a change here?
I've followed that dup in my version below, but it's probably wrong.
Yes, you want the PAT bits in vm_page_prot (so various places
will automatically set them), but I think here it's more correct
to take the PAT bits from the old pte and mask out those from the
vm_page_prot?
I suggest that way round because I'm vaguely anticipating that
silly case of a MAP_PRIVATE mapping of one of these things: we'll
need to add something in do_wp_page so as not to set the PAT bits
on private COWed copies of the PATted page (but that can follow
later as a fixup patch).
> Index: linux-x86.q/include/linux/mm.h
> ===================================================================
> --- linux-x86.q.orig/include/linux/mm.h
> +++ linux-x86.q/include/linux/mm.h
> @@ -1177,6 +1177,7 @@ static inline unsigned long vma_pages(st
> }
>
> pgprot_t vm_get_page_prot(unsigned long vm_flags);
> +pgprot_t vm_get_page_prot_preserve(unsigned long vm_flags, pgprot_t oldprot);
I'm thinking its equivalent needs to be in asm/pagetable.h.
> struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
> int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
> unsigned long pfn, unsigned long size, pgprot_t);
> Index: linux-x86.q/mm/mmap.c
> ===================================================================
> --- linux-x86.q.orig/mm/mmap.c
> +++ linux-x86.q/mm/mmap.c
> @@ -77,6 +77,19 @@ pgprot_t vm_get_page_prot(unsigned long
> }
> EXPORT_SYMBOL(vm_get_page_prot);
>
> +#ifndef _PAGE_PROT_PRESERVE_BITS
> +#define _PAGE_PROT_PRESERVE_BITS 0
> +#endif
> +
> +pgprot_t vm_get_page_prot_preserve(unsigned long vm_flags, pgprot_t oldprot)
> +{
> + pteval_t newprotval = pgprot_val(oldprot);
> +
> + newprotval &= _PAGE_PROT_PRESERVE_BITS;
> + newprotval |= pgprot_val(vm_get_page_prot(vm_flags));
> + return __pgprot(newprotval);
> +}
> +
I'm not convinced that this anding and oring is necessarily right for
any architecture: I think this function (or my equivalent pgprot_modify)
ought to go down in the arch headers: next to pte_modify helped my focus.
> int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
> int sysctl_overcommit_ratio = 50; /* default is 50% */
> int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> Index: linux-x86.q/mm/mprotect.c
> ===================================================================
> --- linux-x86.q.orig/mm/mprotect.c
> +++ linux-x86.q/mm/mprotect.c
> @@ -192,7 +192,9 @@ success:
> * held in write mode.
> */
> vma->vm_flags = newflags;
> - vma->vm_page_prot = vm_get_page_prot(newflags);
> + vma->vm_page_prot = vm_get_page_prot_preserve(newflags,
> + vma->vm_page_prot);
> +
> if (vma_wants_writenotify(vma)) {
> vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
> dirty_accountable = 1;
Here's my more minimal alternative; but I think we need to clarify
on that apparent duplication of the PAT bits in pte_modify.
arch/x86/pci/i386.c | 4 +---
include/asm-x86/pgtable.h | 14 +++++++++++++-
mm/mprotect.c | 11 ++++++++++-
3 files changed, 24 insertions(+), 5 deletions(-)
--- 2.6.26-rc1/arch/x86/pci/i386.c 2008-05-03 21:54:41.000000000 +0100
+++ linux/arch/x86/pci/i386.c 2008-05-07 17:14:53.000000000 +0100
@@ -301,15 +301,13 @@ int pci_mmap_page_range(struct pci_dev *
prot = pgprot_val(vma->vm_page_prot);
if (pat_wc_enabled && write_combine)
prot |= _PAGE_CACHE_WC;
- else if (pat_wc_enabled)
+ else if (pat_wc_enabled || boot_cpu_data.x86 > 3)
/*
* ioremap() and ioremap_nocache() defaults to UC MINUS for now.
* To avoid attribute conflicts, request UC MINUS here
* aswell.
*/
prot |= _PAGE_CACHE_UC_MINUS;
- else if (boot_cpu_data.x86 > 3)
- prot |= _PAGE_CACHE_UC;
vma->vm_page_prot = __pgprot(prot);
--- 2.6.26-rc1/include/asm-x86/pgtable.h 2008-05-03 21:55:10.000000000 +0100
+++ linux/include/asm-x86/pgtable.h 2008-05-07 19:09:42.000000000 +0100
@@ -57,7 +57,8 @@
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
_PAGE_DIRTY)
-#define _PAGE_CHG_MASK (PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define _PAGE_CHG_MASK (PTE_MASK | _PAGE_PWT | _PAGE_PCD | \
+ _PAGE_ACCESSED | _PAGE_DIRTY)
#define _PAGE_CACHE_MASK (_PAGE_PCD | _PAGE_PWT)
#define _PAGE_CACHE_WB (0)
@@ -294,6 +295,17 @@ static inline pte_t pte_modify(pte_t pte
return __pte(val);
}
+/*
+ * mprotect needs to preserve PAT bits when updating vm_page_prot
+ */
+#define pgprot_modify pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+ pgprotval_t preservebits = pgprot_val(oldprot) & _PAGE_CHG_MASK;
+ pgprotval_t addbits = pgprot_val(newprot);
+ return __pgprot(preservebits | addbits);
+}
+
#define pte_pgprot(x) __pgprot(pte_val(x) & (0xfff | _PAGE_NX))
#define canon_pgprot(p) __pgprot(pgprot_val(p) & __supported_pte_mask)
--- 2.6.26-rc1/mm/mprotect.c 2008-01-24 22:58:37.000000000 +0000
+++ linux/mm/mprotect.c 2008-05-07 18:42:07.000000000 +0100
@@ -26,6 +26,13 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
+#ifndef pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+ return newprot;
+}
+#endif
+
static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
@@ -192,7 +199,9 @@ success:
* held in write mode.
*/
vma->vm_flags = newflags;
- vma->vm_page_prot = vm_get_page_prot(newflags);
+ vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
+ vm_get_page_prot(newflags));
I did check that with _PAGE_PROT_PRESERVE_BITS defined to zero,
compiler optimizes vm_get_page_prot_preserve to generate same code as
vm_get_page_prot with no function call (on x86 64). So, things should be OK
from overhead perspective. But, from the code cleanliness aspect,
PRESERVE_BITS looks unclean and needs some cleanup. The reason I posted the
patch as is, was to get the confirmation on the original thread, whether this
indeed fixes those PAT related error messages.
> but i'm wondering why similar issues never came up on other
> architectures - i thought it would be rather common to have immutable
> pte details. So maybe i'm missing something here ...
Probably other architectures does not depend on preserving things in
vma->vm_page_prot once ptes are set correctly. With PAT, we use
vm_page_prot to keep track of PAT attributes for vmas from parent to
child across fork.
pte_modify() part will be required in all archs that wants to preserve
some bits in pte in a mprotect call.
Thanks,
Venki
Hugh: Thanks for looking into this. Yes. I like your modified patch. Simpler
and smaller.
> > ------------------------------------->
> > Subject: generic, x86, PAT: fix mprotect
> > From: Venki Pallipadi <venkatesh...@intel.com>
> > Date: Tue, 6 May 2008 15:42:40 -0700
> >
> > There is a hole in mprotect, which lets the user to change the page
> > cache type bits by-passing the kernel reserve_memtype and free_memtype
> > wrappers. Fix the hole by not letting mprotect change the PAT bits.
>
> Maybe say "defect" rather than "hole"? I stupidly thought for a
> while that you were talking about some gap in the address space.
Yes. we can call it a loophole or defect.
> > @@ -289,7 +291,8 @@ static inline pte_t pte_modify(pte_t pte
> > * Chop off the NX bit (if present), and add the NX portion of
> > * the newprot (if present):
> > */
> > - val &= _PAGE_CHG_MASK & ~_PAGE_NX;
>
> (Nothing to do with your patch, but that's a silly line, isn't it?
> _PAGE_NX isn't in _PAGE_CHG_MASK so it doesn't need further masking out.)
Yes. That PAGE_NX part indeed looks bogus.
> > + /* We also preserve PAT bits from existing pte */
> > + val &= (_PAGE_CHG_MASK | _PAGE_PROT_PRESERVE_BITS) & ~_PAGE_NX;
> > val |= pgprot_val(newprot) & __supported_pte_mask;
> >
> > return __pte(val);
>
> An odd thing is that you end up keeping the PAT bits from the old
> pte, and the PAT bits from vm_page_prot. So if they can get out
> of synch, you'd be oring them together; and if they can't get out
> of synch, then do you need a change here?
They cannot get out of sync. So, we can do without the change here. But, I felt
it is safer to take things from old pte.
> I've followed that dup in my version below, but it's probably wrong.
> Yes, you want the PAT bits in vm_page_prot (so various places
> will automatically set them), but I think here it's more correct
> to take the PAT bits from the old pte and mask out those from the
> vm_page_prot?
Yes. We can mask out these bits from vm_page_prot, that way we will not
'or' them if they get out of sync.
If your below modified patch is OK, I can send an incremental patch to
change pte_modify() to inherit PAT bits only from oldpte.
Thanks,
Venki
> > I've tried doing it slightly differently below, don't know whether
> > you'll consider it an improvement or not.
>
> Hugh: Thanks for looking into this. Yes. I like your modified patch.
> Simpler and smaller.
i have stuck your original patch into testing and nothing blew up so
far. Due to the mm/ bits this is not for the scope of x86.git, but
obviously it all looks good and is .26-worthy to me:
Acked-by: Ingo Molnar <mi...@elte.hu>
Tested-by: Ingo Molnar <mi...@elte.hu>
Venki, could you please send a full patch against -git that has
everything from Hugh included, with an updated changelog, for
Linus/Andrew to ack/apply?
Ingo
Ingo,
Split up the patch into two parts as the pci part was unrelated to mprotect
problem in a sense.
Here is the first patch.
Thanks,
Venki
Some versions of X used the mprotect workaround to change caching type from
UC to WB, so that it can then use mtrr to program WC for that region [1].
Change the mmap of pci space through /sys or /proc interfaces from UC to
UC_MINUS. With this change, X will not need to use mprotect
workaround to get WC type.
Also the bug with mprotect which lets caller to change PAT bits is fixed in
the follow on patch. So, this X workaround will stop working as well.
Signed-off-by: Venkatesh Pallipadi <venkatesh...@intel.com>
Signed-off-by: Suresh Siddha <suresh....@intel.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
arch/x86/pci/i386.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
Index: linux-2.6/arch/x86/pci/i386.c
===================================================================
--- linux-2.6.orig/arch/x86/pci/i386.c 2008-05-06 15:45:47.000000000 -0700
+++ linux-2.6/arch/x86/pci/i386.c 2008-05-09 10:05:45.000000000 -0700
@@ -301,15 +301,13 @@ int pci_mmap_page_range(struct pci_dev *
prot = pgprot_val(vma->vm_page_prot);
if (pat_wc_enabled && write_combine)
prot |= _PAGE_CACHE_WC;
- else if (pat_wc_enabled)
+ else if (pat_wc_enabled || boot_cpu_data.x86 > 3)
/*
* ioremap() and ioremap_nocache() defaults to UC MINUS for now.
* To avoid attribute conflicts, request UC MINUS here
* aswell.
*/
prot |= _PAGE_CACHE_UC_MINUS;
- else if (boot_cpu_data.x86 > 3)
- prot |= _PAGE_CACHE_UC;
vma->vm_page_prot = __pgprot(prot);
And the second patch for mprotect problem.
There is a defect in mprotect, which lets the user to change the page
cache type bits by-passing the kernel reserve_memtype and free_memtype
wrappers. Fix the problem by not letting mprotect change the PAT bits.
Signed-off-by: Venkatesh Pallipadi <venkatesh...@intel.com>
Signed-off-by: Suresh Siddha <suresh....@intel.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
Signed-off-by: Hugh Dickins <hu...@veritas.com>
---
include/asm-x86/pgtable.h | 16 +++++++++++++---
mm/mprotect.c | 11 ++++++++++-
2 files changed, 23 insertions(+), 4 deletions(-)
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c 2008-05-09 10:50:28.000000000 -0700
+++ linux-2.6/mm/mprotect.c 2008-05-09 11:01:23.000000000 -0700
@@ -26,6 +26,13 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
+#ifndef pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+ return newprot;
+}
+#endif
+
static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
@@ -192,7 +199,9 @@ success:
* held in write mode.
*/
vma->vm_flags = newflags;
- vma->vm_page_prot = vm_get_page_prot(newflags);
+ vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
+ vm_get_page_prot(newflags));
+
if (vma_wants_writenotify(vma)) {
vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
dirty_accountable = 1;
Index: linux-2.6/include/asm-x86/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable.h 2008-05-09 10:50:28.000000000 -0700
+++ linux-2.6/include/asm-x86/pgtable.h 2008-05-09 11:01:23.000000000 -0700
@@ -57,7 +57,8 @@
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
_PAGE_DIRTY)
-#define _PAGE_CHG_MASK (PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define _PAGE_CHG_MASK (PTE_MASK |_PAGE_PCD | _PAGE_PWT | \
+ _PAGE_ACCESSED | _PAGE_DIRTY)
#define _PAGE_CACHE_MASK (_PAGE_PCD | _PAGE_PWT)
#define _PAGE_CACHE_WB (0)
@@ -288,12 +289,21 @@ static inline pte_t pte_modify(pte_t pte
* Chop off the NX bit (if present), and add the NX portion of
* the newprot (if present):
*/
- val &= _PAGE_CHG_MASK & ~_PAGE_NX;
- val |= pgprot_val(newprot) & __supported_pte_mask;
+ val &= _PAGE_CHG_MASK;
+ val |= pgprot_val(newprot) & (~_PAGE_CHG_MASK) & __supported_pte_mask;
return __pte(val);
}
+/* mprotect needs to preserve PAT bits when updating vm_page_prot */
+#define pgprot_modify pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+ pgprotval_t preservebits = pgprot_val(oldprot) & _PAGE_CHG_MASK;
+ pgprotval_t addbits = pgprot_val(newprot);
+ return __pgprot(preservebits | addbits);
+}
+
#define pte_pgprot(x) __pgprot(pte_val(x) & (0xfff | _PAGE_NX))
#define canon_pgprot(p) __pgprot(pgprot_val(p) & __supported_pte_mask)
Thanks a lot for finalizing that, Venki: looks good to me.
I didn't originally sign-off my version of your patch, but am happy
to do so now that you've gone over it again and fixed the pte_modify
issue (though I'm not sure what the order of our signoffs should be).
I shall need to cover the MAP_PRIVATE COW issue in do_wp_page,
but I think that's better done as a separate patch: I'll attend
to it in a few days, it's not something that will upset anyone
bisecting, my mind is elsewhere at present.
Hugh
>-----Original Message-----
>From: Dave Airlie [mailto:air...@gmail.com]
>Sent: Friday, May 09, 2008 3:11 PM
>To: Pallipadi, Venkatesh
>Cc: Ingo Molnar; Hugh Dickins; Frans Pop; Barnes, Jesse;
>linux-...@vger.kernel.org; Packard, Keith; Yinghai Lu;
>Andrew Morton; Linus Torvalds; H. Peter Anvin; Thomas
>Gleixner; Nick Piggin; Jesse Barnes
>Subject: Re: [git head] X86_PAT & mprotect
>Wow this kinda puts X in a nasty position, we have 2.6.25 and
>previous kernels
>where we use the original /sys interfaces and nasty hack to
>workaround, but on 2.6.26 we magically need to
>switch to the /sys _uc interfaces or the users X will slow down.
>
No. With 2.6.26 you can still use same /sys resource file, without
the mprotect workaround and set the MTRR as before.
The change from UC to UC_MINUS in this patch ensures that the
X drivers MTRR will take precedence and X does not need any mprotect
hacks.
Thanks,
Venki
Wow this kinda puts X in a nasty position, we have 2.6.25 and previous kernels
where we use the original /sys interfaces and nasty hack to
workaround, but on 2.6.26 we magically need to
switch to the /sys _uc interfaces or the users X will slow down.
Granted I think only F9 is shipping libpciaccess so far, but now we
need to fix it up and make sure a new one exists before
2.6.26 hits users. Build it yourself users are going to be noticing
the slowdown I suspect.
Dave.
> Wow this kinda puts X in a nasty position, we have 2.6.25 and previous kernels
> where we use the original /sys interfaces and nasty hack to
> workaround, but on 2.6.26 we magically need to
> switch to the /sys _uc interfaces or the users X will slow down.
It didn't sound like that to me -- using UC- should mean that if we set
up MTRRs correctly, we'll get WC access for our frame buffer. Did I miss
something here?
Cool just making sure.
Keithp, I hope we didn't error check the mremap return values :)
Dave.
> Keithp, I hope we didn't error check the mremap return values :)
/* KLUDGE ALERT -- rewrite the PTEs to turn off the CD and WT bits */
mprotect (map->memory, map->size, PROT_NONE);
mprotect (map->memory, map->size, PROT_READ|PROT_WRITE);
No return value checks for us ;-)
Hi,
My apologies for the very late reply. Unfortunately I've had no time for
work on kernel issues the last few weeks.
I have not specifically tested any of the proposed patches, but I can
confirm that the "expected mapping type" errors are gone with -rc2
and -rc3. Thanks a lot.
The artifacts are still there though and frequently even worse than
originally reported. I will open a bugzilla entry for that.
Cheers,
FJP
AFAIK there has been no progress on this issue, which is now listed on the
regression list for .26 [1]. As there has been no response from any fb
devs, is someone else maybe willing to have a closer look at this?
I'm still seeing the artifacts with -rc6 and the severity seems to depend
on what was last displayed: sometimes the whole display is covered with
colored nonsense, artistic maybe but not desired...
Cheers,
FJP
[1] http://bugzilla.kernel.org/show_bug.cgi?id=10843