x86 ptep_get_and_clear question

Jamie Lokier

unread,

Feb 15, 2001, 12:47:29 PM2/15/01

to

[Added Linus and linux-kernel as I think it's of general interest]

Kanoj Sarcar wrote:
> Whether Jamie was trying to illustrate a different problem, I am not
> sure.

Yes, I was talking about pte_test_and_clear_dirty in the earlier post.

> Look in mm/mprotect.c. Look at the call sequence change_protection() -> ...
> change_pte_range(). Specifically at the sequence:
>
> entry = ptep_get_and_clear(pte);
> set_pte(pte, pte_modify(entry, newprot));
>
> Go ahead and pull your x86 specs, and prove to me that between the
> ptep_get_and_clear(), which zeroes out the pte (specifically, when the
> dirty bit is not set), processor 2 can not come in and set the dirty
> bit on the in-memory pte. Which immediately gets overwritten by the
> set_pte(). For an example of how this can happen, look at my previous
> postings.

Let's see. We'll assume processor 2 does a write between the
ptep_get_and_clear and the set_pte, which are done on processor 1.

Now, ptep_get_and_clear is atomic, so we can talk about "before" and
"after". Before it, either processor 2 has a TLB entry with the dirty
bit set, or it does not (it has either a clean TLB entry or no TLB entry
at all).

After ptep_get_and_clear, processor 2 does a write. If it already has a
dirty TLB entry, then `entry' will also be dirty so the dirty bit is
preserved. If processor 2 does not have a dirty TLB entry, then it will
look up the pte. Processor 2 finds the pte is clear, so raises a page fault.
Spinlocks etc. sort everything out in the page fault.

Here's the important part: when processor 2 wants to set the pte's dirty
bit, it *rereads* the pte and *rechecks* the permission bits again.
Even though it has a non-dirty TLB entry for that pte.

That is how I read Ben LaHaise's description, and his test program tests
exactly this.

If the processor worked by atomically setting the dirty bit in the pte
without rechecking the permissions when it reads that pte bit, then this
scheme would fail and you'd be right about the lost dirty bits. I would
have thought it would be simpler to implement a CPU this way, but
clearly it is not as efficient for SMP OS design so perhaps CPU
designers thought about this.

The only remaining question is: is the observed behaviour defined for
x86 CPUs in general, or are we depending on the results of testing a few
particular CPUs?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Kanoj Sarcar

unread,

Feb 15, 2001, 1:05:48 PM2/15/01

to

>
> [Added Linus and linux-kernel as I think it's of general interest]
>
> Kanoj Sarcar wrote:
> > Whether Jamie was trying to illustrate a different problem, I am not
> > sure.
>
> Yes, I was talking about pte_test_and_clear_dirty in the earlier post.
>
> > Look in mm/mprotect.c. Look at the call sequence change_protection() -> ...
> > change_pte_range(). Specifically at the sequence:
> >
> > entry = ptep_get_and_clear(pte);
> > set_pte(pte, pte_modify(entry, newprot));
> >
> > Go ahead and pull your x86 specs, and prove to me that between the
> > ptep_get_and_clear(), which zeroes out the pte (specifically, when the
> > dirty bit is not set), processor 2 can not come in and set the dirty
> > bit on the in-memory pte. Which immediately gets overwritten by the
> > set_pte(). For an example of how this can happen, look at my previous
> > postings.
>

Now you are talking my language!

> Let's see. We'll assume processor 2 does a write between the
> ptep_get_and_clear and the set_pte, which are done on processor 1.
>
> Now, ptep_get_and_clear is atomic, so we can talk about "before" and
> "after". Before it, either processor 2 has a TLB entry with the dirty
> bit set, or it does not (it has either a clean TLB entry or no TLB entry
> at all).
>
> After ptep_get_and_clear, processor 2 does a write. If it already has a
> dirty TLB entry, then `entry' will also be dirty so the dirty bit is
> preserved. If processor 2 does not have a dirty TLB entry, then it will
> look up the pte. Processor 2 finds the pte is clear, so raises a page fault.
> Spinlocks etc. sort everything out in the page fault.
>
> Here's the important part: when processor 2 wants to set the pte's dirty
> bit, it *rereads* the pte and *rechecks* the permission bits again.
> Even though it has a non-dirty TLB entry for that pte.
>
> That is how I read Ben LaHaise's description, and his test program tests
> exactly this.
>

Okay, I asked Ben, he couldn't point me at specs and shut me up.

> If the processor worked by atomically setting the dirty bit in the pte
> without rechecking the permissions when it reads that pte bit, then this
> scheme would fail and you'd be right about the lost dirty bits. I would

Exactly. This is why I did not implement this scheme earlier when Alan
and I talked about this scenario, almost a couple of years back.

> have thought it would be simpler to implement a CPU this way, but
> clearly it is not as efficient for SMP OS design so perhaps CPU
> designers thought about this.
>
> The only remaining question is: is the observed behaviour defined for
> x86 CPUs in general, or are we depending on the results of testing a few
> particular CPUs?

Exactly!

So my claim still stands: ptep_get_and_clear() doesn't do what it claims
to do. I would be more than happy if someone can give me logic to break
this claim ... which would mean one longstanding data integrity problem
on Linux has been fixed satisfactorily.

Kanoj

Kanoj Sarcar

unread,

Feb 15, 2001, 1:23:38 PM2/15/01

to

>
> [Added Linus and linux-kernel as I think it's of general interest]
>
> Kanoj Sarcar wrote:
> > Whether Jamie was trying to illustrate a different problem, I am not
> > sure.
>
> Yes, I was talking about pte_test_and_clear_dirty in the earlier post.
>
> > Look in mm/mprotect.c. Look at the call sequence change_protection() -> ...
> > change_pte_range(). Specifically at the sequence:
> >
> > entry = ptep_get_and_clear(pte);
> > set_pte(pte, pte_modify(entry, newprot));
> >
> > Go ahead and pull your x86 specs, and prove to me that between the
> > ptep_get_and_clear(), which zeroes out the pte (specifically, when the
> > dirty bit is not set), processor 2 can not come in and set the dirty
> > bit on the in-memory pte. Which immediately gets overwritten by the
> > set_pte(). For an example of how this can happen, look at my previous
> > postings.
>

> Let's see. We'll assume processor 2 does a write between the
> ptep_get_and_clear and the set_pte, which are done on processor 1.
>
> Now, ptep_get_and_clear is atomic, so we can talk about "before" and
> "after". Before it, either processor 2 has a TLB entry with the dirty
> bit set, or it does not (it has either a clean TLB entry or no TLB entry
> at all).
>
> After ptep_get_and_clear, processor 2 does a write. If it already has a
> dirty TLB entry, then `entry' will also be dirty so the dirty bit is
> preserved. If processor 2 does not have a dirty TLB entry, then it will
> look up the pte. Processor 2 finds the pte is clear, so raises a page fault.
> Spinlocks etc. sort everything out in the page fault.
>
> Here's the important part: when processor 2 wants to set the pte's dirty
> bit, it *rereads* the pte and *rechecks* the permission bits again.
> Even though it has a non-dirty TLB entry for that pte.
>
> That is how I read Ben LaHaise's description, and his test program tests
> exactly this.

Okay, I will quote from Intel Architecture Software Developer's Manual
Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:

"Bus cycles to the page directory and page tables in memory are performed
only when the TLBs do not contain the translation information for a
requested page."

And on the same page:

"Whenever a page directory or page table entry is changed (including when
the present flag is set to zero), the operating system must immediately
invalidate the corresponding entry in the TLB so that it can be updated
the next time the entry is referenced."

So, it looks highly unlikely to me that the basic assumption about how
x86 works wrt tlb/ptes in the ptep_get_and_clear() solution is correct.

Kanoj

>
> If the processor worked by atomically setting the dirty bit in the pte
> without rechecking the permissions when it reads that pte bit, then this
> scheme would fail and you'd be right about the lost dirty bits. I would

> have thought it would be simpler to implement a CPU this way, but
> clearly it is not as efficient for SMP OS design so perhaps CPU
> designers thought about this.
>
> The only remaining question is: is the observed behaviour defined for
> x86 CPUs in general, or are we depending on the results of testing a few
> particular CPUs?
>

Manfred Spraul

unread,

Feb 15, 2001, 1:51:59 PM2/15/01

to

Kanoj Sarcar wrote:
>
> Okay, I will quote from Intel Architecture Software Developer's Manual
> Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:
>
> "Bus cycles to the page directory and page tables in memory are performed
> only when the TLBs do not contain the translation information for a
> requested page."
>
> And on the same page:
>
> "Whenever a page directory or page table entry is changed (including when
> the present flag is set to zero), the operating system must immediately
> invalidate the corresponding entry in the TLB so that it can be updated
> the next time the entry is referenced."
>

But there is another paragraph that mentions that an OS may use lazy tlb
shootdowns.
[search for shootdown]

You check the far too obvious chapters, remember that Intel wrote the
documentation ;-)
I searched for 'dirty' though Vol 3 and found

Chapter 7.1.2.1 Automatic locking.

.. the processor uses locked cycles to set the accessed and dirty flag
in the page-directory and page-table entries.

But that obviously doesn't answer your question.

Is the sequence
<< lock;
read pte
pte |= dirty
write pte
>> end lock;
or
<< lock;
read pte
if (!present(pte))
do_page_fault();
pte |= dirty
write pte.
>> end lock;

--
Manfred

Jamie Lokier

unread,

Feb 15, 2001, 1:42:46 PM2/15/01

to

Kanoj Sarcar wrote:
> > Here's the important part: when processor 2 wants to set the pte's dirty
> > bit, it *rereads* the pte and *rechecks* the permission bits again.
> > Even though it has a non-dirty TLB entry for that pte.
> >
> > That is how I read Ben LaHaise's description, and his test program tests
> > exactly this.
>

> Okay, I will quote from Intel Architecture Software Developer's Manual
> Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:
>
> "Bus cycles to the page directory and page tables in memory are performed
> only when the TLBs do not contain the translation information for a
> requested page."
>
> And on the same page:
>
> "Whenever a page directory or page table entry is changed (including when
> the present flag is set to zero), the operating system must immediately
> invalidate the corresponding entry in the TLB so that it can be updated
> the next time the entry is referenced."
>

> So, it looks highly unlikely to me that the basic assumption about how
> x86 works wrt tlb/ptes in the ptep_get_and_clear() solution is correct.

To me those quotes don't address the question we're asking. We know
that bus cycles _do_ occur when a TLB entry is switched from clean to
dirty, and furthermore they are locked cycles. (Don't ask me how I know
this though).

Does that mean, in jargon, the TLB does not "contain
the translation information" for a write?

The second quote: sure, if we want the TLB updated we have to flush it.
And eventually in mm/mprotect.c we do. But what before, it keeps on
using the old TLB entry? That's ok. If the entry was already dirty
then we don't mind if processor 2 continues with the old TLB entry for a
while, until we do the big TLB range flush.

In other words I don't think those two quotes address our question at
all.

What worries more is that this is quite a subtle requirement, and the
code in mm/mprotect.c is not specific to one architecture. Do all SMP
CPUs support by Linux do the same thing on converting TLB entries from
clean to dirty, or do they have a subtle, easily missed data integrity
problem?

-- Jamie

Kanoj Sarcar

unread,

Feb 15, 2001, 1:57:16 PM2/15/01

to

Agreed. But these are the only relevant quotes I could come up with. And
to me, these quotes make the ptep_get_and_clear() assumption look risky
at best ... even though they do not give clear answers either way.

>
> What worries more is that this is quite a subtle requirement, and the
> code in mm/mprotect.c is not specific to one architecture. Do all SMP
> CPUs support by Linux do the same thing on converting TLB entries from
> clean to dirty, or do they have a subtle, easily missed data integrity
> problem?

No. All architectures do not have this problem. For example, if the
Linux "dirty" (not the pte dirty) bit is managed by software, a fault
will actually be taken when processor 2 tries to do the write. The fault
is solely to make sure that the Linux "dirty" bit can be tracked. As long
as the fault handler grabs the right locks before updating the Linux "dirty"
bit, things should be okay. This is the case with mips, for example.

The problem with x86 is that we depend on automatic x86 dirty bit
update to manage the Linux "dirty" bit (they are the same!). So appropriate
locks are not grabbed.

Kanoj

Ben LaHaise

unread,

Feb 15, 2001, 2:06:30 PM2/15/01

to

On Thu, 15 Feb 2001, Kanoj Sarcar wrote:

> No. All architectures do not have this problem. For example, if the
> Linux "dirty" (not the pte dirty) bit is managed by software, a fault
> will actually be taken when processor 2 tries to do the write. The fault
> is solely to make sure that the Linux "dirty" bit can be tracked. As long
> as the fault handler grabs the right locks before updating the Linux "dirty"
> bit, things should be okay. This is the case with mips, for example.
>
> The problem with x86 is that we depend on automatic x86 dirty bit
> update to manage the Linux "dirty" bit (they are the same!). So appropriate
> locks are not grabbed.

Will you please go off and prove that this "problem" exists on some x86
processor before continuing this rant? None of the PII, PIII, Athlon,
K6-2 or 486s I checked exhibited the worrisome behaviour you're
speculating about, plus it is logically consistent with the statements the
manual does make about updating ptes; otherwise how could an smp os
perform a reliable shootdown by doing an atomic bit clear on the present
bit of a pte?

-ben

Jamie Lokier

unread,

Feb 15, 2001, 2:07:01 PM2/15/01

to

Manfred Spraul wrote:
> Is the sequence
> << lock;
> read pte
> pte |= dirty
> write pte
> >> end lock;
> or
> << lock;
> read pte
> if (!present(pte))
> do_page_fault();
> pte |= dirty
> write pte.
> >> end lock;

or more generally

<< lock;
read pte
if (!present(pte) || !writable(pte))

do_page_fault();
pte |= dirty
write pte.
>> end lock;

Not to mention, does it guarantee to use the newly read physical
address, does it check the superviser permission again, does it use the
new PAT/CD/WT attributes?

I can vaguely imagine some COW optimisation where the pte is updated to
be writable with the new page's address, and there is no need to flush
other processor TLBs because they will do so when they first write to
the page. (But of course you have to be careful synchronising with
other uses of the shared page prior to the eventual TLB flush).

-- Jamie

Jamie Lokier

unread,

Feb 15, 2001, 2:19:45 PM2/15/01

to

Kanoj Sarcar wrote:
> > Is the sequence
> > << lock;
> > read pte
> > pte |= dirty
> > write pte
> > >> end lock;
> > or
> > << lock;
> > read pte
> > if (!present(pte))
> > do_page_fault();
> > pte |= dirty
> > write pte.
> > >> end lock;
>

> No, it is a little more complicated. You also have to include in the
> tlb state into this algorithm. Since that is what we are talking about.
> Specifically, what does the processor do when it has a tlb entry allowing
> RW, the processor has only done reads using the translation, and the
> in-memory pte is clear?

Yes (no to the no): Manfred's pseudo-code is exactly the question you're
asking. Because when the TLB entry is non-dirty and you do a write, we
_know_ the processor will do a locked memory cycle to update the dirty
bit. A locked memory cycle implies read-modify-write, not "write TLB
entry + dirty" (which would be a plain write) or anything like that.

Given you know it's a locked cycle, the only sensible design from Intel
is going to be one of Manfred's scenarios.

An interesting thought experiment though is this:

<< lock;
read pte
pte |= dirty
write pte
>> end lock;

if (!present(pte))
do_page_fault();

It would have a mighty odd effect wouldn't it?

Kanoj Sarcar

unread,

Feb 15, 2001, 2:05:00 PM2/15/01

to

>
> Kanoj Sarcar wrote:
> >
> > Okay, I will quote from Intel Architecture Software Developer's Manual
> > Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:
> >
> > "Bus cycles to the page directory and page tables in memory are performed
> > only when the TLBs do not contain the translation information for a
> > requested page."
> >
> > And on the same page:
> >
> > "Whenever a page directory or page table entry is changed (including when
> > the present flag is set to zero), the operating system must immediately
> > invalidate the corresponding entry in the TLB so that it can be updated
> > the next time the entry is referenced."
> >
>

> But there is another paragraph that mentions that an OS may use lazy tlb
> shootdowns.
> [search for shootdown]
>
> You check the far too obvious chapters, remember that Intel wrote the
> documentation ;-)

:-) :-)

The good part is, there are a lot of Intel folks now active on Linux,
I can go off and ask one of them, if we are sufficiently confused. I
am trying to see whether we are.

> I searched for 'dirty' though Vol 3 and found
>
> Chapter 7.1.2.1 Automatic locking.
>
> .. the processor uses locked cycles to set the accessed and dirty flag
> in the page-directory and page-table entries.
>
> But that obviously doesn't answer your question.
>

> Is the sequence
> << lock;
> read pte
> pte |= dirty
> write pte
> >> end lock;
> or
> << lock;
> read pte
> if (!present(pte))
> do_page_fault();
> pte |= dirty
> write pte.
> >> end lock;

No, it is a little more complicated. You also have to include in the
tlb state into this algorithm. Since that is what we are talking about.
Specifically, what does the processor do when it has a tlb entry allowing
RW, the processor has only done reads using the translation, and the
in-memory pte is clear?

Kanoj

>
> --
> Manfred

Kanoj Sarcar

unread,

Feb 15, 2001, 2:19:52 PM2/15/01

to

>
> On Thu, 15 Feb 2001, Kanoj Sarcar wrote:
>
> > No. All architectures do not have this problem. For example, if the
> > Linux "dirty" (not the pte dirty) bit is managed by software, a fault
> > will actually be taken when processor 2 tries to do the write. The fault
> > is solely to make sure that the Linux "dirty" bit can be tracked. As long
> > as the fault handler grabs the right locks before updating the Linux "dirty"
> > bit, things should be okay. This is the case with mips, for example.
> >
> > The problem with x86 is that we depend on automatic x86 dirty bit
> > update to manage the Linux "dirty" bit (they are the same!). So appropriate
> > locks are not grabbed.
>
> Will you please go off and prove that this "problem" exists on some x86
> processor before continuing this rant? None of the PII, PIII, Athlon,

And will you please stop behaving like this is not an issue?

> K6-2 or 486s I checked exhibited the worrisome behaviour you're

And I maintain that this kind of race condition can not be tickled
deterministically. There might be some piece of logic (or absence of it),
that can show that your finding of a thousand runs is not relevant.

> speculating about, plus it is logically consistent with the statements the
> manual does make about updating ptes; otherwise how could an smp os

Don't say this anymore, specially if you can not point me to the specs.

> perform a reliable shootdown by doing an atomic bit clear on the present
> bit of a pte?

OS clears present bit, processors can keep using their TLBs and access
the page, no problems at all. That is why after clearing the present bit,
the processor must flush all tlbs before it can assume no one is using
the page. Hardware updated access bit could also be a problem, but an
error there does not destroy data, it just leads the os to choosing the
wrong page to evict during memory pressure.

Kanoj

>
> -ben
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/

Linus Torvalds

unread,

Feb 15, 2001, 3:31:15 PM2/15/01

to

In article <2001021520...@pcep-jamie.cern.ch>,

Jamie Lokier <l...@tantalophile.demon.co.uk> wrote:
>> > << lock;
>> > read pte
>> > if (!present(pte))
>> > do_page_fault();
>> > pte |= dirty
>> > write pte.
>> > >> end lock;
>>
>> No, it is a little more complicated. You also have to include in the
>> tlb state into this algorithm. Since that is what we are talking about.
>> Specifically, what does the processor do when it has a tlb entry allowing
>> RW, the processor has only done reads using the translation, and the
>> in-memory pte is clear?
>

>Yes (no to the no): Manfred's pseudo-code is exactly the question you're
>asking. Because when the TLB entry is non-dirty and you do a write, we
>_know_ the processor will do a locked memory cycle to update the dirty
>bit. A locked memory cycle implies read-modify-write, not "write TLB
>entry + dirty" (which would be a plain write) or anything like that.
>
>Given you know it's a locked cycle, the only sensible design from Intel
>is going to be one of Manfred's scenarios.

Not necessarily, and this is NOT guaranteed by the docs I've seen.

It _could_ be that the TLB data actually also contains the pointer to
the place where it was fetched, and a "mark dirty" becomes

read *ptr locked
val |= D
write *ptr unlock

Now, I will agree that I suspect most x86 _implementations_ will not do
this. TLB's are too timing-critical, and nobody tends to want to make
them bigger than necessary - so saving off the source address is
unlikely. Also, setting the D bit is not a very common operation, so
it's easy enough to say that an internal D-bit-fault will just cause a
TLB re-load, where the TLB re-load just sets the A and D bits as it
fetches the entry (and then page fault handling is an automatic result
of the reload).

However, the _implementation_ detail is not, as far as I can tell,
explicitly defined by the architecture. And in anothe rpost I quote a
book by the designers of the original 80386 that implies strongly that
the "re-walk the page tables on D miss" assumption is not what they
_meant_ for the architecture design, even if they probably happened to
implement it that way.

>An interesting thought experiment though is this:
>

><< lock;
>read pte
>pte |= dirty
>write pte
>>> end lock;

>if (!present(pte))
> do_page_fault();
>
>It would have a mighty odd effect wouldn't it?

Why do you insist on the !present() check at all? It's not implied by
the architecture - a correctly functioning OS is not supposed to ever
be able to cause it according to specs..

I tink Kanoj is right to be worried. I _do_ believe that the current
Linux code works on "all current hardware". But I think Kanoj has a
valid point in that it's not guaranteed to work in the future.

That said, I think Intel tends to be fairly pragmatic in their design
(that's the nice way of saying that Intel CPU's tend to dismiss the
notion of "beautiful architecture" completely over the notion of "let's
make it work"). And I would be extremely surprised indeed if especially
MS Windows didn't do some really bad things with the TLB. In fact, I
think I can say from personal experience that I pretty much _know_
windows has big bugs in TLB invalidation.

And because of that, it may be that nobody can ever create a
x86-compatible CPU that does anything but "re-walk the TLB tables on
_anything_ fishy going on with the TLB".

(Basically, it seems to be pretty much a fact of life that the x86
architecture will NOT raise a page protection fault directly from the
TLB content - it will re-walk the page tables before it actually raises
the fault, and only the act of walking the page tables and finding that
it really _should_ fault will raise an x86-level fault. It all boils
down to "never trust the TLB more than you absolutely have to").

Linus

Linus Torvalds

unread,

Feb 15, 2001, 3:16:36 PM2/15/01

to

In article <2001021519...@google.engr.sgi.com>,

Kanoj Sarcar <ka...@google.engr.sgi.com> wrote:
>>
>> Will you please go off and prove that this "problem" exists on some x86
>> processor before continuing this rant? None of the PII, PIII, Athlon,
>
>And will you please stop behaving like this is not an issue?

This is documented in at least

Programming the 80386
John Crawford & Patrick Gelsinger

which is still the best book I've ever seen on the x86 architecture. See
page 477, "Memory management, Protection, and Tasks", under "Multiple-
Processor Considerations". And I quote:

"Before changing a page table entry that may be used on another
procesor <sic>, software should use a locked AND instruction to
clear the P bit to 0 in an indivisible operation. Then the
entry can be changed as required, and made available by later
setting the P bit to 1.

At some point in the modification of a page table entry, all
processors in the system that may have had the entry cached must
be notified (usually with an interrupt) to flush their page
translation caches to remove any old copies of the entry. Until
these old copies are flushed, the processors can continue to
access the old page, and may also set the D bit in the entry
being modified. If this may case the modification of the entry
to fail, the paging caches should be flushed after the entry is
marked not present, but before the entry is otherwise modified".

Note the last sentence - that's the one that really matters to this
discussion.

And it does imply that the read-and-clear thing is not the right thing
to do and is not guaranteed to fix the race (even if I personally
suspect that all current x86 implementations will just re-walk the page
tables and set the D bit the same way they set the A bit, and basically
making the usage an "argument" to the page table walker logic).

However, I suspect that we could extend it to just re-read the entry
(which _should_ be zero, but could have the D bit set) after having
invalidated the TLB on the other CPU's. But Gelsinger suggests just
clearing the P bit - which is easily enough done, as the following
modification would be needed anyway in mm/vmscan.c:

pte = ptep_get_and_clear(page_table);
flush_tlb_page(vma, address);
+ pte = ptep_update_after_flush(page_table, pte);

where "ptep_update_after_flush()" would be a no-op on UP, and on SMP it
would just or in the D bit (which should be practically always zero)
from the page table entry into the pte.

Just clearing the P bit actuall ymakes "out_unlock_restore:" simpler: it
becomes a simple

lock ; orl $1, page_table

which makes the worry about overwriting the D bit at that point go away
(although, considering where we invalidate the TLB's and that we should
now have had the correct D bits anyway, the non-locked simple store
should also work reliably).

The case of munmap() is more worrisome, and has much worse performance
issues. Ben's experimental shootdown patch would appear to not be good
enough. The only simple solution is the "gather" operation that I've
already suggested because of it's obvious correctness and simplicity.

A potential alternative would be to walk the page tables twice, and make
the page table zapping be a two-phase process. We'd only need to do
this when the "mm->cpu_vm_mask" bits implied that other CPU's might have
TLB entries, so we could avoid the double work for the normal case.

HOWEVER, this is also the only case where a CPU "gather" operation would
be necessary, so the thing basically boils down to the question of
whether "gather" or "double walk" is the more expensive operation.

The "gather" operation could possibly be improved to make the other
CPU's do useful work while being shot down (ie schedule away to another
mm), but that has it's own pitfalls too.

>OS clears present bit, processors can keep using their TLBs and access
>the page, no problems at all. That is why after clearing the present bit,
>the processor must flush all tlbs before it can assume no one is using
>the page. Hardware updated access bit could also be a problem, but an
>error there does not destroy data, it just leads the os to choosing the
>wrong page to evict during memory pressure.

The bible (see above) explicitly mentions only the D bit here - the A
bit is set at page table walk time, and is explicitly NOT set if the P
bit is clear at walk time, so there is no apparent race on that.

But as the A bit isn't very important anyway, the apparent lack of a
race is not all that interesting.

Manfred Spraul

unread,

Feb 15, 2001, 4:29:11 PM2/15/01

to

Manfred Spraul wrote:
>
> I just benchmarked a single flush_tlb_page().
>
> Pentium II 350: ~ 2000 cpu ticks.
> Pentium III 850: ~ 3000 cpu ticks.
>
I forgot the important part:
SMP, including a smp_call_function() IPI.

IIRC Ingo wrote that a local 'invplg' is around 100 ticks.

--
Manfred

Manfred Spraul

unread,

Feb 15, 2001, 4:26:50 PM2/15/01

to

Jamie wrote "one of my scenarios", that's the other option ;-)

> Now, I will agree that I suspect most x86 _implementations_ will not do
> this. TLB's are too timing-critical, and nobody tends to want to make
> them bigger than necessary - so saving off the source address is
> unlikely. Also, setting the D bit is not a very common operation, so
> it's easy enough to say that an internal D-bit-fault will just cause a
> TLB re-load, where the TLB re-load just sets the A and D bits as it
> fetches the entry (and then page fault handling is an automatic result
> of the reload).
>

But then the cpu would support setting the D bit in the page directory,
but it doesn't.

Probably Kanoj is right, the current code is not guaranteed by the
specs.

But if we change the interface, could we think about the poor s390
developers?

s390 only has a "clear the present bit in the pte and flush the tlb"
instruction.

From your other post:

> pte = ptep_get_and_clear(page_table);
> flush_tlb_page(vma, address);
>+ pte = ptep_update_after_flush(page_table, pte);

What about one arch specific

pte = ptep_get_and_invalidate(vma, address, page_table);

On i386 SMP it would

{
pte = *page_table_entry;
if(!present(pte))
return pte;
lock; andl 0xfffffffe, *page_table_entry;
flush_tlb_page();
return *page_table_entry | 1;
}

>
> The "gather" operation could possibly be improved to make the other
> CPU's do useful work while being shot down (ie schedule away to another
> mm), but that has it's own pitfalls too.
>

IMHO scheduling away is the best long term solution.
Perhaps try to schedule away, just to improve the probability that
mm->cpu_vm_mask is clear.

I just benchmarked a single flush_tlb_page().

Pentium II 350: ~ 2000 cpu ticks.
Pentium III 850: ~ 3000 cpu ticks.

--

Jamie Lokier

unread,

Feb 15, 2001, 6:57:42 PM2/15/01

to

Linus Torvalds wrote:
> It _could_ be that the TLB data actually also contains the pointer to
> the place where it was fetched, and a "mark dirty" becomes
>
> read *ptr locked
> val |= D
> write *ptr unlock

If you want to take it really far, it _could_ be that the TLB data
contains both the pointer and the original pte contents. Then "mark
dirty" becomes

val |= D
write *ptr

> Now, I will agree that I suspect most x86 _implementations_ will not do

> this. TLB's are too timing-critical, and nobody tends to want to make
> them bigger than necessary - so saving off the source address is
> unlikely.

Then again, these hypothetical addresses etc. aren't part of the
associative lookup, so could be located in something like an ordinary
cache ram, with just an index in the TLB itself.

-- Jamie

Linus Torvalds

unread,

Feb 15, 2001, 8:21:28 PM2/15/01

to

On Thu, 15 Feb 2001, Manfred Spraul wrote:
>
> > Now, I will agree that I suspect most x86 _implementations_ will not do
> > this. TLB's are too timing-critical, and nobody tends to want to make
> > them bigger than necessary - so saving off the source address is

> > unlikely. Also, setting the D bit is not a very common operation, so
> > it's easy enough to say that an internal D-bit-fault will just cause a
> > TLB re-load, where the TLB re-load just sets the A and D bits as it
> > fetches the entry (and then page fault handling is an automatic result
> > of the reload).
>
> But then the cpu would support setting the D bit in the page directory,
> but it doesn't.

Not necessarily. The TLB walker is a nasty piece of business, and
simplifying it as much as possible is important for hardware. Not setting
the D bit in the page directory is likely to be because it is unnecessary,
and not because it couldn't be done.

> But if we change the interface, could we think about the poor s390
> developers?
>
> s390 only has a "clear the present bit in the pte and flush the tlb"
> instruction.

Now, that ends up being fairly close to what it seems mm/vmscan.c needs to
do, so yes, it would not necessarily be a bad idea to join the
"ptep_get_and_clear()" and "flush_tlb_page()" operations into one.

However, the mm/memory.c use (ie region unmapping with zap_page_range())
really needs to do something different, because it inherently works with a
range of entries, and abstacting it to be a per-entry thing would be
really bad for performance anywhere else (S/390 might be ok with it,
assuming that their special instruction is really fast - I don't know. But
I do know that everybody else wants to do it with one single flush for the
whole region, especially for SMP).

> Perhaps try to schedule away, just to improve the probability that
> mm->cpu_vm_mask is clear.
>
> I just benchmarked a single flush_tlb_page().
>
> Pentium II 350: ~ 2000 cpu ticks.
> Pentium III 850: ~ 3000 cpu ticks.

Note that there is some room for concurrency here - we can fire off the
IPI, and continue to do "local" work until we actually need the "results"
in the form of stable D bits etc. So we _might_ want to take this into
account in the interfaces: allow for a "prepare_to_gather()" which just
sends the IPI but doesn't wait for it to necessarily get accepted, and
then only by the time we actually start checking the dirty bits (ie the
second phase, after we've invalidated the page tables) do we need to wait
and make sure that nobody else is using the TLB any more.

Done right, this _might_ be of the type

- prepare_to_gather(): sends IPI to all CPU's indicated in
mm->cpu_vm_mask
- go on, invalidating all VM entries
- busy-wait until "mm->cpu_vm_mask" only contains the local CPU (where
the busy-wait is hopefully not a wait at all - the other CPU's would
have exited the mm while we were cleaning up the page tables)
- go back, gather up any potential dirty bits and free the pages
- release the mm

Note that there are tons of optimizations for the common case: for
example, if we're talking about private read-only mappings, we can
possibly skip some or all of this, because we know that we simply won't
care about whether the pages were dirty or not as they're going to be
thrown away in any case.

So we can have several layers of optimizations: for UP or the SMP case
where we have "mm->cpu_vm_mask & ~(1 << current_cpu) == 0" we don't need
the IPI or the careful multi-CPU case at all. And for private stuff, we
need the careful invalidation, but we don't need to go back and gather the
dirty bits. So the only case that ends up being fairly heavy may be a case
that is very uncommon in practice (only for unmapping shared mappings in
threaded programs or the lazy TLB case).

I suspect getting a good interface for this, so that zap_page_range()
doesn't end up being the function for hell, is the most important thing.

Linus

Linus Torvalds

unread,

Feb 15, 2001, 7:55:02 PM2/15/01

to

On Fri, 16 Feb 2001, Jamie Lokier wrote:
>
> If you want to take it really far, it _could_ be that the TLB data
> contains both the pointer and the original pte contents. Then "mark
> dirty" becomes
>
> val |= D
> write *ptr

No. This is forbidden by the intel documentation. First off, the
documentation clearly states that it's a locked r-m-w cycle.

Secondly, the documentation also makes it clear that the CPU page table
accesses work correctly in SMP environments, which the above simply would
not do. It doesn't allow for people marking the entry invalid, which is
documented to work (see the very part I quoted).

So while the above could be a valid TLB writeback strategy in general for
some hypothetical architecture, it would _not_ be an x86 CPU any more if
it acted that way.

So a plain "just write out our cached value" is definitely not legal.

> > Now, I will agree that I suspect most x86 _implementations_ will not do
> > this. TLB's are too timing-critical, and nobody tends to want to make
> > them bigger than necessary - so saving off the source address is
> > unlikely.
>

> Then again, these hypothetical addresses etc. aren't part of the
> associative lookup, so could be located in something like an ordinary
> cache ram, with just an index in the TLB itself.

True. I'd still consider it unlikely for the other reasons (ie this is not
a timing-critical part of the normal CPU behaviour), but you're right - it
could be done without making the actual TLB any bigger or different, by
just having the TLB fill routine having a separate "source cache" that the
dirty-marking can use.

Jamie Lokier

unread,

Feb 16, 2001, 9:18:39 AM2/16/01

to

Linus Torvalds wrote:
> So the only case that ends up being fairly heavy may be a case that is
> very uncommon in practice (only for unmapping shared mappings in
> threaded programs or the lazy TLB case).

I can think of one case where performance is considered quite important:
mprotect() is used by several garbage collectors, including threaded
ones. Maybe mprotect() isn't the best primitive for those anyway, but
it's what they have to work with atm.

-- Jamie

Manfred Spraul

unread,

Feb 16, 2001, 9:59:17 AM2/16/01

to

Jamie Lokier wrote:
>
> Linus Torvalds wrote:
> > So the only case that ends up being fairly heavy may be a case that is
> > very uncommon in practice (only for unmapping shared mappings in
> > threaded programs or the lazy TLB case).
>

The lazy tlb case is quite fast: lazy tlb thread never write to user
space pages, we don't need to protect the dirty bits. And the first ipi
clears mm->cpu_vm_mask, only one ipi.

>
> I can think of one case where performance is considered quite important:
> mprotect() is used by several garbage collectors, including threaded
> ones. Maybe mprotect() isn't the best primitive for those anyway, but
> it's what they have to work with atm.
>

Does mprotect() actually care for wrong dirty bits?
The race should be invisible to user space apps.

>>>>>>> mprotect()
for_all_affected_ptes() {
lock andl ~PERMISSION_MASK, *pte;
lock orl new_permission, *pte;
}
< now anther cpu could still write to the write protected pages
< and set the dirty bit, but who cares? Shouldn't be a problem.
flush_tlb_range().
< tlb flush before ending the syscall, user space can't notice
< the delay.
<<<<

--
Manfred

Jamie Lokier

unread,

Feb 16, 2001, 10:27:41 AM2/16/01

to

Manfred Spraul wrote:
> > I can think of one case where performance is considered quite important:
> > mprotect() is used by several garbage collectors, including threaded
> > ones. Maybe mprotect() isn't the best primitive for those anyway, but
> > it's what they have to work with atm.
>
> Does mprotect() actually care for wrong dirty bits?
> The race should be invisible to user space apps.
>
> >>>>>>> mprotect()
> for_all_affected_ptes() {
> lock andl ~PERMISSION_MASK, *pte;
> lock orl new_permission, *pte;
> }
> < now anther cpu could still write to the write protected pages
> < and set the dirty bit, but who cares? Shouldn't be a problem.
> flush_tlb_range().
> < tlb flush before ending the syscall, user space can't notice
> < the delay.
> <<<<

The user-space app doesn't even _know_ about dirty bits.

I don't think there's even the possibility of losing dirty bits with
mprotect(), so long as pte_modify doesn't clear the dirty bit, which it
doesn't, in this code:

/* mprotect.c */

entry = ptep_get_and_clear(pte);
set_pte(pte, pte_modify(entry, newprot));

I.e. the only code with the race condition is code which explicitly
clears the dirty bit, in vmscan.c.

Do you see any possibility of losing a dirty bit here?

If not, there's no need for the intricate "gather" or "double scan"
schemes for mprotect() and it can stay as fast as possible.

Btw, a possible mprotect optimisation: there is no need for
flush_tlb_range() when increasing permissions.

-- Jamie

Manfred Spraul

unread,

Feb 16, 2001, 10:54:44 AM2/16/01

to

Jamie Lokier wrote:
>
> /* mprotect.c */
> entry = ptep_get_and_clear(pte);
> set_pte(pte, pte_modify(entry, newprot));
>
> I.e. the only code with the race condition is code which explicitly
> clears the dirty bit, in vmscan.c.
>
> Do you see any possibility of losing a dirty bit here?
>

Of course.
Just check the output after preprocessing.
It's
int entry;
entry = *pte;
entry &= ~_PAGE_CHG_MASK;
entry |= pgprot_val(newprot)
*pte = entry;

We need
atomic_clear_mask (_PAGE_CHG_MASK, pte);
atomic_set_mask (pgprot_val(newprot), *pte);

for multi threaded apps.

> If not, there's no need for the intricate "gather" or "double scan"
> schemes for mprotect() and it can stay as fast as possible.
>

Correct, but we need a platform specific "update_pte", and perhaps
update_begin, update_end hooks (empty on i386) for other archs.

--
Manfred

Jamie Lokier

unread,

Feb 16, 2001, 11:00:29 AM2/16/01

to

Manfred Spraul wrote:
> > entry = ptep_get_and_clear(pte);
> > set_pte(pte, pte_modify(entry, newprot));
> >
> > I.e. the only code with the race condition is code which explicitly
> > clears the dirty bit, in vmscan.c.
> >
> > Do you see any possibility of losing a dirty bit here?
> >
> Of course.
> Just check the output after preprocessing.
> It's
> int entry;
> entry = *pte;
> entry &= ~_PAGE_CHG_MASK;
> entry |= pgprot_val(newprot)
> *pte = entry;

And how does that lose a dirty bit?

For the other processor to not write a dirty bit, it must have a dirty
TLB entry already which, along with the locked cycle in
ptep_get_and_clear, means that `entry' will have _PAGE_DIRTY set. The
dirty bit is not lost.

> We need
> atomic_clear_mask (_PAGE_CHG_MASK, pte);
> atomic_set_mask (pgprot_val(newprot), *pte);
>
> for multi threaded apps.

cmpxchg is probably faster.

-- Jamie

Manfred Spraul

unread,

Feb 16, 2001, 11:23:40 AM2/16/01

to

Jamie Lokier wrote:
>
> And how does that lose a dirty bit?
>
> For the other processor to not write a dirty bit, it must have a dirty

^^^^^^^^^^^

> TLB entry already which, along with the locked cycle in
> ptep_get_and_clear, means that `entry' will have _PAGE_DIRTY set. The
> dirty bit is not lost.
>

The other cpu writes the dirty bit - we just overwrite it ;-)
After the ptep_get_and_clear(), before the set_pte().

The current assumption about the page dirty logic is:
A cpu that has a writable, non-dirty pte cached in its tlb it may
unconditionally set the dirty bit - without honoring present or write
protected bits.

--> set_pte() can either loose a dirty bit or a 'pte_none() entry' could
suddenly become a swap entry unless it's guaranteed that no cpus has a
cached valid tlb entry.

Linus, does the proposed pte gather code handle the second part?
pte_none() suddenly becomes 0x0040.

Back to the current mprotect.c code:

pte is writable, not-dirty.

cpu1:
has a writable, non-dirty pte in it's tlb.
cpu 2: in mprotect.c
entry = ptep_get_and_clear(pte);
* pte now clear.
* entry contains the pte value without
the dirty bit
cpu decodes a write instruction, and dirties the pte.
lock; orl DIRTY_BIT, *pte
set_pte(pte, pte_modify(entry, newprot));
* pte overwritten with entry.

--> dirty bit lost.

--
Manfred

Jamie Lokier

unread,

Feb 16, 2001, 11:43:16 AM2/16/01

to

Manfred Spraul wrote:
> The other cpu writes the dirty bit - we just overwrite it ;-)
> After the ptep_get_and_clear(), before the set_pte().

Ah, I see. The other CPU does an atomic *pte |= _PAGE_DIRTY, without
checking the present bit. ('scuse me for temporary brain failure).

How about a pragmatic solution.

Given that Ben's found that "checks pte_present on dirtying" works in
practice, and it is _much_ simpler to do things that way, perhaps we
could write a boot time test for this?

If the boot time test fails, we

(a) printk("Sorry we've never seen a CPU like this, please report");

(b) Put this in ptep_get_and_clear:

if (tlb_dirty_doesnt_sync)
flush_tlb_page(page)

It should be fast on known CPUs, correct on unknown ones, and much
simpler than "gather" code which may be completely unnecessary and
rather difficult to test.

If anyone reports the message, _then_ we think about the problem some more.

Ben, fancy writing a boot-time test?

-- Jamie

Ben LaHaise

unread,

Feb 16, 2001, 12:29:41 PM2/16/01

to

On Fri, 16 Feb 2001, Jamie Lokier wrote:

> It should be fast on known CPUs, correct on unknown ones, and much
> simpler than "gather" code which may be completely unnecessary and
> rather difficult to test.
>
> If anyone reports the message, _then_ we think about the problem some more.
>
> Ben, fancy writing a boot-time test?

Sure, I'll whip one up this afternoon.

-ben

Jamie Lokier

unread,

Feb 16, 2001, 12:20:20 PM2/16/01

to

Manfred Spraul wrote:
> Ok, Is there one case were your pragmatic solutions is vastly faster?

> * mprotect: No. The difference is at most one additional locked
> instruction for each pte.

Oh, what instruction is that?

> * munmap(anon): No. We must handle delayed accessed anyway (don't call
> free_pages_ok() until flush_tlb_ipi returned). The difference is that we
> might have to perform a second pass to clear any spurious 0x40 bits.

That second pass is what I had in mind.

> * munmap(file): No. Second pass required for correct msync behaviour.

It is?

-- Jamie

Manfred Spraul

unread,

Feb 16, 2001, 12:12:12 PM2/16/01

to

Jamie Lokier wrote:
>
> Manfred Spraul wrote:
> > The other cpu writes the dirty bit - we just overwrite it ;-)
> > After the ptep_get_and_clear(), before the set_pte().
>
> Ah, I see. The other CPU does an atomic *pte |= _PAGE_DIRTY, without
> checking the present bit. ('scuse me for temporary brain failure).
>
> How about a pragmatic solution.
>

Ok, Is there one case were your pragmatic solutions is vastly faster?

* mprotect: No. The difference is at most one additional locked
instruction for each pte.

* munmap(anon): No. We must handle delayed accessed anyway (don't call

free_pages_ok() until flush_tlb_ipi returned). The difference is that we
might have to perform a second pass to clear any spurious 0x40 bits.

* munmap(file): No. Second pass required for correct msync behaviour.

* try_to_swap_out(): No. another memory read.

Any other cases?

>
> Ben, fancy writing a boot-time test?
>

I'd never rely on such a test - what if the cpu checks in 99% of the
cases, but doesn't handle some cases ('rep movd, everything unaligned,
...'. And check the Pentium III erratas. There is one with the tlb
that's only triggered if 4 instruction lie in a certain window and all
access memory in the same way of the tlb (EFLAGS incorrect if 'andl
mask,<memory_addr>' causes page fault)).

--
Manfred

Ben LaHaise

unread,

Feb 16, 2001, 12:44:05 PM2/16/01

to

On Fri, 16 Feb 2001, Linus Torvalds wrote:

> How do you expect to ever see this in practice? Sounds basically
> impossible to test for this hardware race. The obvious "try to dirty as
> fast as possible on one CPU while doing an atomic get-and-clear on the
> other" thing is not valid - it's in fact quite likely to get into
> lock-step because of page table cache movement synchronization. And as
> such it could hide any race.

That's not the behaviour I'm testing, but whether the CPU is doing

lock
pte = *ptep
if (present && writable)
pte |= dirty
*ptep = pte
unlock

versus

lock
pte = *ptep
pte |= dirty
*ptep = pte
unlock

Which can be tested by means of getting the pte into the tlb then changing
the pte without flushing and observing the results (page fault vs changed
pte). I'm willing to bet that all cpus are doing the first version.

-ben

Linus Torvalds

unread,

Feb 16, 2001, 12:38:08 PM2/16/01

to

On Fri, 16 Feb 2001, Ben LaHaise wrote:

> On Fri, 16 Feb 2001, Jamie Lokier wrote:
>
> > It should be fast on known CPUs, correct on unknown ones, and much
> > simpler than "gather" code which may be completely unnecessary and
> > rather difficult to test.
> >
> > If anyone reports the message, _then_ we think about the problem some more.
> >

> > Ben, fancy writing a boot-time test?
>

> Sure, I'll whip one up this afternoon.

How do you expect to ever see this in practice? Sounds basically

impossible to test for this hardware race. The obvious "try to dirty as
fast as possible on one CPU while doing an atomic get-and-clear on the
other" thing is not valid - it's in fact quite likely to get into
lock-step because of page table cache movement synchronization. And as
such it could hide any race.

Linus

Jamie Lokier

unread,

Feb 16, 2001, 12:37:07 PM2/16/01

to

> > Ben, fancy writing a boot-time test?
> >

> I'd never rely on such a test - what if the cpu checks in 99% of the
> cases, but doesn't handle some cases ('rep movd, everything unaligned,
> ...'.

A good point. The test results are inconclusive.

> And check the Pentium III erratas. There is one with the tlb
> that's only triggered if 4 instruction lie in a certain window and all
> access memory in the same way of the tlb (EFLAGS incorrect if 'andl
> mask,<memory_addr>' causes page fault)).

Nasty, but I don't see what an obscure and impossible to work around
processor bug has to do with this thread. It doesn't actually change
page fault handling, does it?

-- Jamie

Linus Torvalds

unread,

Feb 16, 2001, 12:36:07 PM2/16/01

to

On Fri, 16 Feb 2001, Jamie Lokier wrote:

> Manfred Spraul wrote:
> > Ok, Is there one case were your pragmatic solutions is vastly faster?
>
> > * mprotect: No. The difference is at most one additional locked
> > instruction for each pte.
>

> Oh, what instruction is that?

The "set_pte()" thing could easily be changed into

lock ; orl pte,(ptepointer)

which actually should work as-is. We do not allow "set_pte()" on anything
but "pte_none()" entries anyway, so in the trivial case the "orl" is
exactly equivalent to a "movl". And in the (so far theoretical) case where
another CPU might have set the dirty bit, the locked "or" will again do
the right thing, and preserve it.

So that would basically be a one-liner that removes the set_pte() race for
mprotect() (and the vmscan.c case of re-establishing the pte, but as
vmscan needs to do something more anyway that part is probably not
interesting).

> > * munmap(anon): No. We must handle delayed accessed anyway (don't call
> > free_pages_ok() until flush_tlb_ipi returned). The difference is that we
> > might have to perform a second pass to clear any spurious 0x40 bits.
>

> That second pass is what I had in mind.
>

> > * munmap(file): No. Second pass required for correct msync behaviour.
>

> It is?

Not now it isn't. We just do a msync() + fsync() for msync(MS_SYNC). Which
is admittedly not optimal, but it works.

Linus

Linus Torvalds

unread,

Feb 16, 2001, 12:59:51 PM2/16/01

to

On Fri, 16 Feb 2001, Manfred Spraul wrote:

> Jamie Lokier wrote:
> >
> > Linus Torvalds wrote:
> > > So the only case that ends up being fairly heavy may be a case that is
> > > very uncommon in practice (only for unmapping shared mappings in
> > > threaded programs or the lazy TLB case).
> >
> The lazy tlb case is quite fast: lazy tlb thread never write to user
> space pages, we don't need to protect the dirty bits. And the first ipi
> clears mm->cpu_vm_mask, only one ipi.

This is NOT necessarily true in the generic case.

The lazy TLB thread itself may not write to the address space, but I can
in theory see a hardware implementation that delays writing out the dirty
bit from the TLB until it is invalidated. I agree that it is unlikely,
especially on an x86, but I think it's a case we should at least think
about for the generic kernel architecture.

Think of the TLB as a cache, and think of the dirty state as being either
write-through or write-back. Now, I will bet you that all current x86's
(a) _do_ actually check the P bit when writing D (ie current Linux code
is probably fine as-is, even if incorrect in theory)
and
(b) the D bit is write-through.

But even so, I want people to at least consider the case of a write-back
TLB dirty bit, in which case the real state of the D bit might not be
known until a TLB flush has been done (even on a UP machine - which is why
I'm certain that no current x86 actually does this optimization).

(And because of (a), I don't think I'll necessarily fix this during 2.4.x
anyway unless it gets fixed as a result of the generic TLB shootdown issue
which has nothing at all to do with the D bit)

Don't get too hung up on implementation details when designing a good
architecture for this thing.

Jamie Lokier

unread,

Feb 16, 2001, 1:09:19 PM2/16/01

to

Manfred Spraul wrote:
> A very simple test might be
>
> cpu 1:
> cpu 2:

Ben's test uses only one CPU.

> Now start with variants:
> change to read only instead of not present
> a and b in the same way of the tlb, in a different way.
> change pte with write, change with lock;
> .
> .
> .
>
> But you'll never prove that you tested every combination.

Indeed.

-- Jamie

Manfred Spraul

unread,

Feb 16, 2001, 1:04:33 PM2/16/01

to

Jamie Lokier wrote:
>
> > > Ben, fancy writing a boot-time test?
> > >
> > I'd never rely on such a test - what if the cpu checks in 99% of the
> > cases, but doesn't handle some cases ('rep movd, everything unaligned,
> > ...'.
>
> A good point. The test results are inconclusive.
>
> > And check the Pentium III erratas. There is one with the tlb
> > that's only triggered if 4 instruction lie in a certain window and all
> > access memory in the same way of the tlb (EFLAGS incorrect if 'andl
> > mask,<memory_addr>' causes page fault)).
>
> Nasty, but I don't see what an obscure and impossible to work around
> processor bug has to do with this thread. It doesn't actually change
> page fault handling, does it?
>

Page fault handling is unchanged, but perhaps there are other races. And
note that these races wouldn't be processor bugs - the spec nowhere
guarantee the behaviour you assume.

Ben tries to prove that the current cpu _never_ sets the dirty bit

without checking the present bit.

A very simple test might be

cpu 1:
cli();
a = 0; b = 0; m = 0;
flush_local_tlb_page(a);
flush_local_tlb_page(b);
flush_local_tlb_page(a);
while(!m);
while (!a && !b);
a = 1;

cpu 2:
<wait>
cli();
both ptes for a and b as writable, not dirty.
m = 1;
udelay(100);
change the pte of a to not present.
wmb();
b = 1;

Now start with variants:
change to read only instead of not present
a and b in the same way of the tlb, in a different way.
change pte with write, change with lock;
.
.
.

But you'll never prove that you tested every combination.

--
Manfred

Linus Torvalds

unread,

Feb 16, 2001, 2:00:41 PM2/16/01

to

On Fri, 16 Feb 2001, Manfred Spraul wrote:
>

> That leaves msync() - it currently does a flush_tlb_page() for every
> single dirty page.
> Is it possible to integrate that into the mmu gather code?

Not even necessary.

The D bit does not have to be coherent. We need to make sure that we flush
the TLB before we start the IO on the pages which clears the per-physical
D bit (so that no CPU will have done any modifications that didn't show up
in one of the D bits - whther virtual in the page tables or physical in
the memory map), but there are no other real requirements.

So you don't strictly need to gather them at all, although right now with
the type of setup we have I suspect it's hard to actually implement any
other way (because msync doesn't necessarily know when the IO has been
physically started and has no good way of hooking into it..).

Linus

Hugh Dickins

unread,

Feb 16, 2001, 1:36:58 PM2/16/01

to

On Fri, 16 Feb 2001, Jamie Lokier wrote:
>
> > And check the Pentium III erratas. There is one with the tlb
> > that's only triggered if 4 instruction lie in a certain window and all
> > access memory in the same way of the tlb (EFLAGS incorrect if 'andl
> > mask,<memory_addr>' causes page fault)).
>
> Nasty, but I don't see what an obscure and impossible to work around
> processor bug has to do with this thread. It doesn't actually change
> page fault handling, does it?

Obscure but not nasty: the copy of EFLAGS pushed onto the stack when
taking the fault is wrong, but once the instruction is restarted it
all sorts itself out (as I understand from the Spec Update).
Possible to work around, but just not worth the effort.

Nastier was its precursor, Pentium Pro Erratum #63, generated under
similar conditions: where the wrong (carry bit of) EFLAGS when faulting
in the middle of ADC, SBB, RCR or RCL could cause a wrong arithmetic
result when restarted. Perfectly possible to work around (only lower
permissions of a pte visible on another CPU while that CPU is pulled
into the kernel with an IPI), and necessary to work around it back
then (4 years ago) when the Pentium Pro was at the leading edge;
but I doubt it's worth redesigning now to suit an old erratum.

These errata do make the point that, whatever x86 specs say should
happen, Intel sometimes fails to match them; and the SMP TLB area
was certainly prone to errata at the time of the Pentium Pro -
but hopefully that means Intel exercise greater care there now.

Hugh

Manfred Spraul

unread,

Feb 16, 2001, 1:49:47 PM2/16/01

to

Linus wrote:
>
> >
> > That second pass is what I had in mind.
> >
> > > * munmap(file): No. Second pass required for correct msync behaviour.
> >
> > It is?
>
> Not now it isn't. We just do a msync() + fsync() for msync(MS_SYNC). Which
> is admittedly not optimal, but it works.
>

Ok, munmap() will be fixed by the tlb shootdown changes - it also uses
zap_page_range().

That leaves msync() - it currently does a flush_tlb_page() for every
single dirty page.
Is it possible to integrate that into the mmu gather code?

tlb_transfer_dirty() in addition to tlb_clear_page()?

--
Manfred

Ben LaHaise

unread,

Feb 16, 2001, 2:02:10 PM2/16/01

to

On Fri, 16 Feb 2001, Manfred Spraul wrote:

> That leaves msync() - it currently does a flush_tlb_page() for every
> single dirty page.
> Is it possible to integrate that into the mmu gather code?
>
> tlb_transfer_dirty() in addition to tlb_clear_page()?

Actually, in the filemap_sync case, the flush_tlb_page is redundant --
there's already a call to flush_tlb_range in filemap_sync after the dirty
bits are cleared. None of the cpus we support document having a writeback
tlb, and intel's docs explicitely state that they do not as they state
that the dirty bit is updated on the first write to dirty the pte.

-ben

Ben LaHaise

unread,

Feb 16, 2001, 2:42:52 PM2/16/01

to

On Fri, 16 Feb 2001, Linus Torvalds wrote:

> This is, actually, a problem that I suspect ends up being _very_ similar
> to the zap_page_range() case. zap_page_range() needs to make sure that
> everything has been updated by the time the page is actually free'd. While
> filemap_sync() needs to make sure that everything has been updated before
> the page is written out (or marked dirty - which obviously also guarantees
> the ordering, and makes the problems look even more similar).

Ah, I see what I was missing. So long as the tlb flush is in between the
ptep_test_and_clear_dirty and the set_page_dirty, we're fine (ie the
current code is good). If we really want to reduce the number of tlb
flushes, yes, we can use the gather code and then just do the
set_page_dirty after a tlb_flush_range.

Linus Torvalds

unread,

Feb 16, 2001, 2:32:27 PM2/16/01

to

On Fri, 16 Feb 2001, Ben LaHaise wrote:
>
> Actually, in the filemap_sync case, the flush_tlb_page is redundant --
> there's already a call to flush_tlb_range in filemap_sync after the dirty
> bits are cleared.

This is not enough.

If another CPU has started write-out of one of the dirty pages (which, as
far as I can tell, is certainly unlikely but not impossible) while we were
still handling other dirty pages, that other CPU might clear the physical
dirty bit of that page while a third CPU (or the same writer, but that
makes the timing even _more_ unlikely) is still using a stale "dirty" TLB
entry and writing to the page (and not updating the virtual dirty bit
because it doesn't know that it has already been cleared).

So you have to somehow guarantee that you invalidate the TLB's before the
dirty bit from the "struct page" can be cleared (which in turn has to
happen before the writeout). That can obviously be done with the tlb range
flushing, but it needs more locking.

This is, actually, a problem that I suspect ends up being _very_ similar
to the zap_page_range() case. zap_page_range() needs to make sure that
everything has been updated by the time the page is actually free'd. While
filemap_sync() needs to make sure that everything has been updated before
the page is written out (or marked dirty - which obviously also guarantees
the ordering, and makes the problems look even more similar).

Linus