x86: page enter optimization

Maxime Villard

unread,

Oct 11, 2016, 12:20:20 PM10/11/16

to tech...@netbsd.org

Userland is pageable, so when mmap is called with one page, the kernel does not
yet make the page officially available to the CPU. Rather, it waits for the page
to fault, and at fault time it will make it valid for real. It means the kernel
code path from the interrupt to the moment when the page is entered needs to be
fast.

All this to say that in pmap_enter_ma on x86, an optimization is possible. In
this function, new_pve and new_sparepve are always allocated, but not always
needed. The reason it is done this way is because preemption is disabled in the
critical part, so obviously the allocation needs to be performed earlier.

new_pve and new_sparepve are to be used in pmap_enter_pv. After adding atomic
counters in here, a './build.sh tools' gives these numbers:

PVE: used=36441394 unused=58955001
SPAREPVE: used=1647254 unused=93749141

It means that 38088648 allocations were needed and performed, and 152704142
performed but not used. In short, only 19% of the allocated buffers were needed.
Verily, the real number may be even smaller than that, since I didn't take into
account the fact that there may be no p->v tracking at all (in which case both
buffers would be unused as well).

I have a patch which introduces two inlined functions that can tell earlier
whether these buffers are needed. One problem with this patch is that it makes
the code harder to understand, even though I tried to explain clearly what we
are doing. Another problem is that when both buffers are needed, my patch
introduces a little overhead (the cost of a few branches).

I don't know if we care enough about things like that, if someone here has
particular comments feel free.

[1] https://nxr.netbsd.org/xref/src/sys/arch/x86/x86/pmap.c#4061
[2] http://m00nbsd.net/garbage/pmap/enter.diff

Joerg Sonnenberger

unread,

Oct 11, 2016, 12:26:13 PM10/11/16

to tech...@netbsd.org

On Tue, Oct 11, 2016 at 02:42:00PM +0200, Maxime Villard wrote:
> All this to say that in pmap_enter_ma on x86, an optimization is possible. In
> this function, new_pve and new_sparepve are always allocated, but not always
> needed. The reason it is done this way is because preemption is disabled in the
> critical part, so obviously the allocation needs to be performed earlier.

What happens if we cache a pair in curcpu? Basically, instead of the
pool get, check if curcpu already has one and take that out, at the end,
if the pair was unused, put it into curcpu.

Joerg

Jean-Yves Migeon

unread,

Oct 14, 2016, 6:04:10 AM10/14/16

to Maxime Villard, tech...@netbsd.org

I would benchmark both (with and without the "overhead" introduced); a
while back when implementing PAE I did not expect the paddr_t promotion
from 32 to 64 bits to have that much of an impact on pmap performance,
but the first attempt induced more that 5% overhead on a "cold"
/build.sh run.

Granted, you are not dealing with the same situation here but pool
caches make the allocation used/unused dance almost free (except for the
slow path). When objets are in the pool cache but not yet obtained
through the getter, they are still allocated but basically not used. It
would be interesting to see if the hit/miss ratio is affected for the
"pvpl" pool with your optimization.

--
Jean-Yves Migeon