Virtual Memory Page Size change

413 views
Skip to first unread message

Sean Halle

unread,
Apr 6, 2018, 8:02:01 PM4/6/18
to RISC-V HW Dev

Hi, I'm approaching this subject of virtual memory page size delicately, as it seems to create a lot of energy :-)  In our design, we would get a non-trivial boost by increasing the VM page size to 16KB -- both in IPC and in energy.  I heard a rumor that early Rocket looked at 16KB pages, but this was rejected.  However, I haven't found much on the reason.  

So, we have looked into the Linux kernel, and there appears to be one master location where the default page size is set:

Everything else appears to derive page size from there -- the page table walker, slab allocation, and so on.  But we don't yet have kernel expertise on the team to verify.

Also, looking further, for other examples, ARM and POWER both enable default page sizes larger than 4KB.  POWER looks like it defaults to 64KB, and ARM can be set to default to 16KB or 64KB (in addition to the standard 4KB).  As for the files system, in the recent kernels the only requirement seems to be alignment on 512 byte boundaries -- the file system page size appears insulated from the Virtual Memory page size..

But..  this is at the heart of the kernel, with a tremendous amount of complexity throughout a vast system, with uncountable number of opportunities for things to go wrong..  so..

Does anyone have concrete examples of what goes wrong when the default VM page size is increased, to, say 16KB?  It makes me very nervous, messing with something this deep, but the gain is attractive..

Thanks for any pointers,

Sean


Andrew Waterman

unread,
Apr 6, 2018, 8:39:49 PM4/6/18
to Sean Halle, RISC-V HW Dev
On Fri, Apr 6, 2018 at 5:01 PM, Sean Halle <sean...@gmail.com> wrote:

Hi, I'm approaching this subject of virtual memory page size delicately, as it seems to create a lot of energy :-)  In our design, we would get a non-trivial boost by increasing the VM page size to 16KB -- both in IPC and in energy.  I heard a rumor that early Rocket looked at 16KB pages, but this was rejected.  However, I haven't found much on the reason.  

Quoth the priv spec commentary: "After much deliberation, we have settled on a conventional page size of 4 KiB for both RV32 and RV64. We expect this decision to ease the porting of low-level runtime software and device drivers. The TLB reach problem is ameliorated by transparent superpage support in modern operating systems. Additionally, multi-level TLB hierarchies are quite inexpensive relative to the multi-level cache hierarchies whose address space they map."


So, we have looked into the Linux kernel, and there appears to be one master location where the default page size is set:

Everything else appears to derive page size from there -- the page table walker, slab allocation, and so on.  But we don't yet have kernel expertise on the team to verify.

Also, looking further, for other examples, ARM and POWER both enable default page sizes larger than 4KB.  POWER looks like it defaults to 64KB, and ARM can be set to default to 16KB or 64KB (in addition to the standard 4KB).  As for the files system, in the recent kernels the only requirement seems to be alignment on 512 byte boundaries -- the file system page size appears insulated from the Virtual Memory page size..

But..  this is at the heart of the kernel, with a tremendous amount of complexity throughout a vast system, with uncountable number of opportunities for things to go wrong..  so..

Does anyone have concrete examples of what goes wrong when the default VM page size is increased, to, say 16KB?  It makes me very nervous, messing with something this deep, but the gain is attractive..

Don't forget that, if you stick to the current pattern, an RV64 page table will now have 2K entries, so superpages will become 32M, 64G, 128T (instead of 2M, 1G, 512G).

Although we don't support transparent superpages yet, once we do, 2M will be a far more practical size than 32M.  Most likely, some workloads will perform better with the 4K base page for this reason (though of course others will perform better with the 16K base page).


Thanks for any pointers,

Sean


--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/CAJ4GwD%2BSdWfFvgNEx0fH9YmgkRuLbVa0jGeT0R7B9ZXGTfQZ9Q%40mail.gmail.com.

David Chisnall

unread,
Apr 7, 2018, 5:42:40 AM4/7/18
to Sean Halle, RISC-V HW Dev
The problem is not the kernel, it is applications that expect mmap and mprotect to work with 4KB chunks. I don’t have concrete examples to hand, but there were quite a lot when we looked. Searching for mmap and mprotect in large open source code bases should give you a few. Note that even the ones that do try to support different paged sizes have often never been tested with non-4KB sizes so may fail in subtle and exciting ways.

David

ron minnich

unread,
Apr 7, 2018, 11:00:46 AM4/7/18
to David Chisnall, Sean Halle, RISC-V HW Dev
On Sat, Apr 7, 2018 at 2:42 AM David Chisnall <David.C...@cl.cam.ac.uk> wrote:


The problem is not the kernel, it is applications that expect mmap and mprotect to work with 4KB chunks.  I don’t have concrete examples to hand, but there were quite a lot when we looked.  Searching for mmap and mprotect in large open source code bases should give you a few.  Note that even the ones that do try to support different paged sizes have often never been tested with non-4KB sizes so may fail in subtle and exciting ways.



well, we've been here before. There used to be lots of code that used pointers with a value of zero and worked. A LOT of code.

We did not, as a result, decide to allow zero pointers everywhere (except IBM, and that only for reading). Rather, we walled zero off and took the pain. It took a while.

I'm not sure "conformance to broken code" is ever a good reason to lock down an architectural feature.  

It's nice that riscv allows superpage support, and it certainly works fine in Harvey, which doesn't even support 4k pages, I just wanted to take small exception to the reasoning here :-)

ron

Luke Kenneth Casson Leighton

unread,
Apr 7, 2018, 11:16:09 AM4/7/18
to David Chisnall, Sean Halle, RISC-V HW Dev
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
oo, oo, i know one! LMDB. LMDB is awseome, i curated the wikipedia
page for a while, to get some {insert appropriate noun} from Oracle
off it, he kept scheduling it for deletion, surpriiiise

https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database

there's a list of applications there which is incomplete and will
give you some idea of how extensively used LMDB really is.

why am i mentioning this? because LMDB, in its awesomeness, uses
shared memory *copy on write* semantics. i quite literally had never
heard that that was even possible until i encountered LMDB.

why is *that* relevant? well, it's because LMDB formats its B+ Tree
page tables based EXACTLY on the underlying OS's page-table format.

why is *that* relevant? well, it's because if you want to migrate
an LMDB database from one system that supports a particular mmap
page-size over to another, the performance will be f*****d, basically.
LMDB does actually support it, but i would be extremely surprised if
it was a well-tested code-path.

so... basicallllyyy... you'd be forcing IT Sysadmins to dump and
migrate entire OpenLDAP databases (or use synchronisation - slapd - to
do it), or if you used any of the other software which happened *not*
to have distributed capability, they *would* be f*****d, and would
have to schedule offline down-time in order to migrate to an
alternative architecture.

*beams happily* - i love LMDB. 7.5 *million* sequential reads per
second. 2.5 million end-of-index writes per second on 100-byte
records! and built-in atomic transaction support and multi-readers
with one simultaneous writer! and.... and... and... :)

l.

Sean Halle

unread,
Apr 7, 2018, 3:39:10 PM4/7/18
to David Chisnall, RISC-V HW Dev

I see, so mmap exposes VM page size choice to application code..  which results in significant porting effort.  That would be a negative, that outweighs the positive of the performance gain.  By a lot :-)  

If I may ask a novice question..  my understanding of mmap is that it allows user space to map a file onto a range of virtual memory addresses.  Underneath, it is convenient when the file page size matches the VM page size..  so, at a minimum, we would need to modify the mmap implementation to maintain a 4KB page visible to user space..  but, do I understand correctly, that when you looked at this in depth that there was something that prevented mmap from maintaining the 4KB file page size on the user side, while underneath working with a 16KB VM page size?

Thanks David :-)
 

Sean Halle

unread,
Apr 25, 2018, 1:05:04 AM4/25/18
to David Chisnall, RISC-V HW Dev

Thanks David, Luke, and Howard.  After diving into this more..  it looks, as far as I can tell, as though mmap and file system implementations, can maintain a 4KB "file-page" abstraction seamlessly by having hardware support for 4KB permission granularity.  In other words, when chasing down mmap, mprotect, (and LMDB), it appears as though the only trip wire is that the granularity for permissions is 4KB -- any of the 4KB "file-page"s occupying the same 16KB virtual physical page could have different permissions.  So adding permissions checks to the TLB for each 4KB chunk allows mmap to present 4KB pages to user space without any overhead.  User space thinks it's getting 4KB pages.   The extra permission granularity looks like a minimal impact on the TLB hardware, and also minimal impact on the page table.

However, it means we have to tweak the page table walker, and tweak mmap and mprotect implementations in the kernel.  We'll also have to do a survey to figure out if any other user-land interfaces expose the VM page size, then look closely at device drivers, to see whether there are built-in assumptions about page-size that bypass the defines in the kernel header..  which, I expect there will be..

So, for the MVP first product, we'll be sticking with 4KB pages, and put this on the list for version 2 :-)  We're open to interested parties, would love to do some experiments with the 4KB permissions approach, to see what still breaks.

Sean
Intensivate


Luke Kenneth Casson Leighton

unread,
Apr 25, 2018, 1:48:12 AM4/25/18
to Sean Halle, David Chisnall, RISC-V HW Dev
On Wed, Apr 25, 2018 at 6:04 AM, Sean Halle <sean...@gmail.com> wrote:

> However, it means we have to tweak the page table walker, and tweak mmap and
> mprotect implementations in the kernel.

lots of work... all of which would need to be mainlined otherwise you
end up with 2+ months of effort to port every point-release of the
linux kernel (and other libraries)....

l.

David Chisnall

unread,
Apr 25, 2018, 5:13:50 AM4/25/18
to Sean Halle, RISC-V HW Dev
On 25 Apr 2018, at 06:04, Sean Halle <sean...@gmail.com> wrote:
>
> However, it means we have to tweak the page table walker, and tweak mmap and mprotect implementations in the kernel. We'll also have to do a survey to figure out if any other user-land interfaces expose the VM page size, then look closely at device drivers, to see whether there are built-in assumptions about page-size that bypass the defines in the kernel header.. which, I expect there will be..
>
> So, for the MVP first product, we'll be sticking with 4KB pages, and put this on the list for version 2 :-) We're open to interested parties, would love to do some experiments with the 4KB permissions approach, to see what still breaks.

Some Apple folk that Andrew and I spoke to at ASPLOS hinted that they’d done this evaluation and come to the conclusion that enough stuff would break that it isn’t really worth it.

Modern operating systems are now very good at transparent superpage promotion. On FreeBSD, malloc asks for memory in 2-8MB chunks. These are initially backed by individual 4KB pages (allocated lazily), but once a region is packed with live objects, it almost always ends up being replaced by a smaller number of superpages. Anything being serviced by the buffer cache and backed by a file will often already be convenient for superpages, because it is more efficient to ask the disk to DMA a superpage-sized amount of data into a contiguous chunk of free memory (and evict the unused portions if not used) than to do a larger number of 4KB requests.

On RISC-V, increasing the base page size also increases the smallest superpage size. This increases the effort in transparent superpage promotion and means that, somewhat counterintuitively, you are likely to see increased TLB pressure with larger leaf page table entry sizes.

If you want a more scalable mechanism, you should consider asking Krste about Mondrian Memory Protection, which allows you to decouple the protection and translation granule sizes. Doing something more restrictive than a full Mondrian implementation, giving 4KB protection granules but 64KB translation granules may give some improvements and would be an interesting avenue for experimentation.

David

Michael Clark

unread,
Apr 25, 2018, 6:18:23 AM4/25/18
to David Chisnall, Sean Halle, RISC-V HW Dev


On 25/04/2018, at 9:13 PM, David Chisnall <David.C...@cl.cam.ac.uk> wrote:

On 25 Apr 2018, at 06:04, Sean Halle <sean...@gmail.com> wrote:

However, it means we have to tweak the page table walker, and tweak mmap and mprotect implementations in the kernel.  We'll also have to do a survey to figure out if any other user-land interfaces expose the VM page size, then look closely at device drivers, to see whether there are built-in assumptions about page-size that bypass the defines in the kernel header..  which, I expect there will be..

So, for the MVP first product, we'll be sticking with 4KB pages, and put this on the list for version 2 :-)  We're open to interested parties, would love to do some experiments with the 4KB permissions approach, to see what still breaks.

Some Apple folk that Andrew and I spoke to at ASPLOS hinted that they’d done this evaluation and come to the conclusion that enough stuff would break that it isn’t really worth it.

Agree totally. Just google “#define PAGE_SIZE 4096”

Modern operating systems are now very good at transparent superpage promotion.  On FreeBSD, malloc asks for memory in 2-8MB chunks.  These are initially backed by individual 4KB pages (allocated lazily), but once a region is packed with live objects, it almost always ends up being replaced by a smaller number of superpages.  Anything being serviced by the buffer cache and backed by a file will often already be convenient for superpages, because it is more efficient to ask the disk to DMA a superpage-sized amount of data into a contiguous chunk of free memory (and evict the unused portions if not used) than to do a larger number of 4KB requests.

On RISC-V, increasing the base page size also increases the smallest superpage size.  This increases the effort in transparent superpage promotion and means that, somewhat counterintuitively, you are likely to see increased TLB pressure with larger leaf page table entry sizes.

If you want a more scalable mechanism, you should consider asking Krste about Mondrian Memory Protection, which allows you to decouple the protection and translation granule sizes.  Doing something more restrictive than a full Mondrian implementation, giving 4KB protection granules but 64KB translation granules may give some improvements and would be an interesting avenue for experimentation.

Interesting. I didn’t know about the 2002 Mondrian paper until now, but it seems it is prior art for Intel’s 2014 sub-page protection [1] for which there are now patches for the Linux kernel. In Intel’s implementation of the Mondrian system they have a 128 byte granularity for write access permissions. There is a sub page permission bit in the regular page tables that causes a walk in the sub page permission table.

It’s annoying that the US patent office doesn’t do prior art checks on filing, instead giving giving patents to anyone with lawyers [3].

Sean Halle

unread,
Apr 25, 2018, 1:40:36 PM4/25/18
to Michael Clark, David Chisnall, RISC-V HW Dev

Thanks David and Michael.  It’s a good example of unintended consequences..  it seems fairly clear that the early idea was to make page size a free choice in the hardware, invisible to applications, and only visible to the page table walker.  But the “invisible” hardware choice leaked out, through operating system choices..  to succeed, you need both parties on the same “page” :-)  

I wonder what choices are happening now, in RISC-V, that have clear intentions from the hardware design, that will end up leaking out unless the system software community carefully thinks through down the road consequences of their choices.

I wonder whether that’s happening, and the loop is getting closed, pulling in pre-visualized system software and application software issues, and letting those drive hardware choices..?

I’ve seen here what looks to me like good dialog around the hypervisor.  I wonder whether all the areas have equally good dialog, open in both directions?


-- 
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.

To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

Samuel Falvo II

unread,
Dec 26, 2018, 1:43:19 PM12/26/18
to Michael Clark, David Chisnall, Sean Halle, RISC-V HW Dev
On Wed, Apr 25, 2018 at 3:18 AM Michael Clark <michae...@mac.com> wrote:
> Interesting. I didn’t know about the 2002 Mondrian paper until now, but it seems it is prior art for Intel’s 2014 sub-page protection [1] for which there are now patches for the Linux kernel. In Intel’s implementation of the Mondrian system they have a 128 byte granularity for write access permissions. There is a sub page permission bit in the regular page tables that causes a walk in the sub page permission table.

Especially since I have a distinct interest in SASOS operating systems
(particularly as usable virtual address spaces exceed 64 bits), I
found MMP extremely interesting. Like you, I hadn't known about
Mondriaan until today. I just read some of the papers on MMP,
including the Ph.D. thesis on the topic. It all seemed so obvious
while reading through it; even, familiar somehow. And, then, I
reviewed the PMP registers in the privilege specification. They are
nearly (but not exactly) identical in scope and capability.

Bluntly, PMP is a proper and software-managed subset of MMP. It lacks
support for gates; but, that can be remedied with more CSRs just for
that task. The only "weird" architectural aspect of it is it must rely
on S-mode paging to perform address translation. Unlike MMP as it was
documented, it has proper RWX permissions and is entirely
software-driven (think MIPS software managed TLBs). Otherwise, the
overlap in functionality is utterly uncanny.

It seems to me that if one wants to implement MMP but at a supervisor
and/or user-level of privilege, you'd just need to add a corresponding
number of registers (SMP and UMP?) to make this work. The backing
hardware seems like it'd be a fairly trivial task to replicate, as
long as the protection unit was modular enough.

--
Samuel A. Falvo II
Reply all
Reply to author
Forward
0 new messages