Currently, I'm working on a new hash table implementation in Rust for the project at
github.com/project-oak, for their "Oak Functions" system. For some reason, Rust's memory allocator does not yet do what all the good allocators do, and this is killing memory efficiency. Good allocators realize they can't compete with Linux's memory manager for large arrays, and as dynamic arrays grow beyond a certain threshold, they are removed from the heap and instead managed by the Linux kernel, using mmap2 and mremap2.
The 3X memory growth problem:
For whatever reason, the memory allocator in Rust's alloc package does not use this trick. Maybe it does in the standard allocator? I've not checked, but as my hash table grows, it is reallocated by a factor of 2 larger as needed. Lets say a table has 2^30 entries, and crosses the threshold. If each entry is 16 bytes, I have 16GiB in the array. Rust's alloc package creates a new array of 2^31 entries, copies the data from the old array to the new, and frees the old array. If monitoring with 'top', when this happens, you'll see memory usage suddenly jump from 16GiB to 48GiB after allocating just one more object! This is where my program crashes in my case...
The Linux magic solution
When Linux manages large dynamic arrays, behavior is so nice! Consider this nutty C program:
#include <stdint.h>#include <stdio.h>
#include <stdlib.h>
int main() {
for (uint32_t i = 0; i < 1024; i++) {
uint8_t* buf = malloc((uint64_t)1 << 35);
printf("Address of buff = 0x%lx\n", buf);
}
return 0;
}
This code runs without any problem on my under powered laptop. In theory, it allocates 1024*32GiB = 32TiB of memory on a PC that has only 32GiB of RAM. Why didn't it run out of memory?
Linux only allocates pages when you write to them.
Did you know that malloc returns zeroed memory on Linux, just like calloc? The trick is Linux has one special page of memory containing all 0's, which is marked as read-only. Linux maps this same page to every page table entry backing the memory buffer returned by mmap2. If you read anywhere in any of those arrays allocated above, you get 0's.
When you write to one of these pages for the first time, it causes a page fault interrupt. The page fault handler detects the write to the special zero-page, and allocates a writable page for that location. So, if I have a 16GiB hash table allocated and I need one more 16 byte entry, instead of suddenly consuming 48GiB, Linux will just allocate 1 more page to the array. The memory location doesn't even move.
In Rune, using AoS memory layout, we have a lot of large arrays growing dynamically. We double array sizes whenever we need more room. Fortunately, Linux doesn't actually map physical pages to all the new room at the end of the arrays. If it did, we'd have on average a 25% memory penalty for using SoA memory layout. As it is, we have 0%.
Bill
P.S. Did you wonder why I allocated 1024 arrays of 2^35 each, rather than one array of 2^45? Because malloc refuses to return memory arrays that are larger than what the CPU can physically address. My CPU has 35 memory address bits on the package, so 32 GiB is the largest memory configuration possible with my CPU. Cloud-class CPUs can have up to 1TiB of RAM, and in those cases, malloc will let you allocate up to 1TiB per call.