map-based large allocation, map_populate and mremap()

38 views
Skip to first unread message

Waldek Kozaczuk

unread,
Mar 30, 2020, 12:38:42 AM3/30/20
to OSv Development
Hi,

I came up with this test trying to replicate the fragmentation scenario Rick Payne was experiencing:

int main() {
   size_t size = 0x4C1000;

   void* mem = malloc(size);
   auto start = std::chrono::high_resolution_clock::now();

   for (int i = 0; i < 100; i++) {
       size = (size * 105) / 100;
       malloc(0x4000);
       printf("%d allocation of %ld bytes\n", i + 1, size);
       mem = realloc(mem, size);
   }

   auto end = std::chrono::high_resolution_clock::now();
   std::chrono::duration<double> elapsed = end - start;
   std::cout << "Elapsed time: " << elapsed.count() << " s\n";
   printf("DONE - last size: %ld!\n", size);

   sleep(1000);
   free(mem);
}

So before the latest patch that implements mapped-based malloc_large() this test needs at least 1.6GB of memory to pass - the last realloc() would try to allocate ~ 625MB or memory. However, I did not really see the fragmentation I expected to see in physical memory so I do not think this test is very representative.

The latest version of the code with all patches applied would need a minimum of 1.2GB - which is not surprising given that realloc() needs two copies of similar size to copy data from one place to another one until it frees the smaller one. But the good thing is, it does not need contiguous 625MB of physical memory to do it anymore.

However, with this experiment, I noticed that new malloc_large is slower - this test takes 3 seconds vs 2 seconds before the patch. As I suspected the culprit was the fact that mapped_malloc_large calls map_anon with mmap_populate that pre-faults entire memory which I thought was necessary to do. But it turns out that when I replaced mmap_populate with mmap_uninitialized all the unit test and a couple of other apps were still working just fine. And the test above would take a similar amount of time as before - around 2 seconds. So maybe we should use mmap_uninitialized. The only concern would be kernel code running in preemption disabled mode and trying to access memory that was allocated with mapped_malloc_large(). Could it happen?

Finally, when I ran the same test on Linux host it would complete under 2ms (milliseconds) - so 1000 times faster. Why is that? Linux obviously uses mremap() to implement reallloc() and OSv copies the data using memcpy(). So if we want to make realloc() work faster (it could be also useful to speedup ramfs), we should implement remap() - here is an open issue for that - https://github.com/cloudius-systems/osv/issues/184 and supposedly Pekka sent a patch a while ago which we can build upon.

Waldek 

Nadav Har'El

unread,
Mar 30, 2020, 4:33:05 AM3/30/20
to Waldek Kozaczuk, OSv Development
On Mon, Mar 30, 2020 at 7:38 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Hi,

I came up with this test trying to replicate the fragmentation scenario Rick Payne was experiencing:

int main() {
   size_t size = 0x4C1000;

   void* mem = malloc(size);
   auto start = std::chrono::high_resolution_clock::now();

   for (int i = 0; i < 100; i++) {
       size = (size * 105) / 100;
       malloc(0x4000);
       printf("%d allocation of %ld bytes\n", i + 1, size);
       mem = realloc(mem, size);
   }

   auto end = std::chrono::high_resolution_clock::now();
   std::chrono::duration<double> elapsed = end - start;
   std::cout << "Elapsed time: " << elapsed.count() << " s\n";
   printf("DONE - last size: %ld!\n", size);

   sleep(1000);
   free(mem);
}

So before the latest patch that implements mapped-based malloc_large() this test needs at least 1.6GB of memory to pass - the last realloc() would try to allocate ~ 625MB or memory. However, I did not really see the fragmentation I expected to see in physical memory so I do not think this test is very representative.

The latest version of the code with all patches applied would need a minimum of 1.2GB - which is not surprising given that realloc() needs two copies of similar size to copy data from one place to another one until it frees the smaller one. But the good thing is, it does not need contiguous 625MB of physical memory to do it anymore.

However, with this experiment, I noticed that new malloc_large is slower - this test takes 3 seconds vs 2 seconds before the patch. As I suspected the culprit was the fact that mapped_malloc_large calls map_anon with mmap_populate that pre-faults entire memory which I thought was necessary to do. But it turns out that when I replaced mmap_populate with mmap_uninitialized all the unit test and a couple of other apps were still working just fine. And the test above would take a similar amount of time as before - around 2 seconds. So maybe we should use mmap_uninitialized. The only concern would be kernel code running in preemption disabled mode and trying to access memory that was allocated with mapped_malloc_large(). Could it happen?

Although malloc() cannot be called in preemption disabled mode (it uses locks), you're right that all preemption-disabled kernel code that *uses* memory previously allocated today just silently assumes that this memory is already populated ("swapped in"). There is even more delicate issues like the issue of lazy TLB flush explained in  commit c9e5af6deef6ef9420bdf6437f8c34371c8df4e9. So this sort of preemption-disabled kernel code today relies on the old-style pre-populated and visible-on-all-cores malloc(), and may not work correctly with mmap(). However, I don't think this sort of preemption-disabled code in the kernel ever works with very large allocations for which you fall back to mmap()?
 

Finally, when I ran the same test on Linux host it would complete under 2ms (milliseconds) - so 1000 times faster. Why is that? Linux obviously uses mremap() to implement reallloc() and OSv copies the data using memcpy(). So if we want to make realloc() work faster (it could be also useful to speedup ramfs), we should implement remap() - here is an open issue for that - https://github.com/cloudius-systems/osv/issues/184 and supposedly Pekka sent a patch a while ago which we can build upon.

As I noted a few weeks ago on this mailing list, Linux had hundreds of people optimizing every tiny detail. The question is whether this optimization is worth it. Is it common for people to realloc() huge allocations? If they do, do they do it many times, or exponentially (i.e., double the size each time)?
If you think it's worth it, then, yes - your new malloc() code now knows it used mmap(), and can use mremap() to do the actual work. I think the question is just if you want to optimize this specific use case.
 

Waldek 

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/54a9bb96-ea86-4c75-8d86-9a0be128fc58%40googlegroups.com.

Waldek Kozaczuk

unread,
Mar 30, 2020, 3:06:36 PM3/30/20
to OSv Development


On Monday, March 30, 2020 at 4:33:05 AM UTC-4, Nadav Har'El wrote:
On Mon, Mar 30, 2020 at 7:38 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Hi,

I came up with this test trying to replicate the fragmentation scenario Rick Payne was experiencing:

int main() {
   size_t size = 0x4C1000;

   void* mem = malloc(size);
   auto start = std::chrono::high_resolution_clock::now();

   for (int i = 0; i < 100; i++) {
       size = (size * 105) / 100;
       malloc(0x4000);
       printf("%d allocation of %ld bytes\n", i + 1, size);
       mem = realloc(mem, size);
   }

   auto end = std::chrono::high_resolution_clock::now();
   std::chrono::duration<double> elapsed = end - start;
   std::cout << "Elapsed time: " << elapsed.count() << " s\n";
   printf("DONE - last size: %ld!\n", size);

   sleep(1000);
   free(mem);
}

So before the latest patch that implements mapped-based malloc_large() this test needs at least 1.6GB of memory to pass - the last realloc() would try to allocate ~ 625MB or memory. However, I did not really see the fragmentation I expected to see in physical memory so I do not think this test is very representative.

The latest version of the code with all patches applied would need a minimum of 1.2GB - which is not surprising given that realloc() needs two copies of similar size to copy data from one place to another one until it frees the smaller one. But the good thing is, it does not need contiguous 625MB of physical memory to do it anymore.

However, with this experiment, I noticed that new malloc_large is slower - this test takes 3 seconds vs 2 seconds before the patch. As I suspected the culprit was the fact that mapped_malloc_large calls map_anon with mmap_populate that pre-faults entire memory which I thought was necessary to do. But it turns out that when I replaced mmap_populate with mmap_uninitialized all the unit test and a couple of other apps were still working just fine. And the test above would take a similar amount of time as before - around 2 seconds. So maybe we should use mmap_uninitialized. The only concern would be kernel code running in preemption disabled mode and trying to access memory that was allocated with mapped_malloc_large(). Could it happen?

Although malloc() cannot be called in preemption disabled mode (it uses locks), you're right that all preemption-disabled kernel code that *uses* memory previously allocated today just silently assumes that this memory is already populated ("swapped in"). There is even more delicate issues like the issue of lazy TLB flush explained in  commit c9e5af6deef6ef9420bdf6437f8c34371c8df4e9. So this sort of preemption-disabled kernel code today relies on the old-style pre-populated and visible-on-all-cores malloc(), and may not work correctly with mmap(). However, I don't think this sort of preemption-disabled code in the kernel ever works with very large allocations for which you fall back to mmap()?

With my patches malloc_large() uses map_anon() for any allocations >= 2MB (greater than so called "huge pages) and for allocations > 4K and < 2MB ONLY if we cannot find large enough page ranges that would fit ("fallback"). So do you think it is safe to change mmap_populate to  mmap_uninitialized for the 1st case but not for the other? It is based on your thinking that kernel does not operate on such large allocated memory areas?

Also as I understand each CPU has it own TLB so it has to be flushed to see any changes made to the global page tables structures, right? So the commit ("mmu: flush tlb lazily for cpus that are running system threads" - https://github.com/cloudius-systems/osv/commit/7e38453390d6c0164a72e30b2616b0f3c3025349 by Gleb) optimizes flushing logic by making it lazy for system (?) threads (reading the code in this commit I think we are optimizing the application threads). This somehow affected the scheduler code which would sometimes end up operating on mmapped areas of memory where thread info used to be stored and lead to the a page fault when preemption was disabled, right?. So your commit ("sched: only allocate sched::thread objects on the heap" - https://github.com/cloudius-systems/osv/commit/c9e5af6deef6ef9420bdf6437f8c34371c8df4e9) addressed it by making sure "new thread" is hidden by thread::make() that makes sure to use regular non-mmaped way of allocating memory. So we hope that my commit did not break any of that. Also even if we change to lazy population for >2GB allocation we should be fine as the allocation made by thread::make() should not try to allocate more than 4K of memory and call malloc_large(), right?

    static thread* make(Args&&... args) {
        return new thread(std::forward<Args>(args)...);
    }

Lastly, do you think the changes to malloc_large() affect the patch I sent to make application threads use lazily populated stack - https://groups.google.com/d/msg/osv-dev/tZnwiScmjZY/GkY0hV9EAwAJ ?


Finally, when I ran the same test on Linux host it would complete under 2ms (milliseconds) - so 1000 times faster. Why is that? Linux obviously uses mremap() to implement reallloc() and OSv copies the data using memcpy(). So if we want to make realloc() work faster (it could be also useful to speedup ramfs), we should implement remap() - here is an open issue for that - https://github.com/cloudius-systems/osv/issues/184 and supposedly Pekka sent a patch a while ago which we can build upon.

As I noted a few weeks ago on this mailing list, Linux had hundreds of people optimizing every tiny detail. The question is whether this optimization is worth it. Is it common for people to realloc() huge allocations? If they do, do they do it many times, or exponentially (i.e., double the size each time)?
If you think it's worth it, then, yes - your new malloc() code now knows it used mmap(), and can use mremap() to do the actual work. I think the question is just if you want to optimize this specific use case.
 

Waldek 

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv...@googlegroups.com.

Nadav Har'El

unread,
Mar 30, 2020, 5:10:44 PM3/30/20
to Waldek Kozaczuk, OSv Development
Maybe that's indeed a good idea. Also for small regions the population part would be relatively cheap?


Also as I understand each CPU has it own TLB so it has to be flushed to see any changes made to the global page tables structures, right? So the commit ("mmu: flush tlb lazily for cpus that are running system threads" - https://github.com/cloudius-systems/osv/commit/7e38453390d6c0164a72e30b2616b0f3c3025349 by Gleb) optimizes flushing logic by making it lazy for system (?) threads (reading the code in this commit I think we are optimizing the application threads). This somehow affected the scheduler code which would sometimes end up operating on mmapped areas of memory where thread info used to be stored and lead to the a page fault when preemption was disabled, right?. So your commit ("sched: only allocate sched::thread objects on the heap" - https://github.com/cloudius-systems/osv/commit/c9e5af6deef6ef9420bdf6437f8c34371c8df4e9) addressed it by making sure "new thread" is hidden by thread::make() that makes sure to use regular non-mmaped way of allocating memory. So we hope that my commit did not break any of that. Also even if we change to lazy population for >2GB allocation we should be fine as the allocation made by thread::make() should not try to allocate more than 4K of memory and call malloc_large(), right?

I don't remember what is sizeof(thread), I think it is around 20K, not less than 4KB (you can check).
I think that indeed, we need to make sure that thread::make() should return old-style malloc()ed memory, not mmap'ed() memory.
Perhaps thread::make() should use instead of aligned_alloc() a special function which ensures the old-style behavior? Maybe like we have alloc_phys_contiguous_aligned() we should have another function like alloc_populated_and_visible_on_all() (or some other name) which ensures the memory is pre-populated and visible *now* (not later) on all CPUs.
Alternatively, I wonder if we could add an mmap() option which will ensure a full TLB flush (with no "laziness") and use this only from the malloc()-doing-mmap() code.

I think outside the scheduler and thread::make(), we don't need to worry about Gleb's lazy TLB flush optimization:

The goal of this optimization for reducing unnecessary IPIs to CPUs currently running kernel threads - such as I/O threads and even idle threads.
The thinking is that kernel threads would not be holding pointers to inside user's mmap()ed memory, since the kernel has no control over when this memory may be unmapped by the user. So if some CPU is currently running a kernel thread, there is no reason to interrupt it (via an IPI), which is pretty slow and even more so on VMs, and flush the TLB - on every mmap() call, and we can flush the TLB only when returning to an application thread which may want to use this newly-mapped area (via a pointer it gets from some shared memory area).
I think the same thinking also follows for malloc() - I think kernel threads would also not be keeping pointers to user malloc()ed memory because the user may free it any time. So I think the TLB flush optimization is still fine for malloc, or in other words: It's fine that malloc() actually calls mmap() which has this TLB flush optimization.

 

    static thread* make(Args&&... args) {
        return new thread(std::forward<Args>(args)...);
    }

Lastly, do you think the changes to malloc_large() affect the patch I sent to make application threads use lazily populated stack - https://groups.google.com/d/msg/osv-dev/tZnwiScmjZY/GkY0hV9EAwAJ ?


Finally, when I ran the same test on Linux host it would complete under 2ms (milliseconds) - so 1000 times faster. Why is that? Linux obviously uses mremap() to implement reallloc() and OSv copies the data using memcpy(). So if we want to make realloc() work faster (it could be also useful to speedup ramfs), we should implement remap() - here is an open issue for that - https://github.com/cloudius-systems/osv/issues/184 and supposedly Pekka sent a patch a while ago which we can build upon.

As I noted a few weeks ago on this mailing list, Linux had hundreds of people optimizing every tiny detail. The question is whether this optimization is worth it. Is it common for people to realloc() huge allocations? If they do, do they do it many times, or exponentially (i.e., double the size each time)?
If you think it's worth it, then, yes - your new malloc() code now knows it used mmap(), and can use mremap() to do the actual work. I think the question is just if you want to optimize this specific use case.
 

Waldek 

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/54a9bb96-ea86-4c75-8d86-9a0be128fc58%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/efd5f149-43cd-492d-bb01-501fe8994bdd%40googlegroups.com.

Waldek Kozaczuk

unread,
Apr 3, 2020, 5:59:59 PM4/3/20
to OSv Development
I have added new issue https://github.com/cloudius-systems/osv/issues/1079 that captures all suggestions for improvements we have talked here about.
Reply all
Reply to author
Forward
0 new messages