Hi,
we are seing OOM "GC overhead limit exceeded" issue in Vert.x test suite on slow machines like Cloudbees or using virtualization on a laptop.
This issue means that the VM spend more time in GC than the rest of the VM (98%), but I’m pretty sure everyone here knows that already.
In the case of Vert.x testsuite, it happens because we create and destroy many Vertx instances and therefore many event loops / threads during the tests.
I spent quite some time on the issue and I found that it happens because the ThreadDeatchWatcher's Recycler DELAYED_RECYCLE fast thread local is a WeakHashMap<Stack, WeakOrderQueue) used by the ThreadPoolCache grows and retains many entries (up to 2000). This WeakHashMap contains recycled objects with large footprint and also the Stack itself reference its Thread that has a large footprint.
I am not saying it is a leak per se, but the maps grows and takes time to be garbaged. This does not prevent the testsuite to run on a laptop but it slows it down and on a slow machines it prevents running the testsuite entirely.
I believe this behavior was introduced by this commit :
https://github.com/netty/netty/commit/afafadd3d7caf1e4b346da049baab0afeae0a4bc
The change that makes the whole difference is :
https://github.com/netty/netty/commit/afafadd3d7caf1e4b346da049baab0afeae0a4bc#diff-23eafd00fcd66829f8cce343b26c236aR226
The introduction of the field stack in the WeakOrderQueue keeps a reference on the Stack objects, however Stack instances are also weak keys of the fast thread local WeakHashMap and therefore defeats the purpose of the WeakHashMap. Not entirely as it is GC’ed but much less often and increase the memory footprint.
let me know what you think
Julien