Problems with CachingResourceManager

Brad Wood

unread,

Dec 31, 2021, 3:50:50 PM12/31/21

to Undertow Dev

So I was pretty happy with getting the CachingResourceManager plugged into my code, but I figured I would run some test to make sure it was actually caching stuff. Turns out it doesn't work. It actually has a number of fairly serious issues that seem to make it a no-op. I'm actually curious if anyone has really ever tested it before or if everyone just assumes it's working since it doesn't error. I'll try to break down all the issues I ran into categorically. I spent a couple days reverse engineering how it's supposed to work. Some of these issues in are also in the DirectBufferCache and likely affect other uses of it outside of the CachingResourceManager. The good news I have what appears to be fixes for many of these issues locally so I can try and help with pull requests.

CacheEntry claim enable has code paths that leave invalid cache items

So the cyclomatic complexity of the CachedResource's serve() method is pretty crazy. There are a ton of code paths which vaguely use non-idempotent methods like cacheItem.reference() or calls that require *un-doing" like cacheItem.claimEnable(). There are a number of code paths that leave cache entries in invalid states. Most notably, in its current form, a cached resource won't actually get cached until the SECOND time its served (more on why below) but the serve method tends to leave CacheEntries with an enabled status of "1" which means they can never be claimed by another thread, basically killing them. I had to completely refactor CachedResource.serve() to ensure any claimed CacheEntry which didn't get wrapped by the ResponseCacheSender was careful to disable the entry so it could get picked up the next thread.

CacheEntries aren't allocated right away, and are only eventually allocated once the get() hit sampling runs every 5 hits

So this is the reason it takes TWO hits to a cache resource to actually cache it. There is no logic anywhere that actually asks newly created cache entries to allocate their buffers! This means the first time through CachedResource.server(), the entry is unallocated so it's not possible to be cached. The ONLY code that eventually allocates an entry is inside of the hit sampling counter that attempts to purge items from the cache to see if it can finally allocate buffers for unallocated entries. So it's only after 5 "hits" (the SAMPLE_INTERVAL) to the cache entry that it gets eventually allocated. This clearly appears to be an oversight as there's no reason not to immediately attempt allocation of buffers to each new cache entry.

CacheEntries log way more hits than they should

I mentioned above that it takes 5 "hits" to a cache entry before it's allocated. You may be wondering how it gets cached after only two HTTP requests. The reason is that the CacheEntry receives FOUR hits for every HTTP request due to multiple dataCache.get() calls throughout the code (such as getting the content length). So by the 2nd http request, the cacheEntry is getting its 5th (though 8th) "hit" which finally manages to allocate it. To reduce contention and make the "hits" more accurate, I would introduce a "getQuiet()" method to the direct buffer cache which fetches an item from the cache without bumping the hit counter and triggering the cache reaping. I would have the CachedResource.serve() method be the only place that uses "get()" and all the other places use "getQuiet()" so there is only a single "hit" registered per HTTP request serving the file. Ironically all the extra hits helped mask of the issue above.

ResponseCachingSender won't cache transfers, only sends

So I'll be honest-- I've read the code through about a dozen times and I still really don't quite understand the difference between a sender that "transfers" and "sends". I also can't find a shred of information online about exactly what the "TransferMinSize" really does and what it should be set to. I see several of the Undertow test cases set the transfer min size to 10 MB but I can't find any advice one how to tune that. The reason it matters is because any files smaller than the transfer min size (or when serving a range) use a transfer, not a serve (this logic in the PathResource.serveImpl() method. This is problematic because the ResponseCachingSender doesn't implement any caching in its transferFrom() method. There's no explanation as to why other than a comment simply saying "Transfer never caches". Can I get some clarity on why that is? This causes several issues which I'll cover below.

PathResource serveImpl() only uses serve() if file is smaller than the transferMinSize which means the ResponseCachingSender will never cache the buffers for files smaller than the transferMinSize, regardless of the maxFileSize set in your CachingResourceManager

Given what I explained above, this creates a weird catch 22 where files that are smaller than the CachingResourceManager's maxFileSize look like they are eligible to be cached so far as the CachedResource class is concerned. The CachedResource wraps the sender in a ResponseCachingSender with the expectation that

The resource will be cached
The cacheEntry will be enabled aftward (or disbled on error)

However, the transfer min size logic in the PathResource can create a separate code path through the ResponseCachingSender;s transferFrom() method which never enables or disables the cacheEntry, leaving it in an invalid state and never caching the item. Furthermore, the default transfer min size is only 1024 bytes which is very small and means all files over 1KB would never actually be cached!. Is it expected to always set your transfer min size to be at least as large as the caching resource manager's maxFileSize? This isn't documented anywhere and wasn't even obvious until I did some very deep debugging.

Files larger than the CachingResourceManager's maxFileSize but larger than a PathResourceManager's transferMinSize clog up the datacache and keep trying to allocate buffers every time they are served even though they will never be cached.

Furthermore (on the point above) - all files small enough to be cached but larger than the transfer min size are not only not cached, but their CachedEntry instances are left in invalid states and clogging the dataCache with allocated buffers which will never be used. I think this needs some thought put into it as I can't find a way to detect when the file will not be cached from inside of the CachedResource method in order to optimize. The CachedResource class doesn't know what specific type of Resource is being served and the transerMinSize is specific only to the PathResourceManager so there's just no way to get the information. I'd really like to know why we can't just cache the transfers, but assuming there's a good reason for that, there needs to be a better way for CachedResource.serve() to know when a resource isn't going to be cacheable and completely skip even trying to create and allocate the CacheEntry instance for it. Otherwise, it's just a waste of CPU cycles and memory every time.

CachedResource.serve() references the cache entry twice leaving the refs count at 2, meaning it can never be destroyed.

This goes back to the crazy code paths I talked about in the first section above. The code in CachedResource,serve() can call the entry.reference() method twice which creates two references to the item. Furthermore, there are code paths which reference the cacheEntry BUT opt to not use the responseCachingSender with its DereferencingCallback, leaving the cached entry in an invalid state where the number of refs get stuck at 2. This means these cache entries can NEVER be destroyed and their buffers will never be cleared, even after the cache entry is removed from the directBufferCache. The actual slices will be used forever in the slice pool and will never even be garbage collected. It was pretty easy to put the cache in a completely invalid state if I set the size small enough to only have a few cache regions in the slice pool. Once they were all used, I could never allocate another cache entry since there were no more available regions in the slice pool EVEN THOUGH all the cachedEntries had been removed from the cache. This is pretty bad as it renders the cache useless.

A DirectBufferCache$CacheEntry's destroy method doesn't actually release buffers

This item is similar to the item above, but instead of bumping the refs counter too high so the cacheEntry can't be destroyed, the actual destroy() method, when called, doesn't even work! Here's the code (from DirectBufferCache$CacheEntry.destroy()

private void destroy() {
this.buffers = EMPTY_BUFFERS;
for (PooledByteBuffer buffer : buffers) {
buffer.free();
}
}

You can see this.buffers is replaced by empty buffers BEFORE the buffers are looped over and freed. This means that the old buffers are never freed for any cache item ever so far as I can tell! The regions remain used in the slice pool forever, and the entire reaping mechanism of the direct buffer cache is moot since it may remove the cache entries, but the buffers themselves, once full, will never be able to allocate again. The order of those two operations needs to be reversed so the older buffers are cleared BEFORE they are replaced by empty buffers.

Note on slice relationship to files

This is more just something that wasn't obvious at first and would hurt to be documented somewhere. The number of slices in the direct buffer cache represents the max number of files which can be cached. I found this out when playing with larger slice sizes. Basically, one file may use 5 slices, but 1 slice can only be assigned to a single file. Therefore, a 10 MB cache with 1 MB slices can only cache 10 files regardless of how small they are. The "left over" portions of the slices are just wasted. Again, this may be obvious to some people, but it wasn't to me at first and I only figured it out via trial and error so it's worth documenting somewhere. Most examples of how to use the direct buffer cache online were copied from the Undertow test suite which tends to use very non-realistic numbers just for the sake of the test such as this very non-production-ready example from the test suite which you can see has worked its way into this online tutorial even though a cache with those settings would be nearly useless for a real app.

-----------------------------------------

Whew-- so that represents about 3 full days of debugging the Undertow source and endless tests :) The good news is I have fixes for most of the issues I described above working in my local build of Undertow and the CachingResourceManager and DirectBufferCache seem to be working great. I can help by sending pulls, but perhaps I can send everything I changed in one big pull because honestly many of the issues were so intertwined, they only work with all the fixes together. I also added a few new methods to cacheEntries purely for the purpose of debugging. Perhaps leaving these in could be helpful to build a visualizer for a cache to see what it's doing. Right now it's just a black box and you can't even tell if it's working. Please give me your thoughts and how you want to proceed. I'm really anxious to roll out this caching feature in my product, but it's basically DOA as it stands right now.

Thanks!

~Brad

Developer Advocate

Ortus Solutions, Corp

E-mail: br...@coldbox.org

ColdBox Platform: http://www.coldbox.org

Blog: http://www.codersrevolution.com

Brad Wood

unread,

Jan 3, 2022, 2:28:59 PM1/3/22

to Undertow Dev

To help the discussion, I've sent a pull request that shows the local changes I made to get serving cached resources up and working in Undertow.

https://github.com/undertow-io/undertow/pull/1282

This also includes another small change not discussed in my previous message. I changed the atomic integer that tracks hits to simply incrementAndGet() since it make no sense to care about the previous hit value as it has no semantic meaning to the cache implementation. Just increment and be done.

Also note, the pull above fixes all the issues I found EXCEPT the issue of file smaller than the max file size to cache but bigger than the transfer min size clogging up the cache with unused cacheEntries since the CachedResource has no way to optimize and know when that scenario will occur. Bumping my transfer min size up to at least the same as the max cache file size is my current workaround to avoid that.

Flavia Rainone

unread,

Feb 4, 2022, 8:50:43 AM2/4/22

to Undertow Dev

Hi Brad

I see that there is a lot of points here to review, and I'm not sure if any of these points are by design, and what do make of it. I'll separate some time slot for going through all this in detail, so we can create the Jiras and have your fixes finally merged.

Thanks a lot for the PR, by the way! I'll talk to you soon.

Brad Wood

unread,

Feb 4, 2022, 6:51:24 PM2/4/22

to Flavia Rainone, Undertow Dev

Excellent, and thanks again for helping go through this with me!

~Brad

Developer Advocate

Ortus Solutions, Corp

E-mail: br...@coldbox.org

ColdBox Platform: http://www.coldbox.org

Blog: http://www.codersrevolution.com

--
You received this message because you are subscribed to the Google Groups "Undertow Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to undertow-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/undertow-dev/ef30cada-ba35-41bb-9e04-c3877354e611n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Flavia Rainone

unread,

Apr 25, 2022, 1:07:42 PM4/25/22

to Undertow Dev

Brad,

This is just a quick update that I started looking at this issue, but I haven't been able to look at all referenced PRs and Jiras, and I really would like to see the big picture here so I can give a proper feedback and follow up.

At the soonest time possible, either me or someone else from the team you give you proper feedback and keep your Jiras moving forward so they can be incorporated to Undertow.

Best regards,

Flavia

Brad Wood

unread,

Apr 25, 2022, 1:12:42 PM4/25/22

to Flavia Rainone, Undertow Dev

Thanks for the update and I'll happily answer any questions about the PR. Also note, there are a number of outstanding questions in these mailing list threads that I think you were hoping Stuart would chime in on.

~Brad

Developer Advocate

Ortus Solutions, Corp

E-mail: br...@coldbox.org

ColdBox Platform: http://www.coldbox.org

Blog: http://www.codersrevolution.com

To view this discussion on the web visit https://groups.google.com/d/msgid/undertow-dev/97eb53e4-1d8f-4b28-9c10-aa23f1232cc8n%40googlegroups.com.

Brad Wood

unread,

Dec 23, 2022, 1:33:08 PM12/23/22

to Undertow Dev

Hi, it's been a full year now and not only are all my points brought up in this thread last December unaddressed, but my pull request is still sitting in Github with no comments (other than my own). Why can't I get any attention to this?

Flavia Rainone

unread,

Dec 23, 2022, 11:48:20 PM12/23/22

to Undertow Dev

Hi Brad,

This issue has been postponed to Undertow 2.4.0.Final, due to the highest priority of 2.3.0.Final with the new Jakarta specs. It is the team's priority now to move master to 2.4.0.Final-SNAPSHOT so this can be triaged and merged, and we are also working to prevent this kind of delay to happen again in the future, by better organizing the process for PR reviews.

I apologize for the delay, and I hope to talk to you soon on this particular topic. I am saving some time next week to review pending issues like this one, that require code review and discussion.

Happy holidays!

Brad Wood

unread,

Dec 27, 2022, 2:43:50 PM12/27/22

to Flavia Rainone, Undertow Dev

As a bug, is there a reason it won't be targeted at 2.2? As I've discussed several times before, the app servers I use are still on the javax namespace with no plans to update any time soon so all the great stuff going into Undertow 2.3 and up is unusable by me :(

Thanks!

~Brad

Developer Advocate

Ortus Solutions, Corp

E-mail: br...@coldbox.org

ColdBox Platform: http://www.coldbox.org

Blog: http://www.codersrevolution.com

To view this discussion on the web visit https://groups.google.com/d/msgid/undertow-dev/bb6a0cf1-3d98-4587-bd2f-e4eff1187bacn%40googlegroups.com.

Brad Wood

unread,

Jun 5, 2023, 12:49:24 PM6/5/23

to Undertow Dev

Flavia, any update on any of this? None of the questions I raised in the original post have been answered and my pull is still sitting in Github for a year and a half now. Undertow 2.2 is supposed to be receiving bug fixes and the issues I found were most certainly bugs. What's going on here?

Flavia Rainone

unread,

Oct 15, 2025, 4:57:05 PM (2 days ago) Oct 15

to Undertow Dev

Brad,

We are finally back to the PRs and I created a Jira for this one:
https://issues.redhat.com/browse/UNDERTOW-2636

Would you update the PR rebasing it against main? And, I looked at the points you raised here and at the code, am I correct to assume that the PR does not fix all the issues you found here? The ideal would be to have a test, if you can extract one from any usage you developed while debugging, that would be awesome!

If you are pressed for time and don't have time to get the PR ready for merging, let us know, I understand this is a very old issue and maybe not your priority any more.