Waiting For Cache Optimization/Heuristic

59 views
Skip to first unread message

PhistucK

unread,
Jun 6, 2015, 4:09:46 PM6/6/15
to net-dev
Under heavy system load, I sometimes see "Waiting for cache..." for more than half a minute (which includes seeing the busy tab dialog box that lets you kill it).

While this is understandable (the system is under heavy load, a lot of processes are using the hard drive, or the memory in which the cached response resides is in the swap file and the like), I think there can be an optimization here - if "Waiting for cache..." is shown for more than X seconds and the cached response is less than Y bytes (where X is dependent on the size of Y), stop waiting for the cache and just re-fetch the content from the server.

For example, the cached response is less than a megabyte and 3 seconds have passed - stop waiting for cache and just re-fetch.

A better heuristic would be to also identify the connection speed and state (even a rough approximation) and change the X:Y ratio according to the speed (a slow or metered connection would wait 30 seconds for the a megabyte of cached response to be fetched).

(This happens on Windows 7, if it matters)

PhistucK

Darin Fisher

unread,
Jun 6, 2015, 4:53:39 PM6/6/15
to PhistucK, net-dev

This can also occur if we are waiting for write access to a cache entry, right? I think it may also happen if there is a writer and one or more readers waiting for the cache entry to be fully written so they can read it. (Or at least that used to be the case.)

-Darin

--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CABc02_J%3DOtJRiX-Co8a0KXfwuJCNwTR9NOTz4q%3DxrBYoi%2Bs9CA%40mail.gmail.com.

PhistucK

unread,
Jun 6, 2015, 4:56:32 PM6/6/15
to Darin Fisher, net-dev
I do not know, really. But I assume it happens because it tries to read a swapped memory, because I do not think my hard drive is that busy (then again, reading from swapped memory makes it busy for other operations...).

Perhaps you were not really asking me. :)


PhistucK

Chris Bentzel

unread,
Jun 7, 2015, 5:49:50 AM6/7/15
to PhistucK, Darin Fisher, net-dev
First - it'd be great to understand why we are blocking for 30 seconds. There is the "cache lock" which Darin mentions which could be a cause.

We've talked about this in the past though - basically bypassing the cache if it's taking too long. As you bring up, getting the right heuristics in place, particularly for metered connections. is going to take a lot of tuning. 

For writes, it might be good for us to pass data along to the URLRequestJob instead of waiting for writes to cache to succeed, but it also means that we need additional memory to buffer it.

PhistucK

unread,
Jun 7, 2015, 6:46:04 AM6/7/15
to Chris Bentzel, Darin Fisher, net-dev
How can I help you understand that? chrome:net-internals when it happens, or do you need me to go to it before it happens (or none of them)?

It happens under heavy system load, especially after I had not interacted with a background tab for a long time. Basically, the process was probably paged out (the tab content is white), but I go to the omnibox and press Control + Enter to get the page in a new tab (it is a Google Analytics report, if it matters. But it can happen in other pages as well, of course) and that is when it happens.


PhistucK

Chris Bentzel

unread,
Jun 8, 2015, 8:59:53 AM6/8/15
to PhistucK, Darin Fisher, net-dev
net-internals before it happens is pretty useful. You can always specify --log-net-log at command line and this will end up logging to a file rather than buffering in memory, which may help.

Thanks for details about how to recapture it.

PhistucK

unread,
Jun 13, 2015, 4:19:45 AM6/13/15
to Chris Bentzel, Darin Fisher, net-dev
Is a second or two of "Waiting for cache..." enough for you as a start?


PhistucK

Chris Bentzel

unread,
Jun 15, 2015, 7:40:45 AM6/15/15
to PhistucK, Darin Fisher, net-dev
If there's some net-internals associated with that could be useful.

PhistucK

unread,
Jun 15, 2015, 8:48:05 AM6/15/15
to Chris Bentzel, Darin Fisher, net-dev
Yes, I opened it beforehand. :)
I will share it with you later today.


PhistucK

Chris Bentzel

unread,
Jun 15, 2015, 8:49:05 AM6/15/15
to PhistucK, Darin Fisher, net-dev
Thanks!

PhistucK

unread,
Jun 15, 2015, 1:16:41 PM6/15/15
to Chris Bentzel, Darin Fisher, net-dev
Here (I granted access to Chris, if anyone else needs access, just request it) -


PhistucK

Matt Menke

unread,
Jun 15, 2015, 2:23:52 PM6/15/15
to PhistucK, Chris Bentzel, Darin Fisher, net-dev
There are three hanging requests (2 for *.gateway.messenger.live.com URLs, and one for client-channel.google.com).  The google one is definitely a deliberate hanging get, I assume the first two are as well.

So, looking at URL_REQUESTS, it looks like the cache thread (Or maybe the IO thread) was pretty much locked up from ~203175 to ~207870.  During that period, basically anything that round-tripped to the cache thread did very little for 4.5 seconds.  There aren't enough other things happening then to be sure if this was an issue with the cache thread or the IO thread.

Looking at the DISK_CACHE_ENTRY events is a bit more illuminating.

204191 to 205949:  Event source 256643 locks up the cache thread with a single 32k read attempt (And gets 15k).
205949 to 207397:  Event source 256644 locks up the cache with a 32k read of its own (And gets 25k).

Before that, there are a couple 100-200 millisecond reads as well, for getting cache entry metadata before those two longer reads.  So it looks like reads from disk as just being slow, for some reason.

Note that the block file cache does blocking reads one at a time on a single thread, though using more threads may not help if the drive is hiccuping, and the read start/read complete events are logged synchronously on that thread, so there's nothing else happening on the cache thread that could be distorting read time measurements.


PhistucK

unread,
Jun 15, 2015, 3:52:53 PM6/15/15
to Matt Menke, Chris Bentzel, Darin Fisher, net-dev
You can disregard anything unrelated to Google Analytics, really, since there were other unrelated open tabs as well (like outlook.com).

The drive was probably hiccuping, yes, it was under pretty heavy system load (meaning, the hard drive LED was constantly on - before, during and after it happened).

How do you want to proceed from here?


PhistucK

Matt Menke

unread,
Jun 15, 2015, 4:00:18 PM6/15/15
to PhistucK, Chris Bentzel, Darin Fisher, net-dev
I'll defer to the owners of the cache on what action we want to take here.

I just wanted to check if this was due to the cache's per-entry lock (Two requests trying to access the same entry at the same time) - it isn't, though some potential fixes would fix both issues, or at least work around them.

PhistucK

unread,
Jul 18, 2015, 9:03:53 AM7/18/15
to Matt Menke, Chris Bentzel, Darin Fisher, net-dev
I have a new log. I got another "Waiting for cache" for more than half a minute (after which I gave up) for mail.live.com. It did not even load chrome:net-internals at the time. Only about 5 minutes later, chrome:net-internals loaded. I proceeded to open mail.live.com again and after about 20 seconds, it opened.

You can find it here -

Chris, Darin and Matt were given access rights. If anyone else wants to access it, just request one.


PhistucK

Matt Menke

unread,
Jul 20, 2015, 1:14:11 PM7/20/15
to PhistucK, Chris Bentzel, Darin Fisher, net-dev
There are about 84 requests being blocked by the DownloadResourceThrottle.  They have been in that state for the entire time net-internals was open (4 hours, 45 minutes?) with no activity.  I'm not sure if they're relevant, though none look to be for files related to a mail.live.com page load.  I'm not sure how these fit into the described scenario - we don't back data log entries, or record log entries when net-internals isn't open.  So if net-internals is opened 15 minutes before the log is saved to disk, existing requests should claim  to be 15 minutes old, even if they were older.

Ignoring those, we have four hung connected streams, two for different *.gateway.messenger.live.com hosts, and one for *people.directory.live.com (The other is for a google URL) - I assume these are hanging gets used for a push-based stream.

Then we have a more interesting request to h.live.com.  This request looks to be hung for 15 seconds (Until the log dump was crated) with a CERT_VERIFIER_JOB.

After that, no other request takes more than 5 seconds, which isn't exactly zipped, but not sure any of them are related to the problem.  It looks like rtt varies from maybe 250 to 700 milliseconds to the various microsoft hosts we talk to.  It also looks like there's anywhere from a 70 to 1100 millisecond latency when reading from/writing to the cache, but we only report that state when starting up a request, not when reading the body, so I don't think that could account for such a long delay.

So I'm not really sure what the problem is.

PhistucK

unread,
Jul 20, 2015, 4:15:37 PM7/20/15
to Matt Menke, Chris Bentzel, Darin Fisher, net-dev
I mentioned that in the log message, I think, but anyway -
Those downloads were throttled long before I opened net-internals (net-internals was only open for less than ten minutes overall, even less than two minutes). Which was also weird, by the way... why were they throttled? Chrome asked me if I want to allow the page to download multiple files and I approved. I had to download twice or trice in order to get all of the 135 links (at first, it stopped after about 50). But this is irrelevant.

Once the page showed actual signs of active loading (the background showed up and so), everything was fast (or as usual). This is not the problem. The issue was that I waited for almost half a minute before it started to actively load the website.

The weird thing is, unless I am reading it incorrectly, the log never mentions that I actually tried to load mail.live.com, or https://dub122.mail.live.com/?fid=flinbox, you only see it as a referrer. How come it does not even mention it (while it did mention those hour (or more) old download attempts :|) given that I first opened net-internals and only then went to live.mail.com? :(


PhistucK

Matt Menke

unread,
Jul 21, 2015, 3:19:05 PM7/21/15
to PhistucK, Chris Bentzel, Darin Fisher, net-dev
The log shows 4 requests for https://mail.live.com/, all of which succeeded (On ran into a failure when trying to create a cache entry, curiously, but then went over the network and succeeded).  Only one of those was a main frame load, curiously.

It shows a request issued with a referrer of https://dub122.mail.live.com/?fid=flinbox, but does not show a request for that URL itself.  Note that requests served entirely out of blink's cache won't be shown on net-internals, since they never reach the network stack.

Matt Menke

unread,
Jul 21, 2015, 3:23:09 PM7/21/15
to PhistucK, Chris Bentzel, Darin Fisher, net-dev
Curiously, the log doesn't show any requests that hung for 30 seconds before continuing.  If this happens again, another log would be great.  Don't see a path forward here at the moment, unfortunately.

PhistucK

unread,
Jul 21, 2015, 4:48:24 PM7/21/15
to Matt Menke, Chris Bentzel, Darin Fisher, net-dev
I somehow missed those navigations when I search the log. Sorry.

I am reluctant to use the command line flag as it happens after a long time of using the browser and the log could blow up to over several gigabytes (also negatively affecting the performance of the system, or sparking a low disk space situation). But I understand I have no choice. :(


PhistucK

Matt Menke

unread,
Jul 21, 2015, 4:50:57 PM7/21/15
to PhistucK, Chris Bentzel, Darin Fisher, net-dev
You can try keeping net-internals open in the background, and checking "Discard old data under memory pressure" in the capture tab.  That should hopefully not reduce performance too much, and won't writ anything to disk (Modulo the swap file).

PhistucK

unread,
Jul 21, 2015, 4:52:56 PM7/21/15
to Matt Menke, Chris Bentzel, Darin Fisher, net-dev
Interesting. Thank you!


PhistucK
Reply all
Reply to author
Forward
0 new messages