Measuring load-time interactivity network delays

11 views
Skip to first unread message

Maïeul Chevalier

unread,
4:12 AM (6 hours ago) 4:12 AM
to web-vitals-feedback
> Disclaimer: I am a Qwik core maintainer. We are focused on web performance and we're looking for better ways to measure the Qwik performance gains compared to other web frameworks.

For web apps relying on javascript for interactivity, a page can present fully rendered, visually complete pages but are unable to respond to user inputs until the code has finished downloading. This applies to all current reactive frameworks like Angular, React, Vue, Svelte, Solid and even Qwik (although to a much lesser extent).

It is frequent for an interaction to be lost or delayed by a blocking download, but among the current Core Web Vitals metrics, only INP measures interactivity and it does not account for network delays. I believe it would be a great addition for core web vitals and the web to have a metric that measures download delays in a stable and accurate way. I know this is a long shot so I'm hoping to get the ball rolling :).

This issue (1) describes context we have from the Qwik framework side and what we are currently measuring, and (2) discusses the issues I see in the current and deprecated google interactivity metrics regarding network delays (including INP, but also TBT, FID and TTI).

## Framework Context

### Qwik?

Qwik is a new reactive web framework similar to Angular, React, Vue, Svelte or Solid. It's main innovation is what we call "javascript streaming": Compared to the other reactive web frameworks which must execute all and thefore download all of the javascript code present or visible on a page in order to become interactive, Qwik has this unique ability to buffer javascript code bit by bit, prioritize preloading and executing code in case of user events, and continuing to preload the rest of the code when idle.

You can think of it like how video streaming compares to downloading. Instead of having to wait for the download to complete before being able press play, users can start running the video most often right away, and they can jump to other parts of the video, which will resume instantly if the packets have already been buffered or only after a short delay if not. Except the difference from videos is that javascript streaming is about avoiding executing code, and therefore the associated downloads.

Historically, reactive web frameworks were designed for CSR (Client Side Rendering). It is only later that they introduced SSR (Server Side Rendering) to show content to the user sooner. The way they've achieved this is through a process called hydration, where the server generates and sends the html to the client, and then the client uses javascript to regenerate the tree and attach event listeners for a given page/section to be interactive. The problem of this approach is that because the framework uses javascript to attach event listeners, it still has to execute all of the code for that page/section and therefore must also download all of it.

### Manual testing

In manual local tests between Qwik, React, Vue, Svelte and Solid apps, on firefox under 3G and CPU calibrated to low-end device throttling, on fairly similar applications, we are measuring input delays of only ~3s in Qwik, vs ~10s to ~20s for the others (depending on the implementation, lazy-loading, etc.).

Those numbers already make for good demos but mindful developers and lead engineers considering Qwik might be wary of such manual throttling measurements. For example 3G throttling in chrome adds a 2s latency penalty on each network request, which inflates the Qwik numbers to ~6s total because of it's small bundles streaming architecture, more than is arguably the case in reality. This is why we prefer to use Firefox's 3G throttling as it's 100ms latency yields more accurate results for Qwik in our experience.

Because of the chrome 3G throttling 2s per-request latency penalty, we spent a fair amount of time looking into inlining the Qwik preloading logic or putting it into a worker instead of keeping it as a separate module. This would make for better chrome demos, but in reality we now understand that it is unsure whether or not this would benefit end users. This is the kind of optimization we could only produce with a stable and accurate field metric. The answer might be that it depends on the type of application, but a field metric would at least help engineers pick the right choice.

## State of the art of network delays measurement

Among INP, FID, TBT and TTI, TTI is currently the best metric we have at our disposal to measure network delays. The problem is that it is not only deprecated, but also too sensitive to outlier network requests and long tasks to be reliably measured in the lab, let alone in the field.

Even though INP, TBT and FID measure blocking delays, none of them effectively takes network delays into account.

For testing the metrics in real world conditions, I use the following links:
- https://qwikui.com/docs/styled/accordion/
- https://ui.shadcn.com/docs/components/radix/accordion
- https://shadcn-vue.com/docs/components/accordion
- https://www.shadcn-svelte.com/docs/components/accordion
- https://www.solid-ui.com/docs/components/accordion

Although optimization strategies might differ, those 5 libraries are all heavily inspired by ui.shadcn.com and therefore are somewhat comparable to one another.


### INP
As a CWV, INP does a fairly good job of tracking CPU delays and appears to be quite stable across the board, but it is pretty much blind to network delays and also seems to mis-report certain CPU delays.

The issues:
- INP cannot track user events until event listeners are attached, which is only the case once the hydration process is completed. In the case a user clicks pre-hydration, INP will simply report a good ~10-20ms value, even though nothing meaningful happens from a user perspective. If the user re-clicks on the same element post-hydration, the event listener will be attached and the real CPU delay once the code has been downloaded will be recorded. Notice that the first click is not recorded by INP even though the user did experience unresponsiveness. This is easily reproducible in the performance tab with https://ui.shadcn.com/docs/components/radix/accordion.
- Even in the case event listeners are attached, INP can easily be fooled by in-flight network requests. In Qwik while preloads are still ongoing, event listeners on the html trigger a few small scripts to preload the user events code in priority and replay them once they're ready. Those are the 2-3s delays we experience on 3G throttling, but INP will report a delay of ~80ms. This is easily reproducible in the performance tab with https://qwikui.com/docs/styled/accordion/.
- When a user clicks during the hydration execution phase which induces some long blocking tasks, INP will report a slower value as a result. On big apps it is not uncommon to have this execution phase last for a few seconds, which users on good networks but low-end devices may very well encounter. This can be reproduced in https://ui.shadcn.com/docs/components/radix/accordion using low-end device calibration with 3G thorttling vs no network throttling and clicking repeatedly.
- In the performance tab, the hydration CPU execution phase might run for much longer than what INP reports, even when clicking repeatedly as soon as the element is visible. Here's a screenshot where hydration takes roughly ~5s of scripting  but INP only reports 2s: ![image](https://hackmd.io/_uploads/SJTo8rggMl.png)

Considering all those issues, I believe it is fair to say that INP is not representative of the real user experience regarding interactivity and responsiveness.

### FID
FID also required event listeners to be attached in order to be recorded, so I take it that the same INP issues apply.

### TBT
TBT only measures excess CPU delays over long tasks. It clearly does not measure network delays.

### TTI
Because TTI does not rely on user inputs to measure interactivity delays, it **can** detect long delays caused by the first long task and network downloads that preceded it.

The issues:
- Outlier network requests can prevent it from working in the field.
- The long task 50ms threshold is arbitrary: there might be blocking downloads, but TTI will be as fast as FCP as long as there is no blocking tasks recorded.

TTI is therefore not a stable metric and cannot be used in the field. It is nevertheless the best metric we currently have at our disposal to take network delays into account.

## What next?

While INP already gives useful insights regarding interactivity, it paints an incomplete picture of what users actually experience. It would be great to have an interactivity metric that would reliably and accurately account for network delays.

I am curious to hear what you folks think about this, whether this is desirable, whether this is doable, etc. I haven't spent years dealing with the problem space so I'm probably missing a lot of things/context.

I have some ideas but I'll open a new thread for those as I imagine it is better to keep this thread focused on feedback and understanding of the problem.

Thanks,
Maïeul

Amit

unread,
4:41 AM (5 hours ago) 4:41 AM
to web-vitals-feedback
Maïeul, 

It is not a sales pitch. It is not a place where you promote your framework.

Barry Pollard

unread,
4:49 AM (5 hours ago) 4:49 AM
to Amit, web-vitals-feedback
FWIW, while I appreciate it's easy to go over the line I didn't find the original mail overly pitchy and was honest up front and also gave good context of where this thinking was coming. I think it raises interesting questions worthy of discussion—as will be shown by my larger response to that that I'm working on.

This is a moderated forum so while we don't want it to be promotion forum, we do encourage feedback based on people's experience. And sometimes (oftentimes?) that means including context of what you're working on. And that's fine as long as it doesn't take away from the intent of the post.

Also please do keep it respectful here.

Barry

P.S. I will say the markdown formatting is a little weird for an email (are we all AI now!?) and maybe suggests this could have been a blog post (and maybe was intended to be originally?).

--
You received this message because you are subscribed to the Google Groups "web-vitals-feedback" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-vitals-feed...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/web-vitals-feedback/29312e64-80c5-48a5-b694-e969620b9a17n%40googlegroups.com.

Barry Pollard

unread,
5:20 AM (5 hours ago) 5:20 AM
to Maïeul Chevalier, web-vitals-feedback
Hi Maïeul,

I think you bring up a few key points, that I'll summarise as the following two points if my understandind is correct:
  • INP is (and FID was) based around event handling measurement. No event handlers being attached mean they are not measuring that type before an app is "hydrated".
  • INP does nor measure the full impact of an interaction, especially of network interactions.
Those are both fair points and basically come down to the limitations of trying to set a broad medtric that can be applicable to the web as a whole.

To explain some of the evolution of the thinking here, FID was one of our first interactivity metrics and was perhaps closer to what you're considering with a pre-event handler metric, in that the thinking was it was important to measure that first delay time while the page was busy. There was also some concern that measuring more would create the wrong incentives. Despite it's detractors now, FID had it's usefulness when it first came about, we and did see good improvements in responsiveness but it quickly showed it's weaknesses in only measure the delay and only in that first interaction.

INP was the next evolution and sought to measure all interactions AND more of the interaction. But yes it's true that it still does not cover the two missing points you raise. Its primary intent was to encourage quick initial responsiveness, and so keeping a healthy bit of breathing room on the main thread (so just exiting quickly, like we were concerned with FID would likely catch you in the end, or if done in a non-blocking manner would be a good thing).

We believe INP is a good broad measure of page responsiveness that is broadly comparable across sites, and that it encourages good practices benefiting users. These were some of the key aims of the Web Vitals initiative.

However, INP is not a full end-to-end measurement of when an interaction has been all (or even mostly) completed, and was not intended to be. This is more difficult to measure for a couple of reasons. For a start it's quite difficult for the browser to know what processing is important to the user and what is not (e.g. screen updates likely are important, but sending of analytics beacons is not). And secondly it's impossible to fairly measure two very different interactions across a page (a video upload is going to take more time, than opening a details/summary selector).

INP is intended to be a starting measure for site owners and for comparisions, rather than the end point. We encourage site owners to dig beyond this with custom metrics, that can be hyper specific to their particular sites.

At the moment I'm not convinced measuring network delays is necessarily a good user-centric measure. Some of these will impact users, but many will not (the analytics beacon examples) will not. And therefore I think we should ideally look to measure the impact, rather than the potential impact.

The new intection-contentful-paint performance entry, being launched as part of the Soft Navigation API allows more paints to be attributed to each interaction (at present only each larger paints similar to LCP, but we are definitely thinking about expanding that to all paints). This allows measurement of the full time of interaction until its largest paint, which goes a llong way to solving 2. Though as I say, unlike INP, this is not likely to be comparable across sites, or even interactions within a site, so is more of a custom metric.

For the first bullet, we don't get have a good standard metric. A few RUM providers have experimented with measuring "rage clicks" (and Google has too btw!) which counts some of this, but also other parts. I agree it would be good to do more thinking in this space...


--
You received this message because you are subscribed to the Google Groups "web-vitals-feedback" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-vitals-feed...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages