PSA: Adjacent CDATA sections in XML documents will now be merged

53 views
Skip to first unread message

David Bokan

unread,
Aug 21, 2023, 1:05:41 PM8/21/23
to blink-dev
Hi blink-dev,

I just landed https://crrev.com/f848cbce89b422 which will cause an XML document to merge sibling CDATA section nodes into a single CDATA section during parsing.

The reason for this is that libxml emits separate nodes at 300 byte boundaries for a CDATA section that's spread across more than one input chunk. If the CDATA is split across input chunks, libxml also emits separate nodes and this merging behavior was already applied to this case by https://crrev.com/2fdffd306d488 - the change I'm making generalizes this to also cover the 300 byte chunking behavior (see CL commit message for more details).

I'm hoping this is web-compatible but have added a flag-guard if we need to turn it off. If anyone sees any breakage related to this or knows specific sites/apps to check please let me know.

Thanks!
David

Mathias Bynens

unread,
Aug 22, 2023, 2:29:39 AM8/22/23
to David Bokan, blink-dev
Given that this presents a Web compat risk, is there a ChromeStatus entry + Intent thread for this?

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/4bfd9115-b843-4dc7-b10f-26b87d215689n%40chromium.org.

Yoav Weiss

unread,
Aug 22, 2023, 2:37:38 AM8/22/23
to Mathias Bynens, David Bokan, blink-dev
In particular, we recently clarified our process RE changes to existing behavior. It'd be good to answer the questions in "step 2" to understand if the breakage potential here is very unlikely.

David Bokan

unread,
Aug 22, 2023, 5:09:05 PM8/22/23
to Yoav Weiss, Mathias Bynens, blink-dev
Apologies, I wasn't aware the PSA guidance has changed. I've created https://chromestatus.com/feature/5168429241204736 to track the change.

As to an intent, I believed the breakage risk to be small so thought PSA was sufficient. I think so since we already can perform this kind of merge (from https://crrev.com/2fdffd306d488, granted, the case that CL merges is very unlikely to happen). Additionally, if this case does get hit, the resulting data would still be consistent, applications are likely to concatenate the CDATA sections anyway.

But I didn't have hard data so I did some digging today:

Via the XMLDocument UseCounter, 0.18% of page loads include an XMLDocument (this is the *upper bound* on potentially affected page loads).

I went looking through HTTPArchive data using the `requests` and `resposne_bodies` tables from 2023_08_01_desktop. I filtered requests to:

Content-Type header with "application/xhtml+xml"
request_type "Document"

Which I believe is the case where XMLDocument is used.

I then looked at the response bodies for those requests looking for adjacent CDATA sections: looking for instances of the string "]]><![CDATA[" and found none. (note, any kind of space, comment, other text would insert a node between them and thus prevent the merge).

Certainly not definitive but I expect this to be encountered very rarely and even when encountered, it's unlikely to cause an issue unless the page is using some kind of reflection (e.g. looking at `node.innerHTML` and then relying on there being separate CDATA sections for some reason).

Of all the cases I've seen in HTTPArchive, CDATA is used simply to ensure that <script> and <style> content is parsed correctly when loaded as XHTML.

Given all this, I think breakage is very unlikely (and we have a flag guard just in case) so I think PSA is enough but if anyone thinks a full intent is justified please let me know.

Thanks,
David


Yoav Weiss

unread,
Aug 22, 2023, 11:34:10 PM8/22/23
to David Bokan, Mathias Bynens, blink-dev
Thanks for digging into this. Given your investigation, I agree that a PSA+flag is sufficient here.
Reply all
Reply to author
Forward
0 new messages