Set Ready For Review
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Currently trying to figure out how to handle xml* and lit$ targets without unbounded buffering of both original + lowercased data. But I think we can land something and then update it.
// TODO(nrosenthal): allow the full unicode range of ID_Start?Consensus says no: https://github.com/whatwg/html/pull/12118#issuecomment-3847775588
token_.BeginProcessingInstruction();I made an editorial change to the spec to not create a PI token that we might later discard, which I also think made the handling of "xml" and "xml-stylesheet" easier to follow. This step should just be to clear the temp buffer and reconsume (below).
However, I now see that both spec and implementation only use the temp buffer to compare to the lowercase string "script", and doesn't use the buffer to "back up" and get the original characters as I've attempted.
I'm leaning towards changing the spec again to not use the temp buffer in this novel way, WDYT?
ParseError();Does it matter what character we're at when this is called, does this end up pointing to a specific location in DevTools or something?
I ask because the right location to point to for the "xml" and "xml-stylesheet" cases are probably the first "x" but we're way past that when we can see the error.
token_.Clear();I see that `token_.Clear()` is never called outside of HTMLTokenizer::Reset() before this change, just like the spec never said "discard the token".
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
EXPECT_EQ("<?foo ba|r ?>", SetAndGetSelectionText("<?foo ba|r ?>"))I guess this is in an XML document or https://chromium-review.googlesource.com/c/chromium/src/+/7531667 would have failed this test.
String processing_instruction_data_;Why can't this reuse data_? We don't need to use both at the same time, right?
TEST_P(HTMLDocumentParserTest, ProcessingInstructionNoQuestionMark) {Do you think these tests should be kept fairly minimal, or what goes here vs. WPT? Do these tests get the code coverage to 100%?
processing_instruction_test_equivalent("<?hey there?>", "<?hey there>");A few more processing_instruction_test_equivalent would be:
`<?hey?there>` equivalent to `<?hey ?there>` (with > or ?> as the closing syntax, doesn't matter)
`<?HEY THERE>` equivalent to `<?hey THERE>`.
processing_instruction_test("<?something ? >", [Can you also add a more basic trailing whitespace test like `<?something a >` to show that all trailing whitespace is preserved, not just following `?`?
"price$value",Throw in lit$123456789 in a compat category?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
EXPECT_EQ("<?foo ba|r ?>", SetAndGetSelectionText("<?foo ba|r ?>"))I guess this is in an XML document or https://chromium-review.googlesource.com/c/chromium/src/+/7531667 would have failed this test.
Acknowledged
String processing_instruction_data_;Why can't this reuse data_? We don't need to use both at the same time, right?
data_ is used for the target, as we can't use "name" since it has all kinds of special rules.
TEST_P(HTMLDocumentParserTest, ProcessingInstructionNoQuestionMark) {Do you think these tests should be kept fairly minimal, or what goes here vs. WPT? Do these tests get the code coverage to 100%?
Agreed, it was more about duplicating what what already there. Kept only one test for this.
// TODO(nrosenthal): allow the full unicode range of ID_Start?Consensus says no: https://github.com/whatwg/html/pull/12118#issuecomment-3847775588
Acknowledged
token_.BeginProcessingInstruction();I made an editorial change to the spec to not create a PI token that we might later discard, which I also think made the handling of "xml" and "xml-stylesheet" easier to follow. This step should just be to clear the temp buffer and reconsume (below).
However, I now see that both spec and implementation only use the temp buffer to compare to the lowercase string "script", and doesn't use the buffer to "back up" and get the original characters as I've attempted.
I'm leaning towards changing the spec again to not use the temp buffer in this novel way, WDYT?
Seems like that's what you've done?
ParseError();Does it matter what character we're at when this is called, does this end up pointing to a specific location in DevTools or something?
I ask because the right location to point to for the "xml" and "xml-stylesheet" cases are probably the first "x" but we're way past that when we can see the error.
I don't think we do anything with ParseError.
token_.Clear();Noam RosenthalI see that `token_.Clear()` is never called outside of HTMLTokenizer::Reset() before this change, just like the spec never said "discard the token".
Yea but without that we have an initialized token and we call "BeginComment" and assert. Should I not have initialized the "ProcessingInstruction" state until we know we have a valid target?
processing_instruction_test_equivalent("<?hey there?>", "<?hey there>");A few more processing_instruction_test_equivalent would be:
`<?hey?there>` equivalent to `<?hey ?there>` (with > or ?> as the closing syntax, doesn't matter)
`<?HEY THERE>` equivalent to `<?hey THERE>`.
Done the first.
I think the second one is no longer valid?
Can you also add a more basic trailing whitespace test like `<?something a >` to show that all trailing whitespace is preserved, not just following `?`?
Done
Throw in lit$123456789 in a compat category?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
String processing_instruction_data_;Noam RosenthalWhy can't this reuse data_? We don't need to use both at the same time, right?
data_ is used for the target, as we can't use "name" since it has all kinds of special rules.
I see, the name confused me. Will we still need this if we create a PI token early and just add to its target data?
token_.BeginProcessingInstruction();Noam RosenthalI made an editorial change to the spec to not create a PI token that we might later discard, which I also think made the handling of "xml" and "xml-stylesheet" easier to follow. This step should just be to clear the temp buffer and reconsume (below).
However, I now see that both spec and implementation only use the temp buffer to compare to the lowercase string "script", and doesn't use the buffer to "back up" and get the original characters as I've attempted.
I'm leaning towards changing the spec again to not use the temp buffer in this novel way, WDYT?
Seems like that's what you've done?
Yes, Henri thought we should just make target case-sensitive, so I did that. Now the spec creates a PI token as soon as seeing <? and if there's an error the target is read back and turned into a comment token instead. There's no observable difference between this and keeping a buffer until we know what token type it will be, but it made things more readable IMHO.
ParseError();Noam RosenthalDoes it matter what character we're at when this is called, does this end up pointing to a specific location in DevTools or something?
I ask because the right location to point to for the "xml" and "xml-stylesheet" cases are probably the first "x" but we're way past that when we can see the error.
I don't think we do anything with ParseError.
That's what I thought.
token_.Clear();Noam RosenthalI see that `token_.Clear()` is never called outside of HTMLTokenizer::Reset() before this change, just like the spec never said "discard the token".
Yea but without that we have an initialized token and we call "BeginComment" and assert. Should I not have initialized the "ProcessingInstruction" state until we know we have a valid target?
That's one option, but in the end I made the spec throw away a token instead, just like you're doing here.
processing_instruction_test_equivalent("<?hey there?>", "<?hey there>");Noam RosenthalA few more processing_instruction_test_equivalent would be:
`<?hey?there>` equivalent to `<?hey ?there>` (with > or ?> as the closing syntax, doesn't matter)
`<?HEY THERE>` equivalent to `<?hey THERE>`.
Done the first.
I think the second one is no longer valid?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
String processing_instruction_data_;Noam RosenthalWhy can't this reuse data_? We don't need to use both at the same time, right?
Philip Jägenstedtdata_ is used for the target, as we can't use "name" since it has all kinds of special rules.
I see, the name confused me. Will we still need this if we create a PI token early and just add to its target data?
This is AtomicHTMLToken, it's only created when we emit the valid token to the parser.
I changed it to "processing_instruction_target_" instead, is that clearer?
token_.BeginProcessingInstruction();Noam RosenthalI made an editorial change to the spec to not create a PI token that we might later discard, which I also think made the handling of "xml" and "xml-stylesheet" easier to follow. This step should just be to clear the temp buffer and reconsume (below).
However, I now see that both spec and implementation only use the temp buffer to compare to the lowercase string "script", and doesn't use the buffer to "back up" and get the original characters as I've attempted.
I'm leaning towards changing the spec again to not use the temp buffer in this novel way, WDYT?
Philip JägenstedtSeems like that's what you've done?
Yes, Henri thought we should just make target case-sensitive, so I did that. Now the spec creates a PI token as soon as seeing <? and if there's an error the target is read back and turned into a comment token instead. There's no observable difference between this and keeping a buffer until we know what token type it will be, but it made things more readable IMHO.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
String processing_instruction_data_;Noam RosenthalWhy can't this reuse data_? We don't need to use both at the same time, right?
Philip Jägenstedtdata_ is used for the target, as we can't use "name" since it has all kinds of special rules.
Noam RosenthalI see, the name confused me. Will we still need this if we create a PI token early and just add to its target data?
This is AtomicHTMLToken, it's only created when we emit the valid token to the parser.
I changed it to "processing_instruction_target_" instead, is that clearer?
Yep, that's clear. Just checking, now that the target can only be alphanumeric and hyphens, does name_ still have special rules that prevent reuse?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
String processing_instruction_data_;Noam RosenthalWhy can't this reuse data_? We don't need to use both at the same time, right?
Philip Jägenstedtdata_ is used for the target, as we can't use "name" since it has all kinds of special rules.
Noam RosenthalI see, the name confused me. Will we still need this if we create a PI token early and just add to its target data?
Philip JägenstedtThis is AtomicHTMLToken, it's only created when we emit the valid token to the parser.
I changed it to "processing_instruction_target_" instead, is that clearer?
Yep, that's clear. Just checking, now that the target can only be alphanumeric and hyphens, does name_ still have special rules that prevent reuse?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
String processing_instruction_data_;Noam RosenthalWhy can't this reuse data_? We don't need to use both at the same time, right?
Philip Jägenstedtdata_ is used for the target, as we can't use "name" since it has all kinds of special rules.
Noam RosenthalI see, the name confused me. Will we still need this if we create a PI token early and just add to its target data?
Philip JägenstedtThis is AtomicHTMLToken, it's only created when we emit the valid token to the parser.
I changed it to "processing_instruction_target_" instead, is that clearer?
Noam RosenthalYep, that's clear. Just checking, now that the target can only be alphanumeric and hyphens, does name_ still have special rules that prevent reuse?
Yea, it needs to be an HTML element name.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
String processing_instruction_data_;Noam RosenthalWhy can't this reuse data_? We don't need to use both at the same time, right?
Philip Jägenstedtdata_ is used for the target, as we can't use "name" since it has all kinds of special rules.
Noam RosenthalI see, the name confused me. Will we still need this if we create a PI token early and just add to its target data?
Philip JägenstedtThis is AtomicHTMLToken, it's only created when we emit the valid token to the parser.
I changed it to "processing_instruction_target_" instead, is that clearer?
Noam RosenthalYep, that's clear. Just checking, now that the target can only be alphanumeric and hyphens, does name_ still have special rules that prevent reuse?
Philip JägenstedtYea, it needs to be an HTML element name.
Is something else used for <my-button> then?
It gets the name "unknown" and then an additional string of "my-button". I find it a bit odd. Lots of asserts... we can use it awkwardly and say that it's an "unknown" with the target as the "real" name.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
String processing_instruction_data_;Noam RosenthalWhy can't this reuse data_? We don't need to use both at the same time, right?
Philip Jägenstedtdata_ is used for the target, as we can't use "name" since it has all kinds of special rules.
Noam RosenthalI see, the name confused me. Will we still need this if we create a PI token early and just add to its target data?
Philip JägenstedtThis is AtomicHTMLToken, it's only created when we emit the valid token to the parser.
I changed it to "processing_instruction_target_" instead, is that clearer?
Noam RosenthalYep, that's clear. Just checking, now that the target can only be alphanumeric and hyphens, does name_ still have special rules that prevent reuse?
Philip JägenstedtYea, it needs to be an HTML element name.
Noam RosenthalIs something else used for <my-button> then?
It gets the name "unknown" and then an additional string of "my-button". I find it a bit odd. Lots of asserts... we can use it awkwardly and say that it's an "unknown" with the target as the "real" name.
OK, let's not 😊
const AtomicString target_lower = target.AsAtomicString().LowerASCII();This works, but can't we use a ASCII case-insensitive compare instead of lowercasing the target?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
const AtomicString target_lower = target.AsAtomicString().LowerASCII();This works, but can't we use a ASCII case-insensitive compare instead of lowercasing the target?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Seems like there are tons of unrelated indentation/styling changes to VTS, can those be reverted?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Seems like there are tons of unrelated indentation/styling changes to VTS, can those be reverted?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
VTS lgtm; cc'ing kouhei@ (non-blocking) since he's usually interested in this kind of change.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Exportable changes to web-platform-tests were detected in this CL and a pull request in the upstream repo has been made: https://github.com/web-platform-tests/wpt/pull/57716.
When this CL lands, the bot will automatically merge the PR on GitHub if the required GitHub checks pass; otherwise, ecosystem-infra@ team will triage the failures and may contact you.
WPT Export docs:
https://chromium.googlesource.com/chromium/src/+/main/docs/testing/web_platform_tests.md#Automatic-export-process
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Support processing instructions in the HTML parser
The parser now recognizes <?target data> as a ProcessingInstruction and
adds it to the DOM instead of a bogus comment.
As per spec PR:
- xml/xml-stylesheet are blocklisted, and stay a bogus comment.
We can add more of these if there are compat issues.
- A PI can appear wherever a comment appears.
- ?> at the end ignores the ?
Currently in this CL, PI targets are constrained to
/^[A-Za-z][A-Za-z0-9-]*$/.
Added a VTS that keeps current behavior, so that we don't lose some of
the existing html5lib tests while this is in development.
See spec PR: https://github.com/whatwg/html/pull/12118
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
The WPT PR for this CL has been merged upstream! https://github.com/web-platform-tests/wpt/pull/57716
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |