[bug] meta itemprop being modified by pagespeed (schema.org markup)

12 views
Skip to first unread message

Luke Kenneth Casson Leighton

unread,
Aug 26, 2017, 8:32:57 AM8/26/17
to mod-pagesp...@googlegroups.com
https://search.google.com/structured-data/testing-tool#url=http%3A%2F%2Fhorseboxseller.com%2Fhorseboxes-for-sale%2F

the source for the site most definitely does not insert a space before
the itemprop="currency" yet when running pagespeed it's modifying the
thing to add a space at the front! most likely because of the use of
"meta" as a tag. meta itemprop="currency" becomes meta itemprop="
currency". i can assure you that the source code for the site does
*not* do that.

is there any way to track that and get it disabled? any properties
that can be added to tell pagespeed to leave the tag(s) completely
alone?

l.

Luke Kenneth Casson Leighton

unread,
Aug 26, 2017, 8:53:31 AM8/26/17
to mod-pagesp...@googlegroups.com
On Sat, Aug 26, 2017 at 1:32 PM, Luke Kenneth Casson Leighton
<lk...@lkcl.net> wrote:
> https://search.google.com/structured-data/testing-tool#url=http%3A%2F%2Fhorseboxseller.com%2Fhorseboxes-for-sale%2F
>
> the source for the site most definitely does not insert a space before
> the itemprop="currency" yet when running pagespeed it's modifying the
> thing to add a space at the front! most likely because of the use of
> "meta" as a tag. meta itemprop="currency" becomes meta itemprop="
> currency". i can assure you that the source code for the site does
> *not* do that.

.. but i did have a bug in the HTML where the last quote (but the >
was there) was missing in this a href:

<a href="/horseboxes-for-sale/detail/1793/3.5-ton/2007-vauxhall-movano-long-stall-new-build-stalled-for-2-rear-facing/">
<span itemprop="currency" content="GBP">&#163;</span>
<span itemprop="price" content='17450'>17450
</span>
<meta itemprop="condition" content="used"/>
</a>

that caused pagespeed's HTML parser to go a bit.... skewiffy,
resulting in it not only inserting the a href as a <a href="xxxxxx" />
and leaving in what the browsers determined tto be a faulty </a> later
on, but it then messed up the following tag...

ahhh i know what happened: it obviously thought that the *next* quote
- which happened to be on that span - was a *closing* quote.... so
added a space after it!

whoops that's bad. pagespeed shouldn't really be messing with the
HTML like that. i mean it's good that it did, so that the bug in the
page content was found (very indirectly) but it doesn't strike me as
being sensible to make subtle modifications like that.

yeeeees i knooow, i have comment-stripping etc. etc. all switched on... :)

l.

Joshua Marantz

unread,
Aug 26, 2017, 11:07:55 AM8/26/17
to mod-pagespeed-discuss
It was definitely a goal of the HTML parser to interpret lexically broken HTML the same way browsers do, but I think you've found a case where that doesn't happen.  I think this is what happened:

  <a href="/horseboxes/path/>
  <span itemprop="currency" content="GBP">

and I think what PageSpeed did was give the href a value of "/horseboxes/path/>\n  <span itemprop=" and then considered currency to be a new attribute-name. This is how Chrome interprets this lexical structure:


I think that's actually pretty similar to PageSpeed up until the interpretation of
currency" where Chrome considers the quote to be part of the attribute name, and its lexer has recovered by the time it hits content="GBP".  I think PageSpeed probably doesn't recover at all, and relies on quotes to be balanced.

In any case, it's probably good you find the imbalanced quote one way or another.

-Josh


l.

--
You received this message because you are subscribed to the Google Groups "mod-pagespeed-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mod-pagespeed-discuss+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mod-pagespeed-discuss/CAPweEDwt_BS_Z9TPjtm1ygqb9sjA7CFCNmyPP%3D1sPD_e9%3DeFNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Luke Kenneth Casson Leighton

unread,
Aug 26, 2017, 11:47:52 AM8/26/17
to mod-pagesp...@googlegroups.com


---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Sat, Aug 26, 2017 at 4:07 PM, 'Joshua Marantz' via mod-pagespeed-discuss <mod-pagesp...@googlegroups.com> wrote:
It was definitely a goal of the HTML parser to interpret lexically broken HTML the same way browsers do,

 that's a very good (important) goal  also one that's really tricky do do right, given that different engines will interpret HTML (incorrect/bad html) differently.
.
 
but I think you've found a case where that doesn't happen.  I think this is what happened:

  <a href="/horseboxes/path/>
  <span itemprop="currency" content="GBP">

and I think what PageSpeed did was give the href a value of "/horseboxes/path/>\n  <span itemprop="

yehyeh... then puts a space after it... because that's what you do after you have an attribute in quotes...
 
and then considered currency to be a new attribute-name.

.. with no spec / value... yeah makes sense.
 
This is how Chrome interprets this lexical structure:


I think that's actually pretty similar to PageSpeed up until the interpretation of
currency" where Chrome considers the quote to be part of the attribute name, and its lexer has recovered by the time it hits content="GBP".  I think PageSpeed probably doesn't recover at all, and relies on quotes to be balanced.

In any case, it's probably good you find the imbalanced quote one way or another.


 y'tellin me - live site and i messed up the structured data so the search engines freak out and start de-ranking the page...

l.
Reply all
Reply to author
Forward
0 new messages