Linter doesn't recognize html5, and chokes if more than a thousand statements

194 views
Skip to first unread message

The Bicycling Guitarist

unread,
Oct 6, 2015, 3:19:44 PM10/6/15
to Structured Data on the Web
I've been adding some of the newer schema.org markup to one of my older pages, and your linter was handling it until I exceeded a thousand statements on the page. Even then, if I used direct input I could *sometimes* get results but strange compared to the ones I was used to before, with all the @id schema.org and productontology.org classes at the top and  and "_::g" numbers in the boxes when I enter it by  direct input. I hope these results are useful to someone. I did also check in the nu validator to make sure the page is valid html5. it is. But as I came close to or exceeded 1100 statements on the page it tells me there is no structured data on it when i know darn well there is.

The Google and Yandex tools read the page just fine. Google says "all good" and Yandex wants more information but still reads the whole thing. I really like the nested boxes of the Structured Data Linter. They make it so much easier to see the nested relationships even more than indenting the html code on the page. I am so sad that your linter will not recognize legitimate html5 tags and that it chokes when i have over a thousand statements on the page. The other tools I mentioned do read the page but don't give that cool summary at the end like yours does (or used to do!)


I also found that sometimes if I took the "www." from in front of my web address that I could get results that wouldn't happen with the canonical "www." in place. I also found that with the "SmartOptimizer" on my server that if I switched case of some letters in the domain name that also *sometimes* worked to give results that would not happen otherwise. But my page is valid html5 and has valid schema.org markup, and your tool no longer reads it or gives results. That makes me sad because of the three tools I mentioned yours is by far the most pleasing to look at and easiest to understand.

Gregg Kellogg

unread,
Oct 6, 2015, 3:48:51 PM10/6/15
to structure...@googlegroups.com
On Oct 6, 2015, at 11:13 AM, The Bicycling Guitarist <ch...@thebicyclingguitarist.net> wrote:

I've been adding some of the newer schema.org markup to one of my older pages, and your linter was handling it until I exceeded a thousand statements on the page. Even then, if I used direct input I could *sometimes* get results but strange compared to the ones I was used to before, with all the @id schema.org and productontology.org classes at the top and  and "_::g" numbers in the boxes when I enter it by  direct input. I hope these results are useful to someone. I did also check in the nu validator to make sure the page is valid html5. it is. But as I came close to or exceeded 1100 statements on the page it tells me there is no structured data on it when i know darn well there is.

The Google and Yandex tools read the page just fine. Google says "all good" and Yandex wants more information but still reads the whole thing. I really like the nested boxes of the Structured Data Linter. They make it so much easier to see the nested relationships even more than indenting the html code on the page. I am so sad that your linter will not recognize legitimate html5 tags and that it chokes when i have over a thousand statements on the page. The other tools I mentioned do read the page but don't give that cool summary at the end like yours does (or used to do!)



The latest version of the Linter tries to ignore HTML5 errors found by Nokogiri, which is used for HTML validation, but has inadequate support. I’m hopeful that I’ll be able to use Gumbo for parsing/validating in the future, which should be correct, but it’s not quite ready yet.

My last check with the linter returned snippets and a structure, but still detects the “main” tag as being in error, which is something I missed. When it fails to return data (as with my first check) is likely due to the time it takes to generate the output exceeding the 30 seconds Heroku allows, trying again worked, as it was warmed up. I’m afraid that with a free plan on Heroku I’m limited to home much time queries can take. A better architecture would make use of worker threads to do the expensive work and be asynchronous, but as the linter has no budget, this isn’t really feasible. (Of course, if someone wanted to underwrite development and support, or offered resources, this could be re-visited).

If you’d like to better integrate the linter into your process, you could consider running it locally, where you can control the resources available to it. Cloning the repository from https://github.com/structured-data/linter allows you to run it on localhost using rackup or foreman. Also, the core work of validation is done using the rdf-rdfa and rdf-reasoner gems, which can allow you do get many of the benefits without needing an HTTP interface.

The simplest re-write would generate the snippet and tabular graph output on the client side, as parsing the document takes just a small fraction of the time necessary to generate the output. As a workaround, I may add options to skip the snippet and/or hierarchical result when processing time is an issue, or do this in a separate background request. Also, a warning that behavior may deteriorate for larger documents might be useful.

In the mean time, the linter isn’t really intended for production use, but rather as a way for developers to check the markup their generating; larger graphs become increasingly expensive to process.

I also found that sometimes if I took the "www." from in front of my web address that I could get results that wouldn't happen with the canonical "www." in place. I also found that with the "SmartOptimizer" on my server that if I switched case of some letters in the domain name that also *sometimes* worked to give results that would not happen otherwise. But my page is valid html5 and has valid schema.org markup, and your tool no longer reads it or gives results. That makes me sad because of the three tools I mentioned yours is by far the most pleasing to look at and easiest to understand.

I’m glad you like the tool. I suspect the difference you’re seeing with different URLs is because repeated use of the same URL uses a cached document, which leaves more time for analysis. Why you use a variation on the URL that isn’t in the cache, or if the cache isn’t fresh, it needs to fetch the URL remotely, which takes available time away from analysis.

Also, note that bugs can be reported at https://github.com/structured-data/linter/issues, which makes them easier to track.

Gregg

--
You received this message because you are subscribed to the Google Groups "Structured Data on the Web" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structured-data...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages