The latest version of the Linter tries to ignore HTML5 errors found by Nokogiri, which is used for HTML validation, but has inadequate support. I’m hopeful that I’ll be able to use Gumbo for parsing/validating in the future, which should be correct, but it’s not quite ready yet.
My last check with the linter returned snippets and a structure, but still detects the “main” tag as being in error, which is something I missed. When it fails to return data (as with my first check) is likely due to the time it takes to generate the output exceeding the 30 seconds Heroku allows, trying again worked, as it was warmed up. I’m afraid that with a free plan on Heroku I’m limited to home much time queries can take. A better architecture would make use of worker threads to do the expensive work and be asynchronous, but as the linter has no budget, this isn’t really feasible. (Of course, if someone wanted to underwrite development and support, or offered resources, this could be re-visited).
If you’d like to better integrate the linter into your process, you could consider running it locally, where you can control the resources available to it. Cloning the repository from https://github.com/structured-data/linter
allows you to run it on localhost using rackup or foreman. Also, the core work of validation is done using the rdf-rdfa and rdf-reasoner gems, which can allow you do get many of the benefits without needing an HTTP interface.
The simplest re-write would generate the snippet and tabular graph output on the client side, as parsing the document takes just a small fraction of the time necessary to generate the output. As a workaround, I may add options to skip the snippet and/or hierarchical result when processing time is an issue, or do this in a separate background request. Also, a warning that behavior may deteriorate for larger documents might be useful.
In the mean time, the linter isn’t really intended for production use, but rather as a way for developers to check the markup their generating; larger graphs become increasingly expensive to process.
I’m glad you like the tool. I suspect the difference you’re seeing with different URLs is because repeated use of the same URL uses a cached document, which leaves more time for analysis. Why you use a variation on the URL that isn’t in the cache, or if the cache isn’t fresh, it needs to fetch the URL remotely, which takes available time away from analysis.