http://search.cpan.org/dist/Marpa-HTML/lib/Marpa/HTML/Doc/HTML.pod
and wondered if anyone might be interested, or perhaps had used it?
The author claims that his underlying parsing engine
(http://search.cpan.org/~jkegl/Marpa-0.204000/lib/Marpa/Doc/Marpa.pod)
can parse any context-free grammar in O(n^3) time and any LL/LR/etc
grammar in O(n) time, and has nice error reporting. On the other hand,
he says it's alpha software (and Marpa::XS, released only a month ago,
is presumably doubly alpha).
There's more on Marpa and the author's mission to kill Yacc at his
blog, http://blogs.perl.org/users/jeffrey_kegler/.
Miles.
At least when I've gone down this rabbit hole, the problem isn't the parsing
per se, but rather given this:
<html>
<h1>hi</
<tr><td>foo</tr></td
</body>
</html>
What the bleedin' feck does it mean?
(no opening body tag, borked closing h1, no table tags, missing > on the
closing td & the closing td/tr being out of order).
The "problem" is that browsers will display that just fine (for some value of
"fine") and thus you do get real world insanity like this when your home brew
web crawler starts chewing on public web pages.
You're trying to extract meaning from the markup, but the markup is ambiguous,
even after a successful parse.
What is the heading? Is the content tabular? What is the "text" for that page?
Marpa seems to have some smarts for missing/broken HTML, but I'd wager the
combined might of the internets can produce obscure html markup insanity faster
than any mortal can keep up. Tho it is a shame the browser's rendering engines
behaviour isn't more exposed for re-purposing in this regard.
This isn't to say that Marpa might not be a massive big win for dealing with
this sorta thing, just that I'm guessing you're still going to be amazed at how
insane some markup can be and end up dealing with piecemeal exceptions in the
real world. And for me, it's this that has historically dominated the pain.
All that said, Marpa sure looks worth further investigation.
(as would some kinda "but i'm on both those lists, so kindly only forward me
one copy ta" filtering service ... that isn't gmail ;)
I'm having a bit of trouble working out how to get Marpa::HTML to give
me a tree-like view (it's basically SAXy), but I knocked up the
following to see what it would make of your example.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Marpa::HTML qw(html);
use Data::Dumper;
my $html = <<END_HTML;
<html>
<h1>hi</
<tr><td>foo</tr></td
</body>
</html>
END_
html( \$html, {
'*' => sub { say "tag: " . Marpa::HTML::tagname() },
':PCDATA' => sub { say "text: " . Marpa::HTML::literal() },
':CRUFT' => sub { say "cruft: '" . Marpa::HTML::literal() . "'" }
});
The result was
tag: head
text: hi
tag: h1
text: foo
tag: td
tag: tr
cruft: '</td
</body>'
tag: tr
tag: tbody
tag: table
tag: body
tag: html
Which I think means
<html>
<head />
<body>
<h1>hi</h1>
<table>
<tbody>
<tr><td>foo</td></tr>
<tr><![CDATA[</td
</body>]]></tr>
</tbody>
</table>
</body>
</html>
... but some of that may be wishful thinking.
Miles.
Unless I've missed something, the order of your HTML doesn't appear to match
the order of the output from Marpa. If it has 'html' as the last thing, which
is meant to be the first tag, then how come the 'hi' comes before the table?
That said, thanks :)
I still maintain tho that if you're doing this kinda stuff on true web-grade
HTML, you either have to decided to ignore the bad HTML or start picking
through insanity by hand and trying to infer meaning/rules. Which is to say,
the parsing per se, isn't normally the pita bit.
It's reverse Polish notation: it prints out elements when they're
closed, not when they're opened. The html tag encloses everything, so
it's output last.
Miles
Well, that depends on who you ask. HTML4 (and earlier versions) had
nothing whatsoever to say about what it might mean, because they only
talked about valid, well-formed markup. Traditional browsers^W user
agents each had their own interpretation (which may have coincided in
some cases, if you were very very lucky).
One of the aspects of HTML5 I like most is that it offers reasonable
hope for interoperability on this sort of thing: it doesn't require
implementations to do browser-like error recovery on broken markup,
but it does require that any implementation choosing to do error
recovery must use the HTML5 rules for doing so. Further, the HTML5
error recovery rules specify exactly what DOM should be generated for
any possible byte stream. In this case, HTML5 says that the
error-recovered version of that document induces the same DOM that
this document would:
<html><head></head><body><h1>hi<!--
<tr-->foo
</h1></body></html>
Your "what the bleedin' feck does that mean" question is still
entirely reasonable, and if you were starting from scratch, I don't
think you'd necessarily adopt exactly the same error-recovery rules as
HTML5. But, hey, at least we do have rules for this now.
> The "problem" is that browsers will display that just fine (for some value of
> "fine") and thus you do get real world insanity like this when your home brew
> web crawler starts chewing on public web pages.
Um, yes. I may have some experience of the pain that can cause.
> You're trying to extract meaning from the markup, but the markup is ambiguous,
> even after a successful parse.
If you follow HTML5, there's no ambiguity: the document is erroneous,
and the error-recovered DOM looks like the snippet above.
> What is the heading? Is the content tabular? What is the "text" for that page?
According to HTML5: "hifoo", no, and (if I understand the question) "hifoo".
> Marpa seems to have some smarts for missing/broken HTML,
I'm not convinced that Marpa::HTML is of much use in the "home-brew
web crawler" situation. It certainly looks like a nifty demonstration
of the power of the Marpa parsing tools, but it's not clear how you
could run arbitrary tag soup through Marpa::HTML and straightforwardly
generate an HTML-like DOM (or a well-formed HTML document, or what
have you).
> but I'd wager the
> combined might of the internets can produce obscure html markup insanity faster
> than any mortal can keep up.
Again, that's why the HTML5 rules specify error recovery behaviour for
any possible byte stream.
> Tho it is a shame the browser's rendering engines
> behaviour isn't more exposed for re-purposing in this regard.
I fully agree. This seems to be the fastest and most reliable
readily-reusable HTML parser library:
http://about.validator.nu/htmlparser/
It does have the non-trivial disadvantage (from my point of view) that
it's in Java. That said, it's written in such a way as to simplify
conversion to C++, and AIUI, the HTML parser in Gecko 2 (Firefox 4) is
indeed such a converted version of the Java code. But there's no
documentation I can see for how you might go about building it into,
say, a standalone C-bindable library.
--
Aaron Crane ** http://aaroncrane.co.uk/