Marpa::HTML

16 views
Skip to first unread message

Miles Gould

unread,
May 20, 2011, 6:35:48 AM5/20/11
to edinburgh.pm, glas...@googlegroups.com
There's been some chat on the lists about parsing real-world broken
HTML in Perl. I noticed this

http://search.cpan.org/dist/Marpa-HTML/lib/Marpa/HTML/Doc/HTML.pod

and wondered if anyone might be interested, or perhaps had used it?
The author claims that his underlying parsing engine
(http://search.cpan.org/~jkegl/Marpa-0.204000/lib/Marpa/Doc/Marpa.pod)
can parse any context-free grammar in O(n^3) time and any LL/LR/etc
grammar in O(n) time, and has nice error reporting. On the other hand,
he says it's alpha software (and Marpa::XS, released only a month ago,
is presumably doubly alpha).

There's more on Marpa and the author's mission to kill Yacc at his
blog, http://blogs.perl.org/users/jeffrey_kegler/.

Miles.

Murray

unread,
May 20, 2011, 7:36:26 AM5/20/11
to glas...@googlegroups.com, edinburgh.pm
(disclaimer: not played with Marpa, tho it looks interesting)

At least when I've gone down this rabbit hole, the problem isn't the parsing
per se, but rather given this:

<html>
<h1>hi</
<tr><td>foo</tr></td
</body>
</html>

What the bleedin' feck does it mean?

(no opening body tag, borked closing h1, no table tags, missing > on the
closing td & the closing td/tr being out of order).

The "problem" is that browsers will display that just fine (for some value of
"fine") and thus you do get real world insanity like this when your home brew
web crawler starts chewing on public web pages.

You're trying to extract meaning from the markup, but the markup is ambiguous,
even after a successful parse.

What is the heading? Is the content tabular? What is the "text" for that page?

Marpa seems to have some smarts for missing/broken HTML, but I'd wager the
combined might of the internets can produce obscure html markup insanity faster
than any mortal can keep up. Tho it is a shame the browser's rendering engines
behaviour isn't more exposed for re-purposing in this regard.

This isn't to say that Marpa might not be a massive big win for dealing with
this sorta thing, just that I'm guessing you're still going to be amazed at how
insane some markup can be and end up dealing with piecemeal exceptions in the
real world. And for me, it's this that has historically dominated the pain.

All that said, Marpa sure looks worth further investigation.

(as would some kinda "but i'm on both those lists, so kindly only forward me
one copy ta" filtering service ... that isn't gmail ;)

paul mcnally

unread,
May 20, 2011, 7:53:00 AM5/20/11
to glas...@googlegroups.com
I still can't get my head round the idea of building things to fix things that people shouldn't have built badly.

A company puts some dodgy scaffold around your building, it's a bit crap, so you will need to get another company to put some more scaffold around my scaffold to check the first scaffold.
The second company may or may not be as good as the first company, therefor you may need a third company.

p

Miles Gould

unread,
May 20, 2011, 10:16:55 AM5/20/11
to glas...@googlegroups.com
On Fri, May 20, 2011 at 12:36 PM, Murray <pe...@minty.org> wrote:
> (disclaimer: not played with Marpa, tho it looks interesting)
>
> At least when I've gone down this rabbit hole, the problem isn't the parsing
> per se, but rather given this:
>
>    <html>
>      <h1>hi</
>      <tr><td>foo</tr></td
>    </body>
>    </html>
>
> What the bleedin' feck does it mean?

I'm having a bit of trouble working out how to get Marpa::HTML to give
me a tree-like view (it's basically SAXy), but I knocked up the
following to see what it would make of your example.

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

use Marpa::HTML qw(html);
use Data::Dumper;

my $html = <<END_HTML;


<html>
<h1>hi</
<tr><td>foo</tr></td
</body>
</html>

END_

html( \$html, {
'*' => sub { say "tag: " . Marpa::HTML::tagname() },
':PCDATA' => sub { say "text: " . Marpa::HTML::literal() },
':CRUFT' => sub { say "cruft: '" . Marpa::HTML::literal() . "'" }
});


The result was

tag: head
text: hi
tag: h1
text: foo
tag: td
tag: tr
cruft: '</td
</body>'
tag: tr
tag: tbody
tag: table
tag: body
tag: html


Which I think means

<html>
<head />
<body>
<h1>hi</h1>
<table>
<tbody>
<tr><td>foo</td></tr>
<tr><![CDATA[</td
</body>]]></tr>
</tbody>
</table>
</body>
</html>

... but some of that may be wishful thinking.

Miles.

Murray

unread,
May 20, 2011, 10:34:32 AM5/20/11
to glas...@googlegroups.com
On Fri, May 20, 2011 at 03:16:55PM +0100, Miles Gould wrote:
> I'm having a bit of trouble working out how to get Marpa::HTML to give
> me a tree-like view (it's basically SAXy), but I knocked up the
> following to see what it would make of your example.

Unless I've missed something, the order of your HTML doesn't appear to match
the order of the output from Marpa. If it has 'html' as the last thing, which
is meant to be the first tag, then how come the 'hi' comes before the table?

That said, thanks :)

I still maintain tho that if you're doing this kinda stuff on true web-grade
HTML, you either have to decided to ignore the bad HTML or start picking
through insanity by hand and trying to infer meaning/rules. Which is to say,
the parsing per se, isn't normally the pita bit.

Miles Gould

unread,
May 20, 2011, 10:43:43 AM5/20/11
to glas...@googlegroups.com
On Fri, May 20, 2011 at 3:34 PM, Murray <pe...@minty.org> wrote:
> On Fri, May 20, 2011 at 03:16:55PM +0100, Miles Gould wrote:
>> I'm having a bit of trouble working out how to get Marpa::HTML to give
>> me a tree-like view (it's basically SAXy), but I knocked up the
>> following to see what it would make of your example.
>
> Unless I've missed something, the order of your HTML doesn't appear to match
> the order of the output from Marpa.  If it has 'html' as the last thing, which
> is meant to be the first tag, then how come the 'hi' comes before the table?

It's reverse Polish notation: it prints out elements when they're
closed, not when they're opened. The html tag encloses everything, so
it's output last.

Miles

Aaron Crane

unread,
May 20, 2011, 12:53:28 PM5/20/11
to glas...@googlegroups.com
Murray <pe...@minty.org> wrote:
> At least when I've gone down this rabbit hole, the problem isn't the parsing
> per se, but rather given this:
>
>    <html>
>      <h1>hi</
>      <tr><td>foo</tr></td
>    </body>
>    </html>
>
> What the bleedin' feck does it mean?

Well, that depends on who you ask. HTML4 (and earlier versions) had
nothing whatsoever to say about what it might mean, because they only
talked about valid, well-formed markup. Traditional browsers^W user
agents each had their own interpretation (which may have coincided in
some cases, if you were very very lucky).

One of the aspects of HTML5 I like most is that it offers reasonable
hope for interoperability on this sort of thing: it doesn't require
implementations to do browser-like error recovery on broken markup,
but it does require that any implementation choosing to do error
recovery must use the HTML5 rules for doing so. Further, the HTML5
error recovery rules specify exactly what DOM should be generated for
any possible byte stream. In this case, HTML5 says that the
error-recovered version of that document induces the same DOM that
this document would:

<html><head></head><body><h1>hi<!--
<tr-->foo

</h1></body></html>

Your "what the bleedin' feck does that mean" question is still
entirely reasonable, and if you were starting from scratch, I don't
think you'd necessarily adopt exactly the same error-recovery rules as
HTML5. But, hey, at least we do have rules for this now.

> The "problem" is that browsers will display that just fine (for some value of
> "fine") and thus you do get real world insanity like this when your home brew
> web crawler starts chewing on public web pages.

Um, yes. I may have some experience of the pain that can cause.

> You're trying to extract meaning from the markup, but the markup is ambiguous,
> even after a successful parse.

If you follow HTML5, there's no ambiguity: the document is erroneous,
and the error-recovered DOM looks like the snippet above.

> What is the heading? Is the content tabular?  What is the "text" for that page?

According to HTML5: "hifoo", no, and (if I understand the question) "hifoo".

> Marpa seems to have some smarts for missing/broken HTML,

I'm not convinced that Marpa::HTML is of much use in the "home-brew
web crawler" situation. It certainly looks like a nifty demonstration
of the power of the Marpa parsing tools, but it's not clear how you
could run arbitrary tag soup through Marpa::HTML and straightforwardly
generate an HTML-like DOM (or a well-formed HTML document, or what
have you).

> but I'd wager the
> combined might of the internets can produce obscure html markup insanity faster
> than any mortal can keep up.

Again, that's why the HTML5 rules specify error recovery behaviour for
any possible byte stream.

> Tho it is a shame the browser's rendering engines
> behaviour isn't more exposed for re-purposing in this regard.

I fully agree. This seems to be the fastest and most reliable
readily-reusable HTML parser library:

http://about.validator.nu/htmlparser/

It does have the non-trivial disadvantage (from my point of view) that
it's in Java. That said, it's written in such a way as to simplify
conversion to C++, and AIUI, the HTML parser in Gecko 2 (Firefox 4) is
indeed such a converted version of the Java code. But there's no
documentation I can see for how you might go about building it into,
say, a standalone C-bindable library.

--
Aaron Crane ** http://aaroncrane.co.uk/

Reply all
Reply to author
Forward
0 new messages