[Haskell-cafe] HTML library with DOM?

17 views
Skip to first unread message

Günther Schmidt

unread,
Oct 6, 2010, 5:30:54 PM10/6/10
to haskel...@haskell.org
Hi all,

is there an HTML parsing library that creates a DOM from a page?

Gï¿œnther

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Johannes Waldmann

unread,
Oct 6, 2010, 5:41:44 PM10/6/10
to haskel...@haskell.org
> is there an HTML parsing library that creates a DOM from a page?

tagsoup produces trees ( http://hackage.haskell.org/package/tagsoup )

I use it with hxt ( http://hackage.haskell.org/package/hxt )
to tree-walk HTML pages.

J.W.

Gregory Collins

unread,
Oct 6, 2010, 7:44:44 PM10/6/10
to Günther Schmidt, haskel...@haskell.org
Günther Schmidt <gue.s...@web.de> writes:

> Hi all,
>
> is there an HTML parsing library that creates a DOM from a page?

I've got the month of October off, and one of the things I've been
planning on working on is a compliant HTML5 parser for Haskell --
something which is sorely needed! I will ping the list back if/when I
get it finished.

G
--
Gregory Collins <gr...@gregorycollins.net>

Edward Z. Yang

unread,
Oct 7, 2010, 4:20:48 AM10/7/10
to Gregory Collins, haskell-cafe, Günther Schmidt
Excerpts from Gregory Collins's message of Wed Oct 06 19:44:44 -0400 2010:

> I've got the month of October off, and one of the things I've been
> planning on working on is a compliant HTML5 parser for Haskell --
> something which is sorely needed! I will ping the list back if/when I
> get it finished.

I've heard that some of the existing HTML parsers in Haskell were
already HTML5 compliant (this topic came up when I was complaining
that there were some algorithms that you absolutely had to have
state for, because that was how they were specified.) I never
verified this assertion though.

Edward

Gregory Collins

unread,
Oct 7, 2010, 4:35:26 AM10/7/10
to Edward Z. Yang, haskell-cafe, Günther Schmidt
"Edward Z. Yang" <ezy...@MIT.EDU> writes:

> Excerpts from Gregory Collins's message of Wed Oct 06 19:44:44 -0400 2010:
>> I've got the month of October off, and one of the things I've been
>> planning on working on is a compliant HTML5 parser for Haskell --
>> something which is sorely needed! I will ping the list back if/when I
>> get it finished.
>
> I've heard that some of the existing HTML parsers in Haskell were
> already HTML5 compliant (this topic came up when I was complaining
> that there were some algorithms that you absolutely had to have
> state for, because that was how they were specified.) I never
> verified this assertion though.

If there's already a library which *correctly* parses html5 documents
into DOM trees, could someone please let me know so I can use it instead
of wasting a bunch of time writing one?

Thanks,

G
--
Gregory Collins <gr...@gregorycollins.net>

Gregory Collins

unread,
Oct 7, 2010, 8:41:19 AM10/7/10
to Michael Snoyman, Günther Schmidt, haskell-cafe
Michael Snoyman <mic...@snoyman.com> writes:

> As far as I know, Neil Mitchel's tagsoup[1] parses according to the
> HTML 5 parsing rules, but it just generates a list of Tags[2], so
> you'd have to build the DOM tree up from there. I personally have had
> great experience with tagsoup. It's even the core of HTML-scraping
> technology powering searchonce[3].

Yep, someone else wrote me privately to say this (that tagsoup respects
the html5 lexing rules). So I'll be using this as the basis of an html5
DOM parser. Stay tuned!

Michael Snoyman

unread,
Oct 7, 2010, 8:37:53 AM10/7/10
to Gregory Collins, Günther Schmidt, haskell-cafe
2010/10/7 Gregory Collins <gr...@gregorycollins.net>:

> "Edward Z. Yang" <ezy...@MIT.EDU> writes:
>
>> Excerpts from Gregory Collins's message of Wed Oct 06 19:44:44 -0400 2010:
>>> I've got the month of October off, and one of the things I've been
>>> planning on working on is a compliant HTML5 parser for Haskell --
>>> something which is sorely needed! I will ping the list back if/when I
>>> get it finished.
>>
>> I've heard that some of the existing HTML parsers in Haskell were
>> already HTML5 compliant (this topic came up when I was complaining
>> that there were some algorithms that you absolutely had to have
>> state for, because that was how they were specified.)  I never
>> verified this assertion though.
>
> If there's already a library which *correctly* parses html5 documents
> into DOM trees, could someone please let me know so I can use it instead
> of wasting a bunch of time writing one?

As far as I know, Neil Mitchel's tagsoup[1] parses according to the


HTML 5 parsing rules, but it just generates a list of Tags[2], so
you'd have to build the DOM tree up from there. I personally have had
great experience with tagsoup. It's even the core of HTML-scraping
technology powering searchonce[3].

Michael

[1] http://hackage.haskell.org/package/tagsoup
[2] http://hackage.haskell.org/packages/archive/tagsoup/0.11.1/doc/html/Text-HTML-TagSoup.html#t:Tag
[3] http://www.search-once.com/

Neil Mitchell

unread,
Oct 7, 2010, 5:34:06 PM10/7/10
to Gregory Collins, haskell-cafe, Günther Schmidt
Yes, I don't think I've officially announced a version of TagSoup that
has had HTML 5 parsing, but it now does as standard for the last few
releases. The HTML 5 spec is still changing, so it's entirely possible
something is incorrect in a corner case, but please let me know and
I'll fix it.

Thanks, Neil

2010/10/7 Gregory Collins <gr...@gregorycollins.net>:

Reply all
Reply to author
Forward
0 new messages