mochiweb_html tokenizer and bad html.

8 views
Skip to first unread message

Esteban

unread,
Oct 20, 2009, 12:08:16 AM10/20/09
to MochiWeb
Hi there, new to the group, but I've been using mochiweb for a while.

I'm using mochiweb_html to parse some pages, and noticed that it has
some problems with some ugly html constructions.

Example:

>mochiweb_html:tokens("<a href=userdetails.php?id=1340>ElFantasma</a>").

[{start_tag,<<"a">>,
[{<<"href">>,<<"userdetails.php?id">>},{<<>>,<<"1340">>}],
false},
{data,<<"ElFantasma">>,false},
{end_tag,<<"a">>}]

Note the href attribute gets broken because of missing quotes:

107> mochiweb_html:tokens("<a href=\"userdetails.php?
id=1340\">ElFantasma</a>").

[{start_tag,<<"a">>,
[{<<"href">>,<<"userdetails.php?id=1340">>}],
false},
{data,<<"ElFantasma">>,false},
{end_tag,<<"a">>}]

I know the problem is in poorly coded html pages, but since I have no
control over the pages I'm trying to parse, I need to make some
modifications in mochiweb_html code.

Any chances getting this kind of problems solved?

(I think I can manage to hack the tokenize_attr_value function to do
not stop when finding the '=' char, but not sure if this will
introduce more bugs)

Thanks,

Esteban
Reply all
Reply to author
Forward
0 new messages