Using xpath to select siblings

97 views
Skip to first unread message

jm

unread,
Nov 26, 2009, 7:11:56 PM11/26/09
to moch...@googlegroups.com
I attempting to scrape a page with mochiweb_xpath which has the
following format:


html
body
div
h2
table
table
h2
table
table

I can use

mochiweb_xpath("//h2/text()", Page)

to select the label text (the h2 tags) and

mochiweb_xpath("//table", Page)

to select the tables and with a few assumptions about the page format,
eg it's always 1 label to 2 tables, label the data in the table
correctly. What I'd like to do is remove this assumption. Is there a way
to get the label and the tables that belong to this label? I was looking
at sibling and follow-sibling, but can't seem to get this to work.
Ideally I'd like to produce something like a list of {label, [tables]}
for further processing. Any suggestions?

Jeff.

Pablo Polvorin

unread,
Dec 2, 2009, 9:38:36 AM12/2/09
to moch...@googlegroups.com
Hi Jeff,
sadly it isn't possible to do directly with mochiweb_xpath, it only
implements a limited subset of the xpath specification (at least at
the moment I wrote it, no idea if someone else extended it).

Note also that mochiweb_xpath isn't part of the mochiweb distribution.

You could try with the xpath engine of xmerl, it should be more
powerful, but not sure if it implements
the follow-sibling axis. That would probably require you to parse the
html with mochiweb, export it to XML, parse it again with xmerl, and
then run the xpath expression. Or perhaps you can go from mochiweb
struct -> simplified xmerl struct directly.






2009/11/26 jm <je...@ghostgun.com>:
> --
>
> You received this message because you are subscribed to the Google Groups "MochiWeb" group.
> To post to this group, send email to moch...@googlegroups.com.
> To unsubscribe from this group, send email to mochiweb+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mochiweb?hl=en.
>
>
>



--
Pablo Polvorin
ProcessOne

jm

unread,
Dec 2, 2009, 5:31:34 PM12/2/09
to moch...@googlegroups.com
Thanks for the reply. This html stuff is down right evil. In case anyone
else has a similar problem I ended up working around the lack of
follow-sibling by using a few patterns as there was only up to 2
headings and 4 tables page, ie, if Headings are the headings above the
tables and Tables are the tables, then


build_tables([], [], Out) ->
Out;
build_tables([H1, H2], [T1, T2], Out) ->
%% do some cleanup of Hs and Ts
build_tables([], [], [{H1, [T1]}, {H2, [T2]} | Out]);
build_tables([H1 | MoreH], [T1, T2 | MoreT], Out) ->
%% do some cleanup of Hs and Ts
build_tables(MoreH, MoreT, [{H1, [T1, T2]} | Out]);
build_tables([H1], [T1, T2], Out) ->
%% do some cleanup of Hs and Ts
build_tables([], [], [{H1, [T1, T2]} | Out});
build_tables([H1], [T1], Out) ->
%% do some cleanup of Hs and Ts
build_tables([], [], [{H1, [T1]} | Out]).

which seems to do the job, but I was hoping for a more generalised way
of doing this.

Thanks for writting mochiweb_xpath. It (with mochiweb_html) does make
processing scraped web pages easier.

Jeff.
Reply all
Reply to author
Forward
0 new messages