How to use findall and xpath to extract table rows?

73 views
Skip to first unread message

Sean Charles

unread,
Nov 27, 2017, 8:49:34 AM11/27/17
to SWI-Prolog
I am trying to extract some data from a page but it just seems to return one row when I know there are ten!
When I pretty print the loaded DOM it also only shows one element.

I wanted some practice at writing a screen scraper with Prolog and I didn't think it would be this slipper which, based on experience, leads me to believe that I have goofed once again!

test(URL, DOM) :-
    http_open(URL, In, [dialect(html5)]),
    call_cleanup(
        load_html(In, DOM, [dialect(html5)]),
        close(In)).

test(Xs) :-
    findall(
        X,
        xpath(DOM, //table(@id=group_posts_table)//tr, X),
        Xs).

I attached the full page source HTML and the pretty printed dump for reference. I am confused as to why the pretty printed dump only has one element as well?!

Thanks,
Sean.


dump.txt
fc-all-page1.html

Jan Wielemaker

unread,
Nov 27, 2017, 9:23:11 AM11/27/17
to Sean Charles, SWI-Prolog
If you use plain load_structure/3 you get all warnings and errors. In
particular, this seems to trigger the issue:

<a
href='https://groups.freecycle.org/group/PlymouthUK/posts/63704234/Gardener's%20World%20Magazines'
...

You see, the there is a "'" inside the "'" quoted attribute. AFAIK,
HTML5 does specify exactly how various errors must be dealt with and I
guess that will work ok. SWI-Prolog's HTML parser however is a generic
SGML parser with a few tweaks in html-5 mode. It does not comply to
HTML5 error handling though. If you want to process such documents
reliably you probably need an external tool normalize the HTML and emit
either compliant HTML or XML.

I don't know the rules here. If the rule is as simple as `if a matching
attribute quote is not followed by white space, ">" or "/>", treat it as
an ordinary character' we could add that. Does someone know?

Cheers --- Jan

> Thanks,
> Sean.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "SWI-Prolog" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to swi-prolog+...@googlegroups.com
> <mailto:swi-prolog+...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/swi-prolog.
> For more options, visit https://groups.google.com/d/optout.

Carlo Capelli

unread,
Nov 27, 2017, 9:26:01 AM11/27/17
to Sean Charles, SWI-Prolog
Hi Sean

I think the HTML structure is ill formed (well, Jan already spotted it while I was writing this).
In Firefox, just show the source to get red coloring of bad structures.

--
You received this message because you are subscribed to the Google Groups "SWI-Prolog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swi-prolog+unsubscribe@googlegroups.com.

emacstheviking

unread,
Nov 27, 2017, 10:03:12 AM11/27/17
to Carlo Capelli, SWI-Prolog
Jan, Carlo,

Thanks for spotting that. I didn't know about load_structure/3 showing errors. I've not used any of these predicates before. I took a closer look at the HTML being returned from the site, it's *awful*. Firefox complains about the closing HEAD tag so that's not a good start is it!

I will have to find a means to isolate the table markup in the returned string and then either:

  - clean up the mess by replacing illegal markup with better markup
  - write a set of DCG rules to extract what I want.

I don't mind, it's all good practice which is why I am subjecting myself to this anyway!

Thanks again all,

Sean.

emacstheviking

unread,
Nov 27, 2017, 10:05:28 AM11/27/17
to Carlo Capelli, SWI-Prolog
As an aside, I have just spotted a DIV element inside the HEAD element as well. Ouch!!

Jan Wielemaker

unread,
Nov 27, 2017, 10:41:42 AM11/27/17
to emacstheviking, Carlo Capelli, SWI-Prolog
On 11/27/2017 04:02 PM, emacstheviking wrote:
> Jan, Carlo,
>
> Thanks for spotting that. I didn't know about load_structure/3 showing
> errors. I've not used any of these predicates before. I took a closer

It is more load_html/3 suppressing them as invalid HTML is more the
rule than an exception. Most of the times the resulting tree is still
close to the intend of the author though. This is a rather bad
example.

> look at the HTML being returned from the site, it's *awful*. Firefox
> complains about the closing HEAD tag so that's not a good start is it!
>
> I will have to find a means to isolate the table markup in the returned
> string and then either:
>
>   - clean up the mess by replacing illegal markup with better markup

It finds the table ok (despite all the rest), but corrupts the table
itself due to illegal quotes.

>   - write a set of DCG rules to extract what I want.
>
> I don't mind, it's all good practice which is why I am subjecting myself
> to this anyway!

If the target is just this site a dedicated DCG hack may get what you
want. In general this is hard though and I'd probably see whether there
are tools out there that implement the official HTML5 error handling and
can write valid HTML.

Seems this code is written by someone not familiar with the notion of
code injection :(

Cheers --- Jan

>
> Thanks again all,
>
> Sean.
>
>
> On 27 November 2017 at 14:25, Carlo Capelli <cc.car...@gmail.com
> <mailto:cc.car...@gmail.com>> wrote:
>
> Hi Sean
>
> I think the HTML structure is ill formed (well, Jan already spotted
> it while I was writing this).
> In Firefox, just show the source to get red coloring of bad structures.
>
> 2017-11-27 14:49 GMT+01:00 Sean Charles <obj...@gmail.com
> <mailto:obj...@gmail.com>>:
> it, send an email to swi-prolog+...@googlegroups.com
> <mailto:swi-prolog+...@googlegroups.com>.
> <https://groups.google.com/group/swi-prolog>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "SWI-Prolog" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to swi-prolog+...@googlegroups.com
> <mailto:swi-prolog+...@googlegroups.com>.

emacstheviking

unread,
Nov 27, 2017, 10:45:03 AM11/27/17
to Jan Wielemaker, Carlo Capelli, SWI-Prolog
Hi Jan,

I have tried to contact the owners of FreeCycle many times over the last few years, with the offer of free assistance to sort things out but *zero* response is all I get. I get the impression it's one large AI application now that is self-running!

As for the HTML, I am going to pipe it through a local instance of "html Tidy" and if that doesn't fix it...I will resort to PHP! Noooooooooooooooo.



        it, send an email to swi-prolog+unsubscribe@googlegroups.com
        <mailto:swi-prolog+unsubscribe@googlegroups.com>.

        Visit this group at https://groups.google.com/group/swi-prolog
        <https://groups.google.com/group/swi-prolog>.
        For more options, visit https://groups.google.com/d/optout
        <https://groups.google.com/d/optout>.



--
You received this message because you are subscribed to the Google Groups "SWI-Prolog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swi-prolog+unsubscribe@googlegroups.com <mailto:swi-prolog+unsubscribe@googlegroups.com>.

emacstheviking

unread,
Nov 27, 2017, 11:01:44 AM11/27/17
to Jan Wielemaker, Carlo Capelli, SWI-Prolog
Success!

I downloaded the latest version of Tidy from here: http://binaries.html-tidy.org/

Ran it with options: -asxhtml -w 10000 -o foo.html

The new file behaves exactly as I expected, ten rows returned.... I might have a look-see in the future to see if I could be clever enough to wrap libtidy for SWI, might be useful?!

Thanks again everybody.

Sean.

Jan Wielemaker

unread,
Nov 27, 2017, 11:20:16 AM11/27/17
to emacstheviking, Carlo Capelli, SWI-Prolog
On 11/27/2017 05:01 PM, emacstheviking wrote:
> Success!
>
> I downloaded the latest version of Tidy from here:
> http://binaries.html-tidy.org/
>
> Ran it with options: -asxhtml -w 10000 -o foo.html

:)

> The new file behaves exactly as I expected, ten rows returned.... I
> might have a look-see in the future to see if I could be clever enough
> to wrap libtidy for SWI, might be useful?!

The trick SWI-Prolog has for that are filter streams that you can define
in C. The zlib binding is an example. It means that you must translate
block read/write operations to process the blocks for the upstream
stream. That is quite nasty programming. Together with Wouter Beek
we once hacked something that allows arbitrary processes to do the
filtering. Wouter uses this to insert the `recode' process into the
translation if the input is in some wild encoding.

Maybe Wouter kicks in and shows where this code is.

Cheers --- Jan

>
> Thanks again everybody.
>
> Sean.
>
>
> On 27 November 2017 at 15:44, emacstheviking <obj...@gmail.com
> <mailto:obj...@gmail.com>> wrote:
>
> Hi Jan,
>
> I have tried to contact the owners of FreeCycle many times over the
> last few years, with the offer of free assistance to sort things out
> but *zero* response is all I get. I get the impression it's one
> large AI application now that is self-running!
>
> As for the HTML, I am going to pipe it through a local instance of
> "html Tidy" and if that doesn't fix it...I will resort to PHP!
> Noooooooooooooooo.
>
>
>
> On 27 November 2017 at 15:41, Jan Wielemaker <j...@swi-prolog.org
> <mailto:cc.car...@gmail.com
> <mailto:cc.car...@gmail.com>>> wrote:
>
>     Hi Sean
>
>     I think the HTML structure is ill formed (well, Jan
> already spotted
>     it while I was writing this).
>     In Firefox, just show the source to get red coloring of
> bad structures.
>
>     2017-11-27 14:49 GMT+01:00 Sean Charles
> <obj...@gmail.com <mailto:obj...@gmail.com>
>     <mailto:obj...@gmail.com <mailto:obj...@gmail.com>>>:
> swi-prolog+...@googlegroups.com
> <mailto:swi-prolog%2Bunsu...@googlegroups.com>
>         <mailto:swi-prolog+...@googlegroups.com
> <mailto:swi-prolog%2Bunsu...@googlegroups.com>>.
>         Visit this group at
> https://groups.google.com/group/swi-prolog
> <https://groups.google.com/group/swi-prolog>
>         <https://groups.google.com/group/swi-prolog
> <https://groups.google.com/group/swi-prolog>>.
>         For more options, visit
> https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>
>         <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
>
>
>
> --
> You received this message because you are subscribed to the
> Google Groups "SWI-Prolog" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to
> swi-prolog+...@googlegroups.com
> <mailto:swi-prolog%2Bunsu...@googlegroups.com>
> <mailto:swi-prolog+...@googlegroups.com
> <mailto:swi-prolog%2Bunsu...@googlegroups.com>>.

Jan Burse

unread,
Nov 27, 2017, 11:27:43 AM11/27/17
to SWI-Prolog
Very poorly web page it is:

- Mixture of HTML and XHTML:
    In HTML its <meta >
    in XHTML its <meta />
- No entities generate for attribute literal values:
    like Gardener's, but attribute literal enclosing quotes where also single quote.
- Different quotes that open close attribute literal values
    like div style=' .. "

You would need a very tolerate parser I guess....

emacstheviking

unread,
Nov 27, 2017, 11:37:40 AM11/27/17
to Jan Wielemaker, Carlo Capelli, SWI-Prolog
I get the concept... but it sounds like I would just be re-inventing all those HTML tidy rules to stay in Prolog space.

For now I am happy that I can continue my endeavours with SWI.

:)


            <mailto:swi-prolog%2Bunsubscrib...@googlegroups.com>
                     <mailto:swi-prolog+unsubscribe@googlegroups.com
            <mailto:swi-prolog%2Bunsubscrib...@googlegroups.com>>.

                     Visit this group at
            https://groups.google.com/group/swi-prolog
            <https://groups.google.com/group/swi-prolog>
                     <https://groups.google.com/group/swi-prolog
            <https://groups.google.com/group/swi-prolog>>.
                     For more options, visit
            https://groups.google.com/d/optout
            <https://groups.google.com/d/optout>
                     <https://groups.google.com/d/optout
            <https://groups.google.com/d/optout>>.



            --             You received this message because you are subscribed to the
            Google Groups "SWI-Prolog" group.
            To unsubscribe from this group and stop receiving emails
            from it, send an email to
Reply all
Reply to author
Forward
0 new messages