html parsing

Bart

unread,

Oct 31, 2007, 5:41:43 PM10/31/07

to

I would like to retrieve a page from one of my favorite sites. The
section I am interested in starts with a html header (in this case h2),
followed by a table, with all the html formatting mixed in (fonts,
spans, etc.). Is there an easy way to pull out just the h2 header and
convert the table so each row becomes a tcl list?

I am experimenting with tDom but, it is hard to see what I should look
for and what I should ignore.

Ian

unread,

Oct 31, 2007, 5:13:52 PM10/31/07

to

Bart <bart...@yahoo.com> writes:

> I would like to retrieve a page from one of my favorite sites. The
> section I am interested in starts with a html header (in this case
> h2), followed by a table, with all the html formatting mixed in
> (fonts, spans, etc.). Is there an easy way to pull out just the h2
> header and convert the table so each row becomes a tcl list?

Here's a snippet of what I'm using to do something similar,
looking for a table with a known string in the first row and
extracting its contents.

Hope it helps get you started!

Regards,
Ian

package require htmlparse
package require struct

proc html2data s {
::struct::tree x
::htmlparse::2tree $s x
::htmlparse::removeVisualFluff x

set data [list]

x walk root q {
if {([x get $q type] eq "PCDATA") &&
[string match R\u00e6kke/pulje [x get $q data]]} {

set p $q
for {set i 3} {$i} {incr i -1} {set p [x parent $p]}
foreach {row} [lrange [x children $p] 1 end] {

......
}
break
}
}
return $data
}

Bart

unread,

Nov 1, 2007, 10:55:38 AM11/1/07

to

Ian wrote:

> Here's a snippet of what I'm using to do something similar,
> looking for a table with a known string in the first row and
> extracting its contents.
>
> Hope it helps get you started!
>

Thanks! That is a good start indeed!

Bart

unread,

Nov 1, 2007, 2:44:31 PM11/1/07

to

Ian wrote:

> Here's a snippet of what I'm using to do something similar,
> looking for a table with a known string in the first row and
> extracting its contents.
>
> Hope it helps get you started!
>

Unfortunately, this does not work. The website has Javascript in it
which interferes with htmlparse::2tree call.

chi...@singnet.com.sg

unread,

Nov 2, 2007, 4:19:37 AM11/2/07

to

I answered a similar thread couple of days ago.
http://groups.google.com/group/comp.lang.tcl/browse_thread/thread/8463293f2d6e82d5/88044a75553905df

tDOM extension is what you need. You need to learn the XPath syntax to
locate any nodes in the DOM tree.

Regarding the Javascript that cause problem to the HTML parsing. What
you can do is to use "regexp" to remove those chunk of code. You need
to use non-greedy quantifiers. Look under the Tcl man page for
"re_syntax"

Donal K. Fellows

unread,

Nov 2, 2007, 5:06:46 AM11/2/07

to

chi...@singnet.com.sg wrote:
> tDOM extension is what you need. You need to learn the XPath syntax to
> locate any nodes in the DOM tree.

Not quite. It's just far far easier than navigating the tree by hand.

Donal.

anoved

unread,

Nov 4, 2007, 12:07:27 AM11/4/07

to

For what it's worth, this is the tdom code I use to extract the result
from a Google Translate translation (error checking omitted for
clarity):

package require tdom
::dom parse -html $data doc
$doc documentElement tree
foreach line [$tree selectNodes {descendant::div[@id='result_box']/
text()}] {
puts [$line data]
}

The $data variable contains the whole page content as returned by
http::geturl. The selectNodes command gets every line of text from the
div identified as "result_box". Clearly, this is not a generic
example, but the point is that a relatively succinct solution may be
possible depending on the structure of the HTML you're scraping and
the right xpath syntax. For instance, you might be able to select the
content of each tr from the appropriate table.

Jim