What kind of tcl tools would help me parse and use html info?

Larry W. Virden

unread,

Mar 24, 2006, 7:10:56 AM3/24/06

to

I have a need to write a tool to do this:

fetch an html http URL
parse the html
Look through the A tags for some specific phrases
For each one found, check a file cache. If the URL associated with the
tag is in the cache, see if it has been modified since it was placed
into the cache. If not, continue.
If it has been modified, or if it doesn't exist in the cache, then
fetch the URL, place into the cache, and touch to make the cache copy
have the date and time from the web site.
For one of the specific phrases, instead of caching the file, treat it
as the next html to parse and search.
When one specific term is no longer found, application is finished.

The only other possible thing for the algorithm above is that one of
the URLs is the URL of a CGI with values. The other URLs are just
static HTML pages.

What are some examples using some of the Tcl tools for parsing that
fetched file and searching the A tags for phrases?

Michael Schlenker

unread,

Mar 24, 2006, 7:24:08 AM3/24/06

to

You could use the htmlparse or tdom packages to do the html parsing, but
both of them like their html correct, so if you could have invalid html
files they can and do fail (trash in -> trash out).

The tdom page on the wiki has an example of a tdom script that fetches
an url and extracts all links, would probably a good start.Using the
htmlparse module from tcllib would work too.

The rest sounds like a bit of http::geturl with the -command and
probably the -channel option should work quite well.

Michael

Peselnik Michael

unread,

Mar 24, 2006, 7:42:53 AM3/24/06

to

Use package tDOM

"Larry W. Virden" <lvi...@gmail.com> wrote in message
news:1143202256.8...@i39g2000cwa.googlegroups.com...

Bruce Hartweg

unread,

Mar 24, 2006, 9:04:35 AM3/24/06

to

others have already mentioned htmlparse or an xml parser, but if
you have invalid html these will puke (and there is still plenty
of bad html out there). I have done web scraping in the past, and
often a simple RE will work to yank all the links out

set RE {<a.*(?!href)href=['"]([^'"]+)['"][^>]*>(.*(?!</a>))</a>}

foreach {tag href txt} [regexp -all -inline $RE $html] {

}

Note that this isn't perfect either, if someone has a URL with
embedded quotes this will choke, miss it (although it only
misses that particular link, it won't stop handling the rest
of the file)

Bruce

Cameron Laird

unread,

Mar 24, 2006, 11:08:03 AM3/24/06

to

In article <T%SUf.1$tp...@dfw-service2.ext.ray.com>,

Bruce Hartweg <bruce...@hartweg.us> wrote:
>
>
>Larry W. Virden wrote:
>> I have a need to write a tool to do this:
>>
>> fetch an html http URL
>> parse the html

.
.
.

>others have already mentioned htmlparse or an xml parser, but if
>you have invalid html these will puke (and there is still plenty
>of bad html out there). I have done web scraping in the past, and
>often a simple RE will work to yank all the links out
>
>set RE {<a.*(?!href)href=['"]([^'"]+)['"][^>]*>(.*(?!</a>))</a>}
>
>foreach {tag href txt} [regexp -all -inline $RE $html] {
>
>}
>
>Note that this isn't perfect either, if someone has a URL with
>embedded quotes this will choke, miss it (although it only
>misses that particular link, it won't stop handling the rest
>of the file)

.
.
.
... and if that isn't enough--or even just for
a different approach--make sure you read <URL:
http://www.crummy.com/software/BeautifulSoup/ >.

Gerald W. Lester

unread,

Mar 24, 2006, 1:19:07 PM3/24/06

to

Larry W. Virden wrote:

I know others have replied, but...

> I have a need to write a tool to do this:
>
> fetch an html http URL

Use the http pacakge

> parse the html

I'd use htmlparse package from TclLib with the -cmd option

> Look through the A tags for some specific phrases

The routine you specify in the ::htmlparse::parse via the -cmd option will
be called for every tag, just check to see if the tag is an A.

> For each one found, check a file cache. If the URL associated with the
> tag is in the cache, see if it has been modified since it was placed
> into the cache.

file mtime, clock scan and string equal

> If not, continue.
> If it has been modified, or if it doesn't exist in the cache, then
> fetch the URL,

Again use the http package

> place into the cache, and touch to make the cache copy
> have the date and time from the web site.

file mtime $filename $webDateTime

> For one of the specific phrases, instead of caching the file, treat it
> as the next html to parse and search.

Put the above in a proc and recursively call it.

> When one specific term is no longer found, application is finished.

The stack unwinds and you exit.

> The only other possible thing for the algorithm above is that one of
> the URLs is the URL of a CGI with values. The other URLs are just
> static HTML pages.
>
> What are some examples using some of the Tcl tools for parsing that
> fetched file and searching the A tags for phrases?
>

Take a look at the htmlparse.test on tcllib.sf.net

--
+--------------------------------+---------------------------------------+
| Gerald W. Lester |
|"The man who fights for his ideals is the man who is alive." - Cervantes|
+------------------------------------------------------------------------+

Joe English

unread,

Mar 24, 2006, 6:36:41 PM3/24/06

to

Bruce Hartweg wrote:
>Larry W. Virden wrote:
>> I have a need to write a tool to do this:
>>
>> fetch an html http URL
>> parse the html
>> Look through the A tags for some specific phrases

>> [...]

>others have already mentioned htmlparse or an xml parser, but if
>you have invalid html these will puke (and there is still plenty
>of bad html out there). I have done web scraping in the past, and
>often a simple RE will work to yank all the links out
>
>set RE {<a.*(?!href)href=['"]([^'"]+)['"][^>]*>(.*(?!</a>))</a>}
>

>foreach {tag href txt} [regexp -all -inline $RE $html] { [...] }

... and then you have the other problem, namely that any
regexp you devise is likely to give wrong results on
valid HTML (there's actually quite a bit of valid HTML
out there, too ...).

>Note that this isn't perfect either, if someone has a URL with
>embedded quotes this will choke, miss it (although it only
>misses that particular link, it won't stop handling the rest
>of the file)

There's quite a few things wrong with the above regexp, actually.
(I can see four specific problems, including the one you've
already mentioned, without even looking at it too hard; and there
are no doubt many others.)

The regexp/screen-scraping approach can be made to work reasonably
not too badly as long as you're dealing with a known quantity --
if you only need to screen-scrape a specific set of known sites,
you can probably hack up a regexp that will handle the kind of HTML
that those particular sites happen to be producing at the time --
but if you need to handle arbitrary purported HTML fetched from
arbitrary web sites, you really need a general-purpose tag soup
parser.

The htmlparser module in tcllib and tDOM's html parser do a reasonably
good job on tag soup, IME. I'd still recommend using one of those
instead of regexps.

--Joe English