[ISO] Web scraping with Tcl help anyone?

lvi...@yahoo.com

unread,

Jan 18, 2002, 5:48:14 AM1/18/02

to

I would like to be able to write a cronable Tcl script that simulates me
web surfing to a web site, filling in a particular search form,
pressing the submit button, and then, with the results, grabs out just
the text results and writes it to a file.

However, the form in question doesn't use the notation where the form
arguments are placed in the CGI's URL. How does one simulate that sort
of interaction with a CGI?
--
"I know of vanishingly few people ... who choose to use ksh." "I'm a minority!"
<URL: mailto:lvi...@cas.org> <URL: http://www.purl.org/NET/lvirden/>
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.

David N. Welton

unread,

Jan 18, 2002, 6:05:39 AM1/18/02

to

lvi...@yahoo.com writes:

> I would like to be able to write a cronable Tcl script that
> simulates me web surfing to a web site, filling in a particular
> search form, pressing the submit button, and then, with the results,
> grabs out just the text results and writes it to a file.

> However, the form in question doesn't use the notation where the
> form arguments are placed in the CGI's URL. How does one simulate
> that sort of interaction with a CGI?

That's a 'POST' operating, as opposed to a 'GET'.

If you are using the http package:

::http::geturl url ?options?
The ::http::geturl command is the main procedure
in the package. The -query option causes a POST
operation

Ciao,
--
David N. Welton
Consulting: http://www.dedasys.com/
Free Software: http://people.debian.org/~davidw/
Apache Tcl: http://tcl.apache.org/
Personal: http://www.efn.org/~davidw/

Cameron Laird

unread,

Jan 18, 2002, 2:14:15 PM1/18/02

to

In article <871ygnu...@dedasys.com>,

David N. Welton <dav...@dedasys.com> wrote:
>lvi...@yahoo.com writes:
>
>> I would like to be able to write a cronable Tcl script that
>> simulates me web surfing to a web site, filling in a particular
>> search form, pressing the submit button, and then, with the results,
>> grabs out just the text results and writes it to a file.
>
>> However, the form in question doesn't use the notation where the
>> form arguments are placed in the CGI's URL. How does one simulate
>> that sort of interaction with a CGI?
>
>That's a 'POST' operating, as opposed to a 'GET'.
>
>If you are using the http package:
>
> ::http::geturl url ?options?
> The ::http::geturl command is the main procedure
> in the package. The -query option causes a POST
> operation

.
.
.
Good answer.

Larry, this is part of what the TclCurl advocates have
been chattering about. If all you have to do is, in
effect, push a "submit" button, -query will take care
of you (although Phil Ehrens has complaints I don't yet
understand about ::http even in this simple case). Web
scraping gets sticky when there are logins, cookies,
... to manage, and TclCurl definitely has a lot to of-
fer there.
--

Cameron Laird <Cam...@Lairds.com>
Business: http://www.Phaseit.net
Personal: http://starbase.neosoft.com/~claird/home.html

Cameron Laird

unread,

Jan 18, 2002, 4:34:10 PM1/18/02

to

In article <a28uhe$n2l$1...@srv38.cas.org>, <lvi...@yahoo.com> wrote:
.
.

.
>However, the form in question doesn't use the notation where the form
>arguments are placed in the CGI's URL. How does one simulate that sort
>of interaction with a CGI?

.
.
.
David's alreadly correctly pointed you to POST and
how that's different from GET.

Little-known fact that can sometimes be convenient
when working in this area: many pages set up to re-
ceive POSTed information happen to parse it equally
when when GETted.

lvi...@yahoo.com

unread,

Jan 21, 2002, 7:32:28 AM1/21/02

to

According to David N. Welton <dav...@dedasys.com>:

:lvi...@yahoo.com writes:
:> I would like to be able to write a cronable Tcl script that
:> simulates me web surfing to a web site, filling in a particular
:> search form, pressing the submit button, and then, with the results,
:> grabs out just the text results and writes it to a file.
:
:> However, the form in question doesn't use the notation where the
:> form arguments are placed in the CGI's URL. How does one simulate
:> that sort of interaction with a CGI?
:
:That's a 'POST' operating, as opposed to a 'GET'.

:

Does anyone have a sample of what this kind of tcl script would look like?
I've never written any kind of web scripting so I'm looking for some
samples as something to work by.

David N. Welton

unread,

Jan 21, 2002, 7:43:09 AM1/21/02

to

lvi...@yahoo.com writes:

> :> I would like to be able to write a cronable Tcl script that
> :> simulates me web surfing to a web site, filling in a particular
> :> search form, pressing the submit button, and then, with the
> :> results, grabs out just the text results and writes it to a file.

> :> However, the form in question doesn't use the notation where the
> :> form arguments are placed in the CGI's URL. How does one
> :> simulate that sort of interaction with a CGI?

> :That's a 'POST' operating, as opposed to a 'GET'.

> Does anyone have a sample of what this kind of tcl script would look
> like? I've never written any kind of web scripting so I'm looking
> for some samples as something to work by.

From the Apache Rivet test suite:

set page [ ::http::geturl "${urlbase}$testfilename1" -query foobar=goober ]

set page [ ::http::geturl "${urlbase}$testfilename1" -query Más=Tú ]

set page [ ::http::geturl "${urlbase}$testfilename1" -query [ ::http::formatQuery Más Tú ] ]

Hope that helps

Bruce Hartweg

unread,

Jan 21, 2002, 8:08:07 AM1/21/02

to

<lvi...@yahoo.com> wrote in message news:a2h1os$r8l$1...@srv38.cas.org...

>
> According to David N. Welton <dav...@dedasys.com>:
> :lvi...@yahoo.com writes:
> :> I would like to be able to write a cronable Tcl script that
> :> simulates me web surfing to a web site, filling in a particular
> :> search form, pressing the submit button, and then, with the results,
> :> grabs out just the text results and writes it to a file.
> :
> :> However, the form in question doesn't use the notation where the
> :> form arguments are placed in the CGI's URL. How does one simulate
> :> that sort of interaction with a CGI?
> :
> :That's a 'POST' operating, as opposed to a 'GET'.
> :
>
> Does anyone have a sample of what this kind of tcl script would look like?
> I've never written any kind of web scripting so I'm looking for some
> samples as something to work by.
>

Look at tkchat, it does POSTs

Bruce

Your name

unread,

Jan 28, 2002, 5:38:59 PM1/28/02

to

When I first decided to use Tcl for web `scraping' (I prefer the term `web
molestation') I experimented for months on end (I have no friends). This is
what I learnt:

(1) Webserver operators are a paranoid, delusional bunch. If you write a
robot that molests web sites and culls data, you MUST look like a casual
surfer. Don't raise any warning flags.

(2) The http package, though great, is not as great as just exec'ing curl
(http://curl.haxx.se). Curl is a super-wicked tool. It's creator is also
one of my personal GAWDS. I wish I was as cool as him. Now what is so
great about this tool? Well its pretty simple folks: It works. Its cross-
platform. It does SSL, cookies, POST, GET, and can easily fake referer's,
uagents and a whole LOT more. Praise be.

Check out one of my robotic-web-molesters, I like to call `fake_ie'

#!/usr/bin/tclsh
;# First arg is the URL to retrieve and to display.
set url [lindex $argv 0]
set timeout 10 ;# seconds
set uagent {Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; DigExt)}
set data [exec curl -m $timeout -A $uagent -s $url]
puts $data
;# EOF

Now isn't that sexxxy? Damn str8 it is. Why? Because its a small
component, and its easily scriptable from any other proggy.

What is the point of it? It fakes being internet explorer, a very famous
browser that runs on Windows. LOL. Anyway, by doing this, your robot,
though of an unsavory character, will survive paranoid webOps.

Use this component to download the text source from a web page, not binary
data. Then you can run it through a myriad of tcl filters, and entertain
yourself for hours on end.

PLUS! --> Curl can be exec'ed from Windows. (It's GPL'ed)

(3) Use bgexec, and learn from others. Okay, since you're exec'ing curl,
bgexec is USEFUL. Imagine that.

(4) Check out brx, learn from others. Why brx? Or should I say, er,
BrowseX (that sounds sooooo nasty, yum). What is cool about brx? Well,
I'll tell you what --> it does excellent downloading on Windows. It uses
its own resolver, so that it DOESN'T HANG ON NONEXISTENT SERVERS. This is
an incredibly sorry problem for the http package. This is the primary
reason I searched for alternatives. Sorry, I'm not good enough to patch the
http package, I'm just a sexy loser after all. Look for resolver.exe I
think... or maybe resolve.exe ... less than 20 ko

(5) TclCurl is a tcl binding for the libcurl (libcurl is curl's core
library) It's not bad, but curl works for me, and libcurl doesn't offer any
real advantages until it handles keeping the GUI alive while downloading.
That would be reALLLY SEXxxYY! Oh yeah, the creator is one sexy bwwwoy!

In summary:

o The Tcl http package is not useful to me - it hangs on Windows when it
can't resolve. What I mean by hangs is that the GUI isn't responsive.

o We need a new tool - use curl (http://curl.haxx.se - memorize this!)
It's easy to exec, easy to use, the creator is a sexy boy, and its fun!
Join Us.

o If you're on Windows, check out the source to BrowseX. Peter MacDonald
is very cool. Worship Lambda.

o bgexec from BLT is useful for keeping the GUI alive.

o Keep your eyes out for TclCurl developments.

P.S. The source for tcler's chat is large, I think over 50 ko, but it's
clean. I looked it over quickly once... might be worth looking at ;D

Now, for a real-world example of molesting web sites. Oh yeah anything I
write is yours as long as you say out loud 5 times:
"I am SOOOOO SEXXXY"

This program screws around with a Java Web Server, specifically the one at
elpress.com Electric Press publishes books online and allows you to read
them, but not download them!!! JEEZZZ cmon, I'm just a lame little child who
needs to be spanked! So use your Tcl-FU and dammit download those books!!
Realize that I take ALL consequences for your actions, because I WANT TO BE
SPANKED BY THE LAW. Oh yeah, those books are really nice, I like the
database one. I might've even bought em, if I was a loser.

Note that you might actually understand the proggy is you ... wait for it...
look at the site first!!! and maybe even do a `view source'

Before running it, just change the isbn to get the book you want. The
default isbn is a math book...

How does it work?
o It starts a session with the web site,
o Does a fetch of a control panel page for that book to get the names of all
the actual content pages
o There are zooming levels (every page is a gif) We only want the highest
zoomed pages (best quality - fat pages though)
o Before getting a content page, we do a fetch that is required,
I think its a surrounding frame (the damn web server checks for this!)
o Also before doing a fetch of a content page, we do a fetch to set the
darn zooming level!
o That is 3 fetches per page of content: the frame, the zoom, and the actual
content! DamN you JAVA ELPRESS WEB PROGRAMMERS - you deserve a hot, well
brewed death. Plus Java Sucks. Ha!
o Downloads book. Then you read, get smarter, then learn more tricks to
beat the system. Its a vicious cycle, 'cept its not vicious.

I WANT TO LIVE VICARIOUSLY.

#!/usr/bin/tclsh

;# The ISBN number of the book that we want to suck down into our meta-maws.
;#set isbn "085199234X" ;# Biological Control of Weeds.
set isbn "081764038X" ;# A nice math book.
;# The directory to save the dump to. Make sure it exists before running!
set savedir "saves/$isbn"

;###########################################################################

;# Temporary file to dump the headers to start a session.
set headersFile "tmpSessionHeaders.txt"
;# The session ID that we initiated.
;# This is set by the `startSession' proc.
set sessionID ""

;# Mimics IE by using curl.
;# Supply all arguments needed to curl except for user agent (-A option).
proc curlwrap {args} {
;# Our fake user agent to send in web connexions.
set UAGENT {'Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; DigExt)'}
set com [list exec curl -A $UAGENT]
foreach a $args {lappend com $a}
return [eval $com]
}

;# Starts a session with the java web server and
;# set the global `sessionID' variable.
proc startSession {} {
global headersFile
global sessionID
;# Dump the headers, including the session cookie, to a temp file.
curlwrap -m 10 -D $headersFile -s "http://elpress.com/readforfree.jsp"
;# Parse out the session ID from the temp file.
set f [open $headersFile]
while {[gets $f line] >= 0} {
if {[regexp {Set-Cookie\:\ jwssessionid\=(.*?)\;Path\=\/} $line ->
sessionID]} {
break
}
}
return
}

;# Sets up the files and dirs so that we can save the required images.
proc setupFiles {} {
global savedir
exec mkdir -p $savedir
return
}

;# Get all names of the pages from the book,
;# according to their internal naming scheme.
proc getPageNames {} {
global isbn
global sessionID
set pagenums {}
;# To get the names of the pages from this book,
;# first request a page from the book, then
;# get the control panel for that book.
;# The control panel has information on all the page names.
;# Fetch page 1.
set frameURL "http://elpress.com/content.jsp?Book=${isbn}&Page=1"
set sessionCookie "jwssessionid=$sessionID"
curlwrap -m 10 -b $sessionCookie -s $frameURL
;# Fetch the control panel.
set controlURL "http://elpress.com/readitcontrol.jsp"
set sessionCookie "jwssessionid=$sessionID"
set data [curlwrap -m 10 -b $sessionCookie -s $controlURL]
;# Parse the data from the control panel
;# for the names of the pages from the book.
set seenMarker 0
foreach line [split $data "\n"] {
if {[regexp {^\<OPTION VALUE\=\"COM\.} $line {}]} {
;# Start of page numbers.
set seenMarker 1
} elseif {$seenMarker && [regexp {^\<\/SELECT\>} $line {}]} {
;# End of page numbers.
break
} elseif {$seenMarker && [regexp {^\<OPTION VALUE\=\"(.*?)\"} $line ->
aPage]} {
;# Found a page number.
lappend pagenums $aPage
}
}
return $pagenums
}

;# Download the graphic for page `p' and save to file `s'.
;# Requires the name of the page according to their naming scheme,
;# e.g.: roman1, 43.
proc getPage {p s} {
global isbn
global sessionID
;# The java web server requires us to get a previous frame,
;# i.e. do a prefetch before getting the actual content.
;# We must also do another prefetch to get the zoomed image.
;# We want the highest quality image.
;# Do the frame prefetch.
set frameURL "http://elpress.com/content.jsp?Book=${isbn}&Page=${p}"
set sessionCookie "jwssessionid=$sessionID"
curlwrap -m 10 -b $sessionCookie -s $frameURL
;# Do the zoom prefetch.
set zoomURL "http://elpress.com/content.jsp?AbsZoom=6"
set sessionCookie "jwssessionid=$sessionID"
curlwrap -m 10 -b $sessionCookie -s $zoomURL
;# Get the content, large hi-rez image, and save it to disk.
set contentURL "http://elpress.com/servlet/page/${isbn}/${p}/6"
set sessionCookie "jwssessionid=$sessionID"
curlwrap -m 10 -b $sessionCookie -s $contentURL > $s
return
}

;# A wrapper for proc `getPage' that displays info and catches errors.
;# Requires the name of the page according to their naming scheme,
;# e.g.: roman1, 43.
proc niceGet {i} {
global savedir
set success 0
while {!$success} {
puts -nonewline "trying $i ... "
flush stdout
if {[catch {getPage $i [file join $savedir $i.gif]}]} {
puts "failed $i"
puts "waiting 60 seconds"
flush stdout
after 60000
} else {
set success 1
puts "got $i"
flush stdout
after 1000
}
}
}

;# Download entire book.
setupFiles
startSession
foreach p [getPageNames] {
niceGet $p
}

exit
;##################################################################
;# EOF

P.S. If you're not so hot on HTTP protocol, then check out this O'Reilly
book - the bastards published this book though based on Perl, its got a nice
HTTP intro...

http://www.oreilly.com/openbook/webclient/

Suck it Perl MONGRELS!

!============ ON another Note (dontcha love my formatting skilllezzz)

I've done a pretty big project using Tcl/Tk (will give you details another
time). I would just like to say:

THANK YOU GAWDS OF USENET.
I read all your posts to c.l.t and adore you.

I have printed many of *Sir Cameron Laird*'s postings and they paper my
bedroom walls.

*Mr. Larry Virden* is a personal hero.

*Senor Jeff Hobbs* is a great example of what it means to really be a `human
bean.' Thank you Jeffrey. You are a credit.

I will thank the rest of you, as I see fit. Thank you for listening to my
thanks.

Thank you Groups.Google.Com for being my gateway to c.l.t I spent many
lonely nights reading the archived postings of my cyber-heroes who don't
even know I exist - but still I love them!!!!!

If you have read this far, you will be rewarded in the afterlife. Praise
be.

Oh Yeah, I love Tcl/Tk, specially the community. Jeez are you all hippies
or something? you're all soooooo nice.

Slap me,
jooky

Your name

unread,

Jan 28, 2002, 5:45:45 PM1/28/02

to

> Little-known fact that can sometimes be convenient
> when working in this area: many pages set up to re-
> ceive POSTed information happen to parse it equally
> when when GETted.

That is a cool trick. Thank you O Wise One.

Is this because most cgi's are written in php, asp, jsp...dadada, and they
transparently handle arguments?

But... mmm if you want to fake being a random surfer (so often the case that
I want to) it's not good to stand out... I imagine it would look different
in logs...

Spank me,
jooky

Cameron Laird

unread,

Jan 28, 2002, 6:22:41 PM1/28/02

to

In article <Xns91A4B6A4F...@66.185.95.104>,

.
.
.
All true. Yes, a lot of common CGI packages just toss in the
POST/GET switching along with all the parsing they share, so
the application never knows the difference.

It sure shows up in logs, though.

Neil Madden

unread,

Jan 28, 2002, 7:09:20 PM1/28/02

to

Your name wrote:

<ranting, and code for (somewhat dubiously) getting books without
paying, snipped>

> Slap me,
> jooky

Is this the scariest posting ever made to c.l.t.?

On the points about http package, and GUI unresponsiveness - does the
-command option to http::geturl help you there? Also, you can do
http::config -useragent "Mimick IE to your heart's content here", to
mimick UA strings.

Your name

unread,

Jan 28, 2002, 7:20:19 PM1/28/02

to

> <ranting, and code for (somewhat dubiously) getting books without
> paying, snipped>

yeah i love the authors, but `think of the children' there is an element in
the world, that you despise, that you are afraid of, the children of the
corn. we eloi are not alone

> Is this the scariest posting ever made to c.l.t.?

oh the honore

> On the points about http package, and GUI unresponsiveness - does the
> -command option to http::geturl help you there? Also, you can do
> http::config -useragent "Mimick IE to your heart's content here", to
> mimick UA strings.

(I would take this moment to say: thank you brent welch, master of the ages,
lord of the undeserving, and my heart's content.)

Sure, the http package has a wonderful -command option, and can set the
uagent, BUT it hangs when doing a resolve on a host that is down. This
renders the GUI unresponsive, and is plain unnaceptable for a professional
app. Sexy!

All things considered, Daniel Stenberg, the Creator, of curl,
http://curl.haxx.se has a devoted following and has spent countless hours on
a SINGLE tool, a Light Saber, an excellent Component, that is aimed squarely
at being a network aware proggy. Yum.

Until the day that the http package does not hang on unresponsive servers, I
will have to hack around this problem. Note that I have spent months
searching and scouring for alternatives and for a cross-platform solution
(my main dev takes place on linux but I want my buddies to be able to use
mah proggies on Windows).

Before downloading I'm pretty sure that BrowseX even goes so far to exec a
resolver (a small .exe under 20 ko) on Win9x

Love your life. We are all children now.

jooky

Cameron Laird

unread,

Jan 28, 2002, 7:53:56 PM1/28/02

to

In article <3C55E830...@cs.nott.ac.uk>,
Neil Madden <nem...@cs.nott.ac.uk> wrote:
.
.
.

>On the points about http package, and GUI unresponsiveness - does the
>-command option to http::geturl help you there? Also, you can do
>http::config -useragent "Mimick IE to your heart's content here", to
>mimick UA strings.

While jooky certainly has demonstrated an ability to
speak on his own behalf, I'll jump in here. I'm
thinking about Web scraping a lot this month myself,
and, if I understand correctly, NO, -command does
not help with the specific problem of http blocking
while it resolves a symbolic host name.

There's a lot to talk about here. -useragent is in-
deed useful, and worth mentioning. Jooky's not alone,
though, in advertising curl's virtues. While I've
been practicing with curl and libcurl lately, I still
don't have enough experience to trust my judgment on
how much value it adds over http. I hope Phil,
Andrés, Jooky, and others will chime in.

A separate question is what we want to do with http.
Andreas has shown a willingness to incorporate moder-
ate enhancements, and Mats, Pat, and others have
already written a fair amount of new code. I don't
know whether the community as a whole thinks
(pure-Tcl) http is worth maintaining, or whether it
should be abandoned in favor of a libcurl binding.

I don't agree with Jooky on the propriety of "crack-
ing" protection schemes, ignoring licenses, and such.
On the other hand, I work a lot on Web scraping with
techniques that I recognize can easily be put to
those ends. My own plan is to keep working in public,
even if I might be exchanging information with some-
one who makes different choices about legal or
ethical matters than I do.

'Know how Expect has no effective competition? What
would it be like if Tcl (rather than the Perl of the
ORA client-side Webbing book) gained a reputation as
the canonical language for Web scraping?

Your name

unread,

Jan 28, 2002, 7:33:09 PM1/28/02

to

cla...@starbase.neosoft.com (Cameron Laird) wrote in
news:B7BFCBDB95C7B1FD.A1F5D48B...@lp.airnews.net:

> All true. Yes, a lot of common CGI packages just toss in the
> POST/GET switching along with all the parsing they share, so
> the application never knows the difference.

Thank you for enlightenment.

>
> It sure shows up in logs, though.

PRAISE HIM! Children gather round, Learn!

Cameron Laird

unread,

Jan 28, 2002, 7:58:12 PM1/28/02

to

In article <3C55E830...@cs.nott.ac.uk>,
Neil Madden <nem...@cs.nott.ac.uk> wrote:
.
.
.

>Is this the scariest posting ever made to c.l.t.?

.
.
.
I think I'd nominate the ones where someone working
for the US government or a comparable institution
trashes Tcl, says he needs reliability/portability/
..., then concludes with a claim that the next
nuclear reactor/moonshot/vaccination program/...
will be built with Visual Basic.

Your name

unread,

Jan 28, 2002, 8:34:47 PM1/28/02

to

cla...@starbase.neosoft.com (Cameron Laird) wrote in

news:F2AADD92D4078068.15B35DE3...@lp.airnews.net:

> 'Know how Expect has no effective competition? What
> would it be like if Tcl (rather than the Perl of the
> ORA client-side Webbing book) gained a reputation as
> the canonical language for Web scraping?

How can Tcl become the KingTcl? Hmmm..

I have an emotional attachment to the community, and perhaps have even been
an evangelizer - sexy! but I did not directly think of the community and
how we could increase its success. I am sorry to say that I am a short-
sighted, pathetic fool, and this is what seperates men like me from men like
Sir Cameron Laird.

How could Tcl rise to become the canonical language for web-molestation
everywhere?

I have an idea.
It is a small idea.
It is not even original.
It is a small, unoriginal idea from a small, unoriginal man.

After 1 1/2 years of web-molestation, I started to think about why I was
such an idiot.

Why was I doing the same 80% in little squidgy scripts? What an absolute,
tremendous waste of time.

First there was the headers, you know, the command-line arguments. Yeah
thanks to countless hackers, including Andreas Kupries (sexy tcl boy-toy),
this is made simpler with `cmdline' tcllib package.

Then there was the connecting sequences to the web servers, the countless
catch statements, the directory creation routines, the disgusting bloat and
complexity of it all. There was the parsing of the pages, the sickening
retch I would feel in the hollow of my throat as I witnessed a clean emacsen
buffer fill with the delight of grotesqueness resolute in its attempt to
mire me in its bog of bloat.

So I sat down and wrote a little ditty, but forgot it for a year. It is now
resurrected, and I spew my bile forth, without restraint:

0000000000000000000000 BEGIN BILE 0000000000000000000000000000

Fellow seekers,

Today, it is not unusual to have the opportunity to program to open source
APIs. But where the most important information is to be found,
we still encounter the roadblock to greatness, information hoarding,
programming to closed source APIs. Information hoarding takes many
forms, not the least of which is the encapsulation of valuable information,
rich in depth and solace for the soul, in a mire of filth and
heartache, which is not uncommon on the World Wide Web.

Analysis of Paper on Semistructured Information

I read the following work today. It is 8 pages long, and I have taken notes
on it for the reader's, and my own, edification.

Available at http://citeseer.nj.nec.com/hammer97extracting.html
Extracting Semistructured Information from the Web
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo
Department of Computer Science
Stanford University
Stanford, CA 94305-9040
{hector, joachim, cho, aranha, crespo}@cs.stanford.edu
http://www-db.stanford.edu

The abstract states that they `describe a configurable tool for extracting
semistructured data from a set of HTML pages.' I would agree with
them that the web is full of `semistructured' data, since HTML pages do have
some sort of logical structure that aids the human in
understanding the information that is presented on the webpage. There are
tables, line breaks, paragraphs, bulleted lists, ordered lists,
horizontal ruled lines, and graphical elements whose original purpose was to
aid the reader in understanding the information. When
presentation of a webpage aids the reader in understanding its informational
content, the world rotates on its axis just a little easier. These
structures that help humans understand the content of HTML pages, also help
us write tools to parse and extract the valuable information that
is generally lacking in the majority of the WebSpace.

The abstract goes on to say that their tool is for `converting the extracted
information into database objects.' This is a great idea, because
once you have extracted the valuable, information rich chunk of data, you
want to place it in a safe place. When you mine deep in the ground,
you want to save the gold rich ore somewhere safe. I will interpret their
use of `object' to mean that they have some sort of metadata about
this information chunk that will be valuable in its use. The format in which
they store this extracted information is in the OEM format, their
own specialized format. This format for representing, and storing objects is
also used in their other tool, in the TSIMMIS project.

`The input to the extractor is a declarative specification that states where
the data of interest is located on the HTML pages, and how the data
should be "packaged" into objects.' This declarative specification must
respect concerns for ease of maintenance, and structure as well as contain a
vast amount of metadata. Without this metadata, each match is simply a
chunk of rock, rich in precious metal, but we are unsure of whether it
came from Terra or Mars. Information without associated data about where it
was found, and what it is, is not as valuable as tagged information.

The authors of the paper in question noted that they have implemented the
Web extractor in the Python language. Note that our aim here is to
create a generic enough description of where the information on a webpage is
that a robot can be implemented in any language with modern
programming constructs. (we could write a robot that reads these engine
defn files in Tcl/Tk mmmm Portability is SexxxxY 28 Jan 2002)

Their prototype is installed in the TSIMMIS testbed as part of a DARPA I^3
(Intelligent Integration of Information) technology demonstration, where it
is used for extracting weather data from various WWW sites. It is curious to
note that DARPA, is also involved in this project. I will have to do some
more research into this project; I wonder if it is still in use, noting that
this paper was written in 1997. I would appreciate your time and efforts in
locating said information.

`... the contents of the WWW cannot be queried and manipulated in a general
way. In particular, a large percentage of the information is
stored as static HTML pages that can only be viewed through a browser. Some
sites do provide search engines, but their query facilities are
often limited, and the results again come as HTML pages.'
This is unfortunately still the case in 2001; we have only exacerbated the
problem that was rampant and noted by academics in 1997. XML
interfaces to websites seek to remedy this, witness Reuter's tranformation:
`MarketsML will bring together standardized XML formatted information from
the various Reuters data outlets so users can access all of the
information in a seamless format. "Say you want coffee grower prices, then
maybe you'll want weather reports, background research, yield
information, indexes or futures," says Hunt. "This is all quite difficult to
navigate at the moment." MarketsML looks to change that and make
accessing and assembling all of the information faster and easier as an
architecture on which to bring the information together for the end
user.'(footnote)
XML feeds like the one at moreover.com(footnote), are gaining in acceptance.
The reality though is not as rose-coloured as all those XML
advocates would have you believe. When was the last time a search engine or
any news website returned meta-tagged information, instead of
your content surrounded in a mess of blinking lights and java-enhanced
subliminal advertisements? The state of the web today is one in
which commercial interests are all vying for your sore, dry, irritated
eyeballs. This is the reason we parse websites, we cull them for valuable
information, because today, they are so bloody crowded in a jungle of
perversion. Hopefully tomorrow will be a brighter day.

`The descriptor is based on text patterns that identify the beginning and
end of relevant data; it does not use "artificial intelligence" to
understand the contents. This means that our extractor is efficient and can
be used to analyze large volumes of information. However it also
means that if a source changes the format of its exported HTML pages, the
specification for the site must be updated. Since the specification
is a simple text file, it can be modified directly using any editor.
However, in the future we plan to develop a GUI tool that generates the
specification based on high-level user input.'
We see here that their `descriptor' (that is the name I will adhere to for
the idea of definition file for the structure of a website, from now on)
are generated by people. A GUI tool for updating the descriptors would be
quite a neat tool because it would allow the technically uninclined
to aid us, but this is thought best reserved for the future. My mind wonders
at the magnitude of the algorithms needed to implement this, and
shivers a deep, cold, beautiful freezing. I should also note here that upon
further research, I have found screenshots of many tools that do
indeed do implement GUI tools aiding humans in creating website descriptors.
I did not find their source though. Why does academia not
release their source code always? The future of mankind is at stake, no
less.

They go on to speak about their Object Exchange Model (OEM), it seems like a
precursor to XML. Upon further research it is noted that
their current generation of tools do indeed use XML, and searches on
citeseer (see bibliography) do indeed return papers that deal with the
translation of the OEM format to the XML one. I have not studied those
papers in any depth though, I encourage the interested to witness the
portability of legacy formats.

The Stanford IBM Manager of Multiple Information Sources (TSIMMIS) uses the
OEM format extensively, and this is probably why they
chose to use it for this extractor project.

They speak of TSIMMIS `wrappers', these wrappers are not the ones most
usually thought of when rewriting new interfaces to new
programs; this is the idea of TSIMMIS wrappers being the interface to
querying the extracted information. ?

Their project is summarized in the following sentence: `The wrapper uses the
extractor to retrieve the relevant data in OEM format, and then
executes the query (or whatever query conditions have not been applied) at
the wrapper.' The query they speak of is one that is similar to
SQL. How it works, in English: An engine reads an extractor definition file
which tells it where the valuable data is in the website, then it
stores the data in an object (in the object-oriented sense of the word),
then they can answer queries about this object, uses some SQL variant.
If the web page information can be extracted and transformed into XML, then
almost any current generation tool will be able to use this data
and stuff it into any sort of database that is imaginable to the human mind.
There is no use, in my unadulterated opion, to implement our own
querying language. Someone who wishes to cull a large amount of information
from a website, be it images, movies, or raw text, could write
an extraction definition for a website of interest, then sit back as XML was
generated, this XML in turn could be used to populate a database,
or transformed into a new format. If it were, say used to populate a mySQL
database, then we could simply use SQL to query the tables of
interest. (i now like postgresql more Jan 28 2002)

The authors give a detailed account of how they extracted weather data from
an Intellicast site, they call their engine a `configurable
extraction program'. The specification file consists of a sequence of
commands, each defining one extraction step. Each command is of the
form `[ variables, source, pattern ]'. I encourage the interested to read
the original paper and to witness the example, it is very inspiring, and I
will use it as the basis for our Universal Parser (we should call it Tcl
Universal Parser ;>). I do indeed like their use of representing all data in
websites as objects, it does indeed map to the problem domain well. I do not
like their neglection of regular expressions, this is where I will improve
on their design. Their way of matching text is not a very flexible one, and
I suspect that most power users are or want to be familiar with them.
That is what makes them power users, the will to learn and stamina.

Your opinions, reflections, and comments are extremely important to me, I
need your ideas on how to improve the syntax (compared to the academic
exercise Jan 28 2002), as well as the grammer for the extractions
specification files.

Side Note on Social Engineering

For our uses, we will have to organize large quantities of humans to aid us,
we will need liberal amounts of social engineering. This will be,
by far, the most difficult part of this endeavour, the massive organization
of live machines. I would appreciate any, and all help in this task,
for I am not expert at this. (I do then submit to Sir Cameron Laird's
organizing principle, for he is the lynchpin of this affair, and I am but a
knight at his roundtable. Jan 28 2002) This is most probably why we do not
hear of this project's great success, for they were unable to organize a
community large enough for the maintenance of such a large number of website
descriptor files. I trust that we will not suffer the same fate. May Athena
be on our side. I will, in the near future, write a small dissertation on
the uses of social engineering, and the projects' success being intertwined
with its accurate, proper, and timely use.

Extra Research, or Neglecting the Real World

Once I had done an initial first reading of the main paper, I did research
on related projects, such as Lore, the Lorel query language, the OEM
format and the TSIMMIS project. I will present to you my findings. (not
bloody likely anymore Jan 28 2002) This additional research is the primary
reason that it took an extended period of time to write this dissertation. I
hope it is worth it.

Footnotes

Reuters Announces New XML Initiative, By Cristina McEachern , Wall Street &
Technology Online, Jan 26, 2001 (12:16 PM),
http://www.wstonline.com/printableArticle?doc_id=WST20010126S0006
(Reuters jumps on the XML bandwagon, more XML hype, the promise of a better
tomorrow?! You won't miss much if you don't read this ;))

Moreover::XML Feeds, http://w.moreover.com/dev/xml/
(moreover is a great resource for searching for the latest news events, use
& abuse them. Commercial bastards with good technology. At least
they offer free searching of their database, free for now. We must create a
free alternative before the interface looks like altavista. They've
got XML feeds, in many flavours. Stupid XML.

Bibliography

Citation details: Extracting Semistructured Information from the Web -
Hammer, Garcia-Molina, Cho, Aranha, Crespo (ResearchIndex),
NEC Research Institute, http://citeseer.nj.nec.com/context/61091/8361
(These are the search results for a search of all the papers that have cited
the analysed paper. From citeseer, a deep, deep drill into the web.
Why do humans create more trouble, the web is chaotic, like my wandering
mind.)

ResearchIndex: The NECI Scientific Literature Digital Library, NEC Research
Institute, http://citeseer.nj.nec.com/directory.html
(Excellent Computer Science Directory, all pointers to papers written by
highly emminent! professors, HIGHLY RECOMMENDED SITE,
including sections on Agents, Face/Speech Recognition, Clustering, AI,
Natural Language, Compression, Databases, Information Retrieval,
Information Extraction, Machine Learning, Networking, OS Theory, Encryption,
Information Warfare, and loads about SEARCH
ENGINES)

0000000000000000000000 END BILE 0000000000000000000000000000

What do I think of this paper now? I realize that I am a bigger idiot than
I ever thought possible. Mobilize a community? Ha! I am a naive child and
I should be so lucky to be spanked!

CHAIN ME LOVE ME HATE ME KILL ME

Too bad I retain my ideals. They will be my demise.
jooky

Your name

unread,

Jan 28, 2002, 8:47:08 PM1/28/02

to

Reading my own post, its probably too much for most to read. Here is the
point:

you want tcl to get noticed?

let it be the first to seriously implement universal web parsing, and object
chunking.

what does that jargon mean?

instead of hand parsing every web page with hand made fragile scripts,
create a tcl x-platform robot that understands a definition language for
every web site.

so you want to parse the weather from a cnn weather.
you create a definition file that is a bunch of regexps, as well as how they
fit together to yield an object, an object that has various members that
contains your data. mmm anyone smell xml? i dont want to contrib to the
xml hype, but in any case, we get some data extracted that is the weather
for omaha and marseilles. its in 2 weather objects and then you pass it
along to the next program or just chuck it in a db.

sexy. now you just write defn files for each website instead of hand making
crappy tacky crafts like a used and abused native.

read the previous post if you want some bullshit and an analysis of a paper
by some stuffed shirts.

love ya *lots*
jooky

Your name

unread,

Jan 28, 2002, 8:51:21 PM1/28/02

to

cla...@starbase.neosoft.com (Cameron Laird) wrote in
news:F2AADD92D4078068.15B35DE3...@lp.airnews.net:

> There's a lot to talk about here. -useragent is in-

> deed useful, and worth mentioning. Jooky's not alone,
> though, in advertising curl's virtues. While I've
> been practicing with curl and libcurl lately, I still
> don't have enough experience to trust my judgment on
> how much value it adds over http. I hope Phil,
> Andrés, Jooky, and others will chime in.

curl for one has http 1.1 and can do https.

> I don't agree with Jooky on the propriety of "crack-
> ing" protection schemes, ignoring licenses, and such.
> On the other hand, I work a lot on Web scraping with
> techniques that I recognize can easily be put to
> those ends. My own plan is to keep working in public,
> even if I might be exchanging information with some-
> one who makes different choices about legal or
> ethical matters than I do.

let he who is without sin cast the first stone
jooky

p.s. i would give my life to you, if you were to ask it.

p.p.s. i give all my evilly gained knowledge though you may become my enemy
one day and seek to hoard human knowledge. what right do you have to
withhold the shitty code to your app, when newton and leibniz shared
CALCULUS with the goddamn world. fuck you.

Cameron Laird

unread,

Jan 28, 2002, 10:14:10 PM1/28/02

to

In article <Xns91A4D61AE...@66.185.95.104>,
Your name <jo...@jooky.jooky> wrote:
.
.

.
>curl for one has http 1.1 and can do https.

I figure you know this, but I'll make it explicit for
others: Tcl's http can do HTTPS. When it has help--
see the predictable <URL: http://mini.net/tcl/https >
for details.
.
.

.
>let he who is without sin cast the first stone
>jooky
>
>p.s. i would give my life to you, if you were to ask it.
>
>p.p.s. i give all my evilly gained knowledge though you may become my enemy
>one day and seek to hoard human knowledge. what right do you have to
>withhold the shitty code to your app, when newton and leibniz shared
>CALCULUS with the goddamn world. fuck you.

Yee-haw! Newton and Leibniz--why they *forced* calculus on
the world! They coerced it into place. Beat it into submis-
sion. Conquered calculus. Great men both, but let's be
careful how we employ their examples.

Anyway, jooky, thanks for all the tips today. I'd welcome
private correspondence. In its absence, I'll just repeat
my appreciation for the ideas.

Incidentally, I'm unlikely to become any more of a
knowledge-hoarder than I already am. 'Just can't keep tech-
nical stuff to myself.

Your name

unread,

Jan 29, 2002, 12:06:34 AM1/29/02

to

cla...@starbase.neosoft.com (Cameron Laird) wrote in

news:83738C8248C658FD.3B923C0F...@lp.airnews.net:

>>curl for one has http 1.1 and can do https.
> I figure you know this, but I'll make it explicit for
> others: Tcl's http can do HTTPS. When it has help--
> see the predictable <URL: http://mini.net/tcl/https >
> for details.

Yes, you're right... I was wrong. I never sought to use the tcl http
package with HTTPS, since the resolver issues were my prime concern and
eclipsed any others. When I wrote that message I didn't know that the tcl
http package could handle HTTPS. Thank you for your informative message.
I know that curl can do more than just HTTP/HTTPS, also LDAP, FTP... but
then again there are tcl packages for that.

> Yee-haw! Newton and Leibniz--why they *forced* calculus on
> the world! They coerced it into place. Beat it into submis-
> sion. Conquered calculus. Great men both, but let's be
> careful how we employ their examples.

It is very likely that my analogy was improper.
My point is that we all have to realize how significant we are. The little
we accomplish in our lives should be shared between us, otherwise how are
we, as the human race, ever to truly advance ourselves as a people? There
is most likely no reason to tell you this though, as you are a fine example
of an altruistic person.

> Incidentally, I'm unlikely to become any more of a
> knowledge-hoarder than I already am. 'Just can't keep tech-
> nical stuff to myself.

I'm sorry, that last part of my post was not meant to be directed at you; my
use of language is not very clear. You're the last person I would ever
accuse of being a knowledge hoarder.

jooky

Andreas Kupries

unread,

Jan 29, 2002, 1:24:02 AM1/29/02

to

Your name <jo...@jooky.jooky> writes:

> > On the points about http package, and GUI unresponsiveness - does the
> > -command option to http::geturl help you there? Also, you can do
> > http::config -useragent "Mimick IE to your heart's content here", to
> > mimick UA strings.
>
> (I would take this moment to say: thank you brent welch, master of the ages,
> lord of the undeserving, and my heart's content.)
>
> Sure, the http package has a wonderful -command option, and can set the
> uagent, BUT it hangs when doing a resolve on a host that is down. This
> renders the GUI unresponsive, and is plain unnaceptable for a professional
> app. Sexy!

This is not something in the http package itself, but deeper, in the
[socket] command provided by the tcl core. It uses 'gethostbyname ()'
to map from the hostname to an IP address so that it will work with
whatever is configured in resolv.conf for the host, 'hosts', DNS,
LDAP, YP/NIS, ... Problem is that this function will not return until
either got the IP adress or a timeout. This was the topic of several
discussions here on c.l.t. IIRC the best solution which came up so
far, was to have the core spawn a helper thread to process (and wait
for) the 'gethostbyname ()' while the rest of the core goes on
crunching.

--
Sincerely,
Andreas Kupries <akup...@shaw.ca>
Developer @ <http://www.activestate.com/>
Private <http://www.purl.org/NET/akupries/>
-------------------------------------------------------------------------------
}

Andreas Kupries

unread,

Jan 29, 2002, 1:39:07 AM1/29/02

to

cla...@starbase.neosoft.com (Cameron Laird) writes:

> I don't agree with Jooky on the propriety of "crack-
> ing" protection schemes, ignoring licenses, and such.

I agree with Cameron here. Especially as there are so many places
where one can get public texts for free.

For example:

* Books in general are provided by "Project Gutenberg"
A mirror of its archives is at

ftp://ftp.ibiblio.org/pub/docs/books/gutenberg
(Currently uses 1.056.792 K on my disk)

* Then we have 'The Online Books Page' at

http://digital.library.upenn.edu/books/

or 'Eldritch Press' at

http://www.eldritchpress.org/

* For people more tending to anime/manga fan fiction there are

RAAC (rec.arts.anime.creative), readable as either newsgroup
or as mailing list. See

http://www.gweep.ca/mailman/listinfo.cgi/raac-by-mail

for the mailing list, and

ftp://ftp.cs.ubc.ca/pub/archive/anime-fan-works
(Currently uses 227.952 K on my disk)

for the the archive of stories.

* Another anime/manga fanfiction list can be found through

Toby A Inkster

unread,

Jan 29, 2002, 2:51:21 AM1/29/02

to

On Mon, 28 Jan 2002 22:45:45 +0000, Your name wrote:

> Is this because most cgi's are written in php, asp, jsp...dadada, and
> they transparently handle arguments?

Well... PHP, ASP and JSP aren't really CGI. Most Perl CGI programs use
the CGI module (although it's perfectly easy to write one which doesn't)
and thus don't care about GET/POST either.

Personally, I never use the CGI module - big waste of memory that it is,
so all my stuff is hardwired into either GET or POST. Not both.

--
Toby A Inkster, Esq. - mailto:tobyink(at)goddamn.co.uk - gpg:0x5274FE5A
http://www.toby-inkster.co.uk/ - icq:6622880 - aim:inka80

Fuch's Warning:
If you actually look like your passport photo, you aren't well
enough to travel.

Cameron Laird

unread,

Jan 29, 2002, 9:22:29 AM1/29/02

to

In article <Xns91A5313FB...@66.185.95.104>,

Your name <jo...@jooky.jooky> wrote:
.
.
.

>I know that curl can do more than just HTTP/HTTPS, also LDAP, FTP... but
>then again there are tcl packages for that.

.
.
.
... each of which has problems--the Tcl packages for
FTP, LDAP, and so on, I mean. I've worked both on and
with them, and I know.

I want to be clear: I think there's a lot of value in
TclCurl ('hope <URL: http://mini.net/tcl/tclcurl > is
starting to make that clear). I'm still working to
determine just how much incremental value is there.

Darren New

unread,

Jan 29, 2002, 12:58:31 PM1/29/02

to

Andreas Kupries wrote:
> far, was to have the core spawn a helper thread to process (and wait
> for) the 'gethostbyname ()' while the rest of the core goes on
> crunching.

... with the adendum that gethostbyname() isn't universally threadsafe,
so you *still* can't look up two names at once, which is what I'd assume
a web crawler would want to do.

On the other hand, if you're writing a crawler to crawl other peoples'
sites, the likelyhood that you care about anything besides DNS would
seem low.

--
Darren New
San Diego, CA, USA (PST). Cryptokeys on demand.
The opposite of always is sometimes.
The opposite of never is sometimes.

Your name

unread,

Jan 29, 2002, 1:53:15 PM1/29/02

to

cla...@starbase.neosoft.com (Cameron Laird) wrote in

news:618421DB4F4C96ED.3D2ABFCC...@lp.airnews.net:

>
> I want to be clear: I think there's a lot of value in
> TclCurl ('hope <URL: http://mini.net/tcl/tclcurl > is
> starting to make that clear). I'm still working to
> determine just how much incremental value is there.

Only if the incremental value of TclCurl over the tcl http package is high
enough will people start using it on a day to day basis.

If anyone here has used TclCurl apart from myself, I would enjoy hearing
what you thought its advantages were over using the http package.

Thank you.

jooky

Your name

unread,

Jan 29, 2002, 1:55:00 PM1/29/02

to

Toby A Inkster <UseTheAddr...@deadspam.com> wrote in
news:pan.2002.01.29.07...@deadspam.com:

> Personally, I never use the CGI module - big waste of memory that it is,
> so all my stuff is hardwired into either GET or POST. Not both.

Do you really find that you gain a large enough memory savings to warrant
not using a `standard' package? I would be pleased to hear more about your
experiences with this, and I am sure that there will be others who would as
well.

Thank you.

jooky

Your name

unread,

Jan 29, 2002, 1:59:09 PM1/29/02

to

I am not familiar with the Tcl/Tk core. I am a user.

Might this problem of resolving be fixed in the foreseeable future? I would
appreciate your answer.

I would like to help, but I am not skilled enough yet.

May good things come your way.
jooky

I would like to take this moment to thank the Core Tcl'ers, and the Tcl'ers
who make the Tcl, who use the Tcl, and who improve the Tcl. They are the
few, they bind the community together, and without them I would have been
lost long ago.

Michael Schlenker

unread,

Jan 29, 2002, 2:48:09 PM1/29/02

to

Your name wrote:
> 0000000000000000000000 BEGIN BILE 0000000000000000000000000000

<SNIP>

> 0000000000000000000000 END BILE 0000000000000000000000000000
>
> What do I think of this paper now? I realize that I am a bigger idiot than
> I ever thought possible. Mobilize a community? Ha! I am a naive child and
> I should be so lucky to be spanked!

Maybe your naive and a child, but the i found your Text quite nice to
read. I think you are aware of the many different initiatives to make
some of this happen:

- RDF : The semantic web thing from the W3C <http://www.w3.org> . If
anyone besides fanatic librarians would use it more aggressivly would be
very powerfull. A pitty that nearly no one uses it (it`s rather complex
and bloated).

- OAI : The Open Archives Initiative. Try to offer metadata enhanced
access to distributed archives.

- citeseer : If they find your scientific paper they dig into it,
extract all the citations and link it with all the thousend other papers
they have got. You seem to love them (and they are doing a great job).

- DublinCore: One of the most used sets of Metadata on the web with it´s
own problems.

There are many, many problems in this field. First of all there are Lack
of Standards, Lack of Manpower because of that no real applications.

You look a bit too naive, but have identified some good points. It would
be great to revive the original Idea of SGML and HTML in some ways to
seperate content from representation. Would make my work much easier too
(having to deal with metadata extraction and use).

A "website scraping engine" would be very nice to have, the
tcllib::htmlparse module is clumsy, but works well enough for simple
tasks. But why stop at websites? PDF´s and other formates need to be
dissected and analyzed too.

The main problem is not to rip the data from HTML or other formats, but
to convince the providers of such pages to deliver their data in a
usefull format in the first place. The true thing is, that if you have
money you can [sometimes] convince people to give the data in the form
you want it, but most of the time you do not have the money. Taking most
data without paying is considered piracy by society (or at least the
part that has property (intellectual or other)). We could discuss about
that in private email (PGP-Key availiable), but it is very much
off-topic here in c.l.t.

Michael Schlenker

unread,

Jan 29, 2002, 3:06:42 PM1/29/02

to

Darren New wrote:
> Andreas Kupries wrote:
>
>>far, was to have the core spawn a helper thread to process (and wait
>>for) the 'gethostbyname ()' while the rest of the core goes on
>>crunching.
>>>
> ... with the adendum that gethostbyname() isn't universally threadsafe,
> so you *still* can't look up two names at once, which is what I'd assume
> a web crawler would want to do.

Someone mentioned a DNS package in pure TCL (o.k. only the TCP Version,
no standard UDP) that could be used if your desperate...,

Can´t find the reference, maybe it is on the wiki?

Michael Schlenker

unread,

Jan 29, 2002, 7:50:02 PM1/29/02

to

"David N. Welton" wrote:
> Mozilla (Netscape) has a seperate DNS lookup thread, for example.

Yes. That's if you're using DNS. You can write a reentrant DNS lookup
module. That would be #2 on the list, you see. However, that doesn't
work if you're using Sun's "Yellow Pages" for example.

marsd

unread,

Jan 30, 2002, 12:09:12 AM1/30/02

to

Andreas Kupries <akup...@shaw.ca> wrote in message news:<87hep56...@bluepeak.home>...

> Your name <jo...@jooky.jooky> writes:
>
> > > On the points about http package, and GUI unresponsiveness - does the
> > > -command option to http::geturl help you there? Also, you can do
> > > http::config -useragent "Mimick IE to your heart's content here", to
> > > mimick UA strings.
> >
> > (I would take this moment to say: thank you brent welch, master of the ages,
> > lord of the undeserving, and my heart's content.)
> >
> > Sure, the http package has a wonderful -command option, and can set the
> > uagent, BUT it hangs when doing a resolve on a host that is down. This
> > renders the GUI unresponsive, and is plain unnaceptable for a professional
> > app. Sexy!
>
> This is not something in the http package itself, but deeper, in the
> [socket] command provided by the tcl core. It uses 'gethostbyname ()'
> to map from the hostname to an IP address so that it will work with
> whatever is configured in resolv.conf for the host, 'hosts', DNS,
> LDAP, YP/NIS, ... Problem is that this function will not return until
> either got the IP adress or a timeout. This was the topic of several
> discussions here on c.l.t. IIRC the best solution which came up so
> far, was to have the core spawn a helper thread to process (and wait
> for) the 'gethostbyname ()' while the rest of the core goes on
> crunching.

Guys, I'm sorry, I was watching this thread vacillate here, and when it
comes to name resolution and the topic of a (get|head|post) hanging on a down
server, it seems to me that, these are really two different things.
I can resolve the hostname of a given address, but whether or not
it actually responds to my requests for a page is another thing al-
together.
TNM has wonderful dns functionality but it doesn't help me when I try
to get the page on a crashed server.
So, my question is: where does the bung-up occur?
During the host name resolution, or during the http -get?

If the former (for jooky)
package require Tnm 2.0
set rr [dns address -timeout 100 -retries 1 $hostaddy]
if {[info exists $rr]} {
get_free_books codeskillz here..

Else : There is a timeout option on the get right?
if not you could bang the listeningport once with
a socket feeler,read the id or catch eof. It would probably
appear innocuous.

Webserver administrators are paranoid bastards: and pay_per_view
books on the web is a shitty concept, but that's life.
You know there are are libraries full of books and these
same administrators get paid to keep the servers safe, which
entails a certain degree of paranoia.

Jochen Loewer

unread,

Jan 30, 2002, 12:36:53 AM1/30/02

to

"Michael Schlenker" <sch...@uni-oldenburg.de> wrote in message
news:3C5700D2...@uni-oldenburg.de...

> Someone mentioned a DNS package in pure TCL (o.k. only the TCP Version,
> no standard UDP) that could be used if your desperate...,
>
> Can´t find the reference, maybe it is on the wiki?

Yes, I mentioned that and asked Andreas, whether we should include
this (and my pure-Tcl-LDAP) in the tcllib.
I didn't proceed with that due to lack of time.

However, as you already mentioned, it can currently only work,
if the peer DNS service listens to the TCP port.
I would really like to see general UDP support in the standard
Tcl core (Tcl8.4). This would really open the way for implementing
other nice network packages (limited NMB/SMB, ...).

Also, my Tcl code would have to be refactored using Tcl TCP event
handling to full advantage this implementation, but this would
also make the usage a little bit more complex. Right now a
'update' while waiting for the DNS answer does the jobs as well,
I guess.

Beside that, for web scraping one can/should use the HTML parser
within tDOM to do easy XPath-based addressing of the data embedded
deeply in the HTML. I demonstated this on last years european
Tcl user meeting in Hamburg with ebay scraping demo.

Regards,
Jochen.

--
Posted via Mailgate.ORG Server - http://www.Mailgate.ORG

Donal K. Fellows

unread,

Jan 30, 2002, 6:12:18 AM1/30/02

to

Darren New wrote:

> Don Porter wrote:
>> A good first step would be to report the bug in [socket].
>> http://sourceforge.net/projects/tcl/
>
> Except that it's not a bug in [socket]. It's not a bug at all. The basic
> problem is that host name resolution is portably implemented only for
> single-threaded applications. The choices to "fix" this problem are:
>
> 1) Run your lookups in separate processes,
> 2) Only use DNS for lookups,
> 3) Break on some platforms.
>
> There's a difference between "this is wrong" and "this doesn't do what I
> want".

So? Just because the problem comes from software that Tcl sits on top of
doesn't make it something that shouldn't be reported, and it may be that
something can actually be done on this at some time in the future, and maybe
surprisingly much can be done soon. (Also, we can always move it to Feature
Request if we feel like it. :^)

Donal.
--
Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk
-- We shall, nevertheless, continue to dispense [insults] on the premise of
giving credit where credit is due, you ill-bred nanowit sack of bovine
fecal matter. -- Xelloss <jfos...@home.com>

Donal K. Fellows

unread,

Jan 30, 2002, 6:16:34 AM1/30/02

to

Cameron Laird wrote:
> I think I'd nominate the ones where someone working
> for the US government or a comparable institution
> trashes Tcl, says he needs reliability/portability/
> ..., then concludes with a claim that the next
> nuclear reactor/moonshot/vaccination program/...
> will be built with Visual Basic.

How many of those have we seen?

lvi...@yahoo.com

unread,

Jan 30, 2002, 8:22:02 AM1/30/02

to

According to Darren New <dn...@san.rr.com>:

:Don Porter wrote:
:>
:> Your name wrote:
:> > Might this problem of resolving be fixed in the foreseeable future? I would
:> > appreciate your answer.
:>
:> A good first step would be to report the bug in [socket].
:>
:> http://sourceforge.net/projects/tcl/
:
:Except that it's not a bug in [socket]. It's not a bug at all.

That's okay - instead of reporting it as a bug, then report it as a feature
request!

It's okay for people to ask for things - really! Now, does that mean they
will happen in some time frame? Not unless one goes recruiting - money,
fame, skills trade, etc. all are potential incentives...

--
"I know of vanishingly few people ... who choose to use ksh." "I'm a minority!"
<URL: mailto:lvi...@cas.org> <URL: http://www.purl.org/NET/lvirden/>
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.

lvi...@yahoo.com

unread,

Jan 30, 2002, 8:30:51 AM1/30/02

to

According to Cameron Laird <cla...@starbase.neosoft.com>:
:I don't agree with Jooky on the propriety of "crack-

:ing" protection schemes, ignoring licenses, and such.

:On the other hand, I work a lot on Web scraping with

:techniques that I recognize can easily be put to
:those ends.

I agree with Cameron. I started some of this furor asking for some help
scraping some web pages. The pages I was scraping were ones containing
info about my library card status - the owner of the web site had no
stated policy against scraping. There are other web pages similarly
unprotected - these are the types of things I find myself dealing with.

So far, I have gotten to the point of either getting the entire html
page or getting just the text portion. I find it easier to grab the html
and then hand it to lynx to format, than to try to parse through the
html for the useful bits. Maybe some day this info will be provided as
a web service ...

Cameron Laird

unread,

Jan 30, 2002, 9:35:34 AM1/30/02

to

In article <3C57345C...@san.rr.com>, Darren New <dn...@san.rr.com> wrote:
>Don Porter wrote:
>>
>> Your name wrote:
>> > Might this problem of resolving be fixed in the foreseeable future? I would
>> > appreciate your answer.
>>
>> A good first step would be to report the bug in [socket].
>>
>> http://sourceforge.net/projects/tcl/
>
>Except that it's not a bug in [socket]. It's not a bug at all. The basic
>problem is that host name resolution is portably implemented only for
>single-threaded applications. The choices to "fix" this problem are:
>
>1) Run your lookups in separate processes,
>2) Only use DNS for lookups,
>3) Break on some platforms.
>
>There's a difference between "this is wrong" and "this doesn't do what I
>want".

.
.
.
Yes, in general, but we have a higher standard in Tcl.
While Perl and Python can get away with saying, "well,
that's the way C works", Tcl networking is undeniably
modeled on a higher-level abstraction. My conclusion,
therefore, is that there's a fault *somewhere* in the
system. At the very least, we need to document the
socket behavior.

I think your choices 1), 2), and 3) are nearly correct,
and that's certainly a valuable observation. Why,
though, are you excluding
4) Run lookups in separate threads or
processes, as a platform-specific
implementation detail?

Darren New

unread,

Jan 30, 2002, 11:45:51 AM1/30/02

to

lvi...@yahoo.com wrote:
> :Except that it's not a bug in [socket]. It's not a bug at all.
>
> That's okay - instead of reporting it as a bug, then report it as a feature
> request!

Sorry. For some reason, my text-communication skills have been in the
crapper this week. I didn't mean to imply that having more options was
bad. Only that this particular problem is not one you can fix without
losing functionality.

What I would suggest is adding a "-dns" flag to socket that says to
bypass gethostbyname() and use direct DNS lookups instead. Having done
that, you can then support even more things, like the DNS SRV records,
MX records, etc.

Or perhaps a separate [package require dns] that would provide such
things as name resolution and etc, and then you could use
dns::fqdn_to_ip to get the IP address and then pass that to socket so it
doesn't have to do the lookup itself, for example. Then you don't have
to touch socket at all.

Sounds more like a TIP than a bug report, is all.

Chang Li

unread,

Jan 30, 2002, 12:14:09 PM1/30/02

to

Your name <jo...@jooky.jooky> wrote in message news:<Xns91A4B57E6...@66.185.95.104>...

> (2) The http package, though great, is not as great as just exec'ing curl
> (http://curl.haxx.se). Curl is a super-wicked tool. It's creator is also
> one of my personal GAWDS. I wish I was as cool as him. Now what is so
> great about this tool? Well its pretty simple folks: It works. Its cross-
> platform. It does SSL, cookies, POST, GET, and can easily fake referer's,
> uagents and a whole LOT more. Praise be.

I like the http because it was pure Tcl. So it is portable.
You write once run everywhere. If there is no big performance
difference, pure Tcl is nice. I hate to port the C extension
everywhere because I have to write the new make file for
different compilers.

Hope to improve the http.

Chang

Your name

unread,

Jan 30, 2002, 1:59:06 PM1/30/02

to

lvi...@yahoo.com wrote in news:a38sib$onj$2...@srv38.cas.org:

> So far, I have gotten to the point of either getting the entire html
> page or getting just the text portion. I find it easier to grab the html
> and then hand it to lynx to format, than to try to parse through the
> html for the useful bits. Maybe some day this info will be provided as
> a web service ...

The downside to doing this is that a slight change of the format of the page
will be harder to notice when using something like `lynx -dump' and then
parsing it.

I seriously suggest using the source html to be your parsing fodder. Use
something like `lynx -source' IIRC. I have no lynx on-hand to test it; my
apologies.

On the ethics of web scraping:

My browser downloads files from websites and displays them.
My robot downloads files from websites and stores them to be displayed
later.

The website that hosts those books allows you to view them in your
webbrowser now, why can't my robot allow me to view them later? Please note
that I am not `cracking' anything.

When knowledge becomes outlawed, only outlaws will have knowledge.

good day, sir.

lvi...@yahoo.com

unread,

Jan 30, 2002, 2:47:12 PM1/30/02

to

According to Your name <jo...@jooky.jooky>:
:instead of hand parsing every web page with hand made fragile scripts,
:create a tcl x-platform robot that understands a definition language for
:every web site.

I'd go a step farther and suggest that someone with a great interest in this
take a look at projects like sitescooper, plucker, WWW::Search, etc. -
environments currently set up to behave in a manner similar to what is
being described.

Perhaps we can at least learn from them - and maybe, even better,
liberate code/algorithms for our use.

Having a common language shared among these various projects might
enable the users to more quickly adapt to the ever changing environment
of web site designs.

lvi...@yahoo.com

unread,

Jan 31, 2002, 7:59:40 AM1/31/02

to

According to Darren New <dn...@san.rr.com>:

:Sounds more like a TIP than a bug report, is all.

As long as people remember that filing a TIP requires someone predetermined
to do implementation, while filing a feature request leaves out that part.
Of course, with someone to do the implementation, things are likely to
occur, while without it, one may never see it occur - until more people
from the community start looking at the feature requests and start implementing
them. Low developer to feature request ratios is why we don't have more
than we have now...

lvi...@yahoo.com

unread,

Jan 31, 2002, 8:09:21 AM1/31/02

to

According to Jochen Loewer <loe...@hotmail.com>:
:I would really like to see general UDP support in the standard

:Tcl core (Tcl8.4). This would really open the way for implementing
:other nice network packages (limited NMB/SMB, ...).

I've heard this requested a number of times. What are people
doing for general UDP support?

lvi...@yahoo.com

unread,

Jan 31, 2002, 8:17:33 AM1/31/02

to

According to marsd <nomob...@hotmail.com>:
:TNM has wonderful dns functionality ....

What's the state of Tnm? It looks from the main web site that it hasn't
had any bug fixes or enhancements in about 2 years. Is it really, really
stable, or have people just moved on from there?

Juergen Schoenwaelder

unread,

Jan 31, 2002, 12:27:05 PM1/31/02

to

lvi...@yahoo.com wrote:

> What's the state of Tnm? It looks from the main web site that it hasn't
> had any bug fixes or enhancements in about 2 years. Is it really, really
> stable, or have people just moved on from there?

I guess it is both. :-)

Seriously, I am only occasionally using Tnm at the moment myself and
hence there is little forward progress. On the other hand, the stuff
people are using seems to work relatively well and if someone reports
a real bug which is easy to fix, then a fix is going into the CVS
version relatively quickly. There are of course things I would do
different these days - but since I am not regularily using Tnm at
the moment, there is little energy to work on improving things.
(And the people using Tnm probably like this since it gives real
stability. ;-)

The 2 year thing is that I never managed to get 3.0 out of the door
and hence the official version described on the Web page is still
2.1.X, even though many people are using 3.0 snapshots and putting
them in distributions. In retrospect, I should have released 3.0 two
years ago even though I felt it is not ready at that time since
the current situation is just confusing.

/js

--
Juergen Schoenwaelder University of Osnabrueck
<sch...@inf.uos.de> Dept. of Mathematics and Computer Science
Phone: +49 541 969 2483 Albrechtstr. 28, 49069 Osnabrueck, Germany
Fax: +49 541 969 2770 <http://www.informatik.uni-osnabrueck.de/>

marsd

unread,

Jan 31, 2002, 1:56:56 PM1/31/02

to

lvi...@yahoo.com wrote in message news:<a3bg5d$4m1$3...@srv38.cas.org>...

> According to marsd <nomob...@hotmail.com>:
> :TNM has wonderful dns functionality ....
>
> What's the state of Tnm? It looks from the main web site that it hasn't
> had any bug fixes or enhancements in about 2 years. Is it really, really
> stable, or have people just moved on from there?

Thats a good question of course. I do not believe that it is actively
being developed anymore but I do know that the work already done has allowed
a) resolution of asstd dns info easily in tcl(extended)
b) utilization of udp socks(nice)
c) rpc stuff
d) snmp!
and in conjunction with tclldap it really provides a very nice framework
for *nix based management/monitoring apps.

Jacob Levy

unread,

Jan 31, 2002, 4:18:29 PM1/31/02

to

Everyone interested in this topic should find out about RSS. Check
http://www.zvon.org/xxl/RSS0.91reference/Output/index.html or
http://www.webreference.com/perl/tutorial/rss1/ or the many other hits
google returns when searching for RSS.

--JYL

Your name <jo...@jooky.jooky> wrote in message news:<Xns91A4D563C...@66.185.95.104>...
> Reading my own post, its probably too much for most to read. Here is the
> point:
>
>
> you want tcl to get noticed?
>
> let it be the first to seriously implement universal web parsing, and object
> chunking.
>
> what does that jargon mean?

>
> instead of hand parsing every web page with hand made fragile scripts,
> create a tcl x-platform robot that understands a definition language for
> every web site.
>

> so you want to parse the weather from a cnn weather.
> you create a definition file that is a bunch of regexps, as well as how they
> fit together to yield an object, an object that has various members that
> contains your data. mmm anyone smell xml? i dont want to contrib to the
> xml hype, but in any case, we get some data extracted that is the weather
> for omaha and marseilles. its in 2 weather objects and then you pass it
> along to the next program or just chuck it in a db.
>
> sexy. now you just write defn files for each website instead of hand making
> crappy tacky crafts like a used and abused native.
>
>
> read the previous post if you want some bullshit and an analysis of a paper
> by some stuffed shirts.
>
> love ya *lots*
> jooky

Cameron Laird

unread,

Feb 4, 2002, 4:13:36 PM2/4/02

to

In article <a3bfm1$4m1$2...@srv38.cas.org>, <lvi...@yahoo.com> wrote:
.
.

.
>I've heard this requested a number of times. What are people
>doing for general UDP support?

.
.
.
Avoiding it. Or using one of Tcludp, Scotty, and Tcl-DP,
none of which is actively maintained, 'near as I can tell.

But then, not a lot of change is needed in this area.

pe...@browsex.com

unread,

Feb 6, 2002, 10:19:18 PM2/6/02

to

Cameron Laird <cla...@starbase.neosoft.com> wrote:
> In article <3C55E830...@cs.nott.ac.uk>,
> Neil Madden <nem...@cs.nott.ac.uk> wrote:
> .
> .
> .

>>On the points about http package, and GUI unresponsiveness - does the
>>-command option to http::geturl help you there? Also, you can do
>>http::config -useragent "Mimick IE to your heart's content here", to
>>mimick UA strings.

> While jooky certainly has demonstrated an ability to
> speak on his own behalf, I'll jump in here. I'm
> thinking about Web scraping a lot this month myself,
> and, if I understand correctly, NO, -command does
> not help with the specific problem of http blocking
> while it resolves a symbolic host name.

I'm another supporter of tcl only solutions, as the are easy to enhance.
As I mentioned in another related post, you can get around the
http.tcl hang problem with a very small enhancement
that adds an -address option.

However, you still need to manually implement a dns lookup
method, probably externally execed and caching (I wrote one for BrowseX).
Maybe a dns lookup/cacher would make a good tcllib package to
complement http.tcl?

--
Peter MacDonald BrowseX Systems Inc
Email: pe...@BrowseX.com http://BrowseX.com
Phone: (250) 595-5998