Would anyone show me how to use htmllib?

jac...@my-deja.com

unread,

Oct 28, 2000, 4:22:43 AM10/28/00

to

Hi
I have read the python library reference. I am a python newbe, I think I
have to overload some functions to get it working. Could anyone give to
a example to show me how it works?
Thanks
Jack Xie

Sent via Deja.com http://www.deja.com/
Before you buy.

Alex Martelli

unread,

Oct 28, 2000, 6:37:13 AM10/28/00

to

<jac...@my-deja.com> wrote in message news:8te2ch$8ou$1...@nnrp1.deja.com...

> Hi
> I have read the python library reference. I am a python newbe, I think I
> have to overload some functions to get it working. Could anyone give to
> a example to show me how it works?

Override, rather than overload. Normally, yes. Unless
you just want the list of links from an HTML page, in
which case this simple script will do it:

import htmllib
import formatter

parser=htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(open('myfile.html').read())
parser.close()

print parser.anchorlist

Now, if, instead of just instantiating HTMLParser, you
instantiate a class of your own that derives from it
and overrides the methods you're interested in, then
you can do different things. But it's hard to give a
meaningful example without knowing what it is you
want to do. For some tasks, building your own
formatter-class and using the plain parser-class from
htmllib may be a simpler way, too.

Alex

Erno Kuusela

unread,

Oct 28, 2000, 2:23:53 PM10/28/00

to

>>>>> "jackxh" == jackxh <jac...@my-deja.com> writes:

| I have read the python library reference. I am a python newbe, I think I
| have to overload some functions to get it working. Could anyone give to
| a example to show me how it works?

frederik lundh has posted the following example:
<URL:http://www.deja.com/=dnc/getdoc.xp?AN=474703587>

-- erno

Paul Wright

unread,

Oct 28, 2000, 10:24:24 AM10/28/00

to

In article <8te2ch$8ou$1...@nnrp1.deja.com>, <jac...@my-deja.com> wrote:
>Hi
>I have read the python library reference. I am a python newbe, I think I
>have to overload some functions to get it working. Could anyone give to
>a example to show me how it works?

I've played with this a bit. Here's an example which prints out the
attributes associated with all the anchor tags in the file you give it
on the command line. If you just used the href attribute, you could
modify it to print a list of all the links in the document.

The way to use it is to define start_foo and end_foo for a HTML tag
which works like <FOO>Inside a foo</FOO> and do_bar for a HTML tag
which doesn't have openning and closing tags (like IMG for example). The
start_foo method is called when the start of the tag is seen, the
end_foo when the closing tag is seen. The attributes argument is a list
of 2-tuples (or pairs, if you like) giving the name and value of the
attributes for the tag, as you can see from the print_attributes
function below.

import htmllib
import formatter
import sys

def print_attributes (attributes):
for pair in attributes:
print pair [0], "=", pair [1]

class AnchorFinder (htmllib.HTMLParser):

def __init__ (self):
htmllib.HTMLParser.__init__ (self, formatter.NullFormatter ())
# You could do other stuff here to set up your subclass

def start_a (self, attributes):
print_attributes (attributes)

if __name__ == "__main__":
parser = AnchorFinder ()
parser.feed (open (sys.argv [1]).read ())
parser.close ()

--
----- Paul Wright ------| "Their little anoraks bobbed and danced, their
-paul....@pobox.com--| cycling helmets swung with gay abandon - the NatSci
http://pobox.com/~pw201 | Elves were abroad!" -Simon Pick

jac...@my-deja.com

unread,

Oct 30, 2000, 11:51:46 PM10/30/00

to

Thanks for all of the help. Your examples are very helpful. One question
is how did you guys figure those things out?

python reference library are not clear about a lot of the variables and
functions calls.

Thanks again.

jac...@my-deja.com

unread,

Oct 31, 2000, 12:59:13 AM10/31/00

to

Thank you for the example.
I went back and take a look htmllib again. Some part makes more sense
now. Here is what I wanted to do:

I noticed that are lots of patterns in html pages, I want to extract
infomation out of html pages(based on patterns). I have done this using
perl's regular expression before. Now I am wondering if I can speed up
development process and have a stardard approach for this problem using
python htmllib.

For reference, htmllib library documenation metioned:
######################################################################
#This module defines a class which can serve as a base for parsing text
#files formatted in the HyperText Mark-up Language (HTML).
######################################################################

All of the examples I have seen are extracting URL links from a html
page. I was wondering if I can do more with this modules.

Jack X.

In article <8teab...@news1.newsguy.com>,

Alex Martelli

unread,

Oct 31, 2000, 5:24:31 AM10/31/00

to

<jac...@my-deja.com> wrote in message news:8tln3f$sf$1...@nnrp1.deja.com...

> Thank you for the example.
> I went back and take a look htmllib again. Some part makes more sense
> now. Here is what I wanted to do:
>
> I noticed that are lots of patterns in html pages, I want to extract
> infomation out of html pages(based on patterns). I have done this using
> perl's regular expression before. Now I am wondering if I can speed up
> development process and have a stardard approach for this problem using
> python htmllib.

Absolutely yes. Particularly because HTML syntax is NOT parsable
by regular-expressions (either Perl's or Python's -- they're quite
close); you can get, say, 80% of the way there with an amount X
of effort, then each halving of the remaining percentage of "cases
not well treated" doubles the overall effort. It's a no-win
strategy.

> For reference, htmllib library documenation metioned:
> ######################################################################
> #This module defines a class which can serve as a base for parsing text
> #files formatted in the HyperText Mark-up Language (HTML).
> ######################################################################
>
> All of the examples I have seen are extracting URL links from a html
> page. I was wondering if I can do more with this modules.

You have to inherit from HTMLParser, and override some methods, if
you want to do more than extracting links (or simple output formatting),
because that is what HTMLParser itself does. Sometimes it's handier
to use sgmllib rather than htmllib, actually -- sgmllib is "more
primitive" (htmllib's parser inherits from sgmllib's), but that IS
handy at times.

For an example of htmllib use, see, e.g., my post:
http://www.deja.com/getdoc.xp?AN=661888820
"converting an html table to a tree", and its thread.

Alex

unread,

Oct 31, 2000, 9:04:48 AM10/31/00

to

> One question is how did you guys figure those things out? python
> reference library are not clear about a lot of the variables and
> functions calls.

I agree. I figured out how to use it by reading the source of htmllib.
It could really use some better documentation.

Alex.

--
Speak softly but carry a big carrot.

jac...@my-deja.com

unread,

Nov 2, 2000, 3:53:08 AM11/2/00

to

I went through your link. It seems to me in order for you can only
process the HTML TAGs by define start_"TAG NAME" function. This feature
is limited. A lot of times, the meaningful stuff is in the content of
the html.
I don't know if my thought is right or not?
Jack Xie

In article <8tm6u...@news1.newsguy.com>,