Sent via Deja.com http://www.deja.com/
Before you buy.
Override, rather than overload. Normally, yes. Unless
you just want the list of links from an HTML page, in
which case this simple script will do it:
import htmllib
import formatter
parser=htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(open('myfile.html').read())
parser.close()
print parser.anchorlist
Now, if, instead of just instantiating HTMLParser, you
instantiate a class of your own that derives from it
and overrides the methods you're interested in, then
you can do different things. But it's hard to give a
meaningful example without knowing what it is you
want to do. For some tasks, building your own
formatter-class and using the plain parser-class from
htmllib may be a simpler way, too.
Alex
| I have read the python library reference. I am a python newbe, I think I
| have to overload some functions to get it working. Could anyone give to
| a example to show me how it works?
frederik lundh has posted the following example:
<URL:http://www.deja.com/=dnc/getdoc.xp?AN=474703587>
-- erno
I've played with this a bit. Here's an example which prints out the
attributes associated with all the anchor tags in the file you give it
on the command line. If you just used the href attribute, you could
modify it to print a list of all the links in the document.
The way to use it is to define start_foo and end_foo for a HTML tag
which works like <FOO>Inside a foo</FOO> and do_bar for a HTML tag
which doesn't have openning and closing tags (like IMG for example). The
start_foo method is called when the start of the tag is seen, the
end_foo when the closing tag is seen. The attributes argument is a list
of 2-tuples (or pairs, if you like) giving the name and value of the
attributes for the tag, as you can see from the print_attributes
function below.
import htmllib
import formatter
import sys
def print_attributes (attributes):
for pair in attributes:
print pair [0], "=", pair [1]
class AnchorFinder (htmllib.HTMLParser):
def __init__ (self):
htmllib.HTMLParser.__init__ (self, formatter.NullFormatter ())
# You could do other stuff here to set up your subclass
def start_a (self, attributes):
print_attributes (attributes)
if __name__ == "__main__":
parser = AnchorFinder ()
parser.feed (open (sys.argv [1]).read ())
parser.close ()
--
----- Paul Wright ------| "Their little anoraks bobbed and danced, their
-paul....@pobox.com--| cycling helmets swung with gay abandon - the NatSci
http://pobox.com/~pw201 | Elves were abroad!" -Simon Pick
python reference library are not clear about a lot of the variables and
functions calls.
Thanks again.
I noticed that are lots of patterns in html pages, I want to extract
infomation out of html pages(based on patterns). I have done this using
perl's regular expression before. Now I am wondering if I can speed up
development process and have a stardard approach for this problem using
python htmllib.
For reference, htmllib library documenation metioned:
######################################################################
#This module defines a class which can serve as a base for parsing text
#files formatted in the HyperText Mark-up Language (HTML).
######################################################################
All of the examples I have seen are extracting URL links from a html
page. I was wondering if I can do more with this modules.
Jack X.
In article <8teab...@news1.newsguy.com>,
Absolutely yes. Particularly because HTML syntax is NOT parsable
by regular-expressions (either Perl's or Python's -- they're quite
close); you can get, say, 80% of the way there with an amount X
of effort, then each halving of the remaining percentage of "cases
not well treated" doubles the overall effort. It's a no-win
strategy.
> For reference, htmllib library documenation metioned:
> ######################################################################
> #This module defines a class which can serve as a base for parsing text
> #files formatted in the HyperText Mark-up Language (HTML).
> ######################################################################
>
> All of the examples I have seen are extracting URL links from a html
> page. I was wondering if I can do more with this modules.
You have to inherit from HTMLParser, and override some methods, if
you want to do more than extracting links (or simple output formatting),
because that is what HTMLParser itself does. Sometimes it's handier
to use sgmllib rather than htmllib, actually -- sgmllib is "more
primitive" (htmllib's parser inherits from sgmllib's), but that IS
handy at times.
For an example of htmllib use, see, e.g., my post:
http://www.deja.com/getdoc.xp?AN=661888820
"converting an html table to a tree", and its thread.
Alex
I agree. I figured out how to use it by reading the source of htmllib.
It could really use some better documentation.
Alex.
--
Speak softly but carry a big carrot.
In article <8tm6u...@news1.newsguy.com>,