Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

HTMLParser, htmllib and other questions

0 views

Skip to first unread message

Lee Harr

unread,

May 13, 2003, 11:37:24 AM5/13/03

My goal is to take some HTML and change instances of
http://mydomain.com/ to file:///home/me/foo/

Yes, I know about wget, but I can't seem to get it to
work very well with Zope (too much duplication) and so
I want to take the zopemir.py script (to fetch the
content) and extend it to fix up the HTML.

Now: In general, does this seem like something
suited to HTMLParser.HTMLParser? Or maybe to
htmllib.HTMLParser? Or would I be better off just
using ''.replace() ?

I experimented a bit with the 2 parsers, and I can get
either one to modify the required a and img tags, but
once I do, I am not quite sure how to reconstruct the
full lines. ie, if I have:

Here is some text with a <a href="location">link</a>.

I can get the parser to return a tag with the corrected
location, but how do I get it to return the whole corrected
line?

This recipe from the Cookbook seems to point the way,
but I wonder if maybe this is more than I need:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/135005

I am pretty sure I could just go forward with .replace(:o)

Any hints appreciated.

Anand Pillai

unread,

May 15, 2003, 8:31:09 AM5/15/03

Hi Lee

I have written a web-spider program in python and in
it required a class which does a job like what you wants.
Basically the requirement is to map internet urls to directory
files given a base directory.

Myself and a friend have come up with an implementation
of it. It is free but we have not yet documented or informed
this group about it. You can see the code in my webpage
http://members.fortunecity.com/anandpillai. The code is
available as a link somewhere in the page. The module name is
WebHttpUrlPath.py.

Tell me if you find it useful :-)

Best Regards

Anand B Pillai

Lee Harr <mis...@frontiernet.net> wrote in message news:<slrnbc1md3...@localhost.my.domain>...

0 new messages