Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

HTML DOM parser?

1 view
Skip to first unread message

Paul Rubin

unread,
Jul 18, 2002, 3:36:51 PM7/18/02
to
Anyone know of a Python-callable HTML DOM parser? I mean a serious
one that tries to understand the crappy malformed out there in the
real-world Web, the way a browser does. If it can interpret
Javascript that's even better. This is for a consulting client, so a
commercial library would be acceptable (though not preferred).

Thanks.

Gerhard Häring

unread,
Jul 18, 2002, 3:57:05 PM7/18/02
to
* Paul Rubin <phr-n...@NOSPAMnightsong.com> [2002-07-18 12:36 -0700]:

> Anyone know of a Python-callable HTML DOM parser? I mean a serious
> one that tries to understand the crappy malformed out there in the
> real-world Web, the way a browser does.

I see two options:
- use mxTidy (http://www.lemburg.com/files/python/mxTidy.html), then
operate with a normal HTML parser on the output
- extract the parsing code from a real browser, like Mozilla or
Konqueror. If it is win32 only, it might be possible to get to the DOM
with interfacing Internet Exploder via COM, too

> If it can interpret Javascript that's even better.

You'll need a browser engine for that. Or use one of the other
Javascript engines and feed them your DOM.

Gerhard
--
mail: gerhard <at> bigfoot <dot> de registered Linux user #64239
web: http://www.cs.fhm.edu/~ifw00065/ OpenPGP public key id AD24C930
public key fingerprint: 3FCC 8700 3012 0A9E B0C9 3667 814B 9CAA AD24 C930
reduce(lambda x,y:x+y,map(lambda x:chr(ord(x)^42),tuple('zS^BED\nX_FOY\x0b')))


Peter Hansen

unread,
Jul 18, 2002, 6:13:16 PM7/18/02
to

How about automating IE using Python?

from win32com.client import DispatchEx

ie = DispatchEx('internetexplorer.application')
ie.visible = 1
ie.navigate('http://www.nightsong.com')
dom = ie.document

etc...

Access to the DOM tree of the document might be too slow for your
needs, but if it's not, you definitely get a lot of bang for the buck...

-Peter

Paul Rubin

unread,
Jul 18, 2002, 6:39:45 PM7/18/02
to
Peter Hansen <pe...@engcorp.com> writes:
> How about automating IE using Python?
>
> from win32com.client import DispatchEx
>
> ie = DispatchEx('internetexplorer.application')
> ie.visible = 1
> ie.navigate('http://www.nightsong.com')
> dom = ie.document
>
> etc...
>
> Access to the DOM tree of the document might be too slow for your
> needs, but if it's not, you definitely get a lot of bang for the buck...

That's a really interesting idea and I might try it. I had been
thinking in terms of a Linux solution, but automating IE might be
ok for this particular application. Thanks.

David LeBlanc

unread,
Jul 18, 2002, 8:08:37 PM7/18/02
to


I put the above code into "ienavigate.py" and tried it and got:

Traceback (most recent call last):
File "ienavigate.py", line 6, in ?
dom = ie.document
File "J:\Python22\lib\site-packages\win32com\client\dynamic.py", line 448,
in __getattr__
raise pythoncom.com_error, details
pywintypes.com_error: (-2147352567, 'Exception occurred.', (0, None, None,
None, 0, -2147467259), None)

Also got a browser window with a "403" error telling me I don't have
permission to access index.html on www.nightsong.com.

I would be interested in getting this working, so any help is appreciated.

TIA,

Dave LeBlanc
Seattle, WA USA

Paul Rubin

unread,
Jul 18, 2002, 8:51:50 PM7/18/02
to
"David LeBlanc" <whi...@oz.net> writes:
> I put the above code into "ienavigate.py" and tried it and got:
>
> Traceback (most recent call last):
> File "ienavigate.py", line 6, in ?
> dom = ie.document
> File "J:\Python22\lib\site-packages\win32com\client\dynamic.py", line 448,
> in __getattr__
> raise pythoncom.com_error, details
> pywintypes.com_error: (-2147352567, 'Exception occurred.', (0, None, None,
> None, 0, -2147467259), None)
>
> Also got a browser window with a "403" error telling me I don't have
> permission to access index.html on www.nightsong.com.
>
> I would be interested in getting this working, so any help is appreciated.

The exception might be legitimate, because of the 403 error. Try
www.yahoo.com instead of www.nightsong.com. www.nightsong.com really
does return a 403.

David LeBlanc

unread,
Jul 18, 2002, 9:11:17 PM7/18/02
to
> The exception might be legitimate, because of the 403 error. Try
> www.yahoo.com instead of www.nightsong.com. www.nightsong.com really
> does return a 403.
> --


Nope - it's the call to ie.document that chokes. ie.navigation does open
www.w3.org just fine, but trying to get to the DOM isn't working.

It looks like there's no attribute "document" for this interface...

Dave LeBlanc

David LeBlanc

unread,
Jul 18, 2002, 9:31:03 PM7/18/02
to

adding time.sleep(2) made it work - gave the browser time to _have_ a
document :-)

David LeBlanc
Seattle, WA USA

> --
> http://mail.python.org/mailman/listinfo/python-list

Peter Hansen

unread,
Jul 18, 2002, 11:52:06 PM7/18/02
to
David LeBlanc wrote:
>
> adding time.sleep(2) made it work - gave the browser time to _have_ a
> document :-)

Ah, sorry to post code that isn't very robust. I was just trying
to give you an example so you could judge whether it might be suitable.
I actually expected your answer to be that it was very far from being
useful to you...

In this case, you should probably look at "ie.busy", which you should
monitor after things like ie.navigate() to make sure the documented
has been fully loaded.

My choice of www.nightsong.com was of course because that's the
domain the OP posted from. I didn't stop to think someone might
have a "www" subdomain which actually refused a connect request...
strange if you ask me.

-Peter

Paul Rubin

unread,
Jul 19, 2002, 12:18:56 AM7/19/02
to
Peter Hansen <pe...@engcorp.com> writes:
> My choice of www.nightsong.com was of course because that's the
> domain the OP posted from. I didn't stop to think someone might
> have a "www" subdomain which actually refused a connect request...
> strange if you ask me.

www.nightsong.com doesn't refuse connect requests. It accepts
connections and sends back a valid HTTP 403 response indicating
there's no page available at "/" (e.g. no index.html at the document
root). If you use an interior URL like
<http://www.nightsong.com/phr/python/calc.py> it should work fine.

Daniel E. Burrow

unread,
Jul 19, 2002, 1:01:45 AM7/19/02
to
Paul Rubin <phr-n...@NOSPAMnightsong.com> wrote in message news:<7xvg7ck...@ruckus.brouhaha.com>...

Greetings!

My present task requires the automation of IE. After much prayer, I
found that the "dom = ie.Document" assignment IS case sensitive. The
lower case form of "ie.document" just would not work for my
ActiveState ActivePython 2.2.1 distribution.


Joyfully About Alerio's Business,

Daniel

Peter Hansen

unread,
Jul 19, 2002, 9:45:12 AM7/19/02
to
"Daniel E. Burrow" wrote:
>
> My present task requires the automation of IE. After much prayer, I
> found that the "dom = ie.Document" assignment IS case sensitive. The
> lower case form of "ie.document" just would not work for my
> ActiveState ActivePython 2.2.1 distribution.

That's odd. I have never had to worry about case in almost any
aspect of IE automation, especially this one.

Does anyone with greater knowledge of the guts of this stuff
have any input on why Daniel would have to worry about case
while I do not?

Peter

Peter Hansen

unread,
Jul 19, 2002, 9:47:24 AM7/19/02
to

My apologies for using the wrong phrasing. What I meant to say
was I didn't stop to think someone might have a "www" subdomain
which did not provide a valid HTTP 200 response at the top level,
which seems to me to be extremely unusual. (Not, I'm quite sure
many other domains do this too... it just seems unusual.)

-Peter

David LeBlanc

unread,
Jul 19, 2002, 12:16:55 PM7/19/02
to
> -----Original Message-----
> From: python-l...@python.org
> [mailto:python-l...@python.org]On Behalf Of Peter Hansen
> Sent: Friday, July 19, 2002 6:45
> To: pytho...@python.org
> Subject: Re: HTML DOM parser?
>
>

David LeBlanc wrote:
> While trying to figure out how to make the recently posted (by Paul Rubin)
> ie navigation example work, I had occasion to run makepy.py on "Microsoft
> Internet Controls". After doing this, the sample would fail on "ie.visible
=
> 1". Removing the generated file would return the sample to working order.

Mark Hammond replied:
"Was the problem "AttributeError: visible"? If so, the problem is simply
that the correct name for the property is "Visible". makepy is case
sensitive."

If he's run makepy, then the ?mapping? gets used, which is case sensitive.

I had to ask too :-)

Daniel E. Burrow

unread,
Jul 19, 2002, 2:15:48 PM7/19/02
to
"David LeBlanc" <whi...@oz.net> wrote in message news:<mailman.1027042346...@python.org>...

Greetings!

I still get the exception if I don't use the ".Document" reference. My
code is suppose to wait for IE to return a "not busy" result before
going on.

I recall Mr. Hammond giving an example where he was waiting for IE's
"ReadyState" to return what would be true when ready state was
obtained. I used to use a method to determine ready state and then go
on to determine if IE was "Busy". I found that using the "Navigate2"
method seemed to obviate the need for checking IE's ready state.
Perhaps I should leave the ready state method in?

Here are a few snippets:

################## start of snips #####################

In main I call:

o_AleApp.NavigateIE(s_URL)
##fire events after nav test
o_AleApp.ClickCrossFrameElement(1, 'Sub1t')

In NavigateIE I run:

def NavigateIE(self, s_URL):
i_Sleep = self.GetDelaySecondsBetweenPageNavs()
print "Class: AleAppWebReporter Method: NavigateIE: Sleeping",
i_Sleep, "seconds prior to navigating..."
time.sleep(i_Sleep)
print "Class: AleAppWebReporter Method: NavigateIE: Navigating
to:\n" + s_URL
self.o_IE.Navigate2(s_URL)
##now lets make sure the page has fully loaded
s_NotBusy = self.WaitForNotBusy()
if s_NotBusy == 'TRUE':
return 'TRUE'
else:
return s_NotBusy

In WaitForNotBusy I run:

def WaitForNotBusy(self):
print "Class: AleAppWebReporter Method: WaitForNotBusy:
Current URL:", self.GetIELocationURL()
i_IEBusy = int(self.GetIEBusy())
print "Class: AleAppWebReporter Method: WaitForNotBusy:
i_IEBusy:", i_IEBusy
if i_IEBusy != 1:
return 'TRUE'
else:
i = 1
i_State = 0
while i_State == 0:
print "Class: AleAppWebReporter Method:
WaitForNotBusy: Waited:", i, "seconds for IE to complete
downloading..."
time.sleep(1)
i_IEBusy = int(self.GetIEBusy())
print "Class: AleAppWebReporter Method:
WaitForNotBusy: i_IEBusy:", i_IEBusy
if i_IEBusy != 1:
i_State = 1
i += 1
return 'TRUE'

In GetIEBusy I run:

def GetIEBusy(self):
return self.o_IE.Busy

################## end of snips #####################

Syver Enstad

unread,
Jul 19, 2002, 9:36:04 PM7/19/02
to
Peter Hansen <pe...@engcorp.com> writes:

My guess: He has been running makepy on the MSHTML library, you
haven't. Many COM components doesn't care about casing (to please VB
users), the makepy generated wrappers obviously care (as Python is
case-sensitive with respect to identifiers)

--

Vennlig hilsen

Syver Enstad

0 new messages