Thanks.
I see two options:
- use mxTidy (http://www.lemburg.com/files/python/mxTidy.html), then
operate with a normal HTML parser on the output
- extract the parsing code from a real browser, like Mozilla or
Konqueror. If it is win32 only, it might be possible to get to the DOM
with interfacing Internet Exploder via COM, too
> If it can interpret Javascript that's even better.
You'll need a browser engine for that. Or use one of the other
Javascript engines and feed them your DOM.
Gerhard
--
mail: gerhard <at> bigfoot <dot> de registered Linux user #64239
web: http://www.cs.fhm.edu/~ifw00065/ OpenPGP public key id AD24C930
public key fingerprint: 3FCC 8700 3012 0A9E B0C9 3667 814B 9CAA AD24 C930
reduce(lambda x,y:x+y,map(lambda x:chr(ord(x)^42),tuple('zS^BED\nX_FOY\x0b')))
How about automating IE using Python?
from win32com.client import DispatchEx
ie = DispatchEx('internetexplorer.application')
ie.visible = 1
ie.navigate('http://www.nightsong.com')
dom = ie.document
etc...
Access to the DOM tree of the document might be too slow for your
needs, but if it's not, you definitely get a lot of bang for the buck...
-Peter
That's a really interesting idea and I might try it. I had been
thinking in terms of a Linux solution, but automating IE might be
ok for this particular application. Thanks.
I put the above code into "ienavigate.py" and tried it and got:
Traceback (most recent call last):
File "ienavigate.py", line 6, in ?
dom = ie.document
File "J:\Python22\lib\site-packages\win32com\client\dynamic.py", line 448,
in __getattr__
raise pythoncom.com_error, details
pywintypes.com_error: (-2147352567, 'Exception occurred.', (0, None, None,
None, 0, -2147467259), None)
Also got a browser window with a "403" error telling me I don't have
permission to access index.html on www.nightsong.com.
I would be interested in getting this working, so any help is appreciated.
TIA,
Dave LeBlanc
Seattle, WA USA
The exception might be legitimate, because of the 403 error. Try
www.yahoo.com instead of www.nightsong.com. www.nightsong.com really
does return a 403.
Nope - it's the call to ie.document that chokes. ie.navigation does open
www.w3.org just fine, but trying to get to the DOM isn't working.
It looks like there's no attribute "document" for this interface...
Dave LeBlanc
David LeBlanc
Seattle, WA USA
Ah, sorry to post code that isn't very robust. I was just trying
to give you an example so you could judge whether it might be suitable.
I actually expected your answer to be that it was very far from being
useful to you...
In this case, you should probably look at "ie.busy", which you should
monitor after things like ie.navigate() to make sure the documented
has been fully loaded.
My choice of www.nightsong.com was of course because that's the
domain the OP posted from. I didn't stop to think someone might
have a "www" subdomain which actually refused a connect request...
strange if you ask me.
-Peter
www.nightsong.com doesn't refuse connect requests. It accepts
connections and sends back a valid HTTP 403 response indicating
there's no page available at "/" (e.g. no index.html at the document
root). If you use an interior URL like
<http://www.nightsong.com/phr/python/calc.py> it should work fine.
Greetings!
My present task requires the automation of IE. After much prayer, I
found that the "dom = ie.Document" assignment IS case sensitive. The
lower case form of "ie.document" just would not work for my
ActiveState ActivePython 2.2.1 distribution.
Joyfully About Alerio's Business,
Daniel
That's odd. I have never had to worry about case in almost any
aspect of IE automation, especially this one.
Does anyone with greater knowledge of the guts of this stuff
have any input on why Daniel would have to worry about case
while I do not?
Peter
My apologies for using the wrong phrasing. What I meant to say
was I didn't stop to think someone might have a "www" subdomain
which did not provide a valid HTTP 200 response at the top level,
which seems to me to be extremely unusual. (Not, I'm quite sure
many other domains do this too... it just seems unusual.)
-Peter
David LeBlanc wrote:
> While trying to figure out how to make the recently posted (by Paul Rubin)
> ie navigation example work, I had occasion to run makepy.py on "Microsoft
> Internet Controls". After doing this, the sample would fail on "ie.visible
=
> 1". Removing the generated file would return the sample to working order.
Mark Hammond replied:
"Was the problem "AttributeError: visible"? If so, the problem is simply
that the correct name for the property is "Visible". makepy is case
sensitive."
If he's run makepy, then the ?mapping? gets used, which is case sensitive.
I had to ask too :-)
Greetings!
I still get the exception if I don't use the ".Document" reference. My
code is suppose to wait for IE to return a "not busy" result before
going on.
I recall Mr. Hammond giving an example where he was waiting for IE's
"ReadyState" to return what would be true when ready state was
obtained. I used to use a method to determine ready state and then go
on to determine if IE was "Busy". I found that using the "Navigate2"
method seemed to obviate the need for checking IE's ready state.
Perhaps I should leave the ready state method in?
Here are a few snippets:
################## start of snips #####################
In main I call:
o_AleApp.NavigateIE(s_URL)
##fire events after nav test
o_AleApp.ClickCrossFrameElement(1, 'Sub1t')
In NavigateIE I run:
def NavigateIE(self, s_URL):
i_Sleep = self.GetDelaySecondsBetweenPageNavs()
print "Class: AleAppWebReporter Method: NavigateIE: Sleeping",
i_Sleep, "seconds prior to navigating..."
time.sleep(i_Sleep)
print "Class: AleAppWebReporter Method: NavigateIE: Navigating
to:\n" + s_URL
self.o_IE.Navigate2(s_URL)
##now lets make sure the page has fully loaded
s_NotBusy = self.WaitForNotBusy()
if s_NotBusy == 'TRUE':
return 'TRUE'
else:
return s_NotBusy
In WaitForNotBusy I run:
def WaitForNotBusy(self):
print "Class: AleAppWebReporter Method: WaitForNotBusy:
Current URL:", self.GetIELocationURL()
i_IEBusy = int(self.GetIEBusy())
print "Class: AleAppWebReporter Method: WaitForNotBusy:
i_IEBusy:", i_IEBusy
if i_IEBusy != 1:
return 'TRUE'
else:
i = 1
i_State = 0
while i_State == 0:
print "Class: AleAppWebReporter Method:
WaitForNotBusy: Waited:", i, "seconds for IE to complete
downloading..."
time.sleep(1)
i_IEBusy = int(self.GetIEBusy())
print "Class: AleAppWebReporter Method:
WaitForNotBusy: i_IEBusy:", i_IEBusy
if i_IEBusy != 1:
i_State = 1
i += 1
return 'TRUE'
In GetIEBusy I run:
def GetIEBusy(self):
return self.o_IE.Busy
################## end of snips #####################
My guess: He has been running makepy on the MSHTML library, you
haven't. Many COM components doesn't care about casing (to please VB
users), the makepy generated wrappers obviously care (as Python is
case-sensitive with respect to identifiers)
--
Vennlig hilsen
Syver Enstad