[rc] Retrieve full HTML source via Selenium?

674 views
Skip to first unread message

ikent

unread,
Dec 7, 2010, 7:39:06 AM12/7/10
to Selenium Users
Hi all

I know this has been asked quite a few times before but none of the
questions seem to be answered. I've got an existing selenium test
suite which runs automated testing against a web application and I'd
like to capture the HTML which can then be validated later - Selenium
should be perfect for this since the HTML that would then get
validated is the actual HTML the user would see when using that
application.

At first this seemed to work ok, I've managed to hook something into
waitForPageToLoad which calls browser.getHtmlSource and dumps the
response to disk. However this is returning the HTML content without
the leading <html> and trailing </html> tags (plus the associated
stuff like the doctype declaration). After a bit of googling it
appears this is because Selenium uses innerHTML to retrieve the
content, and though I haven't actually looked into that code I take it
that innerHTML comes from the HTML tag, explaining the lack of DTD and
HTML tags on the returned output. Additionally, other users have
commented that the HTML source returned is actually as parsed by the
browser in use (i.e. with quoted attributes etc), and so its not a
true representation of the HTML source retrieved from the server.

While this does mean I can perform some basic validation testing
against the saved content, it does mean the validator has to guess
which DTD to use or I must manually re-add the DTD's, partly defeating
the point of automating the process in the first place! Also by this
point, some originally invalid HTML markup will have now been fixed by
the browser thereby skewing the validation results.

So, is there a way to get Selenium to retrieve the full original
unadulterated HTML source? I can't think of any obvious way since
Selenium would need to retrieve the data via RC, which would then rely
on javascript. However I can't simply call the URLs manually to get at
the source since any post data would be missing and thus different
HTML output...

So, is this catch 22, or is there some workaround to this?

Thanks
Ian

Andrew

unread,
Dec 7, 2010, 11:32:54 AM12/7/10
to Selenium Users
while this isn't what you're looking for, what I did was to write my
tests in PHP and use PHP Curl to pull the pages for HTML validation.
Yes it involved pulling down the page a 2nd time and via a different
means, but I could put it inline in my PHP selenium test to pull code
down and find things like the doc type, etc. I also used this
mechanism to compare images so it proved useful in more than one way.

ikent

unread,
Dec 8, 2010, 3:21:52 AM12/8/10
to Selenium Users
Thanks for the suggestion, however I'm not sure it's quite suitable
for my purposes. When the pages are loaded by Selenium, the pages have
a session therefore context and data. This will be affecting the
output of the page, e.g. a user with incorrect security permissions
would see a very different page from someone with the correct
permissions. This potentially applies to the doctype declarations too,
e.g. the error page might be HTML 4 Transitional while other pages
might be HTML 4 Strict.

Because of this, if I used a script to download the file using the
URL, one of two things would happen:
- If using an identical URL with the same session ID, the page may
display differently since it'll have lost its post data (post data is
heavily used in the application to track application state), or the
second page request may cause some content to change (e.g. a one-off
warning would only get displayed on first page load unless data
changes again), though this should never affect doctype since the page
has context
- If using a new URL the page would lack context, since it wouldn't
have had the initial log in and navigation that the selenium script
does, and so the doctype or page content could change

I really could do with capturing the actual HTML source as displayed
while Selenium is running, but of course if this isn't possible then
I'll need to look at other options.

Out of interest, since its just the HTML tags and DOCTYPE appear to be
missing (though obviously there will be other subtle differences
between the real source and the DOM tree), I could possibly add a
method to Selenium (user-extensions?!) which would return the doctype,
allowing me to wrap the result of getHTMLSource with the HTML tags and
add the doctype. This at least means its all kept in the same place
and using a single page load - will have a hack around this morning.
Is there anything else that might be missing from the HTML returned by
getHTMLSource that could be detected this way?

ikent

unread,
Dec 9, 2010, 7:34:37 AM12/9/10
to Selenium Users
I think I've got a sort of solution to the original problem - I've
added a javascript function to user-extensions.js which returns the
page doctype, which is then re-added to the HTML before its saved.
This has solved that particular problem, however it has highlighted a
different problem.

Every so often my test suite will hang, always on a
'waitForPageToLoad' call. In my selenium log I have the following
actions recorded:
- click(name=someelementnamehere)
- waitForPageToLoad(300000)
- getLocation()
- getDocType()
- getHtmlSource()
- waitForPageToLoad(300000)

I'm confused by where the last waitForPageToLoad originates. My test
suite calls the first click/waitForPageToLoad methods. The click call
results in a page load and so the first waitForPageToLoad call
correctly returns when the page has loaded. Its in waitForPageToLoad
that I've added my code to store the HTML source, via a helper method
saveHTMLForValidation. It's my saveHTMLForValidation method which then
calls getLocation, getDocType and getHtmlSource.

The strange thing is there is no further call to waitForPageToLoad. I
can't see one anywhere in the test suite, nor can I see it in any of
the Selenium/Java code. It seems that getHtmlSource for is causing a
waitForPageToLoad call, which because no page is actually going to
load (i.e. there are no click etc events) means the call hangs until
the timeout elapses, in my case 300 seconds. This happens for roughly
one in every 10 page loads which is adding a considerable delay to the
test suite execution time.

Is there anywhere that last waitForPageToLoad could be coming from, or
is there anything in waitForPageToLoad which could be generating the
new wait call?

Thanks in advance
Ian

ikent

unread,
Dec 13, 2010, 7:44:21 AM12/13/10
to seleniu...@googlegroups.com
I've narrowed it down very slightly if this helps - in my local implementation I've altered waitForPageToLoad (which was using a constant DEFAULT_TIMEOUT) to use a hardcoded value of 15000 (calls browser.waitForPageToLoad("15000")).

When I run the same test suite again, the first call to waitForPageToLoad (after whichever action caused it, e.g. click) does indeed use the new timeout of 15000. However, the extraneous waitForPageToLoad call still gets called with the default timeout of 300000, which I think indicates that the call isn't originating in my code anywhere, or at the least is bypassing my local implementation and calling waitForPageToLoad on the browser object directly.

Does anybody with knowledge of the inner workings of Selenium know if waitForPageToLoad would get called at any point during a getHtmlSource call?

Finally, the waitForPageToLoad call after getHtmlSource doesn't seem to happen every time, perhaps just 95% of the time, which suggests it may not be called directly by getHtmlSource but could be an unintended side effect from another call?

Starting to tear my hair out now, feels a bit like I'm running around in circles! Will keep chipping away at it but any suggestions would be more than welcome!

ikent

unread,
Dec 13, 2010, 8:42:35 AM12/13/10
to seleniu...@googlegroups.com
Ok, so it turns out it was (sort of) my fault. Actually, I blame whoever wrote the code I've inherited, but thankfully it was nothing within Selenium or Concordion that was causing it!

We had a java method which filled out some forms to get to the 'start' of a test suite, i.e. past the usual login stuff which is all tested elsewhere. Inside that login method we were calling another method to submit the form which itself was calling waitForPageToLoad, and we were then calling waitForPageToLoad again, apparently not realising it was already being done by the form submit.

Now thats gone the freezing has stopped, and I think its now doing what I want! It now runs the test suites and downloads the HTML source (along with doctype) and stores a copy locally.

And here's my user extension to get the doctype - I'm sure someone will eventually find it useful:

Selenium.prototype.getDocType = function(locator, text) {
  if(this.browserbot.getCurrentWindow().document.doctype) {
      // The page has provided its doctype
      var pid=this.browserbot.getCurrentWindow().document.doctype.publicId;
      var sid=this.browserbot.getCurrentWindow().document.doctype.systemId;
      var typ=this.browserbot.getCurrentWindow().document.doctype.name
     
      var ret = "<!DOCTYPE " + typ + " PUBLIC " + "\"" + pid  + "\" \"" + sid + "\">\n";
      ret += "<!-- Doctype automatically detected by Selenium user-extensions.js -->";
      return ret;
  }  

  // No doctype was specified - return a default doctype
  // TODO Could probably do some manual detection here instead of simply assuming HTML4 Transitional
  var doctype = "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\n";
  doctype += "\n<!-- Doctype not detectable so auto-generated by Selenium user-extensions.js -->\n";
  return doctype;
};
 And to use it from Selenium:
String doctype = commandProcessor.getString("getDocType", new String[] {} );

Ross Patterson

unread,
Dec 13, 2010, 8:50:38 AM12/13/10
to seleniu...@googlegroups.com
No, getHtmlSource doesn't call waitForPageToLoad. It is actually one of the simplest parts of Selenium, using almost none of the rest of the system. It really is just a single line of code:

Selenium.prototype.getHtmlSource = function() {
/** Returns the entire HTML source between the opening and
* closing "html" tags.
*
* @return string the entire HTML source
*/
return this.browserbot.getDocument().getElementsByTagName("html")[0].innerHTML;
};

Ross
================================================================================================
From: seleniu...@googlegroups.com [mailto:seleniu...@googlegroups.com] On Behalf Of ikent
Sent: Monday, December 13, 2010 7:44 AM
To: seleniu...@googlegroups.com
Subject: [selenium-users] Re: Selenium suite freezing (Re: Retrieve full HTML source via Selenium?)

--
You received this message because you are subscribed to the Google Groups "Selenium Users" group.
To post to this group, send email to seleniu...@googlegroups.com.
To unsubscribe from this group, send email to selenium-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/selenium-users?hl=en.

Reply all
Reply to author
Forward
0 new messages