Doctype and getPageSource()

460 views
Skip to first unread message

Aaron

unread,
Mar 20, 2012, 9:51:57 AM3/20/12
to webd...@googlegroups.com
Hi, I have a page with an XHTML doctype.

When I do getPageSource(), I get the page source as a string, but without the doctype prolog. Is that expected behavior? When I look at the DOM in Firebug, I see the doctype there.

Is there some other way of detecting the doctype using webdriver?

Here's the HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <title>A Title</title>
    </head>
    <body>
...  
    </body> 
</html>

David

unread,
Mar 20, 2012, 7:32:45 PM3/20/12
to webdriver
I wonder does it behave the same under Selenium RC?

Worst case solution, you'll need to integrate with an external tool
that will capture the network traffic during automation and parse out
the DOCTYPE from the HTTP response body that was captured.

Alternatively, if the page in question doesn't use AJAX or you can
skip the AJAX, you could also make a direct HTTP request to the URL
from code (Java, Python, etc.) rather than via WebDriver and parse out
the DOCTYPE from the response returned. You could set user agent as
one of the browsers for your HTTP request, and this code can also be
part of your WebDriver test code, only for this particular check,
you're not using WebDriver. And if cookies or a session is needed, you
can extract from browser session with WebDriver and pass into your
code-based HTTP request as a header.

Andreas Tolf Tolfsen

unread,
Mar 21, 2012, 4:56:08 AM3/21/12
to webd...@googlegroups.com
On 20. mars 2012, at 14:51, Aaron wrote:

> Hi, I have a page with an XHTML doctype.
>
> When I do getPageSource(), I get the page source as a string, but
> without the doctype prolog. Is that expected behavior? When I look
> at the DOM in Firebug, I see the doctype there.

You should not trust the consistency of getPageSource(), especially
for cross-browser purposes.

Let me quote the WebDriver API:

Get the source of the last loaded page. If the page has been
modified after loading (for example, by Javascript) there is
no guarantee that the returned text is that of the modified
page. Please consult the documentation of the particular driver
being used to determine whether the returned text reflects the
current state of the page or the text last sent by the web
server. The page source returned is a representation of the
underlying DOM: do not expect it to be formatted or escaped in
the same way as the response sent from the web server. Think of
it as an artist's impression.

Some browsers may return the modified DOM, some may return the document
as it were when it was loaded. The text wrapping and indentation is
almost certainly going to be different, and it goes without saying that
the same thing applies for the doctype.

> Is there some other way of detecting the doctype using webdriver?

WebDriver is not designed for this purpose. I recommend using another
library, such as curl or a proxy, for determining the doctype of a
particular document.

Aaron

unread,
Mar 21, 2012, 8:55:49 AM3/21/12
to webd...@googlegroups.com
It turns out, I had to just open up a network connection via Java and get the source that way.

As far as what WebDriver is meant for, I am fully aware of what it is "meant for", and it's great at that.

But when making frameworks, it seems to me, folks will use it in ways the designers never anticipated. There is, of course, a line in the sand, where you have to limit the scope of the framework, but in this case it seems a waste to not "remember" the original page-source since I webdriver must be making an HTTP connection anyway. Why not have a method called getRawSource();

Is there a technical limitation here?

Jim Evans

unread,
Mar 21, 2012, 9:41:05 AM3/21/12
to webd...@googlegroups.com
WebDriver doesn't make an HTTP connection; the browser does. Yes, WebDriver uses HTTP to talk to its own server component, but never talks directly to the web server from which the page is coming. So the technical limitation is that we only get what the browser lets us see. If the browser provides no way to access the "raw" HTTP response (IE, for example doesn't), we can't provide the "raw" HTML. Incidentally, this is the same limitation that prevents us from supplying HTTP status codes on requests. Even if it weren't out of scope for the project, it wouldn't be possible, at least, not for all browsers.

--Jim

Reply all
Reply to author
Forward
0 new messages