Getting the raw HTML source using JavaScript in a Chrome extension

1,218 views
Skip to first unread message

Mat Kelly

unread,
Apr 9, 2015, 3:59:22 PM4/9/15
to chromium-...@chromium.org
I have created a Chrome extension that allows the user to store the contents of a web page into a single concatenated file. The page may have been manipulated by the user or scripts, so I use the final HTML representation.

However, using the HTML-getting functions (e.g., innerHTML, outerHTML) conventionally attached to the document object returns the rendered characters instead of the HTML source. For example, if a web page's source contain:

→
→
→

the JavaScript functions return the decoded form:

→


How do I go about getting the original raw HTML source using either JavaScript or the Chrome extension API?

Daniel F

unread,
Apr 10, 2015, 1:02:30 PM4/10/15
to chromium-...@chromium.org
I have not tested this, but I think that it may be the javascript console that renders it, if you are running innerHTML in the javascript console and looking at the result.

Simon Knott

unread,
Apr 10, 2015, 1:25:27 PM4/10/15
to chromium-...@chromium.org
How are the end users using the captured page?  Have you thought about using the https://developer.chrome.com/extensions/pageCapture API?  This captures the full content of the page in a single MHTML file and you don't need to do all of the heavy lifting...

Mat Kelly

unread,
Apr 10, 2015, 1:44:39 PM4/10/15
to Simon Knott, chromium-...@chromium.org
Simon,
The raw HTML of the page as well as the content of the other resources needed to construct the page (e.g., images, CSS, JavaScripts) are concatenated together to create a WARC file (ISO 28500:2009), so MHTML is not an option. The tool already does the majority of this heavy lifting and produces WARC files but I noticed this nuance is how escaped special characters are recorded.

I am using document.documentElement.outerHTML in the extension's JavaScript to obtain the HTML from within the content script then passing it elsewhere in the script for processing.

Daniel,
My actual code does not use the console but it's difficult to verify what is stored in the variable along the way when the console appears to be transforming these characters.

Is there any way to get the raw HTML, byte-for-byte?

Thank you,
Mat

--
You received this message because you are subscribed to a topic in the Google Groups "Chromium-extensions" group.
To unsubscribe from this topic, visit https://groups.google.com/a/chromium.org/d/topic/chromium-extensions/YA5xg6PaIVw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to chromium-extens...@chromium.org.
To post to this group, send email to chromium-...@chromium.org.
Visit this group at http://groups.google.com/a/chromium.org/group/chromium-extensions/.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-extensions/17409424-7e3a-4a04-ae9e-f6bb17cc0804%40chromium.org.

Reply all
Reply to author
Forward
0 new messages