Simon,
The raw HTML of the page as well as the content of the other resources needed to construct the page (e.g., images, CSS, JavaScripts) are concatenated together to create a WARC file (ISO 28500:2009), so MHTML is not an option. The tool already does the majority of this heavy lifting and produces WARC files but I noticed this nuance is how escaped special characters are recorded.
I am using document.documentElement.outerHTML in the extension's JavaScript to obtain the HTML from within the content script then passing it elsewhere in the script for processing.
Daniel,
My actual code does not use the console but it's difficult to verify what is stored in the variable along the way when the console appears to be transforming these characters.
Is there any way to get the raw HTML, byte-for-byte?
Thank you,
Mat