WalkAll/Stream loading/UI-less HTML Dom

96 views
Skip to first unread message

Mark

unread,
Mar 18, 2004, 6:26:05 PM3/18/04
to
Hi...

A year and a half ago, I made a com object to get UI-less access to the MSHTML dom object. Back then I had a dev box with IE 5.5 on it. To get my object to work, I basically grafted the MSDN WalkAll code sample
(http://msdn.microsoft.com/downloads/samples/internet/default.asp?url=/downloads/samples/internet/browser/walkall/default.asp)

with the code snippet on how to load mshtml from a stream on MSDN
(http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/webbrowser/WebBrowser.asp)

My COM object has a method to load the contents of MSHTML::IHTMLDocument2 from a url or from a stream (the latter coming exactly from the msdn article where it queries for a PersistStreamInit interface and calls Load() with a stream). The LoadUrl method works and I thought the LoadStream method *used* to work, but I haven't looked at this in over a year.

Now when I run it, LoadUrl works but LoadStream doesn't. It doesn't throw any errors, but the ui-less parser wrapper seems to get invoked differently depending on how the load is being effected.

Does the load-from-stream example work? Are there some other things i need to hook up to get it to work?

Thanks
-mark

Yan-Hong Huang[MSFT]

unread,
Mar 19, 2004, 4:23:08 AM3/19/04
to
Hi Mark,

Based on my understanding, now the problem is: The COM component used
IPersisitStreamInit interface to load from a stream. However, it seems it
doesn't work now. Right?

In MSDN, there is a short code sample for you. I think you can refer to it
for more information. Please refer to:
http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/webbrows
er/tutorials/webocstream.asp

The sample code is:
HRESULT LoadWebBrowserFromStream(IWebBrowser* pWebBrowser, IStream* pStream)
{
HRESULT hr;
IDispatch* pHtmlDoc = NULL;
IPersistStreamInit* pPersistStreamInit = NULL;

// Retrieve the document object.
hr = pWebBrowser->get_Document( &pHtmlDoc );
if ( SUCCEEDED(hr) )
{
// Query for IPersistStreamInit.
hr = pHtmlDoc->QueryInterface( IID_IPersistStreamInit,
(void**)&pPersistStreamInit );
if ( SUCCEEDED(hr) )
{
// Initialize the document.
hr = pPersistStreamInit->InitNew();
if ( SUCCEEDED(hr) )
{
// Load the contents of the stream.
hr = pPersistStreamInit->Load( pStream );
}
pPersistStreamInit->Release();
}
}
}

For a completed sample, we can also refer to MSDN KB article:
"BUG: PersistStreamInit::Load() Displays HTML Files as Text"
http://support.microsoft.com/?id=323569

Though it is used mainly for describing a know issue, we can still use the
repro sample in it to achieve what we need.

Thanks very much. If there is anything unclear, please feel free to post
here.

Best regards,
Yanhong Huang
Microsoft Community Support

Get Secure! ¨C www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.

Mikko Noromaa

unread,
Mar 19, 2004, 6:44:19 AM3/19/04
to
Hi,

Just recently I had a very similar problem. Code that (I think) worked
earlier doesn't work anymore. I was using IPersistStreamInit to load HTML
from memory. I then checked the ReadyState property in a loop, and it was
returning 1 ("loading") all the time.

I tracked the problem down to my CoInitialize() call. The plain old
CoInitialize(NULL) didn't work but when I replaced it with the following,
everything started working fine:

CoInitializeEx(NULL,COINIT_MULTITHREADED);

I guess this is a new requirement of the MSHTML component that ships with
IE6. I never made any checks to confirm this, though.

Soon after fixing this, I was hit by the problem mentioned by Yan-Hong:

"BUG: PersistStreamInit::Load() Displays HTML Files as Text"
http://support.microsoft.com/?id=323569

This caused the file not to parse as HTML, but instead be treated as a
single block of <PRE> text by IHtmlDocument2.

Because of this, I ended up writing my HTML to a temporary file and loading
it from there with IPersistFile (an approach I very much hate). And before
you ask, no, you cannot use FILE_FLAG_DELETE_ON_CLOSE on the temporary file.
IPersistFile would have to use the FILE_SHARE_DELETE flag for this to work.

--

Mikko Noromaa (mik...@excelsql.com)
- SQL in Excel, check out ExcelSQL! - see http://www.excelsql.com -


"Mark" <msdno...@lycos-inc.com> wrote in message
news:82C6EE41-FA9F-4743...@microsoft.com...

Igor Tandetnik

unread,
Mar 19, 2004, 1:30:00 PM3/19/04
to
"Mark" <msdno...@lycos-inc.com> wrote in message
news:119464BF-BAB7-44C9...@microsoft.com
> Thanks for the tip on the bug report. As Mikko found, that seems to
> be more of what I'm tripping over... The bug doesn't really seem to
> go far enough is saying how easy it is to confuse this version of
> MSHTML; the page I was fetching has no large sections of text nor
> large script blocks. What it *does* have - which seems to confuse
> mshtml - is some <META> tags at the top before the <HTML> tag. For
> example:
>
> <META HTTP-EQUIV="Pragma" CONTENT="no-cache"><META
> HTTP-EQUIV="Expires" CONTENT="0"><META HTTP-EQUIV="Expires"
> CONTENT="Sun, 22 Mar 1998 16:18:35 GMT"><META
> HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"
> /><HTML> ... </HTML>
>
> Now, strictlly speaking this isn't legit to have those up top, but to
> refuse to call the stream an html document based on that seems a bit
> harsh.

Well, you expect heroic efforts from a piece of software. How would you
definitively identify a piece of text as HTML content? MSHTML MIME
sniffing looks at the first 256 bytes of the text (IIRC) and expects to
see <html> there. If it does not, it's a plain text file, not an HTML
content. Well actually, when you download from the server or from a disk
file, there are two additional pieces of information that MSHTML takes
into account: the Content-Type sent by the server, and file extension
(.htm or .html strongly suggest the content is HTML). But when you load
content from a memory stream, you don't have this extra data and have to
rely on content alone.

However, see this

http://groups.google.com/groups?threadm=031f01c3780b%24b860f080%24a301280a%40phx.gbl

This approach allows you to send a fake URL along with the content. Make
this URL refer to a dummy file with .html extension. With a bit more
work, it is possible to imitate sending Content-Type header, too - see

http://groups.google.com/groups?threadm=%23AIMBmRBEHA.3712%40tk2msftngp13.phx.gbl

--
With best wishes,
Igor Tandetnik

"For every complex problem, there is a solution that is simple, neat,
and wrong." H.L. Mencken


Mark

unread,
Mar 19, 2004, 1:41:05 PM3/19/04
to
Another question just occurred to me - We're planning on using this COM object we've created in an asp/aspx context; when you use mshtml IPersistMoniker->Load() call to load an html dom from a url does that use WinInet? I remember hearing that WinInet use is dangerous in an ASP context.

Thanks
-mark

Igor Tandetnik

unread,
Mar 19, 2004, 1:56:48 PM3/19/04
to
"Mark" <msdno...@lycos-inc.com> wrote in message
news:D2F875B7-1969-4B6D...@microsoft.com

> Another question just occurred to me - We're planning on using this
> COM object we've created in an asp/aspx context; when you use mshtml
> IPersistMoniker->Load() call to load an html dom from a url does that
> use WinInet?

Of course not. You are loading from memory, there is no network activity
involved.

Igor Tandetnik

unread,
Mar 19, 2004, 3:06:13 PM3/19/04
to
"Mark" <msdno...@lycos-inc.com> wrote in message
news:4EAF7600-9B00-4943...@microsoft.com
> Sorry I wasn't more clear... As I noted in my original post, the
> WalkAll sample comes with a method where you pass in a text string
> url and it loads it using the IPersistMoniker (after creating a
> Moniker from the url). That still worked and that does, indeed,
> involve network activity (though what utilities are used to perform
> the fetch are in question, as I noted).

Sorry, my bad. Somehow I though you were still talking about
IPersistStreamInit.

Yes, when you create a URL Moniker, you use a URLMon library that
ultimately uses WinInet. And yes, WinInet is not supported for use in
server-side applications - see KB Article KB238425 "INFO: WinInet Not
Supported for Use in Services". For that matter, so is MSHTML - see KB
Article KB244085 "PRB: Parsing HTML on Server Using Internet Explorer
Components"

Mark

unread,
Mar 19, 2004, 3:11:09 PM3/19/04
to
Okay, trying to read between the lines, if I implement IMoniker::GetDisplayName() to hold the url and IMoniker::BindToStorage() along the lines of the Google examples, to wrap the stream I already have it seems like that might work around the about:blank issue. The part I'm still missing is what/how to get around the mimetype bug.

If I go through the URLMoniker method and let urlmon go do the fetch, mshtml knows to parse the return as html regardless of whether <html> is in the first 256 bytes. What's URLMoniker doing differently/exposing differently that allows mshtml to proceed with confidence and not do mime detection?

Thanks
-mark

Mikko Noromaa

unread,
Mar 19, 2004, 3:22:08 PM3/19/04
to
Hi,

> Well, you expect heroic efforts from a piece of software. How would you
> definitively identify a piece of text as HTML content?

I wouldn't call it very heroic for an HTML parser to recognize even
malformed HTML content as HTML... If I were to implement an HTML parser, I'd
assume HTML format and make plain text a rare exception.

Besides, this still doesn't explain why MSHTML recognizes the file better
through IPersistFile than through IPersistStreamInit:

> Well actually, when you download from the server or from a disk
> file, there are two additional pieces of information that MSHTML takes
> into account: the Content-Type sent by the server, and file extension
> (.htm or .html strongly suggest the content is HTML).

My temporary file has a .tmp extension and still MSHTML is able to recognize
it as HTML through IPersistFile but not through IPersistStreamInit.

Igor Tandetnik

unread,
Mar 19, 2004, 3:25:15 PM3/19/04
to
"Mark" <msdno...@lycos-inc.com> wrote in message
news:8047114B-DEA4-4D02...@microsoft.com

> Okay, trying to read between the lines, if I implement
> IMoniker::GetDisplayName() to hold the url and
> IMoniker::BindToStorage() along the lines of the Google examples, to
> wrap the stream I already have it seems like that might work around
> the about:blank issue. The part I'm still missing is what/how to get
> around the mimetype bug.

It helps with that, too. MSHTML now has an additional piece of
information - the file extension - and would use it to correctly
determine the MIME type.

> If I go through the URLMoniker method and let urlmon go do the fetch,
> mshtml knows to parse the return as html regardless of whether <html>
> is in the first 256 bytes. What's URLMoniker doing
> differently/exposing differently that allows mshtml to proceed with
> confidence and not do mime detection?

Reporting Content-Type sent by the server - that's a very strong
indication given a high weight in MSTHML's internal algorithm. Also, the
URL is available and the file extension may provide an additional hint.
See FindMimeFromData for details.

Igor Tandetnik

unread,
Mar 19, 2004, 3:42:43 PM3/19/04
to
"Mikko Noromaa" <mik...@excelsql.com> wrote in message
news:u7iFM$eDEHA...@TK2MSFTNGP09.phx.gbl

>> Well, you expect heroic efforts from a piece of software. How would
>> you definitively identify a piece of text as HTML content?
>
> I wouldn't call it very heroic for an HTML parser to recognize even
> malformed HTML content as HTML... If I were to implement an HTML
> parser, I'd assume HTML format and make plain text a rare exception.

What if you were to implement a parser that accepts plain text, HTML,
XML, CSS, JScript, VBScript, GIF, JPeg, ... How would you determine even
which "subparser" to flow the content through? Would you try to parse
through all of them, and pick the one that produces fewest errors or
something?

Also, where to draw the line between a badly malformed HTML, and a plain
text that resembles HTML somewhat? Say, how would you interpret this:

One line of plain text comment
<html>
<!-- 100 KB of well-formed HTML -->
</html>

Is this plain text or HTML file? Does the answer change if there are two
lines of text, and 50KB of HTML? 4 lines of text and 25KB of HTML? 50KB
of text and 50KB of HTML? You see what I'm getting at.

> Besides, this still doesn't explain why MSHTML recognizes the file
> better through IPersistFile than through IPersistStreamInit:
>
>> Well actually, when you download from the server or from a disk
>> file, there are two additional pieces of information that MSHTML
>> takes into account: the Content-Type sent by the server, and file
>> extension (.htm or .html strongly suggest the content is HTML).
>
> My temporary file has a .tmp extension and still MSHTML is able to
> recognize it as HTML through IPersistFile but not through
> IPersistStreamInit.

You sure? Strange. Maybe it passes a larger initial chunk to
FindMimeFromData when reading from a file, though I'm not sure why it
would do that.

Igor Tandetnik

unread,
Mar 19, 2004, 3:49:10 PM3/19/04
to
"Mark" <msdo...@lycos-inc.com> wrote in message
news:B699A88D-7E60-4A74...@microsoft.com
> I'm not sure how different the DHTML editing component is from the
> mshtml component, or if it is any easier to work with. More reading
> to do i guess...

I believe it's deprecated in favor of MSHTML editing support
(designMode="on"). I'm not very familiar with it though. Anyway, it is
about WYSIWYG HTML editing by end user, not about HTML parsing.

Igor Tandetnik

unread,
Mar 19, 2004, 4:05:48 PM3/19/04
to
"Mark" <msdno...@lycos-inc.com> wrote in message
news:C6AC3DE9-28BB-4844...@microsoft.com
> How does mshtml get the Content-Type: header from the URLMoniker?

IBindStatusCallback::OnProgress(BINDSTATUS_MIMETYPEAVAILABLE).

> I've been just reading the URLMoniker pages trying to figure that out
> and have yet to find the connection. FindMimeFromData is a function
> of URLMoniker but it's not clear from the documentation if it's
> invoked as a private function internally or externally.

URL Moniker calls it in order to generate appropriate
BINDSTATUS_MIMETYPEAVAILABLE notification.

> This may be a vain hope, but it seems like what i would really like
> is a way to create an URLMoniker on an existing stream instead of
> having it go ahead and do the fetch. Or find a way to derive a class
> from URLMoniker using an existing ServerXMLHTTP instance as the data
> source. If the fetch is already done with ServerXMLHTTP, all of that
> information is already available. I just need to know how to hook up
> the wires between ServerXMLHTTP and URLMoniker...

http://groups.google.com/groups?threadm=031f01c3780b%24b860f080%24a301280a%40phx.gbl
http://groups.google.com/groups?threadm=%23AIMBmRBEHA.3712%40tk2msftngp13.phx.gbl

Yan-Hong Huang[MSFT]

unread,
Mar 22, 2004, 12:31:53 AM3/22/04
to
Hi Mark,

I agree with Igor here. The MSHTML editing platform is part of Microsoft
Internet Explorer. The editing platform is built on MSHTML, Internet
Explorer's HTML parsing and rendering engine. The MSHTML editing platform
provides a rich set of text editing and Web authoring features, enabling
host applications to support a fully WYSIWYG HTML editing experience.

For more information on it, please refer to MSDN article:
"The MSHTML Editing Platform in Internet Explorer 5.5"
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnmshtml/ht
ml/mshtmleditplatf.asp

Thanks.

Mikko Noromaa

unread,
Mar 23, 2004, 2:35:15 PM3/23/04
to
Hi,

> > My temporary file has a .tmp extension and still MSHTML is able to
> > recognize it as HTML through IPersistFile but not through
> > IPersistStreamInit.
>
> You sure? Strange. Maybe it passes a larger initial chunk to
> FindMimeFromData when reading from a file, though I'm not sure why it
> would do that.

Yes, I'm sure about this. As mentioned, I used the temporary file to work
around the initial problem described in KB323569. I haven't done any
extensive testing to see if the problem reappears if there are larger
sections of SCRIPT content before the <HTML> tag. If it does, perhaps I can
fix it by renaming my temporary file to .html.

Yan-Hong Huang[MSFT]

unread,
Mar 23, 2004, 10:40:28 PM3/23/04
to
Hi Mark,

How is everything going? Do you have any more concerns on the link that
Igor provided?
http://groups.google.com/groups?threadm=uK%24I6nz5CHA.2364%40TK2MSFTNGP10.ph
x.gbl

If the problem is still there, please feel free to post here.

Reply all
Reply to author
Forward
0 new messages