A year and a half ago, I made a com object to get UI-less access to the MSHTML dom object. Back then I had a dev box with IE 5.5 on it. To get my object to work, I basically grafted the MSDN WalkAll code sample
(http://msdn.microsoft.com/downloads/samples/internet/default.asp?url=/downloads/samples/internet/browser/walkall/default.asp)
with the code snippet on how to load mshtml from a stream on MSDN
(http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/webbrowser/WebBrowser.asp)
My COM object has a method to load the contents of MSHTML::IHTMLDocument2 from a url or from a stream (the latter coming exactly from the msdn article where it queries for a PersistStreamInit interface and calls Load() with a stream). The LoadUrl method works and I thought the LoadStream method *used* to work, but I haven't looked at this in over a year.
Now when I run it, LoadUrl works but LoadStream doesn't. It doesn't throw any errors, but the ui-less parser wrapper seems to get invoked differently depending on how the load is being effected.
Does the load-from-stream example work? Are there some other things i need to hook up to get it to work?
Thanks
-mark
Based on my understanding, now the problem is: The COM component used
IPersisitStreamInit interface to load from a stream. However, it seems it
doesn't work now. Right?
In MSDN, there is a short code sample for you. I think you can refer to it
for more information. Please refer to:
http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/webbrows
er/tutorials/webocstream.asp
The sample code is:
HRESULT LoadWebBrowserFromStream(IWebBrowser* pWebBrowser, IStream* pStream)
{
HRESULT hr;
IDispatch* pHtmlDoc = NULL;
IPersistStreamInit* pPersistStreamInit = NULL;
// Retrieve the document object.
hr = pWebBrowser->get_Document( &pHtmlDoc );
if ( SUCCEEDED(hr) )
{
// Query for IPersistStreamInit.
hr = pHtmlDoc->QueryInterface( IID_IPersistStreamInit,
(void**)&pPersistStreamInit );
if ( SUCCEEDED(hr) )
{
// Initialize the document.
hr = pPersistStreamInit->InitNew();
if ( SUCCEEDED(hr) )
{
// Load the contents of the stream.
hr = pPersistStreamInit->Load( pStream );
}
pPersistStreamInit->Release();
}
}
}
For a completed sample, we can also refer to MSDN KB article:
"BUG: PersistStreamInit::Load() Displays HTML Files as Text"
http://support.microsoft.com/?id=323569
Though it is used mainly for describing a know issue, we can still use the
repro sample in it to achieve what we need.
Thanks very much. If there is anything unclear, please feel free to post
here.
Best regards,
Yanhong Huang
Microsoft Community Support
Get Secure! ¨C www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
Just recently I had a very similar problem. Code that (I think) worked
earlier doesn't work anymore. I was using IPersistStreamInit to load HTML
from memory. I then checked the ReadyState property in a loop, and it was
returning 1 ("loading") all the time.
I tracked the problem down to my CoInitialize() call. The plain old
CoInitialize(NULL) didn't work but when I replaced it with the following,
everything started working fine:
CoInitializeEx(NULL,COINIT_MULTITHREADED);
I guess this is a new requirement of the MSHTML component that ships with
IE6. I never made any checks to confirm this, though.
Soon after fixing this, I was hit by the problem mentioned by Yan-Hong:
"BUG: PersistStreamInit::Load() Displays HTML Files as Text"
http://support.microsoft.com/?id=323569
This caused the file not to parse as HTML, but instead be treated as a
single block of <PRE> text by IHtmlDocument2.
Because of this, I ended up writing my HTML to a temporary file and loading
it from there with IPersistFile (an approach I very much hate). And before
you ask, no, you cannot use FILE_FLAG_DELETE_ON_CLOSE on the temporary file.
IPersistFile would have to use the FILE_SHARE_DELETE flag for this to work.
--
Mikko Noromaa (mik...@excelsql.com)
- SQL in Excel, check out ExcelSQL! - see http://www.excelsql.com -
"Mark" <msdno...@lycos-inc.com> wrote in message
news:82C6EE41-FA9F-4743...@microsoft.com...
Well, you expect heroic efforts from a piece of software. How would you
definitively identify a piece of text as HTML content? MSHTML MIME
sniffing looks at the first 256 bytes of the text (IIRC) and expects to
see <html> there. If it does not, it's a plain text file, not an HTML
content. Well actually, when you download from the server or from a disk
file, there are two additional pieces of information that MSHTML takes
into account: the Content-Type sent by the server, and file extension
(.htm or .html strongly suggest the content is HTML). But when you load
content from a memory stream, you don't have this extra data and have to
rely on content alone.
However, see this
http://groups.google.com/groups?threadm=031f01c3780b%24b860f080%24a301280a%40phx.gbl
This approach allows you to send a fake URL along with the content. Make
this URL refer to a dummy file with .html extension. With a bit more
work, it is possible to imitate sending Content-Type header, too - see
http://groups.google.com/groups?threadm=%23AIMBmRBEHA.3712%40tk2msftngp13.phx.gbl
--
With best wishes,
Igor Tandetnik
"For every complex problem, there is a solution that is simple, neat,
and wrong." H.L. Mencken
Thanks
-mark
Of course not. You are loading from memory, there is no network activity
involved.
Sorry, my bad. Somehow I though you were still talking about
IPersistStreamInit.
Yes, when you create a URL Moniker, you use a URLMon library that
ultimately uses WinInet. And yes, WinInet is not supported for use in
server-side applications - see KB Article KB238425 "INFO: WinInet Not
Supported for Use in Services". For that matter, so is MSHTML - see KB
Article KB244085 "PRB: Parsing HTML on Server Using Internet Explorer
Components"
If I go through the URLMoniker method and let urlmon go do the fetch, mshtml knows to parse the return as html regardless of whether <html> is in the first 256 bytes. What's URLMoniker doing differently/exposing differently that allows mshtml to proceed with confidence and not do mime detection?
Thanks
-mark
> Well, you expect heroic efforts from a piece of software. How would you
> definitively identify a piece of text as HTML content?
I wouldn't call it very heroic for an HTML parser to recognize even
malformed HTML content as HTML... If I were to implement an HTML parser, I'd
assume HTML format and make plain text a rare exception.
Besides, this still doesn't explain why MSHTML recognizes the file better
through IPersistFile than through IPersistStreamInit:
> Well actually, when you download from the server or from a disk
> file, there are two additional pieces of information that MSHTML takes
> into account: the Content-Type sent by the server, and file extension
> (.htm or .html strongly suggest the content is HTML).
My temporary file has a .tmp extension and still MSHTML is able to recognize
it as HTML through IPersistFile but not through IPersistStreamInit.
It helps with that, too. MSHTML now has an additional piece of
information - the file extension - and would use it to correctly
determine the MIME type.
> If I go through the URLMoniker method and let urlmon go do the fetch,
> mshtml knows to parse the return as html regardless of whether <html>
> is in the first 256 bytes. What's URLMoniker doing
> differently/exposing differently that allows mshtml to proceed with
> confidence and not do mime detection?
Reporting Content-Type sent by the server - that's a very strong
indication given a high weight in MSTHML's internal algorithm. Also, the
URL is available and the file extension may provide an additional hint.
See FindMimeFromData for details.
What if you were to implement a parser that accepts plain text, HTML,
XML, CSS, JScript, VBScript, GIF, JPeg, ... How would you determine even
which "subparser" to flow the content through? Would you try to parse
through all of them, and pick the one that produces fewest errors or
something?
Also, where to draw the line between a badly malformed HTML, and a plain
text that resembles HTML somewhat? Say, how would you interpret this:
One line of plain text comment
<html>
<!-- 100 KB of well-formed HTML -->
</html>
Is this plain text or HTML file? Does the answer change if there are two
lines of text, and 50KB of HTML? 4 lines of text and 25KB of HTML? 50KB
of text and 50KB of HTML? You see what I'm getting at.
> Besides, this still doesn't explain why MSHTML recognizes the file
> better through IPersistFile than through IPersistStreamInit:
>
>> Well actually, when you download from the server or from a disk
>> file, there are two additional pieces of information that MSHTML
>> takes into account: the Content-Type sent by the server, and file
>> extension (.htm or .html strongly suggest the content is HTML).
>
> My temporary file has a .tmp extension and still MSHTML is able to
> recognize it as HTML through IPersistFile but not through
> IPersistStreamInit.
You sure? Strange. Maybe it passes a larger initial chunk to
FindMimeFromData when reading from a file, though I'm not sure why it
would do that.
I believe it's deprecated in favor of MSHTML editing support
(designMode="on"). I'm not very familiar with it though. Anyway, it is
about WYSIWYG HTML editing by end user, not about HTML parsing.
IBindStatusCallback::OnProgress(BINDSTATUS_MIMETYPEAVAILABLE).
> I've been just reading the URLMoniker pages trying to figure that out
> and have yet to find the connection. FindMimeFromData is a function
> of URLMoniker but it's not clear from the documentation if it's
> invoked as a private function internally or externally.
URL Moniker calls it in order to generate appropriate
BINDSTATUS_MIMETYPEAVAILABLE notification.
> This may be a vain hope, but it seems like what i would really like
> is a way to create an URLMoniker on an existing stream instead of
> having it go ahead and do the fetch. Or find a way to derive a class
> from URLMoniker using an existing ServerXMLHTTP instance as the data
> source. If the fetch is already done with ServerXMLHTTP, all of that
> information is already available. I just need to know how to hook up
> the wires between ServerXMLHTTP and URLMoniker...
http://groups.google.com/groups?threadm=031f01c3780b%24b860f080%24a301280a%40phx.gbl
http://groups.google.com/groups?threadm=%23AIMBmRBEHA.3712%40tk2msftngp13.phx.gbl
I agree with Igor here. The MSHTML editing platform is part of Microsoft
Internet Explorer. The editing platform is built on MSHTML, Internet
Explorer's HTML parsing and rendering engine. The MSHTML editing platform
provides a rich set of text editing and Web authoring features, enabling
host applications to support a fully WYSIWYG HTML editing experience.
For more information on it, please refer to MSDN article:
"The MSHTML Editing Platform in Internet Explorer 5.5"
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnmshtml/ht
ml/mshtmleditplatf.asp
Thanks.
> > My temporary file has a .tmp extension and still MSHTML is able to
> > recognize it as HTML through IPersistFile but not through
> > IPersistStreamInit.
>
> You sure? Strange. Maybe it passes a larger initial chunk to
> FindMimeFromData when reading from a file, though I'm not sure why it
> would do that.
Yes, I'm sure about this. As mentioned, I used the temporary file to work
around the initial problem described in KB323569. I haven't done any
extensive testing to see if the problem reappears if there are larger
sections of SCRIPT content before the <HTML> tag. If it does, perhaps I can
fix it by renaming my temporary file to .html.
How is everything going? Do you have any more concerns on the link that
Igor provided?
http://groups.google.com/groups?threadm=uK%24I6nz5CHA.2364%40TK2MSFTNGP10.ph
x.gbl
If the problem is still there, please feel free to post here.