Skipping parts of a serialized file

Scoots

unread,

Mar 12, 2009, 2:30:18 PM3/12/09

to

I'm sure this problem has been encountered before, but I haven't been
able to turn up anything in searches. It's fairly simple. I'm
decoding a file produced by another application (entirely legal
circumstances, the machine and application is provided for us, as well
as the source code.). The file produced has a minimum of about 3kb's
of runtime information from serializing a CDocument. However, they
don't serialize any of the information in that Document, it is written
out in a stable file format. In fact, I don't even know why they use
the serializing in the first place, as this is an "exported" file
format, and the only serializing is the CDocument itself. But they
do.

So I wind up with a bunch of information at the top of the file that
my application (which reads the files) completely doesn't care about.
Reading the rest of the information is really easy, and I've already
taken care of it. However, my trouble comes when it seems that this
initial runtime class information is NOT a fixed size. I haven't been
able to find any information at, say the top of the file, that
contains the size of this runtime information.

Is this size contained anywhere (if so, I haven't found it yet), or is
there an easy way to skip this runtime information so I can get to
what I need?

Thanks,
~Scoots

r norman

unread,

Mar 12, 2009, 2:35:06 PM3/12/09

to

On Thu, 12 Mar 2009 11:30:18 -0700 (PDT), Scoots <linki...@msn.com>
wrote:

You say you have source information so just look at the code that
writes the file. Then read it with "inverse" code that just does
everything in exactly the same order with the exactly the same data
strictures but just backwards, reading instead of writing.

Scoots

unread,

Mar 12, 2009, 2:54:43 PM3/12/09

to

Unfortunately, we are NOT using CDocuments, nor do I have permission
to use the code in our own code (it's also massive). It is provided
to help us understand the graphics file format (the normal saved file
is done entirely with serialization, we just realized that this
exported file has all the information we need). So while that would
very easily solve the problem, I can't do that.

Oh, and the application does NOT read the files, so I don't have that
code to see how they solved it.

I've dug into the code (the machine does have VS2008 installed on it),
and I've traced it through the following code:

(All of this code is MFC)
COleDocument::OnSaveDocument
(creates through StgCreateDocfile, which may write some of
the information, I'm not sure)

OnSaveDocument calls SaveToStorage()
calls COleStreamFile::CreateStream with the
name "Contents". This definately shows up in that junk at the top of
the file.

These are the only locations that I could possibly see writing
anything, as after that is the call to Serialize (which they have
overrided).

Unless the serialization is stored in memory and the Commit call in
On SaveDocument is doing the work.

Like you, my experience with serialization is when the same
application is opening and saving so I haven't delved into this level
of detail on what exactly is going on. But since I can't do that in
this case (legally or practically), is there any other information
held in that mess that's useful?

Here is a sample of that header in ascii (sorry, it doesn't appear to
support unicode. But perhaps the keywords Root Entry and Contents can
help us sort it out.

ÐÏ à¡± á> þÿ
þÿÿÿþÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿýÿÿÿþÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿRoot
Entry ÿÿÿÿÿÿÿÿÿÿÿÿþÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿRoot
Entry ÿÿÿÿÿÿÿÿ Àª7‡
£žÉ þÿÿÿContents ÿÿÿÿÿÿÿÿÿÿÿÿþÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿÿÿýÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

Scoots

unread,

Mar 12, 2009, 3:23:55 PM3/12/09

to

For comparison, here is one that has extra information:
And note that this is all generated by MFC code, not the application
source code.

ÐÏ à¡± á> þÿ
þÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿýÿÿÿþÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿRoot
Entry ÿÿÿÿÿÿÿÿÿÿÿÿþÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿRoot
Entry ÿÿÿÿÿÿÿÿ Ð Ð¨Ï É € Contents ÿÿÿÿÿÿÿÿÿÿÿÿ
{ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþÿÿÿýÿÿÿþÿÿÿ þÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

þÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

Thanks,
~Scoots

Message has been deleted

Joseph M. Newcomer

unread,

Mar 12, 2009, 10:25:44 PM3/12/09

to

Serialization is one of those horrible techniques that, if Microsoft hadn't already implemented it that way, no one would bother to try. Serializing data in the presence of schema evolution is very difficult, and most of us who do it have written our own serializing code. I first faced this problem in 1977, and we addressed a number of complex issues that the Microsoft serialization ignored entirely, or got completely wrong. Your header-of-the-file problem is an example of this. Generally, in serialization, there are two kinds of contents you need to be concerned with: data, and metadata. The data is the stuff you actually want to use. The metadata is data-about-data, and includes such things as when you serialize a CString, the string length is written first. The string length is irrelevant from the viewpoint of the content of the file, but essential for the reading process. Another piece of metadata is "relocation" information for pointers. I don't know how Microsoft does this in serialization, but it could be part of the large header you are seeing. On Thu, 12 Mar 2009 11:54:43 -0700 (PDT), Scoots <linki...@msn.com> wrote: >Unfortunately, we are NOT using CDocuments, nor do I have permission >to use the code in our own code (it's also massive). It is provided >to help us understand the graphics file format (the normal saved file >is done entirely with serialization, we just realized that this >exported file has all the information we need). So while that would >very easily solve the problem, I can't do that. >Oh, and the application does NOT read the files, so I don't have that >code to see how they solved it. >I've dug into the code (the machine does have VS2008 installed on it), >and I've traced it through the following code: >(All of this code is MFC) >COleDocument::OnSaveDocument > (creates through StgCreateDocfile, which may write some of >the information, I'm not sure) > OnSaveDocument calls SaveToStorage() > calls COleStreamFile::CreateStream with the >name "Contents". This definately shows up in that junk at the top of >the file. >These are the only locations that I could possibly see writing >anything, as after that is the call to Serialize (which they have >overrided). >Unless the serialization is stored in memory and the Commit call in >On SaveDocument is doing the work. **** The Serialize call is what you need to study, because their overrides are what are defining the file format! **** >Like you, my experience with serialization is when the same >application is opening and saving so I haven't delved into this level >of detail on what exactly is going on. But since I can't do that in >this case (legally or practically), is there any other information >held in that mess that's useful? >Here is a sample of that header in ascii (sorry, it doesn't appear to >support unicode. But perhaps the keywords Root Entry and Contents can >help us sort it out. Root >Entry Root >Entry Contents ><rest of the file is useful information> ***** It is amazingly tedious to do, but what I do is try to convert this to binary (hex, actually) For example 000000 D0 EF ?? A1 // 000004 81 ?? E1 3E // There are a number of tools that will decode this (I couldn't decode the splodges because they represent any unprintable character). For example, use VS to open the file in binary mode. I find printing a copy and using several colors of highlighting pen (as well as having a handy hex-to-decimal calculator nearby) to be effective. The , for example, is 0xFF, and is probably filler. By reading the serialize code you can figure out what was intended to be written. The real issue arises if an OLE object is serialized, because that serialization is private to the object, and can be far more difficult to decode. That is, if the code contains ar << thing; the ThingType::Serialize method is invoked; unless you have the ThingType serialization code available, you may not be able to find out what is in it. Typically there is a version number (although this is optional), followed by a length word, followed by that many bytes, etc. but it is sometimes not this simple. I've had to reverse-engineer far too much serialized data over the years, and sometimes brute force is all that is possible. Look for things you recognize and try to figure out the metadata that surrounds them. But there's no silver bullet here. Start by reading the serialize code and see if you can understand each thing that is written out. For example, if there is a CString value ar << stringvalue; you will find 05 00 00 00 'A' 'B' 'C' 'D' 'E' 00 Note the 00 is not the terminal NUL character, but just a filler (if I recall CString::Serialize correctly) to the next WORD boundary (watch out for filler bytes like this...it may be a DWORD. It has been too many years since I last looked at this) **** Joseph M. Newcomer [MVP] email: newc...@flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm

Tom Serface

unread,

Mar 13, 2009, 12:04:16 AM3/13/09

to

If you can recognize the beginning of the useful data you could just skip
past most of it then read the rest until something you recognize, like not
one of those funny chars, shows up then start reading there. Sans any
useful information that's probably the best you could do.

Tom

"Scoots" <linki...@msn.com> wrote in message
news:5fdecb80-5f33-4ab5...@v13g2000pro.googlegroups.com...

Goran

unread,

Mar 13, 2009, 1:14:32 AM3/13/09

to

On Mar 12, 7:30 pm, Scoots <linkingf...@msn.com> wrote:
>... The file produced has a minimum of about 3kb's

> of runtime information from serializing a CDocument. However, they
> don't serialize any of the information in that Document, it is written
> out in a stable file format.

How about asking for/creating a function (you say you have all
sources) that can read up to the part you can read? Then you can
continue from the known point, in that known water. DLL is probably
the best solution, but be aware of caveats, like, your code must be on
a same MFC release as the DLL.

To create such a function, you could envisage splitting the document
class into base one that only reads data and the full-blown one used
in the external app. (But this may be hard, as document classes tend
to meddle everywhere in the app, and, if one is allowed an opinion,
for dubious reasons).

HTH,
Goran.

P.S. I disagree with Joseph on the general usefulness of MFC
serialization. Sure, one can do better, but MFC does correctly what it
does (e.g. serializing MFC stuff like strings and containers, backward-
compatible object versioning, storing/retrieving of object
references). There's no forward-compatibility, though (old code can
read newer files), which is probably Joe's primary complaint. There
are other issues I've seen, too, but hey, life ain't perfect! ;-)

Anthony Wieser

unread,

Mar 13, 2009, 3:09:12 AM3/13/09

to

*****
HTH,
Goran.

P.S. I disagree with Joseph on the general usefulness of MFC
serialization. Sure, one can do better, but MFC does correctly what it
does (e.g. serializing MFC stuff like strings and containers, backward-
compatible object versioning, storing/retrieving of object
references). There's no forward-compatibility, though (old code can
read newer files), which is probably Joe's primary complaint. There
are other issues I've seen, too, but hey, life ain't perfect! ;-)

******

I too find serialization very useful in my apps, and with some effort you
can also support forward compatibility to a degree, though it does require
you to put in extra data into the stream that stores the size of the objects
you've just written, and you occasionally have to rewind the stream to write
the sizes back into the stream.

It's also very useful as a private format for supporting undo/redo.

Anthony Wieser
Wieser Software Ltd

Tom Serface

unread,

Mar 13, 2009, 10:45:35 AM3/13/09

to

I agree that these kinds of files are very handy for things like undo and
redo and other temp files you may use in your applications since they are so
easy to set up. The bigger issue for me is reading the file after the fact.

Tom

"Anthony Wieser" <newsgroup...@wieser-software.com> wrote in message
news:uEpmCr6o...@TK2MSFTNGP02.phx.gbl...

Tom Serface

unread,

Mar 13, 2009, 10:44:09 AM3/13/09

to

The biggest problem I have with serialization in the traditional MFC sense
is that it is very difficult for others to know how to read the files. It
is also very difficult, though not impossible by any means, to do
versioning. It is handy for temp files, but for something that may need to
persist I'd rather use some other, more easily distinguishable format even
if it takes more time to parse like INI or XML. Almost all formats have
forward compatibility issues, but I just fine serialized files difficult to
work with. Of course, that's just an opinion :o)

Tom

"Goran" <goran...@gmail.com> wrote in message
news:202d0b3a-75fe-4beb...@a12g2000yqm.googlegroups.com...

Joseph M. Newcomer

unread,

Mar 13, 2009, 11:22:25 AM3/13/09

to

There are many problems with the serialization as implemented. I know of very few
situations in which older software can read newer files (although we "solved" that in
1977, the solution addresses only a part of the problem, and cannot deal with structure
invariants and consequently cannot maintain forward-compatible output although it could
read newer files). I found that in general, writing my own serialization code ended up
with fewer problems than trying to use MFC's code, which takes a very naive view of the
world.
joe

Joseph M. Newcomer

unread,

Mar 13, 2009, 11:31:22 AM3/13/09

to

These days, my preference is XML. XML has fewer problems than using binary files, and
huge advantages. Using techniques we developed in the 1970s, I once did an app that wrote
both an HTML and binary version of the file. If a program opened the binary version and
got a version skew reading the version ID, it fell back to reading the XML. The result
was that for stable files the input performance improved by a factor of 10, the output
performance was about the same ratio (it took 10 times longer to write the XML than the
binary, but output was also substantially faster than input, so the 10% add-on for binary
really paid off in program startup each time), and we had the advantages of a fast-input
representation and a textual representation.

Those who have access can check out the article on the techniques we developed in the 1979
time frame (the article was published in 1987), and more detail is available in our book
(1989)

http://portal.acm.org/citation.cfm?id=39309
http://www.amazon.com/Idl-Language-Implementation-Prentice-Hall-Software/dp/0134502140

The book, alas, is out of print. My XML version was an ad-hoc adaptation of the
principles described in these publications (if I had more time, and the client more money,
I would have done a binary representation driven off the DTD, but that would have been too
expensive for the project needs)
joe

Scoots

unread,

Mar 23, 2009, 9:07:41 AM3/23/09

to

Thank you all for your responses. I actually just returned from an
extended vacation (in the Florida Keys, ahhhh 80 degrees....), so I
was unable to respond.

To reiterate, this application saves TWO files. One is a complex file
that relies entirely on serialization of complex types and bumps into
every problem with serialization you all have mentioned so far. This
is not a stable file format, as every time they change the data
structures in the slightest, the schema changes, and we don't know if
we will have access to future versions of the source code. This is
not the file type to use, but I do have the code that writes AND reads
this file. However, the application also writes a simpler file format
used by other applications and it is much more stable. It does use
serialization to trigger, which is why we get this file header issue,
but after that it is a very structure (and very easy to read) file
format. This file format I have already handled and is very simple.
It's just the header isn't always the same, or even the same size.

By following the reading path of the other graphic file, I've managed
to get into the MFC code that reads the COleDocument and found some
things that might be significant. The base header (the first one I
posted above) appears to be the COleDocument itself, and appears to be
a minimum of 2048 bytes. Even if the document is "empty", this gets
written. In looking at the loading algorithm for the COleDocument, it
hits LoadFromStorage(). The code is on another machine so I can't
just directly copy it here, but the LoadFromStorage (Line 703 in
oledoc1.cpp, for me) method DOES call COleStreamFile::OpenStream(...)
on the stream "Contents", which you will see in the headers I posted.
It then appears to serialize based on this stream position. There
must be a bit of automatic information in the stream, as there is no
"hokey pokey" to get the file pointer in the right position for the
serialization that follows.

Yes, Visual Studio's binary editor is the only reason I've been able
to get as far as I have in this file, but unfortunately I cannot copy
from the binary editor to here, otherwise I'd show the binary for the
header. It appears that this stream will get me close, but I still
haven't found what causes the file header to change (sometimes very
radically). The number of bytes following the Contents keyword is not
constant. If I delete all of the graphical elements in their
application and save, this extra header information disappears and I'm
left with the base header, and yet the very first thing that gets
serialized by their custom Serialize is a byte that is very easy to
pick out from the header information (in terms of pattern
recognition. None of those 0xFF's flying around.).

And yes, I do have their Serialize, Joseph (fortunately!). That's how
I've gotten as far as I have, and if I manually delete the header, my
code can read their file just fine. That, fortunately, was actually
fairly simple since they do NOT use any CStrings or more complex types
in the more stable of the two files. Mainly, they write out
structures, bytes, and chars, so it was a fairly simple process to
load.

The base header is fairly easy to pick out, as an empty page will
generate the COleDocument information and it always appears to be 2048
bytes. This is the first header I posted. However, this header is
sometimes significantly larger when data is present and I don't know
why. Their custom serialization does not write anything that should
cause this, and I haven't found anything in the COleDocument to
account for this. I'll keep plugging at it and post what I find.

In the meantime, the CArchive has an m_bForceFlat member variable.
The COleDocument sets this to FALSE, but I haven't found documentation
on what behavior this causes. Anyone know?

Thanks again,
~Brian

Scoots

unread,

Mar 23, 2009, 9:50:36 AM3/23/09

to

Following the suggestions of others, I added the lpstorage and
COleStreamFile code from the COleDocument loading algorithm, and this
appears to do the trick. I have not stress-tested this yet, but the
following appears to work:

bool Decoder::OpenAndReadPreamble(CString p_csFilename,
COleStreamFile* p_pfr)
{
//The preamble is completely and utterly unimportant. This gets
written
//even if (Name removed for confidentiality) doesn't even save the
file! It's runtime information left over
//from serialization. We just... plain... don't care.

LPSTORAGE lpRootStg = NULL;

//This is based on the COleDocument code for reading.
BOOL bResult = FALSE;
TRY
{
if (lpRootStg == NULL)
{
LPCOLESTR lpsz = T2COLE(p_csFilename);

// use STGM_CONVERT if necessary
SCODE sc;
LPSTORAGE lpStorage = NULL;
if (StgIsStorageFile(lpsz) == S_FALSE)
{
// convert existing storage file
sc = StgCreateDocfile(lpsz, STGM_READWRITE|
STGM_TRANSACTED|/*STGM_SHARE_EXCLUSIVE|*/STGM_CONVERT,
0, &lpStorage);
if (FAILED(sc) || lpStorage == NULL)
sc = StgCreateDocfile(lpsz, STGM_READ|
STGM_TRANSACTED|/*STGM_SHARE_EXCLUSIVE|*/STGM_CONVERT,
0, &lpStorage);
}
else
{
// open new storage file
sc = StgOpenStorage(lpsz, NULL,
STGM_READWRITE|STGM_TRANSACTED/*|STGM_SHARE_EXCLUSIVE*/,
0, 0, &lpStorage);
if (FAILED(sc) || lpStorage == NULL)
sc = StgOpenStorage(lpsz, NULL,
STGM_READ|STGM_TRANSACTED/*|STGM_SHARE_EXCLUSIVE*/,
0, 0, &lpStorage);
}
if (FAILED(sc))
AfxThrowOleException(sc);

ASSERT(lpStorage != NULL);
lpRootStg = lpStorage;
}

ASSERT(lpRootStg != NULL);

// open Contents stream
CFileException fe;
if (!p_pfr->OpenStream(lpRootStg, _T("Contents"),
CFile::modeRead|CFile::shareExclusive, &fe) &&
!p_pfr->CreateStream(lpRootStg, _T("Contents"),
CFile::modeRead|CFile::shareExclusive|CFile::modeCreate, &fe))
{
if (fe.m_cause == CFileException::fileNotFound)
AfxThrowArchiveException(CArchiveException::badSchema);
else
AfxThrowFileException(fe.m_cause, fe.m_lOsError);
}

// load it with CArchive (loads from Contents stream)
CArchive loadArchive(p_pfr, CArchive::load |
CArchive::bNoFlushOnDelete);
}
CATCH_ALL(e)
{
MessageBox(NULL,_T("Whoops"), _T("We did something bad."), MB_OK);
return false;

}
END_CATCH_ALL

return true;
}

Okay, so there is a fair amount of debugging left to do, and nevermind
the fact that the messagebox is completely uninformative or that I
haven't commented it, this appears to automatically handle the
variable size header information. I'll let you know as I test it
more.
~Scoots

Joseph M. Newcomer

unread,

Mar 23, 2009, 12:22:43 PM3/23/09

to

Glad that it works. The issue with the advancing of the file pointer on the read-embedded
is how serialization is ideally supposed to work, that is, you move the current file
position forward until you hit an object of interest. Then you invoke its reader, which
understands its format, and you are left with the file position just past that object. It
is a tree-structured file, in that objects can themselves contain objects. The Good News
is that you don't have to know the implementation details of the objects you are reading;
their serializers will handle this for you. If I were doing XML, I'd have to use a
[[CDATA]] object to hold the unknown objects if I wanted them to be readable.

In my serialization, I start off with

[Version]
[File Length]
[File Version Info]

followed by a variety of objects such as

[Object type]
[Length]
[object values]*

where each object value is of the form

[Field type]
[length]
[bytes of data]
[padding]*

where padding is enough 0 bytes to get a DWORD alignment.

so a
typedef struct {
int n;
double d;
char x[80];
} SomeStruct;

would be

[SomeStructID]
[116]
[code for n]
[4]
[value of n]
[code for d]
[8]
[value of d]
[code for x]
[80]
[value of x]
[SomeStructID]
...same format as above

Note that I give each field a code, so I could change the header file, rearrange the
fields, and the reader automatically handles the assignment because the test is
essentially

if(fieldcode == code_for_n)
object->n = ReadIntValue(length of field);

and so on. This "tagged binary" representation is ancient; I saw it in specifications for
files in the late 1960s, and was obviously well-established by that point.

The IDL system we did at CMU, and the LG (Linear Graph) system that preceded it, would
automatically generate the tables that drove the readers and writers. I've often thought
about how I might build binary reader/writer code from DTDs, but haven't pursued it, since
I've already done it twice, and both times better than XML could hope for.

We even stored complex structures using the equivalent of what Microsoft calls "based
pointers" for the binary representation, and we could handle forward-pointer resolution
when reading the text representation.

Since my latest project now takes too long to read in a 2MB XML file, I will probably add
a binary reader/writer in the near future.
joe