The plan is to work around this by modifying the algorithm to detect
files of the appropriate format and use knowledge of the PE spec to
CRC32 only the pertinent sections, skipping over timestamps. I've run
across the ImageHlp library which contains an interface called
MapFileAndCheckSum(), which gives rise to a few questions:
1. Is this "official" Win32 checksum a CRC32 or something else?
2. Does it have the smarts to avoid timestamps or is it just a blind
front-to-back checksum? Logic says it must skip at least part of the
header, the part where you're supposed to insert the checksum after
deriving it. I should make it clear that I have no intention/need of
inserting my CRC to the file; it will go in a database.
3. Is there existing open-source code for comparing PE files exclusive
of timestamp? It seems likely to have been done before but Google hasn't
found it for me yet.
4. Is there a way to tell the compiler not to insert timestamps? This
would be a poor workaround but would still be good to know about.
--
Thanks,
M.Biswas
Hey man,
If this was me, I would learn everything I could about the structure of
the PE file format and then write my own routines for working with
sections, calculating checksums and what not. I guess the possibilities
are really dictated by your knowledge of the PE format. Food for
thought. Not much help though =)
Kip
The type and quality of the hash is orthogonal to the issue I'm raising
here. I did mention that the CRC32 is an architectural requirement not
amenable to change. I hope you'll appreciate that changing such a thing
in midstream is quite difficult, for reasons like backward compatibility
and database schema etc.
And in response to a previous poster, of course writing my own code
based on the PE documentation is an option. But it would be madness to
do so if there's a supported Win32 API or a stable piece of freeware
that already handles it, hence my questions.
I asked specific questions in a bulleted form because those are the
questions I'm seeking answers to. Once I have those answers I'll know
what to do. FWIW I have answered the first two questions myself: based
on a test program, MapFileAndCheckSum() produces something which (a)
seems not to be a CRC32 and (b) changes with each recompile, thus
presumably not ignoring timestamps. It's a strange, monotonically
increasing, seemingly weak hash: as I recompile the test object I get
results like these:
CRC32 of hello.obj is 59144
CRC32 of hello.obj is 59150
CRC32 of hello.obj is 59184
CRC32 of hello.obj is 59188
CRC32 of hello.obj is 59223
No idea what algorithm this is but it doesn't seem like a lot of bits
are involved ...
--
Thanks,
M.Biswas
"Mohun Biswas" <m.bi...@invalid.addr> wrote in message
news:QiWRc.120815$eM2.64957@attbi_s51...
Sorry, monotonically was the wrong word. The remainder stands.
--
Thanks,
M.Biswas
Why do you think that the timestamp should be irrelevant? If two files
have different timestamps, they are different files.
>The plan is to work around this by modifying the algorithm to detect
>files of the appropriate format and use knowledge of the PE spec to
>CRC32 only the pertinent sections, skipping over timestamps. I've run
>across the ImageHlp library which contains an interface called
>MapFileAndCheckSum(), which gives rise to a few questions:
>
>1. Is this "official" Win32 checksum a CRC32 or something else?
That is the API called by the linker to compute the checksums that go into
PE files. The only place where they matter is when loading a kernel-mode
driver in NT/2K/XP; if the checksum doesn't match, the driver won't load.
Interesting tidbit: that API has a bug on 95/98, such that it will produce
the wrong results on a file that is an odd number of bytes long. It sucks
up one byte of random data beyond the end of the file.
>2. Does it have the smarts to avoid timestamps or is it just a blind
>front-to-back checksum? Logic says it must skip at least part of the
>header, the part where you're supposed to insert the checksum after
>deriving it. I should make it clear that I have no intention/need of
>inserting my CRC to the file; it will go in a database.
It does a dumb front-to-back checksum. That checksum is then inverted and
inserted into the PE file. MapFileAndCheckSum on the resulting file will
produce a 0, which means "checksum matches".
>3. Is there existing open-source code for comparing PE files exclusive
>of timestamp? It seems likely to have been done before but Google hasn't
>found it for me yet.
"Comparing" for what attributes? Your definition of "equal" is not at all
clear to me.
--
- Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc
Well, in that case, a CRC is not the good solution, since Debug vs Release
versions are functionnaly identical but are binart different, so are t exes
compiled from the same soruce with different compilers (at least
theoretically). Perhaps you should define better what you mean by
"functionally identical".
Arnaud
MVP- VC
> since Debug vs Release versions are functionnaly identical
Normally they are not functional identical!
Debug-Versions should contain many "asserts" and additional checks...
So sometimes the function is also different (if your programm is 100% bug-
free, then it should not be different; but this is theory).
--
Greetings
Jochen
I didn't use the word irrelevant but rather the phrase "functionally
identical". If two object files differ only in their timestamp they are
clearly functionally identical when run or linked.
> That is the API called by the linker to compute the checksums that go into
> PE files. The only place where they matter is when loading a kernel-mode
> driver in NT/2K/XP; if the checksum doesn't match, the driver won't load.
I know all this; it's documented in MSDN. But it only applies if the
checksum is subsequently implanted into the binary. I am (was)
interested in using it for an unintended purpose; as a standalone
digital fingerprint without ever modifying the binary file.
> "Comparing" for what attributes? Your definition of "equal" is not at all
> clear to me.
I thought I'd made it clear by specifying "functionally identical".
Perhaps that could be improved by adding "when used for the intended
purpose", i.e. linking for .obj files and executing for .exe and .dll
files. But the bottom line is that I consider two object files identical
iff they compare equal aside from timestamps.
--
Thanks,
M.Biswas
> Hello Arnaud,
>
>
>>since Debug vs Release versions are functionnaly identical
>
>
> Normally they are not functional identical!
> Debug-Versions should contain many "asserts" and additional checks...
Not to mention optimization differences, any code contained within
#ifdef DEBUG, etc.
In any case, for this application the definition of identical can be
clearly inferred from the original post: two object files are considered
identical if they compare equal aside from timestamps. This may be a
possimistic algorithm, i.e. it may consider files different even when
the difference is insignificant, but it's safe and that's more important.
--
Thanks,
M.Biswas
This looked good for a while and I was excited, but doing a Google
groups search revealed a thread wherein someone had tried the exact same
thing for the exact same purpose and concluded that time stamps were
still included in the data provided by ImageGetDigestStream. Of course
he could have been wrong, so if you say you're sure timestamps aren't
present I'll give it a shot but otherwise I think it's a near miss.
--
Thanks,
M.Biswas
To say nothing about optimization, also.
--
-GJC [MS Windows SDK MVP]
-Software Consultant (Embedded systems and Real Time Controls)
- http://www.mvps.org/ArcaneIncantations/consulting.htm
-gcha...@mvps.org
If your makefile or build procedure is compiling or linking stuff
that's up-to-date, perhaps it needs some work?
--
Sev
This is an existing tool that works well today and has done so for
years. It is not a build system. It handles many many file types and it
already knows how to skip around embedded timestamps in other file
formats. It just happens that COFF and PE are the latest types to be
supported and I'm trying to find the best way to deal with timestamps
within them.
So far I've had 1 on-topic response and 5 which are merely ventings on
tangential topics. Please, people - either answer the questions I asked
or just hit Next.
--
Thanks,
M.Biswas
>Severian wrote:
Well hell, if your project were mine, I would be analyzing PE files,
determining where and how time stamps are stored, and building CRCs
(more likely MD5 digests) that skipped them, rather than expecting
that someone else had already solved my simple conundrum.
How is it 'tangential' for me to say that a decent build tool wouldn't
build an obj or exe unless the related source files haven't changed?
THAT'S HOW USEFUL BUILD TOOLS WORK!
--
Sev
What part of "it is not a build system" is hard for you to understand?
--
Thanks,
M.Biswas
Well, this is just one man's opinion, but in my view, the set of pairs of
functionally identical DLLs with differing timestamps is small enough as to
be ignorable. It just doesn't happen that often in real life.
>Severian wrote:
Well, *something* is creating these almost-identical files.
--
Sev
Example: ten developers working on the same branch of a SW product in
ten different folders and/or workstations may well produce ten files
which are identical aside from timestamps. Or, depending on environment
variables, registry entries, SDK versions, OS patch levels, etc., they
may not. This isn't a failure of the build tool, it's the build tool
working as designed.
Your attempts to impute stupidity to the *conception* of the problem are
simply a waste of bandwidth. For the last time - this is a stable
product. Issues such as you raise have long ago been thought through by
competent people. The only issue is adding a new file type to the
already large set of supported types, and it looks like I'll have to
start reading the PE spec to do so.
--
M.Biswas
> Your attempts to impute stupidity to the *conception* of the problem are
> simply a waste of bandwidth.
Your verbose attempts to impute stupidity to the *free advice* are simply a
bigger waste of bandwidth.
You have been told everything that could possibly have been told on the
subject, what else do you need to know?
[...]
> and it looks like I'll have to start reading the PE spec to do so.
This, and the very first post of yours in this thread are the only ones that
make sense by not wasting bandwidth.
S
P.S. You should understand that executables that are only different in
timestamps may be functionally different. I'm not going to explain why, to
avoid wasting the precious bandwidth.
I can only say that I disagree with the sarcasm. Anticipating the usual
round of "don't do that" posts found on Usenet, I tried to head them off
by prefixing my original post with the words "this much is an
architectural requirement and not subject to reimplementation". Despite
that I got a number of "don't do that" posts from people who had
apparently not read closely. Am I guilty of wasting bandwidth, or
calling them stupid, by trying to nudge the thread back onto the track?
I can't see it. If you look at the original post I think you'll see that
I worked hard at formulating a clear, concise, and well-backgrounded set
of questions.
> You have been told everything that could possibly have been told on the
> subject...
This may be evident to you. It is not evident from the thread.
> P.S. You should understand that executables that are only different in
> timestamps may be functionally different. I'm not going to explain why, to
> avoid wasting the precious bandwidth.
You say you have interesting and topical information but that posting it
would be a waste of bandwidth? Please drop the sarcasm and, for the sake
of other interested readers if not me, explain your reasoning. Clearly a
program which reads its own timestamp(s), or which hashes the file
containing it, can be functionally different. How else?
--
M.Biswas
> I can only say that I disagree with the sarcasm. Anticipating the usual
> round of "don't do that" posts found on Usenet, I tried to head them off
When I read those "head off" messages I did not like them. Because they were
unfriendly. It may be true that they were in reply to "tangential" posts but
that does not change the fact.
[...]
> This may be evident to you. It is not evident from the thread.
The two API routines, the conclusion that they were not suitable, and the
lack of any "non tangential" responses ever since... not evident indeed.
> You say you have interesting and topical information but that posting it
> would be a waste of bandwidth?
Given that you had made it very clear that you wanted to exclude the
timestamps and that it was an architectural decision taken at the
inflationary stage of the Universe and thus not debatable, that did not seem
very topical.
[...]
> Clearly a program which reads its own timestamp(s), or which hashes the
> file containing it, can be functionally different. How else?
"How else" implies that the possibility already mentioned is somehow invalid
or irrelevant. It can be invalid or irrelevant only when you can control the
behavior of the executables and the environment they are exposed to. If you
can control that, then all the other possibilities will be just as invalid
or irrelevant. I am sure you can finish the sentence "if you cannot control
that, then..."
S