Hash My Files !!BETTER!! Download

0 views

Skip to first unread message

Gunvor Nazarian

unread,

Jan 20, 2024, 5:34:12 PM1/20/24

to glanevucrei

The possibility of producing a collision is small, but not unheard of, and is the reason why more secure algorithms like SHA-2 have replaced SHA-1 and MD5. For example, the contents of the following two files, ship.jpg and plane.jpg are clearly different, as a simple visual inspection shows, so they should produce different message digests.

However, when we calculate the value with MD5 we get a collision, falsely indicating that the files are identical. Here the output is from the command line on macOS using the Terminal.app, but you can see that the ship.jpg hash value is the same as we got from PowerShell earlier:

hash my files download

Download Zip ··· https://t.co/8kzixzfQbz

Second, the flaw in legacy AV has always been that detection requires foreknowledge of the threat, so by-design an anti-malware solution that relies on a database of known hash values is always one-step behind the next attack.

Hashes are a fundamental tool in computer security as they can reliably tell us when two files are identical, so long as we use secure hashing algorithms that avoid collisions. Even so, as we have seen above, two files can have the same behaviour and functionality without necessarily having the same hash, so relying on hash identity for AV detection is a flawed approach.

Despite that, hashes are still useful for security analysts for such things as sharing IOCs and threat-hunting, and you will undoubtedly encounter them on a daily basis if you work anywhere in the field of computer and network security.

Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456. (from an answer on StackOverflow.)

Around the year 2005, methods were discovered to take an MD5 hash and create data that matches that hash create two documents that had the same MD5 hash (collision attack). See @user2357112's comment below. This means an attacker can create two executables, for example, that have the same MD5, and if you are depending on MD5 to determine which to trust, you'll be fooled.

Thus MD5 should not be used for cryptography or security. It's bad to publish an MD5 on a download site to ensure download integrity, for example. Depending on an MD5 hash you did not generate yourself to verify file or data contents is what you want to avoid.

If you generate your own, you know you're not being malicious to yourself (hopefully). So for your use, it's OK, but if you want someone else to be able to reproduce it, and you want to publicly publish the MD5 hash, a better hash should be used.

Note that it's possible for two Excel files to contain the same values in the same rows and columns, but for the bytestream of the file to be completely different due to different formatting, styles, settings, etc.

In practice, yes, an identical cryptographic hash means the files are the same, as long as the files were not crafted by an attacker or other malicious entity. The odds of random collisions with any well-designed cryptographic hash function is so small as to be negligible in practice and in the absence of an active attacker.

The way a cryptographic hash function works is to take an arbitrary-length input, and output a fixed-length value computed from the input. Some hash functions have multiple output lengths to choose from, but the output is still to some degree a fixed-length value. This value will be up to a few dozen bytes long; the hash algorithms with the longest output value in common use today have a 512-bit output, and a 512-bit output is 64 bytes.

If an input to a hash function is longer than the output of the hash function, some fidelity must be removed to make the input fit in the output. Consequently, there must exist multiple inputs of lengths greater than the length of the output, which generate the same output.

Let's take the current workhorse, SHA-256, as an example. It outputs a hash of 256 bits, or 32 bytes. If you have two files which are each exactly 32 bytes long, but different, these should (assuming no flaw in the algorithm) hash to different values, no matter the content of the files; in mathematical terms, the hash is a function mapping a 2256 input space onto a 2256 output space, which should be possible to do without collisions. However, if you have two files that are each 33 bytes long, there must exist some combination of inputs that give the same 32-byte output hash value for both files, because we're now mapping a 2264 input space onto a 2256 output space; here, we can readily see that there should, on average, exist 28 inputs for every single output. Take this further, and with 64-byte files there should exist 2256 inputs for every single output!

Cryptographic hash functions are designed such that it's computationally difficult to compose an input that gives a particular output, or compose two inputs that give the same output. This is known as preimage attack resistance or collision attack resistance. It's not impossible to find these collisions; it's just intended to be really, really, really, really hard. (A bit of a special case of a collision attack is a birthday attack.)

Some algorithms are better than others at resisting attackers. MD5 is generally considered completely broken these days, but last I looked, it still sported pretty good first preimage resistance. SHA-1 is likewise effectively broken; preimage attacks have been demonstrated, but require specific conditions, though there's no reason to believe that will be the case indefinitely; as the saying goes, attacks always get better, they never get worse. SHA-256/384/512 are currently still believed safe for most purposes. However, if you're just interested in seeing if two non-maliciously-crafted, valid files are the same, then any of these should be sufficient, because the input space is sufficiently constrained already that you'd be mostly interested in random collisions. If you have any reason to believe that the files were crafted maliciously, then you need to at the very least use a cryptographic hash function that is currently believed safe, which puts the lower bar at SHA-256.

First preimage is to find an input that yields a specific output hash value; second preimage is to find one input that gives the same output as another, specified input; collision is to find two inputs that yield the same output, without regard to what that is and sometimes without regard to what the inputs are.

All that said, it's important to keep in mind that the files may have very different data representations and still display exactly the same. So they can appear to be the same even though their cryptographic hashes don't match, but if the hashes match then they are extremely likely to appear the same.

If we consider a hypothetical (and very weak) 8-bit hashing algorithm, then this can represent 256 distinct values. As you start to run files through the algorithm, you will start to get hashes out... but before long you will start to see "hash collisions". This means that two different files were fed into the algorithm, and it produced the same hash value as its output. Clearly here, the hash is not strong enough, and we cannot assert that "files with matching hashes have the same content".

Extending the size of the hash, and using stronger cryptographic hashing algorithms can significantly help to reduce collisions, and raise our confidence that two files with the same hash have the same content.

If two files have the same MD5 hash, and they haven't both been specially crafted, then they're identical. How hard it is to craft files with the same MD5 hash depends on the file format, I don't know how easy it is with Excel files.

So if you have files of your own that are just lying around and want to find duplicates, MD5 is safe. If you wrote one of the files, and the other file is of dubious origin, MD5 is still safe (the only way to get different files with the same MD5 checksum is to craft both files). If someone you don't trust sends you a budget proposal, and later sends another file which they claim is the same, then MD5 may not be enough.

To avoid any risk, use SHA-256 or SHA-512 instead of MD5. If two files have the same SHA-256 hash, then they're identical. The same goes for SHA-512. (There's a theoretical possibility that they could be different, but the probability of this happening accidentally is so much less than the probability of your computer flipping a bit during the verification than it just isn't relevant. As for someone deliberately crafting two files with the same hash, nobody knows how to do this for SHA-256 or SHA-512.)

If two Excel files have different hashes, then they're different, but there's no way to know by how much they differ. They could have identical data but different formatting, or they could just differ in the properties, or they might have been saved by different versions. In fact if Excel is anything like Word then merely saving a file updates its metadata. If you only want to compare the numerical and text data and ignore formatting and properties, you can export the spreadsheets to CSV to compare them.

Short answer: A cryptographic hash is supposed to help you be reasonably confident that files with matching hashes are the same. Unless deliberately crafted, the chances of two slightly different files having similar hash values is ridiculously small. But when it comes to comparing and verifying files that could be deliberately tampered with, MD5 is poor choice. (Use another hash function like SHA3 or BLAKE2.)

Long answer: An ideal hash function is one that creates an almost unique cryptographic hash for a every unique piece of data. In other words, we definitely know that there are two files in this universe whose hash values collide, the chance of these two files naturally coming together is ridiculously small.

Ten years ago, I decided I must stay as far as I can from MD5. (Of course, until yesterday, I remembered the wrong reason for doing so; ten years is a long time, you see. I revisited my past memos to remember why and edited this answer.) You see, in 1996, MD5 was found to be susceptible to collision attacks. 9 years later, researchers were able to create pairs of PostScript documents and (ouch!) X.509 certificates with the same hash! MD5 was clearly broken. (Megaupload.com was also using MD5, and there was a lot of hanky-panky around hash collisions that gave me trouble at the time.)