constant hash?

Jonathon Paarlberg

unread,

Nov 6, 2015, 11:59:10 AM11/6/15

to OpenRefine

I need to create a unique ID from concatenated values from separate columns, but I need to be able to reproduce the creation of the same unique ID every time the concatenated value reappears.

For example. Let's say my concatenated value is 30648-Closed-Sydney-Rock View. I want to create a hash on that so that the original value isn't obvious, but the next time that value comes around, I want to be able to recreate the exact same hash. I'm supposing that just means feeding the hash a non-random seed, no?

'Sorry if I'm making no sense at all.

Any help you can offer will be appreciated.

Martin Magdinier

unread,

Nov 6, 2015, 12:47:55 PM11/6/15

to openr...@googlegroups.com

Jonathan,

Encoding and hashing function does exactly this job. You can find more information:

on the wiki for the sha1 and md5: https://github.com/OpenRefine/OpenRefine/wiki/GREL-String-Functions#encoding-and-hashing
in this tutorial: http://blog.ouseful.info/2015/01/23/anonymising-data-with-open-refine/

Martin

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jonathon Paarlberg

unread,

Nov 6, 2015, 2:13:20 PM11/6/15

to OpenRefine

So I can expect an md5 hash to output the exact same hash each time the hashing code is reapplied? Even after I restart my computer operating system and reopen my project? Even if I am using a completely different install of OpenRefine on another computer?

Jonathon Paarlberg

unread,

Nov 6, 2015, 2:24:20 PM11/6/15

to OpenRefine

From Wikipedia's entry on Hash functions:

Properties
Good hash functions, in the original sense of the term, are usually required to satisfy certain properties listed below. The exact requirements are dependent on the application, for example a hash function well suited to indexing data will probably be a poor choice for a cryptographic hash function.

Determinism

A hash procedure must be deterministic—meaning that for a given input value it must always generate the same hash value. In other words, it must be a function of the data to be hashed, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as pseudo-random number generators or the time of day. It also excludes functions that depend on the memory address of the object being hashed, because that address may change during execution (as may happen on systems that use certain methods of garbage collection), although sometimes rehashing of the item is possible.

So I guess my question is whether OpenRefine's md5 hash depends on external variable parameters or on my computer's memory address.

On Friday, November 6, 2015 at 11:59:10 AM UTC-5, Jonathon Paarlberg wrote:

Thad Guidry

unread,

Nov 6, 2015, 3:10:31 PM11/6/15

to openrefine

In OpenRefine we have the md5() GREL function, our source here: https://github.com/OpenRefine/OpenRefine/blob/a2aa8dffb4d21146156647f979e65b5ce376abd1/main/src/com/google/refine/expr/functions/strings/MD5.java

which uses Apache Commons DigestUtils to calculate the MD5 digest and convert that to a 32 char hex string for conciseness : https://commons.apache.org/proper/commons-codec/archives/1.7/apidocs/org/apache/commons/codec/digest/DigestUtils.html#md5Hex(java.lang.String)

Martin is correct. And MD5 is used even for files as a checksum to ensure that what you downloaded has the same content as what is stored on the server you got it from. (you may have seen those MD5 files on an FTP server at times.)

OpenRefine has other encoding and hashing functions besides MD5 listed here:

https://github.com/OpenRefine/OpenRefine/wiki/GREL-String-Functions#encoding-and-hashing

BTW, if you wanted to NOT have an exact guarantee on the string data, but only are concerned of the "intent" or "equal but not identical" ... what we call a "fingerprint" of your long strings (no matter lowercase, placement used, etc) then you can instead use the fingerprint() function.... kinda cool when you want to say that things like these 2 strings are "equal but not identical" like identical in other hashing and encoding algorithms such as MD5, etc.. :

{"schön","schon"},
{"\tABC \t DEF ","abc def"}, // test leading and trailing whitespace
{"bbb\taaa","aaa bbb"},
{"müller","muller"},
{"",""}

Thad

+ThadGuidry

--

Thad Guidry

unread,

Nov 6, 2015, 3:44:51 PM11/6/15

to openrefine

Jonathan,

After sending that previous message. I realized that what we are saying, and what your use case is, might actually differ.

MD5 is 1 way... not 2 way..(well mostly, dictionary attacks can be utilized :) ). http://stackoverflow.com/questions/1240852/is-it-possible-to-decrypt-md5-hashes

Is it that your wanting to simply obfuscate your data and actually have an easy 2 way encoding method ?

In that case, you might just want to convert to base64 using Python as your expression language.

return value.encode('base64')

return value.encode('base64').decode('base64')

You can even zip and unzip :)

return value.encode('zip').decode('zip')

Here's a a short (not long) listing of the various encoding aliases that you could use and play with :

https://docs.python.org/2/library/codecs.html?highlight=encode#python-specific-encodings

Thad

+ThadGuidry

Jonathon Paarlberg

unread,

Nov 9, 2015, 12:42:08 PM11/9/15

to OpenRefine

No, I don't need a two-way encoding method, although that could be useful another day.

I just needed reassurance that the MD5 algorithm will give me the same result given the same string every time, regardless of the circumstances. I think that's what you've said, that it's one-way but highly consistent and reliable to give the same results given the same input -- and that's why it's used as check sum. I just want to build a unique ID without spelling out the entire concatenated conglomeration that the unique ID is based on; obfuscation isn't really necessary, but it might be preferable in case somebody assumes that the unique ID can't be used to regenerate the original data without further information. (In other words, I don't want people accidentally disclosing information that they don't intend to.)

Thanks.

Thad Guidry

unread,

Nov 9, 2015, 3:47:08 PM11/9/15

to openrefine

I would use SHA1 (secure hash) to generate the unique ID.

A Hash = A consistently generated result based on an input value.

The methods used to generate the hash result are up to the programmer / user.

MD5 and SHA1 are the most popular hashing algorithms for generating a Hash based on an input value.

A

UUID

=

a universally unique identifier. It is not a hashing algorithm.

A UUID is designed to be unique... BUT, not random. (read on)

In theory, you can get collisions for unique keys generated even with MD5 and SHA1. In practice, if you use SHA1 or even higher, and started generating when your mothers mothers mothers mothers mom was alive... you probably would not have seen a collision.

MD5 and SHA1 algorithms are deemed "cryptographical" because of that long shot in a collision.

If you want an algorithm that is VERY FAST and fairly unique (mostly avoiding collisions) then looks like Murmur is "ok". http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

(I am not a Math major, nor a crypto expert, nor someone to be entirely trusted in this domain. Use the above information at your given risk level for your project, but this guy is and explains the possibility of collisions rather well for MD5 - http://crypto.stackexchange.com/questions/15873/what-is-the-md5-collision-with-the-smallest-input-values) :)

In Summary, for same hash result with same input string and avoiding collision of another input string having the same hash result...it all depends on the digest size used to sufficiently hold all your inputs...increase your digest size and you increase your avoidance of collisions.

Thad

+ThadGuidry

Jonathon Paarlberg

unread,

Nov 10, 2015, 11:34:46 AM11/10/15

to OpenRefine

"A Hash = A consistently generated result based on an input value."

LOL

Thanks, Thad. That's exactly what I needed to know. 'Sorry, but I could swear I saw different results using the Md5 hash twice on the same string awhile back when I was beginning to use OpenRefine. It must be that I slightly varied the way that I was generating my concatenated field that the hash resulted from.

Thanks again for helping to educate me on hashes.

On Friday, November 6, 2015 at 11:59:10 AM UTC-5, Jonathon Paarlberg wrote:

Reply all

Reply to author

Forward