Hash prefix computation

571 views
Skip to first unread message

ben...@synapse-interactive.fr

unread,
Dec 13, 2017, 9:04:34 AM12/13/17
to Google Safe Browsing API
Hello there,

I don't understand how to compute prefix hashes: 

The documentation says I need to compute the most significant 4-32 bytes of a SHA256 hash, but I don't know what "the most significant" means.
It seems that it depends on the original string length, but I don't understand the rules.

For instance, if I take this prefix:
  •  testsafebrowsing [dot] appspot [dot] com/s/phishing.html *

Its sha256 is:

  • efbd4c3ab44f327eb13ca942ad7c7f0ab47ec260a4d0b8051684a01b2ef35220

Which string should I use to test my local database against hash prefix collision?
  1. efbd
  2. efbd4c
  3. efbd4c3a
  4. efbd4c3ab44f ?
Thank you,
Ben

* Replace [dot] by actual dots. The presence of this URL in my message causes Google Groups to consider it SPAM :)

Ben Sanders

unread,
Dec 14, 2017, 1:01:02 PM12/14/17
to Google Safe Browsing API
It's the first four (or more) bytes, so usually that would be 0xefbd4c3a (two hex digits per byte).

The variable length is because we can (and sometimes do) send longer hash prefixes (5+ bytes) in cases where the 4 byte prefix has a collision with a popular URL. So we lengthen the hash to reduce wasted bandwidth, though this does complicate the client implementation a bit.

ben...@synapse-interactive.fr

unread,
Dec 15, 2017, 11:27:31 AM12/15/17
to Google Safe Browsing API
Hello Ben,

On this point I don't understand. I must assume I'm not really familiar with binaries and hexadecimals.
The documentation says I have to:
  1. Canonicalize the URL
  2. Create expressions
  3. For each expression, sha256 hash it
  4. For each hash, shorten it (this is the point I struggle on)
  5. Check if the shortened hash matches local database
  6. If yes, send this shortened hash to GSB API for verification - the GSB API will return the matching full-length hashes
  7. If the expression full-length hash matches one returned by the API, the URL is considered unsafe.

At which point do you come to the string '0xefbd4c3a' ? Is this the type of string I should send to the API? I'm completely lost on that point.
To me the 1st 4 bytes of efbd4c3ab44f327eb13ca942ad7c7f0ab47ec260a4d0b8051684a01b2ef35220 result in efbd, which will collide with many hashes.

Sorry for my newbieness :)

Ben Sanders

unread,
Dec 15, 2017, 12:55:27 PM12/15/17
to Google Safe Browsing API
Some languages/programs output a sha256 as you are seeing it (efbd4c3ab44f32...). This is nice because it's easily readable. It uses two hexadecimal characters [0-9a-f] for each output byte of the hash. A disadvantage is it essentially uses twice as much space as is necessary.

We are sending (and receiving) the binary versions of these hashes. For JSON users, we are base64'ing to make it web safe.

For php, it looks like you can use hash('sha256', $expression, true) to output the binary string instead. With the binary string, you can take a prefix and compare it with the full update we sent you. As I said before, we may have sent both 4 byte and 5 byte (or longer) hashes. So you could first check the first 4 bytes of your hashed expression, then the 5 byte, etc.

When you send the matching hash prefix, you'll need to base64 it as well before encoding to json. In this case the 4-binary-byte base64'ed value should be 771MOg==.

Let me know if you have any more questions.

ben...@synapse-interactive.fr

unread,
Dec 19, 2017, 8:44:22 AM12/19/17
to Google Safe Browsing API
Hello Ben,

Alright, when working on sha256-strings I need to double the prefixSize to get it working. Now everything's fine.

Thanks for your help!
Ben
Reply all
Reply to author
Forward
0 new messages