Hash identifiers in buffers

509 views
Skip to first unread message

mikkelfj

unread,
Apr 10, 2016, 12:44:35 PM4/10/16
to FlatBuffers
I noticed the following issue https://github.com/google/flatbuffers/issues/3826 which now allows fully qualified type identifiers to be provided in the generated C++ code.

The use case is to assign a unique hashed identifier either before a flatbuffer in transmission, or as a replacement of the file identifier which isn't standardized on a per table type.

So I was thinking: rather than having everyone inventing their own hash scheme, wouldn't it make sense to define an official hash of each type name and make this available as a pre-calculated GetTypeHash() or something similar to that. The input to the hash is already well-defined. What remains is to choose an endian stable 32-bit hash function and define that it must be encoded as little endian in transmission, and define a strategy such as any hash that result in a 0 hash should instead hash to hash("").

From the following: FNV-1 32-bit seems to be a strong candidate based on speed, few english word collisions, and simplicity:


 hash = FNV_offset_basis
   for each byte_of_data to be hashed
        hash = hash × FNV_prime
        hash = hash XOR byte_of_data
   return hash
32 bit FNV_prime = 224 + 28 + 0x93 = 16777619
32 bit offset_basis = 2166136261

The above would be stable if converted to little endian after computation and stored in a 4 character array or string.

In the C-api it is currently possible to create buffers with `MyTable_create_as_root` which uses the last seen file_identifier in the schema, or `MyTable_create_as_root_with_identifier` which takes an extra user supplied 4 character identifier string. This string could be derived from the hash mentioned above.

One could also define that the hash is used instead of the file identifier in `MyTable_create_as_root` unless the table is the root_type and there is a file_identifier defined.

The above also solves and issue in the C verifier implementation: it doesn't verify the file identifier of any nested buffer because it is ambigous. The verifier could have two versions, one that continue to ignore nested file identifiers, and one that uses the hash schema.

Wouter van Oortmerssen

unread,
Apr 11, 2016, 1:08:51 PM4/11/16
to mikkelfj, FlatBuffers
Yes, a standardized hash would make sense to add.

FNV-1 is awesome, very elegant. I already use it in one of my other libraries: https://github.com/google/flatui/blob/master/include/flatui/internal/flatui_util.h

I don't think it should be the default file_identifier. Those are declared to be human readable in the schema. Some FlatBuffer types on purpose do not have a file_identifier (saves 4 bytes, for one). So these are separate concerns.

--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mikkel Fahnøe Jørgensen

unread,
Apr 11, 2016, 2:29:45 PM4/11/16
to Wouter van Oortmerssen, FlatBuffers
Great. I noticed that in flatui you are using the FNV-1a variant (xor before multiply) and different constants in the code you link, which is fine, but just to avoid ambiguity I reckon the official FNV-1 with the listed constants is the one to standardize on, it is also slightly better for english words than FNV-1a which is better at numbers.

I agree it should not be default. The C-API is already reserving file id 0 for missing or to be ignored identifiers.

I suggested one could use file_identifier if present, and hash otherwise, but that is confusing. If it is missing, it should be an absent identifier as it is now in the C api.

The hash identifier should either be set manually, or alternatively by a dedicated create call with hash for generators that support it, e.g. create_as_root_with_type_hash() in C.

For future uses: The type hash could also define a string prefix for buffers using different offset types such as 64 bit uoffsets.

mikkelfj

unread,
Apr 11, 2016, 4:54:20 PM4/11/16
to FlatBuffers, w...@google.com
Actually, I see the FNV1 page always recommends to use the FNV1-a version, so I suggest we go for that, just to avoid any questions:


NOTE: We recommend that you use the FNV-1a alternative algorithm instead of the FNV-1 hash where possible.

There is a minor variation of the FNV hash algorithm known as FNV-1a:

hash = offset_basis
for each octet_of_data to be hashed
        hash = hash xor octet_of_data
        hash = hash * FNV_prime
return hash

The only difference between the FNV-1a hash and the FNV-1 hash is the order of the xor and multiply. The FNV-1a hash uses the same FNV_prime and offset_basis as the FNV-1 hash of the same n-bit size.

m...@yahoo-inc.com

unread,
Apr 11, 2016, 5:22:48 PM4/11/16
to FlatBuffers
On Sunday, April 10, 2016 at 9:44:35 AM UTC-7, mikkelfj wrote:
I noticed the following issue https://github.com/google/flatbuffers/issues/3826 which now allows fully qualified type identifiers to be provided in the generated C++ code.

...
 
From the following: FNV-1 32-bit seems to be a strong candidate based on speed, few english word collisions, and simplicity:




Have you considered xxHash?  According to the propaganda, https://github.com/Cyan4973/xxHash, it's faster and better.  In my experience with FNV (not sure if it was the 1-a variant), I've seen poor distribution over buckets given "user identifier" input.

mikkelfj

unread,
Apr 11, 2016, 5:47:09 PM4/11/16
to FlatBuffers
Have you considered xxHash?  According to the propaganda, https://github.com/Cyan4973/xxHash, it's faster and better.  In my experience with FNV (not sure if it was the 1-a variant), I've seen poor distribution over buckets given "user identifier" input.

Yes I have. Yann Collet does some excellent work, and xxHash has a C implementation which I like, contrary to CityHash. But it is not very simple for this use case.
For other purposes I have consdiered xxHash, but eventually settled for MetroHash for fast hashing:

Regarding poor distribution, this is not very important, only actual collisions are, and it seems to be fairly good compared to the listed competition linked to, although it does not cover more recent hash algorithms.

In this instance FNV1a seems better because it is widely known, trusted, is much simpler than other hashes that are not mostly useless, and users of just about any language could easily write code to test if a given buffer has a given type hash without any sort of FlatBuffer generated code. It also reduces risk of implementation mistakes or inaccuracies - implementers just have to remember to use unsigned 32-bit integers (signed overflow is undefined in C). With 32-bits there will always be a risk of collisions, and I would have preferred a 128 bit or at least 64 bit signature to kill this issue entirely. But the schema compiler can warn against collisions in a schema, and the residual risk of a conflict is small unless you have high volumes of different schema, and then you also need to consider collisions in the input namespace.

Wouter van Oortmerssen

unread,
Apr 11, 2016, 7:09:15 PM4/11/16
to mikkelfj, FlatBuffers
Agreed that there's a lot of advantages to the simplicity of FNV1a. Speed is not a concern here, since these hashes are generated by the compiler.

mikkelfj

unread,
Apr 12, 2016, 6:49:15 PM4/12/16
to FlatBuffers, mik...@dvide.com
I implemented type hashes in flatcc master branch - still needs updates to the verifier:

mikkelfj

unread,
Apr 13, 2016, 2:35:21 AM4/13/16
to FlatBuffers, mik...@dvide.com

FYI: I added a lossless dispersion function that may be applied to the type hash value when better distribution is desired. It should not be used in buffer transmission since it just complicates matters and provides no addtional information. flatcc already uses it for pointer sets internally.

/*
 * This is a brute force tested collision free hash (i.e. permutation) of the type hash to
 * provide better distribution for use in hash tables. It is likely not
 * necessary in praxis, and for uniqueness of identifiers it provides no
 * advantage over just using the FNV-1a type hash, except when truncating
 * the identifier to less than 32-bits.
 */
static inline uint32_t flatbuffers_disperse_type_hash(flatbuffers_thash_t type_hash)
{
    uint32_t x = type_hash;

    x = ((x >> 16) ^ x) * 0x45d9f3bUL;
    x = ((x >> 16) ^ x) * 0x45d9f3bUL;
    x = ((x >> 16) ^ x);
    return x;
}



Wouter van Oortmerssen

unread,
Apr 13, 2016, 2:54:04 PM4/13/16
to mikkelfj, FlatBuffers
I doubt many people will be using these hashes in hash tables in a ways where the distribution is a problem, but who knows :)

If you want to make this into a "standard", you could make a PR for flatc too.

Mikkel Fahnøe Jørgensen

unread,
Apr 13, 2016, 3:05:31 PM4/13/16
to Wouter van Oortmerssen, FlatBuffers
Yeah, it was more to discourage concerned users from moving to alternative hashes.

I will see if can find some time to add the PR, but I really would appreciate if someone else could pick it up.
Reply all
Reply to author
Forward
0 new messages