Types Don't Know #

906 views
Skip to first unread message

Howard Hinnant

unread,
Apr 2, 2014, 7:09:22 PM4/2/14
to std-pr...@isocpp.org
http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html

We’re aiming at the pre-Rapperswil mailing. Any feedback, positive or negative, will be much appreciated. The paper currently lacks proposed wording. It is more of a technology demonstration. There’s a co-author spot open for anyone who wants to turn the demonstration into concrete proposed wording. :-)

Howard

Sean Middleditch

unread,
Apr 2, 2014, 9:24:52 PM4/2/14
to std-pr...@isocpp.org
Very well done.

The only thing that stands out to me is the lack of mention of any utility for users who may have a contiguous buffer of what may or may not be contiguously hashable types. It'd be handy to pass in some pointer and a size and have it Do The Right Thing(tm). Essentially the T(&a)[N] overload but for dynamic-length arrays (pointers+length). If ranges are added to the stdlib, it would be fine to just make sure that range<T*> does the right thing when hashed. Otherwise, some new function (likely with a new name, not an overload, to avoid issues with the variadic overload of hash_append) that explicitly takes a T* and a size_t works.

In terms of use case: plenty of users implement their own containers or buffers (even in modern C++). Look at how Chrome and LLVM both have a stack_vector kind of thing, as one easy example. Circular buffers, KD-trees (with lead nodes have >1 object stored in a contiguous chunk), and specialized key-value containers (objects/tables like in YAML/JSON readers or Lua generators) all also could use this new hashing infrastructure but shouldn't need a lot of repeated template SFINAE magic to make use of the contiguous optimization.

John Bytheway

unread,
Apr 2, 2014, 10:02:50 PM4/2/14
to std-pr...@isocpp.org
An excellent presentation.

I see no mention of how to handle types whose hash-worthy content is not
visible in the header (pimpl'ed types, etc.). For these a templated
hash_append function is impossible and I guess you would need some sort
of type-erased hasher.

These types of objects are less likely to be used as keys, but I think
they ought to be supported.

One way to support this nicely would be to use operator() in place of
append, and then std::function<void(void const*, size_t)> would work as
a type-erased hasher.

Even easier for the user, you could require all hashers to inherit from
some common base class (with a pure virtual append member function), so
that any class X has the choice between the following two styles of
hash_append:

template <class Hasher>
friend void hash_append(Hasher& h, X const& x) noexcept
{
using hash_defaults::hash_append;
hash_append(h, x.date_, x.data_);
}

friend void hash_append(std::hasher_base& h, X const& x) noexcept
{
using hash_defaults::hash_append;
hash_append(h, x.date_, x.data_);
}

Concrete hasher classes could mark themselves final so as to avoid
unnecessary virtual function call overhead.

Unfortunately, I think any solution will inevitably entail virtual
function call overhead for every append call on every contiguous
subobject of x. All the more reason to hash the largest possible
contiguous ranges!


Do you think this issue warrants mention? Do you have any other ideas
as to how to handle it?

John Bytheway

Daniel James

unread,
Apr 2, 2014, 10:04:51 PM4/2/14
to std-pr...@isocpp.org
On 3 April 2014 00:09, Howard Hinnant <howard....@gmail.com> wrote:
> We're aiming at the pre-Rapperswil mailing. Any feedback, positive or negative, will be much appreciated. The paper currently lacks proposed wording. It is more of a technology demonstration. There's a co-author spot open for anyone who wants to turn the demonstration into concrete proposed wording. :-)

I agree that hash_combine shouldn't be adopted, it has some problems
(e.g. no initialisation or finalisation of the hash value, limited
state). Since I think you're comparing various hash function with the
boost implementation, it might be relevant that I'm working on a new
version, but haven't found much time for it and am constrained by API
compatibility with previous versions. But it should be a bit better.

Although, using a faster, low quality hash function is entirely in
line with the current standard. If higher quality hash functions are
wanted, the standard really should say so. There's no requirement
about how hash values for unequal values are distributed. As it is,
unordered containers can only assume the very basic hash function
requirements that are stated in the standard, so they need to put in
extra effort to deal with poor hash functions. If a slower, higher
quality hash function is used, then the cost has to be paid twice -
both in the hash function and in the container.

A problem with Hasher as it's currently defined is that it doesn't
take into account the beginning and end of lists. For example, the
list [[1,2],[3]] will hash to the same value as [[1],[2,3]] and a list
of empty lists will always hash to the same value regardless of its
length.

It would be desirable to be able to use randomized hash functions,
such as siphash. These would need to be constructed from values stored
in the hash function object, so might not be default constructible, so
I don't think Hasher should be required to be default constructible.

I dislike using a cast to get the hash value from the Hasher. It seems
like an operation that should have a name, especially because many
hash functions have a finalisation stage, so they're doing something
more than a conversion. Since it's an explicit conversion, it's often
more verbose to call 'static_cast' than to call a member function
anyway. And I feel like a conversion confuses two different types.

It would also be nice if the finalisation function could return types
other than std::size_t. In some circumstances a 64-bit hash value can
be expensive and unnecessary (the standard containers use std::size_t,
but this more general hash function could have other uses). Similarly,
this mechanism could be used for stronger hash functions with much
larger hash values.

A type that is contiguously hashable can probably also use a binary
comparison for equality checks, but I don't that's strictly a
consequence, so it be better to have a slightly stronger trait which
indicates: "x==y if and only if memcmp(addressof(x), addressof(y),
sizeof(T)) == 0". Could be called something like
contiguously_equality_comparable.

Is it possible that there will be ambiguity if 'hash_append' is
overloaded separately for a Hasher and a type?

Sebastian Gesemann

unread,
Apr 3, 2014, 6:46:57 AM4/3/14
to std-pr...@isocpp.org
As for hash functions I would like to point out another option:
SipHash. It is fast but "cryotographic enough" to be able to avoid
hash collision attacks. If anyone of you thinks about using
unordered_map for some service where an attacker can choose keys for
values to store, it should be hard for an attacker to let the
unordered_map degenerate to a linked list. One way of avoiding this is
to use SipHash with a randomly chosen key.

https://131002.net/siphash/

So far, SipHash is getting very popular. It's already used in Perl,
Python, Redis, Ruby, Rust, systemd, OpenDNS, Haskell ...
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
> To post to this group, send email to std-pr...@isocpp.org.
> Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

Daniel James

unread,
Apr 3, 2014, 7:18:23 AM4/3/14
to std-pr...@isocpp.org
On 3 April 2014 11:46, Sebastian Gesemann <s.ges...@gmail.com> wrote:
> As for hash functions I would like to point out another option:
> SipHash.

FWIW I implemented a SipHash example a while ago, but didn't release
it because I wasn't happy with the implementation and didn't feel
enthusiastic enough to improve it:

https://github.com/boostorg/unordered/tree/0f080552fa6c14076938b2a2aa68635d61cca8a0/examples/siphash

But I think the strength of this proposal is that it supports a
generic mechanism for supporting different hash functions, so should
support both a randomized hash function and a more traditional one.
And not really specify the exact algorithm, a lot of the popular
hashing algorithms aren't portable to a variety of processor types.

Andrzej Krzemieński

unread,
Apr 3, 2014, 10:06:02 AM4/3/14
to std-pr...@isocpp.org

This technique of adapting types to the framework appears to me to be similar to what Boost.Serialization does. I wonder if it could be generalized to things beyond hashing, like serialization. 

Regards,
&rzej

rhalb...@gmail.com

unread,
Apr 3, 2014, 3:42:37 PM4/3/14
to std-pr...@isocpp.org
On Thursday, April 3, 2014 4:06:02 PM UTC+2, Andrzej Krzemieński wrote:
This technique of adapting types to the framework appears to me to be similar to what Boost.Serialization does. I wonder if it could be generalized to things beyond hashing, like serialization.  

Regards,
&rzej

Both hashing/serialization extract the bit sequence. A similar thing would be to allow to insert the bit sequence. This allows randomly generating any class X (e.g. to stress test the very hash function you are interested in!). 

So perhaps one could define two lower-level functions bit_insert() and bit_extract() [or maybe names like bit_push_back()/bit_pop_front() to reflect FIFO character of bit sequence?] on top of which higher-level functionality like hash_append, serialize() and random_generate() could be written.

Rein 

Alex B

unread,
Apr 3, 2014, 4:03:20 PM4/3/14
to std-pr...@isocpp.org, rhalb...@gmail.com
I'm not sure that hashing and serialization can be combined so simply in a common interface.
 
For serialization, when extracting a sequence container, you usually need to first extract the size of the container.
For hashing, it is probably not desirable to hash_append the size of the container.

Richard Smith

unread,
Apr 3, 2014, 4:31:21 PM4/3/14
to std-pr...@isocpp.org, rhalb...@gmail.com
On Thu, Apr 3, 2014 at 1:03 PM, Alex B <deva...@gmail.com> wrote:
I'm not sure that hashing and serialization can be combined so simply in a common interface.
 
For serialization, when extracting a sequence container, you usually need to first extract the size of the container.
For hashing, it is probably not desirable to hash_append the size of the container.

As has been demonstrated already, leaving out the size when hashing a container results in a bad hash in this model; [ [1], [2, 3] ] and [ [1, 2], [3] ] would hash the same.
 
On Thursday, April 3, 2014 3:42:37 PM UTC-4, rhalb...@gmail.com wrote:
On Thursday, April 3, 2014 4:06:02 PM UTC+2, Andrzej Krzemieński wrote:
This technique of adapting types to the framework appears to me to be similar to what Boost.Serialization does. I wonder if it could be generalized to things beyond hashing, like serialization.  

Regards,
&rzej

Both hashing/serialization extract the bit sequence. A similar thing would be to allow to insert the bit sequence. This allows randomly generating any class X (e.g. to stress test the very hash function you are interested in!). 

So perhaps one could define two lower-level functions bit_insert() and bit_extract() [or maybe names like bit_push_back()/bit_pop_front() to reflect FIFO character of bit sequence?] on top of which higher-level functionality like hash_append, serialize() and random_generate() could be written.

Rein 

--

Andrew Tomazos

unread,
Apr 3, 2014, 6:10:57 PM4/3/14
to std-pr...@isocpp.org
Hi Howard,

I did quickly read your proposal and would like to share a couple of ideas:

- In cryptography a secure hash function has a well-defined mathematical definition.  Roughly, if we define a true random hash function as a function that contains an infinite table with an independent true random integer value in the range [0..2^n) assigned to every possible bit string {0, 1, 00, 01, 10, 11, 000, 001,...}, then a secure hash function is a function that cannot be distinguished from a true random hash by an attacker in polynomial time of n (called with "negligble" probability over guessing).  I believe it can be shown that for compound sequence types (such as arrays, vectors, lists, tuples and simple "struct" classes) that applying a secure hash function to the bitstring formed by concatenating the results (hash codes) of a secure hash function applied to its subobjects, is itself a secure hash function.  This addresses the [[1], [2,3]] vs [[1,2], [3]] issue.  I think a similar approach can be argued less rigorously with "nearly" secure hash functions.  It wasn't 100% clear if you have studied or considered this idea from the proposal.

- We are working on a reflection feature that will allow you to enumerate and visit the member sub-objects of a class type given as a type template parameter.  Interesting to think about would be rather than defining a separate "hash visitor" per type, you could instead define the visitor once in the program such that it defaults to a memberwise hash composition.  This is most likely orthogonal to what you are proposing, but worth pondering.

Regards,
Andrew.




On Thu, Apr 3, 2014 at 1:09 AM, Howard Hinnant <howard....@gmail.com> wrote:

Howard

Howard Hinnant

unread,
Apr 3, 2014, 8:24:34 PM4/3/14
to std-pr...@isocpp.org
Thanks Sean. I’ll give this more thought. But I’m not immediately coming up with an idea for an API. If we had ranges (or when we get ranges), I can easily see that range<T> should do the right thing. Perhaps it is best to just wait for ranges to settle before proceeding on this point. In the mean time, if people use vector (or whatever std::container), then they can just pass the whole vector to the Hasher and it will do the right thing.

Howard

Howard Hinnant

unread,
Apr 3, 2014, 8:27:01 PM4/3/14
to std-pr...@isocpp.org
On Apr 2, 2014, at 10:02 PM, John Bytheway <jbyt...@gmail.com> wrote:

> On 2014-04-02 19:09, Howard Hinnant wrote:
>> http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html
>>
>> We're aiming at the pre-Rapperswil mailing. Any feedback, positive
>> or negative, will be much appreciated. The paper currently lacks
>> proposed wording. It is more of a technology demonstration. There's
>> a co-author spot open for anyone who wants to turn the demonstration
>> into concrete proposed wording. :-)
>
> An excellent presentation.
>
> I see no mention of how to handle types whose hash-worthy content is not
> visible in the header (pimpl'ed types, etc.). For these a templated
> hash_append function is impossible and I guess you would need some sort
> of type-erased hasher.
>
> These types of objects are less likely to be used as keys, but I think
> they ought to be supported.
>
> One way to support this nicely would be to use operator() in place of
> append, and then std::function<void(void const*, size_t)> would work as
> a type-erased hasher.

This is a most interesting suggestion, thanks. The authors have actually been debating such a signature, but purely on aesthetic grounds. Your argument gives heavy weight to the the use of operator() on *functional* grounds. That is most significant, and most appreciated. Thanks!

Howard

>
> Even easier for the user, you could require all hashers to inherit from
> some common base class (with a pure virtual append member function), so
> that any class X has the choice between the following two styles of
> hash_append:
>
> template <class Hasher>
> friend void hash_append(Hasher& h, X const& x) noexcept
> {
> using hash_defaults::hash_append;
> hash_append(h, x.date_, x.data_);
> }
>
> friend void hash_append(std::hasher_base& h, X const& x) noexcept
> {
> using hash_defaults::hash_append;
> hash_append(h, x.date_, x.data_);
> }
>
> Concrete hasher classes could mark themselves final so as to avoid
> unnecessary virtual function call overhead.
>
> Unfortunately, I think any solution will inevitably entail virtual
> function call overhead for every append call on every contiguous
> subobject of x. All the more reason to hash the largest possible
> contiguous ranges!
>
>
> Do you think this issue warrants mention? Do you have any other ideas
> as to how to handle it?
>
> John Bytheway
>

Alex B

unread,
Apr 3, 2014, 8:53:27 PM4/3/14
to std-pr...@isocpp.org
On Thu, Apr 3, 2014 at 4:31 PM, Richard Smith <ric...@metafoo.co.uk> wrote:
On Thu, Apr 3, 2014 at 1:03 PM, Alex B <deva...@gmail.com> wrote:
I'm not sure that hashing and serialization can be combined so simply in a common interface.
 
For serialization, when extracting a sequence container, you usually need to first extract the size of the container.
For hashing, it is probably not desirable to hash_append the size of the container.

As has been demonstrated already, leaving out the size when hashing a container results in a bad hash in this model; [ [1], [2, 3] ] and [ [1, 2], [3] ] would hash the same.
 

You will only have bad hash when you have containers of containers. If you want to get the hash of a single container/string, there is no need to combine/append the size of the container.

For pair<string, string>, you only need to combine/append the size of the first string, not both of them.

For vector<string>, you need the size of the size()-1 first strings (no need for last one). No need for the size of the vector.

But I admit that generalizing and expressing this (that the size is needed in some contexts but not all) with clean code design can be quite a challenge and it is very tempting to just always use the size. But it would be a bit sad, considering that one of the most wanted hashable is a simple std::string, which does not need the size in the hash (when taken alone).

Howard Hinnant

unread,
Apr 3, 2014, 9:33:47 PM4/3/14
to std-pr...@isocpp.org
On Apr 2, 2014, at 10:04 PM, Daniel James <dnl...@gmail.com> wrote:

> A problem with Hasher as it's currently defined is that it doesn't
> take into account the beginning and end of lists. For example, the
> list [[1,2],[3]] will hash to the same value as [[1],[2,3]] and a list
> of empty lists will always hash to the same value regardless of its
> length.

Thanks for pointing out these issues.

The first issue looks like two different types in C++: list [[1,2],[3]] and list [[1],[2,3]]. There are two different classes of clients to hash_append:

1. The unordered containers which will be hashing types of only one kind (e.g. X, not Y). If hash(X) == hash(Y), such containers shouldn’t care.

2. Type Z, composed of X and Y, and is implementing its hash_append by hash_append’ing hash_append(X) and hash_append(Y). The author of Z may or may not be concerned that the hash of X and the hash of Y could be the same for degenerate cases. If the author of Z does have such concerns, we can easily deal with such issues in the hash_append for Z, for example:

template <class Hasher>
void
hash_append(Hasher& h, const Z& z)
{
using std::hash_append;
hash_append(“list 1”, z.list1_);
hash_append(“list 2”, z.list2_);
}

This seems like the ideal: "pay for the feature only if you need it" solution. unordered_set<X> doesn’t care that it may hash(X) with collisions generated by unordered_set<Y>, and so should not suffer any extra hashing expense. Clients like Z may indeed care. And can very easily add whatever extra attributes (and expense) to meet their needs.

The second issue is similar: an empty list of lists. Here we only have one type, e.g. list<list<T>>. And so unordered_set<list<list<T>>> might rightly be concerned about this. However only the author of the instantiation unordered_set<list<list<T>>> can truly know if empty lists are a case of concern or not. And if so, can easily prescribe the minimal fix in exactly the right place for their problem domain:

template <class Hasher = spooky>
struct my_container_hash
{
template <class Container>
std::size_t
operator()(Container const & c) const noexcept
{
Hasher h;
using std::hash_append;
hash_append(h, c.size(), c);
return static_cast<std::size_t>(h);
}
};



std::unordered_set<list<list<T>>, my_container_hash<>> my_container.

Now we have an especially elegant solution to the problem: We happily pay the price for prepending the hash of the outer container size, but have no desire/motivation to pay that price for the inner container size. Again, everybody ends up happy: paying for only as much as they need.

> It would be desirable to be able to use randomized hash functions,
> such as sip hash.

Agreed. We are currently using a hash function that takes a seed:

template <class T, class Hasher = detail::spooky_wrapper>
class hardened_hash
: public detail::hardened_hash_base <std::size_t>
{
typedef detail::hardened_hash_base <std::size_t> base;
public:
typedef T argument_type;
using detail::hardened_hash_base <std::size_t>::result_type;

public:
hardened_hash() = default;
explicit hardened_hash(result_type seed)
: base (seed)
{
}

result_type
operator() (argument_type const& key) const noexcept
{
Hasher h {base::seed()};
hash_append (h, key);
return static_cast<result_type> (h);
}
};

It is not currently clear to us if this API should be standardized, or if it should be left up to the programmer to provide this API if wanted. Shown above we have templated it on T. But this actually is not necessary, and could be simplified to:

template <class Hasher = my_favorite_hasher>
class hardened_hash
{
// store seed here...
public:

hardened_hash() = default;
explicit hardened_hash(result_type seed);

template <class T>
result_type
operator() (T const& key) const noexcept
{
Hasher h {seed()};
hash_append (h, key);
return static_cast<result_type> (h);
}
};

If we don’t standardize this, hardened_hash demonstrates that it is still easily available to the programmer with O(1) effort.

> I dislike using a cast to get the hash value from the Hasher.


This is an admittedly experimental interface.

> It would also be nice if the finalisation function could return types
> other than std::size_t.

I am interested in exploring this possibility as well.

> A type that is contiguously hashable can probably also use a binary
> comparison for equality checks, but I don't that's strictly a
> consequence, so it be better to have a slightly stronger trait which
> indicates: "x==y if and only if memcmp(addressof(x), addressof(y),
> sizeof(T)) == 0". Could be called something like
> contiguously_equality_comparable.

We will give this more thought.

> Is it possible that there will be ambiguity if 'hash_append' is
> overloaded separately for a Hasher and a type?


We are hoping there is zero motivation to do so. The goal is that hash_append is ignorant of the Hasher.

Howard

Howard Hinnant

unread,
Apr 3, 2014, 9:34:55 PM4/3/14
to std-pr...@isocpp.org
On Apr 3, 2014, at 6:46 AM, Sebastian Gesemann <s.ges...@gmail.com> wrote:

> As for hash functions I would like to point out another option:
> SipHash. It is fast but "cryotographic enough" to be able to avoid
> hash collision attacks. If anyone of you thinks about using
> unordered_map for some service where an attacker can choose keys for
> values to store, it should be hard for an attacker to let the
> unordered_map degenerate to a linked list. One way of avoiding this is
> to use SipHash with a randomly chosen key.
>
> https://131002.net/siphash/
>
> So far, SipHash is getting very popular. It's already used in Perl,
> Python, Redis, Ruby, Rust, systemd, OpenDNS, Haskell …

I agree SipHash looks very attractive. I’ve only glanced at it, but I believe it can easily be cast to meet the requirements of a Hasher.

Howard

Howard Hinnant

unread,
Apr 3, 2014, 9:41:36 PM4/3/14
to std-pr...@isocpp.org
On Apr 3, 2014, at 8:53 PM, Alex B <deva...@gmail.com> wrote:

> For pair<string, string>, you only need to combine/append the size of the first string, not both of them.
>

Actually for pair<string, string> you do not need the size for either. The proposal proposes to hash the trailing null of a string, even for empty strings. This is done to make it compatible (exactly the same) for the hash of string literals, which are always composed of at least one byte.

Howard

Felipe Magno de Almeida

unread,
Apr 3, 2014, 9:44:15 PM4/3/14
to std-pr...@isocpp.org
Wouldn't these two clash?

"\0" "abc"
"" "\0abc"

> Howard
>
> --

Regards,
--
Felipe Magno de Almeida

Howard Hinnant

unread,
Apr 3, 2014, 9:45:19 PM4/3/14
to std-pr...@isocpp.org
On Apr 3, 2014, at 6:10 PM, Andrew Tomazos <andrew...@gmail.com> wrote:

> I think a similar approach can be argued less rigorously with "nearly" secure hash functions. It wasn't 100% clear if you have studied or considered this idea from the proposal.

This work was motivated by a desire by the programmer to apply hash function X, where the programmer gets to choose X, and not have it compromised by a combining step. Algorithm X may be motivated by cryptography, or efficiency, or possibly designed for a known finite set of inputs (a special case, non-secure, perfect hash).

Howard

Howard Hinnant

unread,
Apr 3, 2014, 10:07:16 PM4/3/14
to std-pr...@isocpp.org
> "" "\0abc”

Absolutely yes. So will these:

make_pair(1u, 2)
make_pair(1, 2u)

And so will these:

string(“abc”) + string(“def”)
string(“abcdef”)

Note in this latter case that also:

string(“abc”) + string(“def”) == string(“abcdef”)

and in the second case, it would be reasonable to amend the standard such that:

make_pair(1u, 2) == make_pair(1, 2u)

In all cases you’ve got two (or more) values (maybe the same type, maybe not) and when their hashable states are concatenated, create the same contiguous stream of bytes. This *is* the feature.

If a type X has a pair of such types, and does not want this behavior, such a type can easily insert unique bytes between these embedded types:

template <class Hasher>
void
hash_append(Hasher& h, const X& x)
{
using std::hash_append;
hash_append(“string 1”, x.string1_);
hash_append(“string 2”, x.string2_);
}

The point of this system is that it puts the class author in complete control of the hashed state of the object, without making the class author commit to any particular hashing algorithm. He only needs to write the hashing support code once. And once done, his clients can hash his object with any algorithm of *their* choosing (not his choosing).

Client of type X’s responsibility: Choose a generic hashing algorithm to hash X with.

Author of type X’s responsibility: Decide what bytes to present in X to an unknown hashing algorithm (and in what order).

Hashing algorithm author’s responsibility: Decide how to hash a stream of bytes.

None of these 3 programmers need coordinate with the other, beyond conforming to the universal requirements of Hasher and hash_append.

Howard

Howard Hinnant

unread,
Apr 4, 2014, 12:10:15 AM4/4/14
to std-pr...@isocpp.org
I believe this discussion has helped me realize one further requirement:

A type X should document how it exposes itself to a hash function. For example it is important for the class X author to know that its member data vector<int> simply concatenates the hashable states of all those ints. And that its member string simply concatenates the state of all those chars.

The type should clearly document how it exposes its state to the Hasher, so that other types which embed it, can better decide how they want to expose themselves to a Hasher.

For example if some type Person, which contains two std::strings, wants to expose itself to a Hasher, it is important for Person to know if string exposes its length or not to the Hasher. That way Person can make an informed decision as to whether or not it wants to prepend (for example) extra data to either of its two std::strings while exposing itself to the hasher (such as the string’s size, or the color of its passport).

Such a decision is so much better made by Person, than it is by std::string. If std::string did not clearly document what it exposed, then Person could not portably decide how to expose / amend its two strings.

And even though all these types publicly disclose what bytes they are sending to the hashing algorithm, the hashing algorithm can still be just as secure / reliable. After all, the hashing algorithms simply say: give me your bytes unscrambled: I will scramble them for you. That’s my job, not yours.

Howard

Bjorn Reese

unread,
Apr 4, 2014, 7:48:00 AM4/4/14
to std-pr...@isocpp.org
On 04/03/2014 04:06 PM, Andrzej Krzemieński wrote:

> This technique of adapting types to the framework appears to me to be
> similar to what Boost.Serialization
> <http://www.boost.org/doc/libs/1_55_0/libs/serialization/doc/index.html>
> does. I wonder if it could be generalized to things beyond hashing, like
> serialization.

What a wonderful observation.

I just wrote a small hashing archive for Boost.Serialization, and it
worked pretty much out-of-the-box. Class X then becomes:

class X
{
std::tuple<short, unsigned char, unsigned char> date_;
std::vector<std::pair<int, int>> data_;

public:
X();

template<typename T>
void serialize(T& archive, const unsigned int)
{
archive & date_;
archive & data_;
}
};

The question about how data items are combined can be delegate to
specializations of the non-intrusive serialize() functions.

namespace boost { namespace serialization {

template <typename Archive, typename T1, typename T2>
void save(Archive& archive,
const std::pair<T1, T2>& data,
const unsigned int /* version */)
{
archive << data.first;
archive << data.second;
}

// And a serialize() function that calls split_free
}}

Now I can calculate the hash of class X using:

X x;
hasher::oarchive<hasher::fnv1a> archive;
archive << x;
std::size_t result = archive.get();

And I can serialize the same class to JSON without any changes to X:

X x;
std::ostringsteam output;
json::oarchive archive(output);
archive << x;
std::string result = output.str();

dixl...@gmail.com

unread,
Apr 4, 2014, 1:28:18 PM4/4/14
to std-pr...@isocpp.org
I like the idea; I think it could be even more generic and be templated on the resulting hash type.

When specialising std::hash the resulting type obviously has to be std::size_t; but with this concept the Hasher could actually return a std::array<unsigned char, 16> (MD5), std::array<unsigned char, 20> (SHA1) etc... Such a Hasher wouldn't be directly usable for a std::unordered_map any more, but it would open up different uses for it without (AFAICT) affecting the normal use with Hashers that return a std::size_t.

Sean Middleditch

unread,
Apr 4, 2014, 7:38:03 PM4/4/14
to std-pr...@isocpp.org
On Fri, Apr 4, 2014 at 10:28 AM, <dixl...@gmail.com> wrote:
> I like the idea; I think it could be even more generic and be templated on
> the resulting hash type.

The general interface has no reliance on the result type. The types
being hashed operate on a template "hasher" type with no assumptions
about what the hasher ultimately produces. The example also shows that
a static_cast<size_t>() is used at least in one case, indicating that
a hasher could provide operator size_t(), operator std::array<char,
16>(), or whatever else it likes. I 'm not seeing any reason that a
hasher with a result type incompatible with std::hash (no size_t
output) couldn't be used with the general hash_append machinery. I
think this design satisfies your needs already here.

Geoffrey Romer

unread,
Apr 5, 2014, 1:41:45 AM4/5/14
to std-pr...@isocpp.org

It might be helpful if the paper explicitly discussed how it relates to N3333 and N3898, e.g. where any points of agreement or disagreement are, etc.

dixl...@gmail.com

unread,
Apr 5, 2014, 3:19:44 AM4/5/14
to std-pr...@isocpp.org
The general interface has no reliance on the result type. The types 
being hashed operate on a template "hasher" type with no assumptions
about what the hasher ultimately produces. The example also shows that
a static_cast<size_t>() is used at least in one case, indicating that
a hasher could provide operator size_t(), operator std::array<char,
16>(), or whatever else it likes.

Exactly, it doesn't matter what the hasher produces, uhash will always result in a std::size_t. That is my point, uhash could be templated on the result type and then the user could decide if he wants the "actual" hash result, a std::size_t or whatever. As it stands, if I had an MD5 hasher I would have to replicate uhash::operator() to get an actual MD5 hash value; if uhash was templated on the result type I could reuse uhash everywhere for every hash function.

David Krauss

unread,
Apr 5, 2014, 3:28:04 AM4/5/14
to std-pr...@isocpp.org
I’m working on a generalized alternative interface… hoped to post it yesterday… stay tuned…

Daniel James

unread,
Apr 5, 2014, 3:57:48 AM4/5/14
to std-pr...@isocpp.org
On 4 April 2014 02:33, Howard Hinnant <howard....@gmail.com> wrote:
> On Apr 2, 2014, at 10:04 PM, Daniel James <dnl...@gmail.com> wrote:
>
>> A problem with Hasher as it's currently defined is that it doesn't
>> take into account the beginning and end of lists. For example, the
>> list [[1,2],[3]] will hash to the same value as [[1],[2,3]] and a list
>> of empty lists will always hash to the same value regardless of its
>> length.
>
> Thanks for pointing out these issues.
>
> The first issue looks like two different types in C++: list [[1,2],[3]] and list [[1],[2,3]]. There are two different classes of clients to hash_append:

They were meant to be the same type, If it wasn't clear I was using
JSON style lists. In C++ it would be:

std::vector<std::vector<int>> x{{1,2},{3}};
std::vector<std::vector<int>> y{{1},{2,3}};

These both have the same type. The hash algorithm just hashes the leaf
elements, which are the same, so they'll always hash to same value.

> 2. Type Z, composed of X and Y, and is implementing its hash_append by hash_append'ing hash_append(X) and hash_append(Y). The author of Z may or may not be concerned that the hash of X and the hash of Y could be the same for degenerate cases. If the author of Z does have such concerns, we can easily deal with such issues in the hash_append for Z, for example:
>
> template <class Hasher>
> void
> hash_append(Hasher& h, const Z& z)
> {
> using std::hash_append;

It might be better to use a 'boost::swap' style mechanism to call
hash_append in order to avoid requiring 'using std::hash_append' which
is easily forgotten.

> hash_append("list 1", z.list1_);
> hash_append("list 2", z.list2_);
> }
>
> This seems like the ideal: "pay for the feature only if you need it" solution.

If an attacker knew you used that method, they could possibly force
collisions for that algorithm. For example with strings, I think
{"","list 2\0"} and {"\0list 2", ""} will hash to the same value.

If there's a choice between an unsafe, fast method and a safe, but
slower method, IMO the default should be the safe method. The
processing requirement is not that large.

>unordered_set<X> doesn't care that it may hash(X) with collisions generated by unordered_set<Y>

I only wrote about problems when hashing the same type.

> The second issue is similar: an empty list of lists.

The problem is for a list containing empty lists. Appending them is a
no-op, so it doesn't alter the hash value. For example, these two have
the same length, but will hash to the same value:

std::list<std::list<int> > x{{},{1},{},{2}};
std::list<std::list<int> > y{{1},{},{},{2}};

>> It would be desirable to be able to use randomized hash functions,
>> such as sip hash.
>
> Agreed. We are currently using a hash function that takes a seed:
[snip]
> hardened_hash() = default;

This default constructor is problematic. It's very easy to
accidentally use a default constructed hash function, so in order to
be safe it should generate a random hash, which would be surprising
and potentially expensive. My point here was that the default
constructor shouldn't be required.

Daniel James

unread,
Apr 5, 2014, 4:05:17 AM4/5/14
to std-pr...@isocpp.org
On 4 April 2014 12:48, Bjorn Reese <bre...@mail1.stofanet.dk> wrote:
> On 04/03/2014 04:06 PM, Andrzej Krzemieński wrote:
>
>> This technique of adapting types to the framework appears to me to be
>> similar to what Boost.Serialization
>> <http://www.boost.org/doc/libs/1_55_0/libs/serialization/doc/index.html>
>>
>> does. I wonder if it could be generalized to things beyond hashing, like
>> serialization.
>
>
> What a wonderful observation.

It that's done, there will still need to be a separate hook for
creating a different hash function as serialization and hashing are
often concerned with different things. For example, a unicode string
might want to use a normalized form for comparison, but serialize the
original encoding to be lossless. Another example is an object which
caches its hash value so it doesn't have to be recomputed every time.

My worry here is that someone wouldn't consider a particular use case
when implementing the general function. Later someone else would use
it and assume that it'd work as there are no compile errors. So they
might find that their elements in an unordered_set aren't unique, or
that they have lossy serialization. I'm not sure how likely that is.

Bjorn Reese

unread,
Apr 5, 2014, 8:13:04 AM4/5/14
to std-pr...@isocpp.org
On 04/05/2014 10:05 AM, Daniel James wrote:

> It that's done, there will still need to be a separate hook for
> creating a different hash function as serialization and hashing are
> often concerned with different things. For example, a unicode string

I agree completely with this. I will even raise the ante and claim that
we also need to distinguish between different types of serialization
(e.g. some formats use type-length-value to serialize containers
whereas others use delimiters), and perhaps even between different
types of hashing.

There is actually a need for two different variation points. One for
the user-defined X::serialize() member function as you mentioned, and
one for the non-intrusive counterparts used to serialize classes like
std::pair and std::vector.

The latter could have been handled easily if C++ had had partial
specialization for template functions. However, we delegating the
serialization to a functor to do the partial specialization. Here is
an actual example:


http://sourceforge.net/p/protoc/code/ci/master/tree/include/protoc/json/serialization.hpp

http://sourceforge.net/p/protoc/code/ci/master/tree/include/protoc/json/map.hpp

Regarding the first variation point, there are several potential
solutions. For example, Boost.Serialization already has a way of
distinguishing between the direction of serialization (saving and
loading) via its split functionality:

http://www.boost.org/libs/serialization/doc/tutorial.html#splitting

This could be generalized somehow, although that would not be my
favorite solution.

Another solution is to overload X::serialize() on the archive type. For
instance:

class X
{
std::tuple<short, unsigned char, unsigned char> date_;
std::vector<std::pair<int, int>> data_;

public:
X();

// Catch all
template<typename T>
void serialize(T& archive, const unsigned int)
{
archive & date_;
archive & data_;
}
// Specialization for hashing
template<typename T>
void serialize(hasher::oarchive<T>& archive, const unsigned int)
{
archive & date_;
// data_ is not hashed
}
};

A third solution would be to use tag-based dispatching.

Chandler Carruth

unread,
Apr 5, 2014, 12:42:16 PM4/5/14
to std-pr...@isocpp.org
On Wed, Apr 2, 2014 at 4:09 PM, Howard Hinnant <howard....@gmail.com> wrote:
http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html

We're aiming at the pre-Rapperswil mailing.  Any feedback, positive or negative, will be much appreciated.

As Geoffrey Romer has mentioned, I too want to emphasize that I think it is a waste of our time to look at further papers on hashing which fail to even respond to the several existing papers on hashing, or to the fairly extensive discussion on these matters that has taken place in the committee.

The last time this was discussed there seemed to be some consensus from the group that the direction forward would be to update N3333 and add support for consistent / predictable result hashes to it.

I don't see anything incompatible with N3333 and your paper other than the fundamental problem of using an append method on a class as the fundamental mechanism of composition has serious optimization problems in applications. (Please note, these problems will *not* show up in benchmarks or other small tests. This was discussed somewhat extensively in Issaquah, so I'm hesitant to re-(forgive me)-hash it here.)

-Chandler

Howard Hinnant

unread,
Apr 5, 2014, 12:56:30 PM4/5/14
to std-pr...@isocpp.org
On Apr 5, 2014, at 12:42 PM, Chandler Carruth <chan...@googlers.com> wrote:

> As Geoffrey Romer has mentioned, I too want to emphasize that I think it is a waste of our time to look at further papers on hashing which fail to even respond to the several existing papers on hashing, or to the fairly extensive discussion on these matters that has taken place in the committee.
>
> The last time this was discussed there seemed to be some consensus from the group that the direction forward would be to update N3333 and add support for consistent / predictable result hashes to it.
>
> I don't see anything incompatible with N3333 and your paper other than the fundamental problem of using an append method on a class as the fundamental mechanism of composition has serious optimization problems in applications. (Please note, these problems will *not* show up in benchmarks or other small tests. This was discussed somewhat extensively in Issaquah, so I'm hesitant to re-(forgive me)-hash it here.)
>

My apologies. N3333 did not go unread by me. I shall insert the appropriate references.

Howard

Howard Hinnant

unread,
Apr 6, 2014, 5:12:52 PM4/6/14
to std-pr...@isocpp.org
Updated version with many suggestions from this discussion implemented:

http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html

If I have mis-credited or failed to credit anyone, please let me know. Further constructive criticism welcome. Thanks to all who have helped us make this a much better paper.

Howard

Richard Smith

unread,
Apr 6, 2014, 7:47:51 PM4/6/14
to std-pr...@isocpp.org
It might be useful to explain how sending the size last satisfies Rule 2 -- without careful study, it might seem to be possible to construct two vector<vector<int>>s where the sizes of one are elements of the other. Further, it seems that this only works because your data blocks are self-delimiting when read from right to left. I think this also needs to be part of the documented contract, since using types that don't follow this rule (but do follow Rule 1 and Rule 2) can lead to hash collisions.

For instance, if I add MyVector<T> that's just like vector<T> but hashes the size first, then these collide:

typedef std::pair<vector<int>, MyVector<int>> P;
P x = { {0}, {} };
P y = { {}, {0} };

(both of them give the hasher the input "0 1 0").

Howard Hinnant

unread,
Apr 6, 2014, 8:36:57 PM4/6/14
to std-pr...@isocpp.org

On Apr 6, 2014, at 7:47 PM, Richard Smith <ric...@metafoo.co.uk> wrote:

> On Sun, Apr 6, 2014 at 2:12 PM, Howard Hinnant <howard....@gmail.com> wrote:
>> Updated version with many suggestions from this discussion implemented:
>>
>> http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html
>>
>> If I have mis-credited or failed to credit anyone, please let me know. Further constructive criticism welcome. Thanks to all who have helped us make this a much better paper.
>>
> It might be useful to explain how sending the size last satisfies Rule 2 -- without careful study, it might seem to be possible to construct two vector<vector<int>>s where the sizes of one are elements of the other.

I don’t have a theoretical proof that there can be no collision. If anyone sees one, I would welcome it. I have not come up with a collision scenario beyond an attacker somehow inserting their own type+message as you show below.

> Further, it seems that this only works because your data blocks are self-delimiting when read from right to left.

You are speaking of the size as the delimiter?

> I think this also needs to be part of the documented contract, since using types that don't follow this rule (but do follow Rule 1 and Rule 2) can lead to hash collisions.

I fully agree in the strongest terms. I will update the paper to make this clear. Type’s messages that are sent to hash_append are part of their public API. Other types will need to depend on the details of these messages to avoid collisions (e.g. is the container size prepended or appended?), as you note below:

> For instance, if I add MyVector<T> that's just like vector<T> but hashes the size first, then these collide:
>
> typedef std::pair<vector<int>, MyVector<int>> P;
> P x = { {0}, {} };
> P y = { {}, {0} };
>
> (both of them give the hasher the input "0 1 0”).

Thanks,
Howard

Geoffrey Romer

unread,
Apr 7, 2014, 5:50:58 PM4/7/14
to std-pr...@isocpp.org
The paper still doesn't give me a good sense of how it relates to N3333. In particular, it's confusing that the paper begins by posing the question "How do we write the hash function for X?", but doesn't consider N3333's answer to that question as one of the options, instead treating it as a sort of footnote to N3876. N3333 seems to be strictly superior to N3876 as far as your paper is concerned, and N3333 seemed to have stronger support in Issaquah, so N3333 would be a better primary point of comparison. As a related editorial comment, I'd suggest that when referring to hash_combine, you should be explicit about which paper's proposal you're referring to ("What about using the proposed std::hash_combine you rightly ask?" is one example where that's a problem).

The paper also doesn't address the optimization concerns Chandler raised in Issaquah, but to be fair I'm not sure how it could, since I'm not aware of any record of those concerns apart from some very terse notes in the meeting wiki.

As best I can tell, the primary point of disagreement between N3333 and your paper is the fact that N3333's API does not permit the hash algorithm to be customized independently of the per-type hashing logic (although it seems to me that it's overstating the case to say that it would "hard-code into the standard a single hash concatenation algorithm", since it does not dictate any particular algorithm). On the other hand, your proposal does allow such customization, but reportedly pays for it with substantially degraded performance. That being the case, it would probably be helpful to explicitly discuss the perceived benefits of making the library generic with respect to the hash algorithm. My hunch is that these benefits will mostly apply outside the core use case of hashing data for lookup in a transient, process-local hash table; if that's the case, you should also address N3333's proposal that std::hash and related APIs should focus solely on that use case.


On Sun, Apr 6, 2014 at 2:12 PM, Howard Hinnant <howard....@gmail.com> wrote:

jgot...@gmail.com

unread,
Apr 7, 2014, 11:36:38 PM4/7/14
to std-pr...@isocpp.org

I think you should standardize type_erased_hasher as well.  It  seems complex enough that this would be worthwhile. If it is standardized, you can replace the use of std::function and std::ref with a class that does the right thing and hides std::ref and std::function as implementation details.

class type_erased_hasher {
public:
   
template <class Hasher>
   
explicit type_erased_hasher(Hasher &hasher)
     
: h_(std::ref(hasher))
   
{}

   
void operator()(const void *key, size_t len) {
      h_
(key, len);
   
}

private:
   std
::function<void (const void *, size_t)> h_;
};    
   




Christopher Jefferson

unread,
Apr 12, 2014, 5:30:58 AM4/12/14
to std-pr...@isocpp.org
When it comes to avoiding hashing collisions, the only real way of to
prove no collisions is to prove that the data fed into hash_append is
enough to uniquely reconstruct the original object. I would personally
be unhappy about a design where I no implementation of hash_combine
could avoid collisions for (for example) a combination of standard
library classes/containers.

If you prefix with the size of containers rather than postfix, then it
is easy to prove there are no hash collision issues. The only problem
with doing this is forward_list and list, where you don't know the
size until you have reached the end of the object.

Another option would be to add a 'mark()' function to hash_append, for
use with variable-length objects. hash_append implementations which
didn't care about collisions could just ignore this function, ones
which did could use this in various ways to ensure collisions do not
occur.

This is not just made up on the page, this is what I use in my
personal hashing library. I personally implement 'mark()' to place a
randomly generated (at program start time) 32-bit int into the stream,
which I find provides an acceptably low (never occuring in practice)
level of collisions.

Chris

Howard Hinnant

unread,
Apr 12, 2014, 6:26:00 AM4/12/14
to std-pr...@isocpp.org
What is your easy proof that if you prefix with the size of containers rather than postfix, that there can be no collisions? I don’t doubt that you are correct that there will be no collisions. But I fail to see how prefix vs postfix makes a difference.

Howard

Christopher Jefferson

unread,
Apr 13, 2014, 7:02:23 PM4/13/14
to std-pr...@isocpp.org


On 12 Apr 2014 11:25, "Howard Hinnant" <howard....@gmail.com> wrote:
>
> What is your easy proof that if you prefix with the size of containers rather than postfix, that there can be no collisions?  I don't doubt that you are correct that there will be no collisions.  But I fail to see how prefix vs postfix makes a difference.
>

The advantage of prefix is that first we read the length, then we know how many values we have to read to get to the end of the object, so we can reconstruct the original object easily from the values we feed to hash_append. Similarly with recursive objects, we still always read the length first, so know how far too read. With postfix we are never sure if we are reading another value in the object, or the length at the end of it. Of course it might turn out even with postfix you can prove there is only one valid reading, but it is not so obvious to me.

Howard Hinnant

unread,
Apr 13, 2014, 8:21:46 PM4/13/14
to std-pr...@isocpp.org
On Apr 13, 2014, at 7:02 PM, Christopher Jefferson <ch...@bubblescope.net> wrote:

> The advantage of prefix is that first we read the length, then we know how many values we have to read to get to the end of the object, so we can reconstruct the original object easily from the values we feed to hash_append. Similarly with recursive objects, we still always read the length first, so know how far too read. With postfix we are never sure if we are reading another value in the object, or the length at the end of it. Of course it might turn out even with postfix you can prove there is only one valid reading, but it is not so obvious to me.

Ok, I guess this is a major difference between parsing and hashing. The proposed hash_append signatures are not meant to be parseable. Just to satisfy the rules relating hash_append to operator==:

http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html#hash_append_rules

It is my belief that if vector outputs the hash_append of each element followed by the hash_append of vector.size(), that no two vectors which compare equal will send different messages to the hasher (satisfies rule 1), and that no two vectors which compare unequal will send the same message to the hash (satisfies rule 2).

Admittedly I haven’t taken the time to work out a formal proof. But it seems intuitively obvious to me. Nonetheless, I would welcome a formal proof.

Note also that no other hash-related paper I’m aware of that has been submitted to the standards committee even deal with this issue at all. Indeed, they seem to leave it up to the client to get right. This proposal strongly advocates that this detail, however it may be resolved, should be resolved by std::vector itself, not by clients who want to hash a vector. I’ve come to believe that asking your clients to do your hashing for you is akin to asking your clients to compute operator== for you, or swap for you.

There is only one right answer as to what state of a type should participate in a hash algorithm. And the type author is in the best position to make that call.

Howard

Nevin Liber

unread,
Apr 13, 2014, 8:25:42 PM4/13/14
to std-pr...@isocpp.org
On 13 April 2014 19:21, Howard Hinnant <howard....@gmail.com> wrote:
Note also that no other hash-related paper I'm aware of that has been submitted to the standards committee even deal with this issue at all.  Indeed, they seem to leave it up to the client to get right.  This proposal strongly advocates that this detail, however it may be resolved, should be resolved by std::vector itself, not by clients who want to hash a vector.  I've come to believe that asking your clients to do your hashing for you is akin to asking your clients to compute operator== for you, or swap for you.

+1, especially since a == b --> hash(a) == hash(b).
--
 Nevin ":-)" Liber  <mailto:ne...@eviloverlord.com(847) 691-1404

Richard Smith

unread,
Apr 14, 2014, 12:39:18 AM4/14/14
to std-pr...@isocpp.org

On 13 Apr 2014 16:02, "Christopher Jefferson" <ch...@bubblescope.net> wrote:
>
>
> On 12 Apr 2014 11:25, "Howard Hinnant" <howard....@gmail.com> wrote:
> >
> > What is your easy proof that if you prefix with the size of containers rather than postfix, that there can be no collisions?  I don't doubt that you are correct that there will be no collisions.  But I fail to see how prefix vs postfix makes a difference.
> >
>
> The advantage of prefix is that first we read the length, then we know how many values we have to read to get to the end of the object, so we can reconstruct the original object easily from the values we feed to hash_append. Similarly with recursive objects, we still always read the length first, so know how far too read. With postfix we are never sure if we are reading another value in the object, or the length at the end of it.

Postfix is exactly the same. Just parse right to left.

Jeremy Maitin-Shepard

unread,
Apr 14, 2014, 2:27:36 AM4/14/14
to std-pr...@isocpp.org
On Sunday, April 13, 2014 9:39:18 PM UTC-7, Richard Smith wrote:
Postfix is exactly the same. Just parse right to left.

It seems it may make sense to standardize (for user-defined types as well, not just for the standard containers) that the hashing be done in such a way as to allow parsing from right to left, since as has been noted, combining right-to-left parsable hash sequences with left-to-right parsable hash sequences can result in ambiguity.

Christopher Jefferson

unread,
Apr 14, 2014, 4:52:26 AM4/14/14
to std-pr...@isocpp.org
On 14 April 2014 01:21, Howard Hinnant <howard....@gmail.com> wrote:
> On Apr 13, 2014, at 7:02 PM, Christopher Jefferson <ch...@bubblescope.net> wrote:
>
>> The advantage of prefix is that first we read the length, then we know how many values we have to read to get to the end of the object, so we can reconstruct the original object easily from the values we feed to hash_append. Similarly with recursive objects, we still always read the length first, so know how far too read. With postfix we are never sure if we are reading another value in the object, or the length at the end of it. Of course it might turn out even with postfix you can prove there is only one valid reading, but it is not so obvious to me.
>
> Ok, I guess this is a major difference between parsing and hashing. The proposed hash_append signatures are not meant to be parseable. Just to satisfy the rules relating hash_append to operator==:
>
> http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html#hash_append_rules
>
> It is my belief that if vector outputs the hash_append of each element followed by the hash_append of vector.size(), that no two vectors which compare equal will send different messages to the hasher (satisfies rule 1), and that no two vectors which compare unequal will send the same message to the hash (satisfies rule 2).
>
> Admittedly I haven't taken the time to work out a formal proof. But it seems intuitively obvious to me. Nonetheless, I would welcome a formal proof.
>

I will construct such a proof. I believe the easiest way to construct
such a proof is to show the output is parseable (I believe the
hash_append rules and "the output of hash_append being parsable back
to the original object" will be equivalent for most sensible
implementations). Furthermore, I now agree that both prefix and
postfix are correct.

> Note also that no other hash-related paper I'm aware of that has been submitted to the standards committee even deal with this issue at all. Indeed, they seem to leave it up to the client to get right. This proposal strongly advocates that this detail, however it may be resolved, should be resolved by std::vector itself, not by clients who want to hash a vector. I've come to believe that asking your clients to do your hashing for you is akin to asking your clients to compute operator== for you, or swap for you.

I agree. This setup is better than previous setups. But if we take the
choice off the client, we obviously have to make sure we get it right,
as users and implementors can't fix out mistakes!

However, I now agree the following is true, and will write out a proof
just to satisfy myself:

Hashing a variable-sized container either prefix or postfix (where
either prefix or postfix is used for all containers), and fixed size
containers simply enumerating the elements (and the same for user
types which can be treated like a pair/tuple), leads to no
'hash_append' collisions in the standard library, where (just for
clarity):

We treat std::map (and std::multimap) as a container of pairs.
We treat std::string like a container of chars, and prefix or postfix
the size like every other container.
We treat std::optional as a container of size 0 (disengaged) or size 1
(engaged).
I ignore the hashed containers for now.

We would then teach users that we recommend they follow the same rules
(just hash_append any fixed length data, postfix (if that's what we
decided) the length of any variable length data), although obviously
we wouldn't enforce that rule.

Howard Hinnant

unread,
Apr 14, 2014, 10:52:10 AM4/14/14
to std-pr...@isocpp.org
On Apr 14, 2014, at 4:52 AM, Christopher Jefferson <ch...@bubblescope.net> wrote:

> We treat std::map (and std::multimap) as a container of pairs.

Agreed.

> We treat std::string like a container of chars, and prefix or postfix
> the size like every other container.

I much prefer to make an exception of std::basic_string. We can make std::basic_string unambiguous by hashing the trailing null. Doing so has one huge benefit:

hash_append(h, std::string(“any text at all”)) == hash_append(h, “any text at all”)

for any h. (one can’t actually compare the results of hash_apppend using the operator== syntax) Note that “any text at all” has type const char[16], which includes the trailing null char. Also note that const char[16] is a fixed length type, and can be unambiguously hashed by the same rules you’ve already stated for other fixed length containers. Note also that this is consistent with existing operator== overloads:

std::string(“any text at all”) == “any text at all”

> We treat std::optional as a container of size 0 (disengaged) or size 1
> (engaged).

Sounds good.

> I ignore the hashed containers for now.

As it turns out, one can not hash an unordered container. There is no way to satisfy rule 1:

http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html#hash_append_rules

because two unordered containers can compare equal, and yet have their elements in different orders. hash_append can not specify the proper order to hash_append the elements. Therefore hash_append for unordered containers simply will not exist, so if someone tries to hash an unordered container, they will get a compile-time error.

> We would then teach users that we recommend they follow the same rules
> (just hash_append any fixed length data, postfix (if that's what we
> decided) the length of any variable length data), although obviously
> we wouldn't enforce that rule.

Agreed completely.

Howard

Miro Knejp

unread,
Apr 14, 2014, 11:37:54 AM4/14/14
to std-pr...@isocpp.org

Am 14.04.2014 16:52, schrieb Howard Hinnant:
> size like every other container.
> I much prefer to make an exception of std::basic_string. We can make std::basic_string unambiguous by hashing the trailing null. Doing so has one huge benefit:
>
> hash_append(h, std::string("any text at all")) == hash_append(h, "any text at all")
>
> for any h. (one can't actually compare the results of hash_apppend using the operator== syntax) Note that "any text at all" has type const char[16], which includes the trailing null char. Also note that const char[16] is a fixed length type, and can be unambiguously hashed by the same rules you've already stated for other fixed length containers. Note also that this is consistent with existing operator== overloads:
>
> std::string("any text at all") == "any text at all"
>
This will unfortunatly not work with string_view should it become part
of std, then it would certainly be beneficial if hashers produced the
same value for all three variations of strings we have, i.e. string,
string_view and const char[n]. Fortunately string_view can be implicitly
constructed from const char*, so maybe it's worth elaborating if it
would be better to omit hash_append(const char[N]) but add
hash_append(string_view) instead and not hash the trailing \0 in string.

Miro

Howard Hinnant

unread,
Apr 14, 2014, 12:13:41 PM4/14/14
to std-pr...@isocpp.org
You bring up a good point, and I agree that all 3 should hash identically. However I do not think there is a problem in hashing the appending null char:

template <class Hasher, class charT, class traits>
std::enable_if_t
<
is_contiguously_hashable<charT>{}
>
hash_append (Hasher& h, basic_string_view<charT, traits> const& sv)
{
h (sv.data(), sv.data() + sv.size() * sizeof(charT));
hash_append (h, charT{});
}

template <class Hasher, class charT, class traits>
std::enable_if_t
<
!is_contiguously_hashable<charT>{}
>
hash_append (Hasher& h, basic_string_view<charT, traits> const& sv)
{
for (charT const& c : sv)
hash_append (h, c);
hash_append (h, charT{});
}

Howard

Jeremy Maitin-Shepard

unread,
Apr 14, 2014, 12:44:27 PM4/14/14
to std-pr...@isocpp.org
On Monday, April 14, 2014 9:13:41 AM UTC-7, Howard Hinnant wrote:
You bring up a good point, and I agree that all 3 should hash identically.  However I do not think there is a problem in hashing the appending null char:


Appending a 0 character does not produce a parseable result, either left to right or right to left.  The string may itself contain an embedded 0, resulting in an ambiguity in left to right parsing:

Consider:
pair<string,string>(string("Hello\0World",11), string("Hello",5))
pair<string,string>(string("Hello",5), string("World\0Hello",11))

Under the proposed scheme of hashing the character data followed by 0, both will hash to:

Hello\0World\0Hello\0

Even without embedded 0 characters, it is never right-to-left parseable, which means it could result in an ambiguity if combined with right-to-left parseable types (i.e. all other containers).

It seems to me it would be better for string and string_view to remain consistent with other containers and hash just the container contents (excluding the implicit trailing 0 character in the case of string) followed by the length.

Trying to ensure hash consistency between objects of different types that compare equal is almost certainly impractical.  Consider integers of different sizes, integer vs double, static vs variable-length containers.

Howard Hinnant

unread,
Apr 14, 2014, 1:14:19 PM4/14/14
to std-pr...@isocpp.org
On Apr 14, 2014, at 12:44 PM, Jeremy Maitin-Shepard <jer...@jeremyms.com> wrote:

> On Monday, April 14, 2014 9:13:41 AM UTC-7, Howard Hinnant wrote:
> You bring up a good point, and I agree that all 3 should hash identically. However I do not think there is a problem in hashing the appending null char:
>
>
> Appending a 0 character does not produce a parseable result, either left to right or right to left. The string may itself contain an embedded 0, resulting in an ambiguity in left to right parsing:
>
> Consider:
> pair<string,string>(string("Hello\0World",11), string("Hello",5))
> pair<string,string>(string("Hello",5), string("World\0Hello",11))
>
> Under the proposed scheme of hashing the character data followed by 0, both will hash to:
>
> Hello\0World\0Hello\0

That is a compelling example, thanks!

>
> Even without embedded 0 characters, it is never right-to-left parseable, which means it could result in an ambiguity if combined with right-to-left parseable types (i.e. all other containers).
>
> It seems to me it would be better for string and string_view to remain consistent with other containers and hash just the container contents (excluding the implicit trailing 0 character in the case of string) followed by the length.

I’m now wondering if T[N] should be appended with size_t(N), and/or if we should specialize char[N] et al. I will give this more thought.

>
> Trying to ensure hash consistency between objects of different types that compare equal is almost certainly impractical. Consider integers of different sizes, integer vs double, static vs variable-length containers.

I’m hoping to lay the groundwork for:

unordered_set<string> s;
// …
s.find(“some text”); // avoids constructing string(“some text”)

Howard

Jeremy Maitin-Shepard

unread,
Apr 14, 2014, 1:53:22 PM4/14/14
to std-pr...@isocpp.org
On Monday, April 14, 2014 7:52:10 AM UTC-7, Howard Hinnant wrote:
As it turns out, one can not hash an unordered container.  There is no way to satisfy rule 1:

http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html#hash_append_rules

because two unordered containers can compare equal, and yet have their elements in different orders.  hash_append can not specify the proper order to hash_append the elements.  Therefore hash_append for unordered containers simply will not exist, so if someone tries to hash an unordered container, they will get a compile-time error.

A standard approach for hashing unordered containers (taken by Java, for instance) is to sum the hash codes of the elements:

http://stackoverflow.com/questions/18021643/hashing-a-set-of-integers-in-an-order-independent-way

This seems to be an unfortunate limitation of the proposal, in that there is a very useful form of hash function composition that is not supported by it.  Summing the hash codes seems to be fundamentally incompatible with the message-based hash_append scheme.  However, if the proposal is accepted, hash_append will become the standard way to compose hash functions, and users would have to manually define the hash function for any type (transitively) containing unordered_{set,map} (or any other user-defined hash map) essentially from scratch.

A workaround would be to define hash_append for unordered containers as follows (ignoring the extra template parameters to unordered_set):

template <class Hasher, class T>
void hash_append(Hasher &h, unordered_set<T> const &x) {
    typename Hasher::result_type sum = 0;
    for (auto &&y : x) {
        Hasher h2 = h; // Note 1
        hash_append(h2, y);
        sum += static_cast<typename Hasher::result_type>(h2); // Note 2
    }
    hash_append(h, sum);
}

Note 1: Default constructing h2 does not make sense if h contains important runtime state independent of the data that has been hashed, for instance parameters of the hash function.  On the other hand, copy constructing h2 from h would likely copy the accumulated data state as well, which isn't really desired, though possibly not harmful.

Note 2: This assumes that result_type is summable, which would not necessarily be a requirement on Hasher.  For some hash functions, e.g. SHA1, it might be more natural to produce an array of bytes, which might require some encapsulation to become summable.

I don't know to what extent this definition would preserve the collision-avoidance properties of the original hash function.  If this definition of hash_append for unordered_set were included in the standard library, users would have to be warned that it behaves differently than other hash_append definitions, and may be more prone to collisions.  Still, for the common case where the user just wishes to use a type that contains unordered_{set,map} as the key of another unordered_{set,map}, this would be very suitable and convenient.  If the definition were not included in the standard library, a user would be forced to rely on a custom Hasher type or a user-defined type as the element of unordered_{set,map} in order for ADL to find the definition, which would be much less convenient.  That is a strong argument in favor of including the definition.

Howard Hinnant

unread,
Apr 14, 2014, 1:58:13 PM4/14/14
to std-pr...@isocpp.org
Chris Jefferson also brought this to my attention.

Thanks, I will give this issue more thought.

Howard

Jeremy Maitin-Shepard

unread,
Apr 14, 2014, 2:07:41 PM4/14/14
to std-pr...@isocpp.org
On Monday, April 14, 2014 10:14:19 AM UTC-7, Howard Hinnant wrote:
I’m now wondering if T[N] should be appended with size_t(N), and/or if we should specialize char[N] et al.  I will give this more thought.
If there is an array_view type, then it really doesn't matter much whether the size is appended or not.  (array_view would presumably be consistent with vector.)

Specializing char[N] seems like a bad idea to me.


>
> Trying to ensure hash consistency between objects of different types that compare equal is almost certainly impractical.  Consider integers of different sizes, integer vs double, static vs variable-length containers.

I’m hoping to lay the groundwork for:

unordered_set<string> s;
// …
s.find(“some text”);  // avoids constructing string(“some text”)


How about:

s.find(string_view("some text"))

Daniel James

unread,
Apr 14, 2014, 3:08:28 PM4/14/14
to std-pr...@isocpp.org
On 14 April 2014 18:14, Howard Hinnant <howard....@gmail.com> wrote:
> I'm now wondering if T[N] should be appended with size_t(N), and/or if we should specialize char[N] et al. I will give this more thought.

Should note that since operator==(T[N],T[N]) compares the addresses of
the arrays, hashing the contents doesn't follow your guideline for
calling hash_append. Strictly speaking the address should be hashed,
although that might be error prone.

Jeremy Maitin-Shepard

unread,
Apr 14, 2014, 4:20:12 PM4/14/14
to std-pr...@isocpp.org

It is hard to imagine any case in which decaying to a pointer would be the desired behavior for the hash function.  The safest choice might be to poison hash_append for T[N] to avoid unexpected behavior.

Howard Hinnant

unread,
Apr 26, 2014, 9:58:10 PM4/26/14
to std-pr...@isocpp.org
Updated:

http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html

Lots of changes, including:

* std::string is no longer special. It hashes just like vector.

* unordered containers are mentioned, but not handled. Examples of ways to handle them are included.

* New requirement for Hashers: CopyConstructible, CopyAssignable, value semantics.

* Updated type_erase_hasher example to reflect new requirements.

* Updated tests.

* Updated comparison / contrast with N3333 (now using llvm implementation of N3333).

* Pointer to github implementation of this proposal + lots of example code.

* New siphash adaptor. Really nice. Authors claim algorithm is cryptographically secure. And the speed is competitive, especially on longer message sizes. On shorter messages the very simple FNV-1a is hard to beat. Emphasis is placed not on the quality on any one hashing algorithm, but the ability to easily change among them.

* New quality tests.

Still to-do: proposed wording.

Howard

Jeremy Maitin-Shepard

unread,
Apr 27, 2014, 4:09:13 AM4/27/14
to std-pr...@isocpp.org
On Saturday, April 26, 2014 6:58:10 PM UTC-7, Howard Hinnant wrote:
Updated:

http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html

[snip]
* unordered containers are mentioned, but not handled.  Examples of ways to handle them are included.

There is one issue with hashing unordered_{set,map} by first sorting the keys: they only have an equal_to predicate but not an ordering predicate.  (A different, generic hash table library might have an ordering predicate, though.)

Christopher Jefferson

unread,
Apr 27, 2014, 7:44:11 AM4/27/14
to std-pr...@isocpp.org
That is true. A better route would be to count the number of
occurrences of each value (which is very cheap if there is few hash
collisions, and basically free if there are no hash collisions), hash
pairs of the form pair(value in hash table, count), and add these
hashes together.

I am convinced (and could produce a proof if necessary), that if your
base hash function is good (for an appropriate definitions of good),
then just adding the hashes of these pairs will produce a good quality
hash.

Of course for hash sets, every value occurs at most once, so this is
just the same as adding up the hashes . The only problem with just
adding the hashes of everything in an unordered_multiset is that as
values occur increasingly often, there is some lowing of quality of
the lower order bits (as an extreme, if there was 2^30 occurrences of
some value in a hash table, and you were using a 32-bit hash, then any
number*(2^30) takes one of 4 values).

The only problem with all this is if you have a hash which (for
example) hashes the integers to themselves, then you can get very poor
behaviour. One could either fix that in the hash of unordered_* with a
mixing function, or just tell users that rubbish hash in -> rubbish
hash out.

Chris
Message has been deleted

Howard Hinnant

unread,
May 3, 2014, 3:28:52 PM5/3/14
to std-pr...@isocpp.org
On Apr 29, 2014, at 9:44 AM, r4nd...@gmail.com wrote:

> > Should vector expose its size() to the Hasher?
> In general, yes, but e.g. for vector<int> there's no need to send the size. If we hash a vector<T> the hasher knows the number of bytes hashed and the bytes itself, which means that vector<T> could reconstruct his own size from that. For the outermost type, the size is implicitly known. So I propose the following addition:
>
> // hash_append_first is guaranteed to be the first hash_append called.
> template <class Hasher, class T, class Alloc>
> void
> hash_append_first(Hasher& h, std::vector<T, Alloc> const& v) noexcept
> {
> for (auto const& t : v)
> hash_append(h, t);
> // no need anymore for: hash_append(h, v.size());
> }

What if hash_append(h, t); sends no update to h? The client might write such a type. tuple<> is such a type. In this case vector<tuple<>>(0) hashes to the same thing as vector<tuple<>>(1).

> uhash::operator()(T const& t) const noexcept
> {
> Hasher h;
> using std::hash_append_first;
> hash_append_first(h, t); // falls back to hash_append if no overload found.
> return static_cast<result_type>(h);
> }
>
> Both rules are satisfied, the proof that no collisions happen is analogous.
>
> This may seem as a micro-optimization, but from that we gain:
> 1. s.find(“some text”); works again, because std::string does not append the size anymore (unless he's embedded somewhere, e.g. in pair<string,string>).
> 2. unordered_set<void*> can have a faster hash algorithm. There are faster hash functions for fixed-size elements of small size (e.g. <=sizeof(size_t)). Depending on the data, the best hash function may be the identity. This sort of hashing is safe (there can't be any collisions) if the size of each element is always the same. Note we cannot *mix* different algorithms at runtime for (number of bytes hashed)<=sizeof(size_t) and >=sizeof(size_t), since there may be collisions. But we can change to a different algorithm for (number of bytes hashed)=const. Thus we can do
>
> template <class Hasher>
> void hash_append_first(Hasher& h, size_t x) noexcept
> {
> hash_only(h, x); // we guarantee that this is the only hash append
> }
>
>
> and have the option to implement an optimized hash_only for each hasher (hash_only may fall back to hash_append).

But the cost is that *every* type now has to implement both hash_append and hash_append_first. This complicates things by O(N). The odds of this proposal being excepted are already extremely low. O(N) costs must bring huge benefits.

Howard

Howard Hinnant

unread,
May 3, 2014, 3:28:53 PM5/3/14
to std-pr...@isocpp.org

John Bytheway

unread,
May 3, 2014, 5:19:12 PM5/3/14
to std-pr...@isocpp.org
On 2014-05-03 15:28, Howard Hinnant wrote:
> Now with proposed wording:
>
> http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html

Comments on the proposed wording:

* HashAlgorithm

You mention "The finalize operation" without ever defining it.

You did not specify that two consecutive calls to operator() are
equivalent to one call with the two sequences of bytes concatenated. I
assume this was deliberate (given the note about calls with len==0).
However, later in the note about hash_append for arrays, you imply that
this is the case. It should perhaps be clarified one way or the other.

* hash_append

I don't like "except that the function will modify the state of t prior
to forwarding it to h in such a way that the contiguously hashable
condition is preserved". At the same time, I have tried and I realize
it's tough to write out the condition you really mean; something like:

Effects: h(p_t, len_t) where p_t points to len_t valid bytes with values
derived from t. If t==u then len_t==len_u and memcmp(p_t, p_u, len_t) == 0.

(I'm not sure whether 'valid' is the right adjective there)

John Bytheway

Howard Hinnant

unread,
May 3, 2014, 5:59:03 PM5/3/14
to std-pr...@isocpp.org
Thanks John. Updated

http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html

On May 3, 2014, at 5:19 PM, John Bytheway <jbyt...@gmail.com> wrote:

> On 2014-05-03 15:28, Howard Hinnant wrote:
>> Now with proposed wording:
>>
>> http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html
>
> Comments on the proposed wording:
>
> * HashAlgorithm
>
> You mention "The finalize operation" without ever defining it.

Thanks, I’ve instead referred to the “conversion operation”.

>
> You did not specify that two consecutive calls to operator() are
> equivalent to one call with the two sequences of bytes concatenated. I
> assume this was deliberate (given the note about calls with len==0).
> However, later in the note about hash_append for arrays, you imply that
> this is the case. It should perhaps be clarified one way or the other.

I’ve attempted to clarify that concatenated calls are equivalent.

>
> * hash_append
>
> I don't like "except that the function will modify the state of t prior
> to forwarding it to h in such a way that the contiguously hashable
> condition is preserved". At the same time, I have tried and I realize
> it's tough to write out the condition you really mean; something like:
>
> Effects: h(p_t, len_t) where p_t points to len_t valid bytes with values
> derived from t. If t==u then len_t==len_u and memcmp(p_t, p_u, len_t) == 0.
>
> (I'm not sure whether 'valid' is the right adjective there)

<nod> This is a rough paragraph. I’ve taken another shot at it. I’m now taking advantage of some tricky standardization definitions:

shall - means in order to conform, one must do this.

should - means we would like for this to happen, but it doesn’t have to.

The latter is done to accommodate hash_append(h, nan). Given a floating point x with a value of nan: x != x, but hash_append(h, x) == hash_append(h, x) (I’m using pseudo-code here, hash_append returns void).

Howard

John Bytheway

unread,
May 3, 2014, 7:36:26 PM5/3/14
to std-pr...@isocpp.org
On 2014-05-03 17:59, Howard Hinnant wrote:
> On May 3, 2014, at 5:19 PM, John Bytheway <jbyt...@gmail.com>
> wrote:
>
>> On 2014-05-03 15:28, Howard Hinnant wrote:
>>> Now with proposed wording:
>>>
>>> http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html
>>
>>> Comments on the proposed wording:
>>
>> * HashAlgorithm
>>
>> You did not specify that two consecutive calls to operator() are
>> equivalent to one call with the two sequences of bytes
>> concatenated. I assume this was deliberate (given the note about
>> calls with len==0). However, later in the note about hash_append
>> for arrays, you imply that this is the case. It should perhaps be
>> clarified one way or the other.
>
> I've attempted to clarify that concatenated calls are equivalent.

Why are concatenated calls not necessarily equivalent when at least one
has zero length? This seems an odd deviation from expectation, and I
don't see any justification for it in the rest of the paper. It seems
likely to cause problems because it *will* be true for most
HashAlgorithms, and will come to be relied upon in code which will then
break when someone supplies a HashAlgorithm for which it is not true.

Also, having made this change, the comment about optimizing hash_append
for std::basic_string can be demoted to a Note.

>> * hash_append
>>
>> I don't like "except that the function will modify the state of t
>> prior to forwarding it to h in such a way that the contiguously
>> hashable condition is preserved". At the same time, I have tried
>> and I realize it's tough to write out the condition you really
>> mean; something like:
>>
>> Effects: h(p_t, len_t) where p_t points to len_t valid bytes with
>> values derived from t. If t==u then len_t==len_u and memcmp(p_t,
>> p_u, len_t) == 0.
>>
>> (I'm not sure whether 'valid' is the right adjective there)
>
> <nod> This is a rough paragraph. I've taken another shot at it. I'm
> now taking advantage of some tricky standardization definitions:
>
> shall - means in order to conform, one must do this.
>
> should - means we would like for this to happen, but it doesn't have
> to.
>
> The latter is done to accommodate hash_append(h, nan). Given a
> floating point x with a value of nan: x != x, but hash_append(h, x)
> == hash_append(h, x) (I'm using pseudo-code here, hash_append returns
> void).

You are specifying in terms of the state to which h is updated, rather
than the call (or 'message') used to update that state. This seems a
little dangerous, although I guess the intent is clear. In particular,
if e.g. the state of h is 32 bits then there will be many pairs of
unequal doubles which update h to the same state.

Relatedly, I wonder whether a similar use of 'should' is warranted in
describing the effects of HashAlgorithm's operator(). Something like:

"... If h1 and h2 are given two non-equivalent keys then they should
have different updated state."

But perhaps this is redundant and unhelpful because it obviously can't
be achieved in general for any reasonable HashAlgorithm with bounded
state. And the properties we are really striving for in the state
update are rather more subtle.

John

Howard Hinnant

unread,
May 3, 2014, 8:41:36 PM5/3/14
to std-pr...@isocpp.org

On May 3, 2014, at 7:36 PM, John Bytheway <jbyt...@gmail.com> wrote:

> On 2014-05-03 17:59, Howard Hinnant wrote:
>> On May 3, 2014, at 5:19 PM, John Bytheway <jbyt...@gmail.com>
>> wrote:
>>
>>> On 2014-05-03 15:28, Howard Hinnant wrote:
>>>> Now with proposed wording:
>>>>
>>>> http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html
>>>
>>>> Comments on the proposed wording:
>>>
>>> * HashAlgorithm
>>>
>>> You did not specify that two consecutive calls to operator() are
>>> equivalent to one call with the two sequences of bytes
>>> concatenated. I assume this was deliberate (given the note about
>>> calls with len==0). However, later in the note about hash_append
>>> for arrays, you imply that this is the case. It should perhaps be
>>> clarified one way or the other.
>>
>> I've attempted to clarify that concatenated calls are equivalent.
>
> Why are concatenated calls not necessarily equivalent when at least one
> has zero length? This seems an odd deviation from expectation, and I
> don't see any justification for it in the rest of the paper. It seems
> likely to cause problems because it *will* be true for most
> HashAlgorithms, and will come to be relied upon in code which will then
> break when someone supplies a HashAlgorithm for which it is not true.

Ah! That is a very important misunderstanding. Thank you for bringing it to light!

I am attempting to say that given HashAlgorithm1 and HashAlgorithm2, it is unspecified whether or not a 0 length key updates the state. However if two instances of the *same* HashAlgorithm type see a 0 length key, they should treat it the same way. I had not realized this ambiguity in my English, and will work on it.


> Also, having made this change, the comment about optimizing hash_append
> for std::basic_string can be demoted to a Note.

Good point, thanks.

>
>>> * hash_append
>>>
>>> I don't like "except that the function will modify the state of t
>>> prior to forwarding it to h in such a way that the contiguously
>>> hashable condition is preserved". At the same time, I have tried
>>> and I realize it's tough to write out the condition you really
>>> mean; something like:
>>>
>>> Effects: h(p_t, len_t) where p_t points to len_t valid bytes with
>>> values derived from t. If t==u then len_t==len_u and memcmp(p_t,
>>> p_u, len_t) == 0.
>>>
>>> (I'm not sure whether 'valid' is the right adjective there)
>>
>> <nod> This is a rough paragraph. I've taken another shot at it. I'm
>> now taking advantage of some tricky standardization definitions:
>>
>> shall - means in order to conform, one must do this.
>>
>> should - means we would like for this to happen, but it doesn't have
>> to.
>>
>> The latter is done to accommodate hash_append(h, nan). Given a
>> floating point x with a value of nan: x != x, but hash_append(h, x)
>> == hash_append(h, x) (I'm using pseudo-code here, hash_append returns
>> void).
>
> You are specifying in terms of the state to which h is updated, rather
> than the call (or 'message') used to update that state. This seems a
> little dangerous, although I guess the intent is clear. In particular,
> if e.g. the state of h is 32 bits then there will be many pairs of
> unequal doubles which update h to the same state.

Right. This is another good reason for the “should” in:

> And if t1 != t2, then hash_append(h, t1) should update the state of h to a different state as does hash_append(h, t2).


> Relatedly, I wonder whether a similar use of 'should' is warranted in
> describing the effects of HashAlgorithm's operator(). Something like:
>
> "... If h1 and h2 are given two non-equivalent keys then they should
> have different updated state."
>
> But perhaps this is redundant and unhelpful because it obviously can't
> be achieved in general for any reasonable HashAlgorithm with bounded
> state. And the properties we are really striving for in the state
> update are rather more subtle.

This is not a bad suggestion at all. I will try adding it. We can always take it back out if it doesn’t seem to work.

Howard

John Bytheway

unread,
May 4, 2014, 5:24:59 PM5/4/14
to std-pr...@isocpp.org
On 2014-04-02 19:09, Howard Hinnant wrote:
> http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html
>
> We're aiming at the pre-Rapperswil mailing. Any feedback, positive or negative, will be much appreciated. The paper currently lacks proposed wording. It is more of a technology demonstration. There's a co-author spot open for anyone who wants to turn the demonstration into concrete proposed wording. :-)
>
> Howard

It occurs to me that the type trait is_contiguously_hashable proposed in
this paper is really declaring that, for a particular type, value
equality is equivalent to representation equality. By 'representation
equality' I mean equality according to memcmp. This has potential to
optimize things other than hashing. For example, it could be used to
optimize any of the std:: algorithms which need to test for equality.
This includes find, find_end, find_first, adjacent_find, count,
mismatch, equal, search, replace, remove, and unique.

Does this seem like a trait we should be offering for these purposes?

If so, it would be nice to give the trait a more generic name which
reflects its wider utility. Unfortunately, I can't think of a good
name. Another bikeshedding opportunity... The best I can think of is
along the lines of respects_representation_equality.

John Bytheway

John Bytheway

unread,
May 4, 2014, 5:58:56 PM5/4/14
to std-pr...@isocpp.org
On 2014-05-03 20:41, Howard Hinnant wrote:
>
> On May 3, 2014, at 7:36 PM, John Bytheway <jbyt...@gmail.com>
> wrote:

<snip>

>> Why are concatenated calls not necessarily equivalent when at least
>> one has zero length? This seems an odd deviation from expectation,
>> and I don't see any justification for it in the rest of the paper.
>> It seems likely to cause problems because it *will* be true for
>> most HashAlgorithms, and will come to be relied upon in code which
>> will then break when someone supplies a HashAlgorithm for which it
>> is not true.
>
> Ah! That is a very important misunderstanding. Thank you for
> bringing it to light!
>
> I am attempting to say that given HashAlgorithm1 and HashAlgorithm2,
> it is unspecified whether or not a 0 length key updates the state.
> However if two instances of the *same* HashAlgorithm type see a 0
> length key, they should treat it the same way. I had not realized
> this ambiguity in my English, and will work on it.

Based on reading this message, it didn't seem that you had understood my
concern. However, I see the latest version no longer has a special case
for length zero keys, so I am happy :).

John

Howard Hinnant

unread,
May 4, 2014, 6:00:18 PM5/4/14
to std-pr...@isocpp.org
Agreed. Your argument didn’t sink through my thick skull until it had slept on it. :-)

Howard

Howard Hinnant

unread,
May 4, 2014, 6:00:33 PM5/4/14
to std-pr...@isocpp.org
On May 4, 2014, at 5:24 PM, John Bytheway <jbyt...@gmail.com> wrote:

> It occurs to me that the type trait is_contiguously_hashable proposed in
> this paper is really declaring that, for a particular type, value
> equality is equivalent to representation equality. By 'representation
> equality' I mean equality according to memcmp. This has potential to
> optimize things other than hashing. For example, it could be used to
> optimize any of the std:: algorithms which need to test for equality.
> This includes find, find_end, find_first, adjacent_find, count,
> mismatch, equal, search, replace, remove, and unique.
>
> Does this seem like a trait we should be offering for these purposes?
>
> If so, it would be nice to give the trait a more generic name which
> reflects its wider utility. Unfortunately, I can't think of a good
> name. Another bikeshedding opportunity... The best I can think of is
> along the lines of respects_representation_equality.

If we could demonstrate such use, perhaps. I’ve just looked at find, and I don’t really see an optimization opportunity there (though I could easily be missing it). There is precedent for introducing a general bit of infrastructure to support a specific proposal. ratio and common_type were standardized in this very way (introduced to support <chrono>).

At this time, I don’t see a lot of opportunity (perhaps with std::equal, not positive).

If the LWG shows interest in this proposal, that is certainly an aspect they could look at (and bike shed).

N3333 called this trait is_contiguous_layout<T>, which I don’t think is as good as is_contiguously_hashable<T>, because float has a contiguous layout, and yet is still not contiguously hashable. The closest thing there is to a reference implementation for N3333 (llvm/include/llvm/ADT/Hashing.h) calls it is_hashable_data<T>.

<shrug> Good names are tough.

Howard

Daniel James

unread,
May 5, 2014, 6:22:59 AM5/5/14
to std-pr...@isocpp.org
On 3 May 2014 22:59, Howard Hinnant <howard....@gmail.com> wrote:
>
> http://htmlpreview.github.io/?https://github.com/HowardHinnant/papers/blob/master/hashing.html

The "plausible implementation" of a hash function for floating point
numbers won't always work for IEEE floats, as they can include
padding. For example, intel linux represents long double with an
80-bit float (i.e. 10 bytes), but sizeof(long double) is 12 for 32-bit
linux, and 16 for 64-bit.

Geoffrey Romer

unread,
May 5, 2014, 12:22:25 PM5/5/14
to std-pr...@isocpp.org
In _Elements of Programming_, Stepanov uses the term "uniquely represented" for this property. How do you feel about is_uniquely_represented<T>?
 

Howard

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

Howard Hinnant

unread,
May 6, 2014, 10:07:11 AM5/6/14
to std-pr...@isocpp.org
On May 5, 2014, at 12:22 PM, 'Geoffrey Romer' via ISO C++ Standard - Future Proposals <std-pr...@isocpp.org> wrote:

> In _Elements of Programming_, Stepanov uses the term "uniquely represented" for this property. How do you feel about is_uniquely_represented<T>?
>

I’ve gone back and forth on this good suggestion, but I think I’m settling on not liking it. Rationale:

A quote from Stepanov’s excellent book:

> For example, a type representing a truth value as a byte that interprets
> zero as false and nonzero as true is not uniquely represented.

And I respect that definition, and agree with it.

On the other hand, on my platform, bool is implemented by the compiler as a single byte constrained to hold one of the two bit patterns:

0x00
0x01

Without playing with undefined behavior / aliasing, one simply can not give a bool any other bit pattern. E.g:

bool b = 128;

compiles and sets b to 0x01.

So while not guaranteed by the standard, on my platform, bool is contiguously hashable. And that is a sufficiently important optimization that I would want my std::lib to make std::is_contiguously_hashable<bool>::value == true on this platform. On another platform that implemented bool as something that (for example) could take on any bit pattern of a byte, and any of the non-zero bit patterns represented true, std::is_contiguously_hashable<bool>::value should be false. And admittedly, this *is* precisely what Alexander says in his book.

However, in order to avoid the situation of people reading EOP and concluding that is_uniquely_represented<bool>::value shall *always* be false, I would rather not use that name, in favor of the more pragmatic concerns implied by is_contiguously_hashable<bool>. Otherwise a FAQ will be:

EOP says bool is not uniquely represented. But I’m seeing std::is_uniquely_represented<bool>::value == true. Is that a bug?

Howard

Roman Perepelitsa

unread,
May 6, 2014, 10:12:54 AM5/6/14
to std-pr...@isocpp.org
EOP doesn't say that std::is_uniquely_represented<bool>::value is false. It says that std::is_uniquely_represented<MyBoolean>::value must be false if MyBoolean is a byte with 0 meaning false and 1 meaning true.

is_uniquely_represented looks like a perfect fit to me.

Roman Perepelitsa.

Geoffrey Romer

unread,
May 6, 2014, 1:29:38 PM5/6/14
to std-pr...@isocpp.org
This sounds highly speculative to me, and it proves far too much; if we're willing to reject a perfectly good name on the grounds that some people might hypothetically misunderstand a particular book's discussion of that term, and apply that misunderstanding to C++'s use of that term, then no possible name for anything is safe.

Howard Hinnant

unread,
May 6, 2014, 11:00:44 PM5/6/14
to std-pr...@isocpp.org
On May 6, 2014, at 1:29 PM, 'Geoffrey Romer' via ISO C++ Standard - Future Proposals <std-pr...@isocpp.org> wrote:

> This sounds highly speculative to me, and it proves far too much; if we're willing to reject a perfectly good name on the grounds that some people might hypothetically misunderstand a particular book's discussion of that term, and apply that misunderstanding to C++'s use of that term, then no possible name for anything is safe.

Fair enough. I’m inclined to include an Appendix of issues such as this. Ultimately it will be the LWG, not the authors of this paper, that will decide details like this, assuming the proposal even gets that far.

Thanks,
Howard

jgot...@gmail.com

unread,
May 9, 2014, 6:43:32 PM5/9/14
to std-pr...@isocpp.org
You should also define hash_append for the string_view and optional classes in the Library Fundamentals paper.
Reply all
Reply to author
Forward
0 new messages