Using a custom hasher in Cython

514 views
Skip to first unread message

Pasta farianist

unread,
Dec 16, 2015, 2:16:58 AM12/16/15
to cython-users
I am writing a wrapper for sparse_hash_map from Google's sparsehash library, and I would like to use a custom hash function in this wrapper. The interface of sparse_hash_map is very similar to unordered_map. According to the docs ( https://sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html ) it is defined as:

sparse_hash_map<Key, Data, HashFcn, EqualKey, Alloc>

My concrete use case is Key=uint32_t, Data=uint16_t. The wrapper in its current form is hosted on Github: https://github.com/Pastafarianist/cython-sparsehash . The reason for why I am doing this is described at https://stackoverflow.com/questions/34296868/sparse-hash-map-is-very-slow-for-specific-data .

I would appreciate a trivial example of writing a custom hasher. In reality, I will probably experiment with several different functions, but for now I am only interested in how to do that from Cython.

tl;dr: how to add a custom hasher to a wrapper of a container with the interface of unordered_map?

Robert Bradshaw

unread,
Dec 22, 2015, 1:05:14 AM12/22/15
to cython...@googlegroups.com
Non-type template parameters are not yet supported, but
https://github.com/cython/cython/pull/426 is close.
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "cython-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cython-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Matthew Honnibal

unread,
Dec 26, 2015, 10:56:36 PM12/26/15
to cython-users


My concrete use case is Key=uint32_t, Data=uint16_t. The wrapper in its current form is hosted on Github: https://github.com/Pastafarianist/cython-sparsehash . The reason for why I am doing this is described at https://stackoverflow.com/questions/34296868/sparse-hash-map-is-very-slow-for-specific-data .

This isn't exactly what you asked, but you might find my Cython hash table, preshed, useful: https://github.com/syllog1sm/preshed

The table assumes keys are pre-hashed, so you have to do that yourself. But, the plus side is, you get to do that part yourself :). It uses open-addressing and linear probing, which is very fast and very memory efficient for my use-case (maintaining a weights table with ~30m entries for a linear model for NLP). I believe dense_hash_map uses the same algorithm, and performs similarly.

There's no docs, but "pip install preshed" works, and hopefully you can see the Cython and Python APIs from the code: https://github.com/syllog1sm/preshed/blob/master/preshed/maps.pyx

For hash function I always use murmurhash. You can get my wrapper from "pip install murmurhash". Code is here: https://github.com/syllog1sm/cython-murmurhash
 

Pasta farianist

unread,
Jan 2, 2016, 10:51:20 AM1/2/16
to cython-users
Thanks for the answer! By the way, my wrapper is heavily based on yours, I only added `sparse_hash_map` (which is almost identical to `dense_hash_map`) and a pretty simple Python wrapper.
Unfortunately, I need `sparse_hash_map` as I have several hundred million entries to store at least. I don't know yet how many exactly.
Still, you've given me an idea. I could insert a (bijective) hashing function into the Cython wrapper. This wouldn't be particularly convenient, since I would have to convert the existing data to this representation, and iterating through keys would require a reverse transformation, but that can solve the problem nonetheless.

I don't really know C++ well enough to understand what else I could do here. Maybe I could write a thin wrapper around `sparse_hash_map` that would instantiate template parameters, and base my Cython wrapper on top of that, for example?

воскресенье, 27 декабря 2015 г., 6:56:36 UTC+3 пользователь Matthew Honnibal написал:

Pasta farianist

unread,
Jan 2, 2016, 5:18:52 PM1/2/16
to cython-users
I have probably solved my problem. My solution is very unportable, but it kinda works.

I have subclassed `sparse_hash_map` in a .cpp file and filled the `HashFcn` parameter there.

#include "sparsehash/sparse_hash_map"
#include "stdint.h"

// syntax borrowed from
// https://stackoverflow.com/questions/14094104/google-sparse-hash-with-murmur-hash-function

struct UInt32Hasher {
    size_t
operator()(const uint32_t& x) const {
 
return (x ^ (x << 17) ^ (x >> 13) + 3238229671);
   
}    
};

template<class Key, class T>
class sparse_hash_map : public google::sparse_hash_map<Key, T, UInt32Hasher> {};


So far, my only problem with this code is that `g++` cannot find `sparsehash_wrapper.cpp` when I compile the module using `pyximport`. I had to include the full path to the wrapper in order to make it work.
Reply all
Reply to author
Forward
0 new messages