re: ruby-talk about google hash

43 views
Skip to first unread message

Roger Pack

unread,
Dec 19, 2009, 11:49:33 PM12/19/09
to Shot (Piotr Szotkowski), ruby-opt...@googlegroups.com
Yo yo.

Answered off of ruby-core since it's a more internal discussion. CC
ruby-optimization just because it's an optimization topic and I want
to have *something* listed on that group :)

> One of the routes I can take is writing my own IntSet class in (possibly
> inline) C or D, but I’d rather prototype it in Ruby first, and rewrite
> only if necessary. It seems I should at least try to make it use the
> Google’s hash/set implementation.

IntSet meaning the equivalent of a Set of BigNums or something?

So since you need bignums--I guess 63 bit longs are not large enough? :)

>> What would the ideal be in terms of requirements?
>
> Hm, I guess I should implement and profile the skelatal IntSet first
> to be sure,

that would be good.
Also if you can, time the GC [1] that would be good, since that is
where this might help, as well. Sorry that timing it is only 1.9
currently :)

> but something with fast #each and #combination(2) calls
> would already make wonders in my case (Set#each calls Hash#each_key,
> while my current Set#pairs implementation consists of calling
> to_a.combination(2), so probably there’s a lot of space for improvement
> already right there).

Interesting.

so you've got

set.each.to_a.combination(2) and the #each is most expensive?

I'd imagine that #combination isn't too expensive, since it's just
enumerating each option one at a time and handing it over to you.

I *think* that google's set iterator always goes in the same order, so
creating a #combination for a set would be possible.

Since it's "not the bottleneck" I wonder if just doing #combination in
ruby would work for now for our purposes...

I think I'll add VALUE => VALUE in there for hashes, and see if its
#each is still faster than Ruby's (gut hunch--it will be).

I suppose it's reprofile at that point--I'm not sure where the
slowdown is. If it's GC we could go one route, if it's slow
#combination, another, or maybe it would be worth it to use "real"
Google Set, or what not.

Feel free the fork/hack the code.

To get working.cpp files ruby extconf.rb within the ext folder
(they're auto generated from a template currently--you could look at
the template, too, of course).


> shot@devielle:~$ gem install google_hash
> ERROR: Error installing google_hash:
> sane requires os (>= 0, runtime)

That is really odd. I used to see it when I had dependencies that
"spanned servers" [some from github, some from rubyforge] but I
thought those were done away...I pinged gemcutter about it...

Wish me luck.
-r

[1] http://wiki.github.com/rdp/ruby_tutorials_core/gcprofiler

Piotr Szotkowski

unread,
Dec 20, 2009, 2:41:10 PM12/20/09
to ruby-opt...@googlegroups.com
Roger Pack:

> Answered off of ruby-core since it's a more internal discussion. CC
> ruby-optimization just because it's an optimization topic and I want
> to have *something* listed on that group :)

Agreed. :) Happily joined the group, hence switching
to our-new-overlords-sanctioned email address. ;]

>> One of the routes I can take is writing my own IntSet class in (possibly
>> inline) C or D, but I’d rather prototype it in Ruby first, and rewrite
>> only if necessary. It seems I should at least try to make it use the
>> Google’s hash/set implementation.

> IntSet meaning the equivalent of a Set of BigNums or something?

IntSet meaning the equivalent of Set of
Integers – be them Fixnums or Bignums.

> So since you need bignums--I guess 63 bit longs are not large enough? :)

Nope, that’s the problem: I need to operate on sets (Blankets) of
sets (Blocks) of integers (with the integers being row numbers of
an arbitrarily long transition table of a finite state machine).

I represent the ‘inner’ sets of integers (Blocks) as Fixnums/Bignums
with their bits set to 1 for the integers they hold, and so I need
a IntSet class to represent the ‘outer’ sets (Blankets).

In other words: Blanket[Block[0,1], Block[2,3], Block[3,4]] ends up
being represented as an IntSet consisting of 3, 12 and 24 (0b00011,
0b01100 and 0b11000, respectively).

With this representation, the longest FSM transition table I can fit
into Fixnums sports 62 rows. Unfortunately, I usually need to work
with larger FSMs – and I’d love to be able to keep the Fixnum/Bignum
transparency out of my head (as even for an over-62-row-FSM some of the
Blocks of a single Blanket might fit into a Fixnum, while others will
not – take Blanket[Block[0,1], Block[2,100]] for a contrived example).

>> Hm, I guess I should implement and profile
>> the skelatal IntSet first to be sure,

> that would be good.

Right. Currently I simply use stdlib’s Set, which uses the keys of
a Hash to provide element uniqueness – but if you recall our benchmarks
from ruby-talk this might not neccesarily be the best approach (we
ended up with some benchmarks where a simple Hash-based IntSet is faster
than stdlib’s Set, and – inexplicably – even slightly faster than the
uderlying Hash).

> Also if you can, time the GC [1] that would be good, since that is
> where this might help, as well. Sorry that timing it is only 1.9
> currently :)

1.9 is ok; my code was developed under Ruby 1.9 actually, and only later
I added 1.8 compatibility (mostly thanks to the wonderful backports gem)
so that I can use ruby-prof sanely.

FWIW, perftools.rb reports that the time spent in the garbage collector
is around 8% or so (i.e., not much; Hash#each_key and Array#combination
each take 30% or so).

>> but something with fast #each and #combination(2) calls would
>> already make wonders in my case (Set#each calls Hash#each_key,
>> while my current Set#pairs implementation consists of calling
>> to_a.combination(2), so probably there’s a lot of space for
>> improvement already right there).

> Interesting.

> so you've got

> set.each.to_a.combination(2) and the #each is most expensive?

No, sorry, I muddled this up too much.

I need the IntSet to be fast in two cases: IntSet#each (so I can iterate
over the IntSet’s elements) and IntSet#pairs (so I can iterate over
every possible pair in the set).

In my current implementation (which uses stdlib’s Set), Set#each maps to
Hash#each_key, and I implemented Set#pairs as Set#to_a.combination(2).
Not surprisingly, profiling suggests that I spend most of my time in
Hash#each_key and Array#combination.

(Making the future IntSet’s #each and #pairs return Enumerators, as it
is now in my case, would be most welcome. Also, if it clears anything
up, we can rename #pairs to #each_pair from now on.)

> I'd imagine that #combination isn't too expensive, since it's just
> enumerating each option one at a time and handing it over to you.

Right, Array#combination is implemented in C (as Hash#each_key is).
Unfortunately, I call them so often (in various places – e.g., I often
map the Blanket’s Blocks to be vertices in a graph, and then iterate
over every vertex to colour it or over every vertex pair to decide
whether to introduce an edge) that they end up eating a lot of the
runtime anyway.

I’m currently analysing ruby-prof’s output to introduce algorythmic
speed-ups (sped the whole code up about three times last weekend), but
ultimately I’ll hit an end of my abilities and I assume still most
of the time will be spent iterating over (a) IntSets’ elements and
(b) pairs of IntSet elements.

> I *think* that google's set iterator always goes in the same
> order, so creating a #combination for a set would be possible.

Hm, that would be a very good idea.

> Since it's "not the bottleneck" I wonder if just doing
> #combination in ruby would work for now for our purposes...

I started with implementing Set#pairs on the Ruby side, only
to find out that to_a.combination(2) is actually much faster.

> I think I'll add VALUE => VALUE in there for hashes, and see if
> its #each is still faster than Ruby's (gut hunch--it will be).

VALUE GoogleHash keys would mean I can have my IntSet implemented in its
keys in the same way stdlib’s Set is implemented in Hash’s keys; that
would be a very good starting point. :)

> I suppose it's reprofile at that point--I'm not sure where the
> slowdown is. If it's GC we could go one route, if it's slow
> #combination, another, or maybe it would be worth it to use
> "real" Google Set, or what not.

If implementing IntSet in terms of VALUE-keyed GoogleHash indeed ends
up working out, I guess a GoogleSet class (mapped to Google’s *_hash_set
code) would be the logical next step to try.

> Feel free the fork/hack the code.

I will. I need to brush up my C/C++ skills (and I actually need to get
the first draft of my thesis out first…), but a VALUE-keyed GoogleHash
(or GoogleSet) seems to be a viable route (as an alternative to coding
my own IntSet in C or D – I definitely trust other people’s coding
skills more than mine in these areas).

>> shot@devielle:~$ gem install google_hash
>> ERROR: Error installing google_hash:
>> sane requires os (>= 0, runtime)

> That is really odd. I used to see it when I had dependencies that
> "spanned servers" [some from github, some from rubyforge] but
> I thought those were done away...I pinged gemcutter about it...

Yeah – I double-checked and all of {os, sane, google_hash} are
there on Gemcutter (which is the first entry in my gem sources).

> Wish me luck.

Keeping my fingers crossed. :)

— Piotr Szotkowski
--
The real problem is not whether machines think but whether men do.
[B. F. Skinner]

signature.asc

rogerdpack

unread,
Dec 21, 2009, 9:24:58 AM12/21/09
to ruby optimization
> >> shot@devielle:~$ gem install google_hash
> >> ERROR:  Error installing google_hash:
> >>   sane requires os (>= 0, runtime)
> > That is really odd. I used to see it when I had dependencies that
> > "spanned servers" [some from github, some from rubyforge] but
> > I thought those were done away...I pinged gemcutter about it...
>
> Yeah – I double-checked and all of {os, sane, google_hash} are
> there on Gemcutter (which is the first entry in my gem sources).

Luis recommended to submit it as a bug.
http://groups.google.com/group/gemcutter/browse_thread/thread/165c095cbb02e375
:)
-r

rogerdpack

unread,
Dec 21, 2009, 9:35:52 AM12/21/09
to ruby optimization

rogerdpack

unread,
Dec 21, 2009, 4:06:09 PM12/21/09
to ruby optimization
> Right. Currently I simply use stdlib’s Set, which uses the keys of
> a Hash to provide element uniqueness – but if you recall our benchmarks
> from ruby-talk this might not neccesarily be the best approach (we
> ended up with some benchmarks where a simple Hash-based IntSet is faster
> than stdlib’s Set, and – inexplicably – even slightly faster than the
> uderlying Hash).

> 1.9 is ok; my code was developed under Ruby 1.9 actually, and only later


> I added 1.8 compatibility (mostly thanks to the wonderful backports gem)
> so that I can use ruby-prof sanely.

Good to hear--there's a teeny optimization in there that I'm unsure of
how to do in 1.9

with 1.9 it


size_t generate_hash(...)

...
with a "cheat" to not have to call #hash on the object itself:

case T_BIGNUM:
return LONG2FIX(((long*)(RBIGNUM_DIGITS(hash_me)))[0]);

I'm not sure how to do that in 1.8 though, so currently in 1.8 it just
calls #Hash on it [i.e. slower for insertions/retrieval].

> FWIW, perftools.rb reports that the time spent in the garbage collector
> is around 8% or so (i.e., not much; Hash#each_key and Array#combination
> each take 30% or so).

Interesting. If I can just get perftools to work with me on
windows...


> >> but something with fast #each and #combination(2) calls would
> >> already make wonders in my case

K the latest edge should have a syntax like

require 'google_hash'
a = GoogleHashDenseRubyToRuby.new
a['abc'] = 'def'

a.keys # => array
a.keys_combination_2{|a, b| puts a, b}

Here are the latest timings:
http://github.com/rdp/google_hash/blob/master/results.txt

Give er a shot, let me know if there's more we can do.


> I will. I need to brush up my C/C++ skills (and I actually need to get
> the first draft of my thesis out first…), but a VALUE-keyed GoogleHash
> (or GoogleSet) seems to be a viable route (as an alternative to coding
> my own IntSet in C or D – I definitely trust other people’s coding
> skills more than mine in these areas).


"If you want to see the code/hack on it, run extconf.rb within the ext
directory, to create the code it actually uses (from a template)."

-r

Caleb Clausen

unread,
Dec 21, 2009, 10:15:37 PM12/21/09
to ruby-opt...@googlegroups.com
Piotr Szotkowski wrote:
> I need the IntSet to be fast in two cases: IntSet#each (so I can iterate
> over the IntSet�s elements) and IntSet#pairs (so I can iterate over

> every possible pair in the set).
>
> In my current implementation (which uses stdlib�s Set), Set#each maps to

> Hash#each_key, and I implemented Set#pairs as Set#to_a.combination(2).
> Not surprisingly, profiling suggests that I spend most of my time in
> Hash#each_key and Array#combination.

I have to wonder... if those are really the only two cases that you need
to be fast, then why not use an array rather than any sort of hash-based
set? Surely ary.each and ary.combination(2) are going to be faster than
hash.each_key and hash.keys.combination(2).... but I suspect you
actually need Set#add and/or Set#include? to be fast too.

rogerdpack

unread,
Dec 26, 2009, 7:36:50 PM12/26/09
to ruby optimization

> ...
> with a "cheat" to not have to call #hash on the object itself:
>
>               case T_BIGNUM:
>                       return LONG2FIX(((long*)(RBIGNUM_DIGITS(hash_me)))[0]);
>
> I'm not sure how to do that in 1.8 though, so currently in 1.8 it just
> calls #Hash on it [i.e. slower for insertions/retrieval].

K this is working for 1.8, too, now.
Enjoy.
-r

Reply all
Reply to author
Forward
0 new messages