Mysterious heap dump from HLL on Apache Spark

97 views
Skip to first unread message

julianke...@gmail.com

unread,
Dec 29, 2015, 8:48:57 AM12/29/15
to stream-lib-user
I implemented a simple prototype which estimates distinct users on a stream of web tracking data with HyperLogLog on Apache Spark. I used stream-lib in version 2.9.0 and in my test dataset there are exactly 10.000.000 distinct elements.

Now I wanted to test the accuracy and the memory consumption compared to a naive solution with a hashset in Java (for now only in Spark local mode). 

So I installed VisualVM and studied the heap dump to check the memory footprint. 

The first screenshot shows the heap dump at the end of the naive hashset version. I interrupted the application because the memory limit was reached. The second screenshot shows the situation after the run of the HLL version. Everything as expected: The total memory consumption is much better than for the naive solution. But I wonder about that peak at the middle of the runtime and the rapid decrease of the heap size although the max cardinality wasn't reached at that moment. 

How can one explain this phenomenon? And is this a appropriate method to run such tests at all?

I'm happy about every help, explanation or advice. Thank's!
hashCount.jpg
hllCount.jpg

Matt Abrams

unread,
Dec 29, 2015, 10:36:43 AM12/29/15
to stream-lib-user

Hello

Just for clarity where you using HLL or HLL±?  HLL+ is much better than HLL.  What parameters were you using to create it?

The most likely reason for the heap usage change is the conversion from sparse mode.  When an HLL is initialized it starts off in 'sparse' mode which resembles a linear counter.  This gives you accurate counts at small cardinalities.  As the number of elements represented by the structure grows it eventually hits a tipping point where it converts into 'normal' mode.  This mode has a stable memory profile and should remain essentially flat for the lifetime of the structure regardless of how many more elements you submit.

The ideal design has that threshold work so that the memory usage pre and post conversion is about the same but that is hard to do in practice and can vary based on your configuration inputs.

HLL+ should have a less severe memory profile change from low to high cardinality sets.

Hope this helps,

Matt


--
You received this message because you are subscribed to the Google Groups "stream-lib-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stream-lib-us...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

julianke...@gmail.com

unread,
Dec 30, 2015, 3:47:43 AM12/30/15
to stream-lib-user
Thanks.

I used HLL+ with the following constructor call: HyperLogLogPlus hll = new HyperLogLogPlus(16). 

So I used it explcitly without sparse representation for this test.

Do you think one can see the effects of HyperLogLog itself in a heap dump of a spark application at all? I think there is a lot of noise around, so it could be very difficult to recognize trends. Can I maybe watch the memory fottprint of a specific class? Or what other ideas do you have to get a reasonable result?

Does anybody have experience in using stream-lib/HLL+ in an Apache Spark context? It would be interesting to exchange experiences.
Reply all
Reply to author
Forward
0 new messages