Weird hyperLogLog output

144 views
Skip to first unread message

Federico Nieves

unread,
Feb 1, 2017, 2:36:45 PM2/1/17
to Druid User
Hi!

When doing a topN query, asking for hyperLogLog metric, sometimes we get values like these:

1.7976931348623157E308

The weird thing is that for many different values we get that number (and they should differ in normal cases). Is this an error on our environment? Or maybe a wrong format on the output? Can we specify the format in which we want to obtain the metric (integer, float, etc)?

Thanks,

Gian Merlino

unread,
Feb 1, 2017, 4:32:57 PM2/1/17
to druid...@googlegroups.com
This is Double.MAX_VALUE and probably means that the estimate was too large to correct via the algorithm we use. Is this reproducible? And do you think your "real" unique counts are larger than about 10^19?

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/30b51744-635a-4a43-9ff4-b5998e59e517%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Federico Nieves

unread,
Feb 1, 2017, 5:31:51 PM2/1/17
to Druid User
Wow very weird, since it's impossible that the value is above 10^19.

The count metric is always above hyperloglog, count is a value way below that number. For example:

"count" : 10387841,
"unique" : 1.7976931348623157E308,

"count" : 10387837,
"unique" : 1.7976931348623157E308,

When it works OK the result is something like this:

"count" : 5860320,
"user_unique" : 5170144.904699262,

Yes this is reproducible, every time I run this simple split metric with count and hyperloglog metric, it returns those weird values.

Do you think this is a problem on hyperloglog (in our environment)? Otherwise how can I detect the problem?

Thanks Gian!


On Wednesday, February 1, 2017 at 6:32:57 PM UTC-3, Gian Merlino wrote:
This is Double.MAX_VALUE and probably means that the estimate was too large to correct via the algorithm we use. Is this reproducible? And do you think your "real" unique counts are larger than about 10^19?

Gian

On Wed, Feb 1, 2017 at 11:36 AM, Federico Nieves <federic...@gmail.com> wrote:
Hi!

When doing a topN query, asking for hyperLogLog metric, sometimes we get values like these:

1.7976931348623157E308

The weird thing is that for many different values we get that number (and they should differ in normal cases). Is this an error on our environment? Or maybe a wrong format on the output? Can we specify the format in which we want to obtain the metric (integer, float, etc)?

Thanks,

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.

Gian Merlino

unread,
Feb 2, 2017, 10:31:03 AM2/2/17
to druid...@googlegroups.com
It might be interesting to set "finalize": false in your query context to see the actual HLL objects returned by this query, rather than the cardinality estimates. That'd possibly provide some clues as to how they ended up the way they did.

Gian

To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

Federico Nieves

unread,
Feb 2, 2017, 11:26:26 AM2/2/17
to Druid User
Hi Gian,

I did that, and here is the output:

- For the wrong numbers, same HLL objects were returned:

+zUCceNCa8lTe0RFUjQhEyQSRkSDNEIxQ0Q1JCJ0MmMmVCNDRhVDNUN0UYpEcTEjJUc1RBQ0QyNTWlYhMzREhTIyEzRENEEiIzRXEjZDJEFEgjIydCMzNkEhRDQqQmRVI1IzNDIzM0I5ZIFEVSUUJVYRUiMhU1JCUjNCWTI1MjQ0MiQkYyVEIyNVIjMxRURjFhQjMxMUN4cUMUJCZiklVDQjIBEzFANCdEMDMSQjRTKAQkQzRCMiJDJCZlVUYTEjYjIjY1IlZ1NVUwUhNFdUNUUiVSEUNyIyIyIzIxOkREMyJTVCViJjMSJCNkI0MiI0IiUxQzJkRFI0RoNDYyIlEyQ5IyMnSEdlEXU0QlQiQkJhRFX2VBQ0YiFCaDIwYzRVNyMkNkUzckQnMVQzNEc4RjNCUUJCRDIoUyM4MyM2RCQkRDOGgjIjciZUQkRWJEKDJUU2NWUzcyQjQmkmMiJHNEQkRTY0JDQyMmumMiQTNCUxNjUzkxMiJBVRVDI3UzYxNzJDI3MWM1QzJUZDQkJGFENFRDI0U0MzRHJEJDJVlDQkQ2VSMyVLJCYxJEJGUiQWJDI1N5NTYjdEE6NDNCFGMyEjRSMzMxZDMCQiMzdTAzM0MyU2JlN0Q8NWMyISFUZCMmdEIjMUIiRCRURTQmRDQzR0E1QTQjImNTQjVhIiREUjUlM3E1MjRjJUNHIzYiNWMzMVVSMTITEyFDJlEzVURTI0QiIodBY2FzNENDEiVSOLOTMVEjQzdREkJSUhNFQhIjRCRjMzYjM1YzMzSBREM0ODJRMlxSIVUiM6MUZDFCRFMiR5IyNTQ1EUUwMzMjKFIyYoE2IhMlMyMyV0R5JFNJRCOEGzRVFSZDQzMzITQ0FSVEUiN2QiBSZyMzRxNhJjQiYlJzVkZCNGJSVVIjYzQ0RBoiNjInMUNDYlVFMmIUFEZEhCYiUTMhNjJDNlJDItRSIyOSJzNEUlMzNDQjJDJUREMyIkRyIyQmoUZVNFJUJzQxQiQhIkZHJCQ5YzMyJFQiUjYkMjMxYnVUIyFUViNCM0MzASJIN0NBNCo3UVZWIyJlRUUkJXNUN2Q1NFQkRTQjZDEjgSU1ciFTFjRyVFdCYzV0QiM0NjIzQzVjIyImN3OFQ0IjQzESSTQlOBcyFDIUNGJjQiNkViVDWCEUc1RFV6JFNCYXQzNCQ0RCIyRyMzRlEzKBYpNnU1M2YyMjJCRGMVY0VTIgJENVJhEUJUQ1NEQ4IhNjJCkzazYUETU1IjdEQTMlKCFDtzJhMyV3R0EyMhNjZTIXcVYiRmNSUTAiVCVEQTUlSDhCMyUjMxQTNTNjMzA4IzVBRDI5MhMzUidkNFUkJUNjMUMkdRNDYXJCJZUkRFMzQ=

Then good results returned values like this:

AQAB5AAAAAAQAAAAAAABUzAQQAARARACAAIAACAAABAAECAAAAIiAAAAAAQAAAAAAAAAAAAAAQAAAAIAAAACAAAIADAAAAAAMAEgACAAAAAAAQAAMAMAAEAQAAIAAAABARABABAAAAIQAiAxAAAAEAADEAMgBAAAEAAiAAATEwABAAAAAQAAABAAARAAAAMAMQAQAAAAAABQAAAAMAQAEAEZMAICAADBAAAgEAMDEAABAAAQAAAQAAAEAAAAAgAAAAEQIAIAAQAQAAAAAQAQIAIAADAAUAAQABAAAAAAAAAAAAcQAAACAAAAAAAAEAAQAAIAAAABAwAAAwIQAAAAAAAAAAAAQAMAMAAAAAAwAwAAAAEAAQAAAQAgAAAAAAABQAAAAAAAMAESAAABQAAAAwASIBABABAQEAIAAAAAQAEAIAAAAAAAAAABEAABAAAAEAAQAAAAAAAAAAQBAgRQEQAAAAAwAAAhABADACEAAAADAQASECAQAAABAAYQEAAAAFAAABAAEAAAYABAAAAEAAAAAgEgACEAAAAQEQAAIAEAAgABAAEAUAAAEQAAAAEQEAAFAAAAAAAAAAEAABACAAAwAAAAEAACAAEAAgAAAAABABAAIwAQAAAAAAAAAAAAAQAgAAABIAAAAAAAABAQADARBQADAAAQABAAAQEAAAIBEQAzAFEwATEDACACEAEAAiEAAAAQABAAAAEAAIIAAAAAAAIBACABAAAAIABAAAAAAwAAAAAgAAAAUGAAAAAAQAAgABABAAAAAAMCAAAAAAIAAAABAEAAACAAAAEBAgAAAAAQAAAAAAAAFwAAExAAAGAAAAEAAAACADAhAQABAQEBAAACAAAAAgAAFAABAAACAAAAACAgAAEhEAADAAAAIAAAIAAAFjACAAAAAAAAAQAAAAIAACAQAQEBABABAAMQAAAAADAABwAAEAAQAEMAEAAAAAAAAAABADEAAAADAAEAEAQwABAQAAAAASAAABAAAQEAAAJAMBAAABAAEAAAIAAQAQACCQAwAQAAEQAQBDAAABAVAAEQMAAAYQAAAQAAAwAAAAMQASEAAAAAARAAAAABABEAABAAAAAAAmAAABAyIAAREAQAECAAEAAAAAAAABAAACAAEAMBMAAFABAAUBEAAAAwAEEAEAIBACAAADAAABAAAAAAAAAAIBIAAAABAwAAAwAAARAAAAIAABBAAABAEAAAAAAAAFEgAAAAAAEgQAAAAAAABAIBAVEAAAEAAAAAAAEAAAAhAAYAEwQAABAAAAA0ABAAIAAAAAAAABAAAAQAATARAAAwAQAAAgEAIQAAIAAAABAAAAAAAAAQUAIAEAIgAgAAASAAAEACEAAAIAAAAAA=


Does this helps?

Thanks,

Gian

Federico Nieves

unread,
Feb 6, 2017, 8:27:14 AM2/6/17
to Druid User
Sometimes Druid queries return the following error, do you think it could be related to this weird hyper log log values?

java.lang.IndexOutOfBoundsException
        at java.nio.Buffer.checkIndex(Buffer.java:540) ~[?:1.8.0_111]
        at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139) ~[?:1.8.0_111]
        at io.druid.query.aggregation.hyperloglog.HyperLogLogCollector.mergeAndStoreByteRegister(HyperLogLogCollector.java:681) ~[druid-processing-0.9.2.jar:0.9.2]
        at io.druid.query.aggregation.hyperloglog.HyperLogLogCollector.fold(HyperLogLogCollector.java:399) ~[druid-processing-0.9.2.jar:0.9.2]
        at io.druid.query.aggregation.hyperloglog.HyperUniquesAggregator.aggregate(HyperUniquesAggregator.java:48) ~[druid-processing-0.9.2.jar:0.9.2]
        at io.druid.query.timeseries.TimeseriesQueryEngine$1.apply(TimeseriesQueryEngine.java:73) ~[druid-processing-0.9.2.jar:0.9.2]
        at io.druid.query.timeseries.TimeseriesQueryEngine$1.apply(TimeseriesQueryEngine.java:57) ~[druid-processing-0.9.2.jar:0.9.2]
        at io.druid.query.QueryRunnerHelper$1.apply(QueryRunnerHelper.java:80) ~[druid-processing-0.9.2.jar:0.9.2]
        at io.druid.query.QueryRunnerHelper$1.apply(QueryRunnerHelper.java:75) ~[druid-processing-0.9.2.jar:0.9.2]
        at com.metamx.common.guava.MappingAccumulator.accumulate(MappingAccumulator.java:39) ~[java-util-0.27.10.jar:?]
        at com.metamx.common.guava.FilteringAccumulator.accumulate(FilteringAccumulator.java:40) ~[java-util-0.27.10.jar:?]
        at com.metamx.common.guava.MappingAccumulator.accumulate(MappingAccumulator.java:39) ~[java-util-0.27.10.jar:?]
        at com.metamx.common.guava.BaseSequence.accumulate(BaseSequence.java:67) ~[java-util-0.27.10.jar:?]
...

Thanks !

Gian Merlino

unread,
Feb 6, 2017, 1:07:01 PM2/6/17
to druid...@googlegroups.com
I loaded up your "bad" HLL and it does have some weird values in it. Did you ever deploy a pre-release version of 0.9.2 built from master, or one of the earlier RCs? Some of them had a bug that could cause corrupt HLLs on disk. If you have that kind of corruption, then even if you upgraded to the final 0.9.2 release, you should still go and reindex your data to fix the segments on disk.

Gian

To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

Federico Nieves

unread,
Feb 6, 2017, 3:06:05 PM2/6/17
to Druid User
Hi Gian!

Yes we were using 0.9.1.1 before upgrading to 0.9.2 . And we have data that was uploaded with real-time nodes using 0.9.1.1, so as you said this could be the reason. Will re-index data and let you know if we still see weird values.

Thanks for the help !

Gian

Gian Merlino

unread,
Feb 6, 2017, 9:26:30 PM2/6/17
to druid...@googlegroups.com
0.9.1.1 is fine – actually, no released versions of Druid had this bug. The only buggy versions would have been snapshots built from master at some points between 0.9.1.1 and 0.9.2, and the earlier 0.9.2 RCs. So if you only ever had 0.9.1.1 and 0.9.2 installed then I think you're hitting something else.

It might be some on-disk data corruption caused by some _other reason_ like bad hardware?

Gian

To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

Federico Nieves

unread,
Feb 6, 2017, 9:59:28 PM2/6/17
to Druid User
It could be possible. It would also explain why the error is very intermittent. After a couple of same query execution, it succeeds. And really it is very random, like no pattern at all, not on segment intervals or something like that.

Will make a hardware check on all servers and get back to you! Thanks for everything Gian :D

Gian

Federico Nieves

unread,
Feb 9, 2017, 4:57:36 PM2/9/17
to Druid User
Hi there Gian,

I'm going mad, I ran a hardware check on every instance, and I got nothing. All disks are OK (smartctl shows good values, between 90 and 100 on remaining use time, not even near being bad). Also dmesg doesn't show disk errors at all.

Also to check that all segments are OK, I ran segment-dump and also downloaded every segment from HDFS, compared crc32 values (zip crc32 value, extracted file crc32 value and cached crc32 value). Everything was correct, all of them had the same crc32 number, so segments are fine. My guess is that the error is happening on real-time indexing. Perhaps should I try to re-index all data and see how it goes?

Also this errors are intermittent, same as wrong hyperloglog values, and this is bad because we made a lot of efforts switching our tech to Druid, but we can't push it to production because of these random failures (attached some errors).

I don't know what else to try. Do you have any suggestion?

Something great would be to track down which was the segment that produced the failure, but it seems that logs don't show that, I already started to see source code to see if I can find anything...

Thanks,
random_failures.log

Federico Nieves

unread,
Feb 16, 2017, 8:53:46 AM2/16/17
to Druid User
Hi, anyone that could help us with these problems? Any kind of hint would be much appreciated!

Thanks,

Nishant Bangarwa

unread,
Feb 21, 2017, 9:49:27 AM2/21/17
to Druid User
Hi Federico, 
Can you confirm if you see the issue only on realtime nodes or on historical also ? 

If on historical nodes, then there might be some issues with the segment and reIndexing might help. 
If only on realtime nodes, I wonder if there is some issue with realtime segments that might have caused this. 

Federico Nieves

unread,
Feb 21, 2017, 11:42:47 AM2/21/17
to Druid User
Hi Nishant, how are you?

I can confirm that the error is on historical servers, but I can't say if it's happening on real-time as well.

I will try mass re-indexing, it's the only thing that's left to check. Error is still happening.

Thanks for your reply!

Jakub Liska

unread,
May 19, 2017, 10:16:53 AM5/19/17
to Druid User
Hi Federico,

I started having this issue too, it started happening after moving to ec2 i3 instances, segments loaded from s3, cluster stabilized and now, as you say, 
we're getting these exact same errors on druid 0.9.2 ... It is absolutely non-deterministic, which makes it impossible to resolve.

Have you made any progress? 

Jakub Liska

unread,
May 19, 2017, 11:17:14 AM5/19/17
to Druid User
Also an interesting thing is that if this happens, then Coordinator somehow looses track of this historical node ... and sees only the other ones ... 

if I restart this historical node, it subscribes back to coordinator and everything works again ...

Federico Nieves

unread,
May 19, 2017, 11:26:14 AM5/19/17
to Druid User
Hi Jakub,

We didn't find the exact error. We don't even know if its gone for good or not. But we made some changes that made it better and today it's really really hard to find that weird output.

The first thing I would recommend is to re-index data (and as often as possible). For example if you ingest data via real-time nodes, have at least a nightly job that re-index data. We are even considering re-indexing every hour to avoid possible real-time data ingestion errors faster, but right now the nightly batch job works well.

The other and I believe the most important. On druid web console (default port 8081), check out the size of your segments. On Druid's documentation it recommends a segment size between 400-700MB. We had segments of 2GB sometimes, so we changed indexing granularity and segments are now between those values.

After this changes, the errors happened much less often (however are not gone).

Hope that helps and if you have any other questions would be more than happy to help.

Regards,

Jakub Liska

unread,
May 19, 2017, 11:31:38 AM5/19/17
to Druid User
Hi,

thank you for suggestions, we have small 30-60MB segments and we don't use real time indexing, so reindexing should not be necessary at all :-/

What instances are you using, isn't it i3 by any chance? With those  NVMe SSD ?

Federico Nieves

unread,
May 19, 2017, 11:43:04 AM5/19/17
to Druid User
Don't you get any logs on the historical server when it gets out from the cluster?

No, we don't use AWS, we use physical dedicated servers with Samsung SSDs.

Also I shared a script that checks for segment integrity between segments on deep-storage and segments present on historicals. But I made that using HDFS as deep storage, in your case you should change it to work with S3, but could be useful to double-check that segments are not corrupted.

Gian Merlino

unread,
May 19, 2017, 12:22:51 PM5/19/17
to druid...@googlegroups.com
I wonder if this is a similar issue to https://github.com/druid-io/druid/issues/4199. Are you all ever seeing exceptions too or just weird results?

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

Federico Nieves

unread,
May 19, 2017, 12:33:33 PM5/19/17
to Druid User
In my case I also see Exceptions, but in the form of:

ERROR [qtp466056887-27[topN_relyEventData_cb46052d-cf73-4586-8b89-d49593c02b76]] io.druid.server.QueryResource - Exception handling request: {class=io.druid.server.QueryResource, exceptionType=class net.jpountz.lz4.LZ4Exception, exceptionMessage=Error decoding offset 1579958 of input buffer, exception=net.jpountz.lz4.LZ4Exception: Error decoding offset 1579958 of input buffer...

It's always the same problem "Error decoding offset".


On Friday, May 19, 2017 at 1:22:51 PM UTC-3, Gian Merlino wrote:
I wonder if this is a similar issue to https://github.com/druid-io/druid/issues/4199. Are you all ever seeing exceptions too or just weird results?

Gian

On Fri, May 19, 2017 at 8:43 AM, Federico Nieves <federic...@gmail.com> wrote:
Don't you get any logs on the historical server when it gets out from the cluster?

No, we don't use AWS, we use physical dedicated servers with Samsung SSDs.

Also I shared a script that checks for segment integrity between segments on deep-storage and segments present on historicals. But I made that using HDFS as deep storage, in your case you should change it to work with S3, but could be useful to double-check that segments are not corrupted.

On Friday, May 19, 2017 at 12:31:38 PM UTC-3, Jakub Liska wrote:
Hi,

thank you for suggestions, we have small 30-60MB segments and we don't use real time indexing, so reindexing should not be necessary at all :-/

What instances are you using, isn't it i3 by any chance? With those  NVMe SSD ?

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

Gian Merlino

unread,
May 19, 2017, 12:35:02 PM5/19/17
to druid...@googlegroups.com
Could you see if there's a fuller stack trace available on a historical node for that? And if so please raise it as a github issue.

Gian

To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

Federico Nieves

unread,
May 19, 2017, 12:37:25 PM5/19/17
to Druid User
Oh and also less frequently but worse, a JVM fatal error (historical shuts down) that creates an error file, which header is the following:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd4cd241eb7, pid=15834, tid=0x00007fd4ccc46700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_111-b14) (build 1.8.0_111-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.111-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [liblz4-java576727587533814320.so+0x5eb7]  LZ4_decompress_fast+0x117
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Again the problem is LZ4 decompress. But not all exceptions end up in this falta error.

Federico Nieves

unread,
Jun 2, 2017, 12:38:05 PM6/2/17
to Druid User
I just created the issue on github:


Regards,
Reply all
Reply to author
Forward
0 new messages