Druid group by query slow

638 views
Skip to first unread message

asa...@princeton.edu

unread,
Jul 8, 2014, 2:00:29 PM7/8/14
to druid-de...@googlegroups.com
Hey all,

I have a ~1 gig file ingested into Druid and now I'm trying to run a groupBy query over it (it is grouping by two dimensions). The file contains about 2 million records but it seems to take a lot of time (around 8-10 seconds) compared to SQL which does it very quickly. I'm wondering whether I made a mistake in indexing the data or maybe it's a memory issue based on what I've read. Here is my indexing file:

{
    "type" : "index",
    "dataSource" : "niagara_full",
    "granularitySpec" :
    {
        "type" : "uniform",
        "gran" : "MONTH",
        "intervals" : [ "2013-01-01/2015-01-01" ]
    },
    "aggregators" :
    [
        { "type" : "count", "name" : "rows" },
        { "type" : "longSum", "name" : "value", "fieldName" : "dataItemValue"}
    ],
    "firehose" :
    {
        "type" : "local",
        "baseDir" : "/home/alok/druid_playground/niagara/tasks",
        "filter" : "niagara.json",
        "parser" :
        {
            "timestampSpec" : { "column" : "timestamp"},
            "data" :
            {
                "format" : "json",
                "dimensions" :
                [
                    "batchEffectiveDate",
                    "entityHierarchyTypeId",
                    "scenarioId",
                    "dataItemId",
                    "batchPriority",
                    "transactionBatchId",
                    "sortDate",
                    "parentEntityOId",
                    "dataItemValue",
                    "transactionId",
                    "childEntityOId",
                    "legalEntityOId",
                    "transactionStatusTypeId",
                    "transactionTypeId"
                ]
            }
        }
    }
}

And here is my group by query file:

{
        "queryType" : "groupBy",
        "dataSource" : "niagara_full",
        "granularity" : "all",
        "dimensions": ["dataItemId", "parentEntityOId"],
        "aggregations":
        [
                {"type": "count", "name": "rows" }
        ],
        "intervals": [ "2013-01-01/2014-07-01" ]
}



Maybe it's something related to my jvm heap size? I'm not too familiar with that but when I launch the broker and historical nodes I add -Xmx256m into the java command. Also my set up is 1 historical node on a virtual machine doesn't resemble anything close to a production cluster.

Thanks very much, you've all been extremely helpful so far!

Alok





Fangjin Yang

unread,
Jul 8, 2014, 2:30:07 PM7/8/14
to druid-de...@googlegroups.com
Hi,

Before investigating too much further, can you run a timeseries query over all your data and see how long that takes? I am almost certain that the groupBy slowness you are seeing is an artifact of configuration which we can help debug.

Just to get an idea, we have open source benchmarks: http://druid.io/blog/2014/03/17/benchmarking-druid.html

Thanks,
FJ

asa...@princeton.edu

unread,
Jul 8, 2014, 2:52:38 PM7/8/14
to druid-de...@googlegroups.com
I did a simple time series query in which I used count as an aggregate and granularity equal to month. This was fast but maybe that's just because the query is much simpler?

Fangjin Yang

unread,
Jul 8, 2014, 3:01:56 PM7/8/14
to druid-de...@googlegroups.com
Hi,

Can you quantify fast and also the hardware you are using? E.g. a timeseries with a single count aggregator takes how long? I am trying to get a sense of rows scanned per second.


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/af4dcac8-2410-4f05-8f29-6f6d29cffa43%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

asa...@princeton.edu

unread,
Jul 8, 2014, 3:29:19 PM7/8/14
to druid-de...@googlegroups.com
The file size is 2 million rows and a timeseries with count aggregation takes approximately 0.55 seconds. 

Hardware (host computer):

Intel(R) Core(TM) i7-4770 CPU @ 3.40 GHz
RAM: 8 GB
64-bit operating system

Ubuntu 14.04 Virtual Machine:
RAM: 2 GB

Fangjin Yang

unread,
Jul 9, 2014, 12:02:21 PM7/9/14
to druid-de...@googlegroups.com
That is very slow. Can you share your broker and historical configs, as well as the total size of segments on your historical node?

asa...@princeton.edu

unread,
Jul 9, 2014, 5:15:07 PM7/9/14
to druid-de...@googlegroups.com
Switched everything to a new VM with 8 gigs of ram. Brought down the query time from 14 seconds to about 1.5  seconds (with 3 threads running). I'll let you know if there's anything else I have a problem with regarding query speeds -- pretty sure the initial issue had to do with hardware. Thanks!
Reply all
Reply to author
Forward
0 new messages