I have a ~1 gig file ingested into Druid and now I'm trying to run a groupBy query over it (it is grouping by two dimensions). The file contains about 2 million records but it seems to take a lot of time (around 8-10 seconds) compared to SQL which does it very quickly. I'm wondering whether I made a mistake in indexing the data or maybe it's a memory issue based on what I've read. Here is my indexing file:
{
"type" : "index",
"dataSource" : "niagara_full",
"granularitySpec" :
{
"type" : "uniform",
"gran" : "MONTH",
"intervals" : [ "2013-01-01/2015-01-01" ]
},
"aggregators" :
[
{ "type" : "count", "name" : "rows" },
{ "type" : "longSum", "name" : "value", "fieldName" : "dataItemValue"}
],
"firehose" :
{
"type" : "local",
"baseDir" : "/home/alok/druid_playground/niagara/tasks",
"filter" : "niagara.json",
"parser" :
{
"timestampSpec" : { "column" : "timestamp"},
"data" :
{
"format" : "json",
"dimensions" :
[
"batchEffectiveDate",
"entityHierarchyTypeId",
"scenarioId",
"dataItemId",
"batchPriority",
"transactionBatchId",
"sortDate",
"parentEntityOId",
"dataItemValue",
"transactionId",
"childEntityOId",
"legalEntityOId",
"transactionStatusTypeId",
"transactionTypeId"
]
}
}
}
}
And here is my group by query file:
{
"queryType" : "groupBy",
"dataSource" : "niagara_full",
"granularity" : "all",
"dimensions": ["dataItemId", "parentEntityOId"],
"aggregations":
[
{"type": "count", "name": "rows" }
],
"intervals": [ "2013-01-01/2014-07-01" ]
}
Maybe it's something related to my jvm heap size? I'm not too familiar with that but when I launch the broker and historical nodes I add -Xmx256m into the java command. Also my set up is 1 historical node on a virtual machine doesn't resemble anything close to a production cluster.
Thanks very much, you've all been extremely helpful so far!
Alok