I have a data source of about 100M rows with just a couple of columns:
- id [string uuid]
- category [multi value array of integers] This column can have up to 200 possible values, but each row typically has 10-20 values.
I am using a groupby query with a set of filters. If I don't put in any value in the dimensions in the query,
{
"queryType": "groupBy",
"dataSource": "dataSource",
"granularity": "all",
"dimensions": [],
"aggregations":[
{"type":"count", "name":"eventcount"}
],
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "categories",
"value": "1"
},
{
"type": "selector",
"dimension": "categories",
"value": "2"
},
{
"type": "selector",
"dimension": "categories",
"value": "3"
},
{
"type": "selector",
"dimension": "categories",
"value": "4"
},
{
"type": "selector",
"dimension": "categories",
"value": "5"
}
]
},
"intervals":["1970-01-01T00:00/1970-01-02T00"]
}
then the query returns in about 1-2 seconds.
However, if I put in a query where the dimension is a multivalue column:
{
"queryType": "groupBy",
"dataSource": "dataSource",
"granularity": "all",
"dimensions": ["categories"],
"aggregations":[
{"type":"count", "name":"eventcount"}
],
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "categories",
"value": "1"
},
{
"type": "selector",
"dimension": "categories",
"value": "2"
},
{
"type": "selector",
"dimension": "categories",
"value": "3"
},
{
"type": "selector",
"dimension": "categories",
"value": "4"
},
{
"type": "selector",
"dimension": "categories",
"value": "5"
}
]
},
"intervals":["1970-01-01T00:00/1970-01-02T00"]
}
The query takes very long time, 3-4 minutes. Is this expected behavior? I would think the groupby count with filters should be pretty fast since the data is already in memory.
- Alvin