Hi Harish / Experts,
Thank you for accepting my Request.
I am using Druid from last 4 years for reporting purpose. It is Excellent for low latency.
Now I want to use druid segment data for spark analytics purpose. I have gone through SparkLineData and it's looking very exciting.
Basically, I want to fetch data from druid segment in spark distributed cluster with GeoSpatial Queries and distinct count for dimensions.
e.g. Query :
{
"queryType": "groupBy",
"dataSource": "segment",
"granularity": "all",
"dimensions": [
"uid,date,coordinates"
],
"filter": {
"type": "and",
"fields": [
{
"type": "spatial",
"dimension": "coordinates",
"bound": {
"type": "radius",
"coords": [
51.515338,
-0.16274
],
"radius": 0.0036231884057971015
}
},
{
"type": "spatial",
"dimension": "coordinates",
"bound": {
"type": "polygon",
"abscissa": [
54.347327104421474,
54.345851247431185,
54.34304947481254,
54.344250257891325,
54.347327104421474
],
"ordinate": [
-6.259245872497558,
-6.265425682067871,
-6.262807846069336,
-6.255512237548828,
-6.259245872497558
]
}
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "numz_count",
"fieldName": "numz"
},
{
"type": "distinctCount",
"name": "distinct_user_count",
"fieldName": "uid"
}
],
"intervals": [
"2017-05-30T03:00:00.000Z/2017-05-30T04:00:00.000Z"
]
}
In spark SQL it will like :
select uid, date, coordinates from <table> where <dateRange>;
I am using distinctCount to fetch exact distinct count for uid. DistinctCount is contributed extension in Druid meanwhile HLL is giving approx. values.
My Questions are:
1. Can SparkLineData support all GeoSpatial Query like polygon and circle?
2. Can SparkLineData fully use data from druid segment historical nodes?
3. Do I need to load all JSON file in Spark SQL for querying (I have data of more than 1B rows per day and I load data hourly)?
Please suggest the infrastructure and how can I use this product.
Thank you,
Jitesh