GeoSpatial Query on SparkLineData

21 views
Skip to first unread message

Jitesh Mogre

unread,
May 31, 2017, 4:08:17 AM5/31/17
to sparklinedata
Hi Harish / Experts,

Thank you for accepting my Request.

I am using Druid from last 4 years for reporting purpose. It is Excellent for low latency.
Now I want to use druid segment data for spark analytics purpose. I have gone through SparkLineData and it's looking very exciting.

Basically, I want to fetch data from druid segment in spark distributed cluster with GeoSpatial Queries and distinct count for dimensions.

e.g. Query : 

{
  "queryType": "groupBy",
  "dataSource": "segment",
  "granularity": "all",
  "dimensions": [
    "uid,date,coordinates"
  ],
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "spatial",
        "dimension": "coordinates",
        "bound": {
          "type": "radius",
          "coords": [
            51.515338,
            -0.16274
          ],
          "radius": 0.0036231884057971015
        }
      },
      {
        "type": "spatial",
        "dimension": "coordinates",
        "bound": {
          "type": "polygon",
          "abscissa": [
            54.347327104421474,
            54.345851247431185,
            54.34304947481254,
            54.344250257891325,
            54.347327104421474
          ],
          "ordinate": [
            -6.259245872497558,
            -6.265425682067871,
            -6.262807846069336,
            -6.255512237548828,
            -6.259245872497558
          ]
        }
      }
    ]
  },
  "aggregations": [
    {
      "type": "longSum",
      "name": "numz_count",
      "fieldName": "numz"
    },
    {
      "type": "distinctCount",
      "name": "distinct_user_count",
      "fieldName": "uid"
    }
  ],
  "intervals": [
    "2017-05-30T03:00:00.000Z/2017-05-30T04:00:00.000Z"
  ]
}



In spark SQL it will like : 
select uid, date, coordinates from <table> where <dateRange>;


I am using distinctCount to fetch exact distinct count for uid. DistinctCount is contributed extension in Druid meanwhile HLL is giving approx. values.

My Questions are: 

1. Can SparkLineData support all GeoSpatial Query like polygon and circle?
2. Can SparkLineData fully use data from druid segment historical nodes?
3. Do I need to load all JSON file in Spark SQL for querying (I have data of more than 1B rows per day and I load data hourly)?

Please suggest the infrastructure and how can I use this product.


Thank you,
Jitesh

harish

unread,
May 31, 2017, 10:10:13 AM5/31/17
to sparklinedata
1. Geospatial and Approx Count support: https://github.com/SparklineData/spark-druid-olap/wiki/Approximate-Count-and-Spatial-Queries
2. Yes, both directly and via the broker; you just need to issue regular SQL: https://github.com/SparklineData/spark-druid-olap/wiki/Druid-Query-Cost-Model
3. If you are asking about querying, than you just write regular SQL and we translate and optimize to using Druid querying.

But we do suggest you consider our SNAP BI Platform(http://bit.ly/2oBJSpPThis is the logical progression of SQL on Druid, and has many more advanced acceleration features.

regards,
Harish.
Reply all
Reply to author
Forward
0 new messages