Apache Druid – Ingesting multiple objects in JSON returns only single row

298 views
Skip to first unread message

Wilhelm D

unread,
Feb 16, 2021, 3:44:03 PM2/16/21
to Druid User

I'm aiming to ingest this JSON data into Apache Druid as multiple rows. The data (about 10x more rows than this example) is served from a proprietary HTTP server that I have no control over. I cannot change how the data is presented from the HTTP server. The data will be pulled periodically, as there is no notification from the HTTP server.

{
  "1": {
    "lastseen": "2021-02-15T05:02Z",
    "name": "Temperature Sensor A",
    "state": {
      "alert": "none",
      "level": 152,
      "on": true,
      "reachable": true
    }
  },
  "3": {
    "lastseen": "2021-02-15T05:01Z",
    "name": "Temperature Sensor B",
    "state": {
      "alert": "none",
      "level": 13,
      "on": false,
      "reachable": true
    }
  },
  "4": {
    "lastseen": "2021-02-15T05:00Z",
    "name": "Temperature Sensor Backup",
    "state": {
      "alert": "none",
      "level": 76,
      "on": true,
      "reachable": true
    }
  }
}

Note that object 2 is missing. This number is an internal record ID from the server, and if object 2 is deleted, this ID will disappear from subsequent requests.

I have created a data source that successfully pulls the data, but it ends up as as single row. As I understand, since the JSON object has no array at the root level, Druid sees these three objects as a single row.

I suspect that the answer lies somewhere in the flattenSpec area, but I have failed to get this going. The closest I have gotten so far is with the following flattenSpec, but this only returns the first row from the JSON data (though the row is represented correctly).

 "inputFormat": {
    "type": "json",
    "flattenSpec": {
      "fields": [
        {
          "type": "jq",
          "expr": ".[].name",
          "name": "name"
        },
        {
          "type": "jq",
          "expr": ".[].lastseen",
          "name": "lastseen"
        }
      ]
    }
  }

The above flattenSpec gives me a single row that looks like:

lastseen            | name
2021-02-15T05:02Z   | Temperature Sensor A

How can I get Druid to see all the rows? What is the right flattenSpec (or jq / JSONPath expression)?

I am not concerned about performance, but I would prefer not to have to spin up more proxy-like services between the HTTP server and Druid.

Cheers,

Wilhelm

Itai Yaffe

unread,
Feb 16, 2021, 3:58:31 PM2/16/21
to druid...@googlegroups.com
Hey Wilhelm,
I think that essentially you're looking for something like "explode" functionality, so for 1 input row, you want N output rows, correct?
Assuming that's the case, I don't think Druid supports such functionality ATM (see https://github.com/apache/druid/pull/8698).
I think you'll have to pre-process the data before ingesting it to Druid.

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/d8b46461-d911-4bdc-99a0-eee156fd6c7an%40googlegroups.com.
Message has been deleted
Message has been deleted

Itai Yaffe

unread,
Feb 23, 2021, 8:57:32 AM2/23/21
to druid...@googlegroups.com
Hey Wilhelm,
After reading your last email a few times, I must admit I'm still not sure if my comment re/ "explode" functionality was relevant or now, so let me make sure I understand correctly:
  1. When Druid retrieves a record from your HTTP server, that record contains multiple JSON objects (whether it's 3 or 30), correct?
    An example for such a single record would be (I intentionally removed white spaces):
    {"1":{"lastseen":"2021-02-15T05:02Z","name":"Temperature Sensor A","state":{"alert":"none","level":152,"on":true,"reachable":true}},"3":{"lastseen":"2021-02-15T05:01Z","name":"Temperature Sensor B","state":{"alert":"none","level":13,"on":false,"reachable":true}},"4":{"lastseen":"2021-02-15T05:00Z","name":"Temperature Sensor Backup","state":{"alert":"none","level":76,"on":true,"reachable":true}}}
  2. If so - then essentially you are trying to produce N (again, whether it's 3 or 30) output records for 1 input record, right?
    So based on the previous record, you'd like Druid to produce 3 records, e.g:
  1. lastseen            | name
    2021-02-15T05:02Z   | Temperature Sensor A
  1. 2021-02-15T05:01Z   | Temperature Sensor B
    2021-02-15T05:00Z   | Temperature Sensor Backup


If the answer to both questions is "yes", then unfortunately I'm not familiar with a way to achieve that without pre-processing the data, but I'll be happy to be proven wrong here :)

Thanks and good luck!
                        Itai

On Fri, Feb 19, 2021 at 4:07 PM Wilhelm D <wil...@damsleth.com> wrote:
Hi Itai!

I was a bit imprecise with my wording in my original post, I apologise. From my example (with three JSON objects), I want three rows into Druid (I was pointing out that my data source has around 30 sensors I want to gather data from, but that I only showed three for brevity).

My issue is that I cannot get Druid to see the consecutive rows, and I suspect that this is because the JSON objects are output in a key:object-format instead of the one-JSON-per-line, so that when this JSON is interpreted, it is in fact understood as a bunch of consecutive columns  ("1.lastseen", "1.name", "2.lastseen", "2.name").

If you load up the sample data I provided into jq play, input ".[][]" as the filter and enable "Slurp" mode (checkbox on the top right), you will see a result that I think is close to what Druid expects. As far as I can understand, this would require pre-processing the data before it is loaded. However, I do feel that I'm "just so close" to nailing it in the inputFormat/flattenSpec kind of area, but the achilles heel seems to be that all commands applied here are applied to the rows after Druid makes an assumption about what a row is.

For example, if I were able to get Druid to run a jq command against the JSON after it is retrieved over HTTP, but before it is interpreted by the engine, this would be solved quite easily. For example, the jq command

someSourceData | jq --slurp '.[][]'

…would sort this right out. Thoughts on how to approach this?

Cheers,

Wilhelm

Reply all
Reply to author
Forward
0 new messages