Batch Ingestion via POST

Brad Peabody

unread,

Sep 2, 2015, 3:51:35 PM9/2/15

to Druid User

Hi,

This is perhaps a very basic question, but I'm testing out loading some web traffic logs and other derived data into Druid and I'd like to be able to take a file that has records - one line of JSON per - and POST that directly to an indexer (not realtime - I have a set of about 10 log files that correspond to each hour, this would be a batch for that hour).

Looking at the documentation:

http://druid.io/docs/latest/ingestion/batch-ingestion.html

http://druid.io/docs/latest/ingestion/firehose.html

I see that the "local" firehose type lets you read from local files on disk. But in my case it would be useful to be able to actually post the content over HTTP (indexer and the machine containing the logs are not the same machine) - not one message at a time (as would seem to be the case with EventReceiverFirehose), but just send the batch over. Is there some option for that? I realize it may be quite a bit of data and perhaps doing it over HTTP is not the greatest, but I wanted to see if there is some feature I'm missing.

Best, Brad

Himanshu

unread,

Sep 2, 2015, 4:21:58 PM9/2/15

to Druid User

Hi Brac,

with event receiver firehose you can post a whole batch of events (tranquility uses same interface to push events actually)... something like

[{event1}, {event 2}.........]

-- Himanshu

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/e9d126ba-b776-4e4d-9b90-34ce62db9257%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brad Peabody

unread,

Sep 2, 2015, 4:54:14 PM9/2/15

to Druid User

Thanks - that would do it. Will try that out.

Brad Peabody

unread,

Sep 21, 2015, 4:36:47 AM9/21/15

to Druid User

Picking this back up, I am running into this issue now with the event receiver firehose:

I am using an overlord node in local mode to attempt to load in realtime data with an EventReceiverFirehose. As a result, I am getting:

<p>Problem accessing /druid/indexer/v1/task. Reason:
<pre>    com.google.inject.ProvisionException: Guice provision errors:

1) Error in custom provider, com.metamx.common.ISE: Cannot add a handler after the Lifecycle has started, it doesn't work that way.
  at io.druid.guice.DruidProcessingModule.getProcessingExecutorService(DruidProcessingModule.java:90)
  at io.druid.guice.DruidProcessingModule.getProcessingExecutorService(DruidProcessingModule.java:90)
  while locating java.util.concurrent.ExecutorService annotated with @io.druid.guice.annotations.Processing()
    for parameter 0 at io.druid.query.IntervalChunkingQueryRunnerDecorator.&lt;init&gt;(IntervalChunkingQueryRunnerDecorator.java:38)
  while locating io.druid.query.IntervalChunkingQueryRunnerDecorator
    for parameter 0 at io.druid.query.timeseries.TimeseriesQueryQueryToolChest.&lt;init&gt;(TimeseriesQueryQueryToolChest.java:72)
  at io.druid.guice.QueryToolChestModule.configure(QueryToolChestModule.java:71)
  while locating io.druid.query.timeseries.TimeseriesQueryQueryToolChest
    for parameter 0 at io.druid.query.timeseries.TimeseriesQueryRunnerFactory.&lt;init&gt;(TimeseriesQueryRunnerFactory.java:51)
  at io.druid.guice.QueryRunnerFactoryModule.configure(QueryRunnerFactoryModule.java:80)
  while locating io.druid.query.timeseries.TimeseriesQueryRunnerFactory
  while locating io.druid.query.QueryRunnerFactory annotated with @com.google.inject.multibindings.Element(setName=,uniqueId=18, type=MAPBINDER)
  at io.druid.guice.DruidBinders.queryRunnerFactoryBinder(DruidBinders.java:36)
  while locating java.util.Map&lt;java.lang.Class&lt;? extends io.druid.query.Query&gt;, io.druid.query.QueryRunnerFactory&gt;
    for parameter 0 at io.druid.query.DefaultQueryRunnerFactoryConglomerate.&lt;init&gt;(DefaultQueryRunnerFactoryConglomerate.java:34)
  while locating io.druid.query.DefaultQueryRunnerFactoryConglomerate
  at io.druid.guice.StorageNodeModule.configure(StorageNodeModule.java:53)
  while locating io.druid.query.QueryRunnerFactoryConglomerate

I don't follow the guts of this thing enough to understand how I'm creating a lifecycle problem with this request.

The steps I am doing to get this are:

This is my task:

rt-test.json:

{
  "type": "index_realtime",
  "spec": {
    "dataSchema": {
      "dataSource": "wikipedia",
      "parser": {
        "type": "string",
        "parseSpec": {
          "format": "json",
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "page",
              "language",
              "user",
              "unpatrolled",
              "newPage",
              "robot",
              "anonymous",
              "namespace",
              "continent",
              "country",
              "region",
              "city"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [
        {
          "type": "count",
          "name": "count"
        },
        {
          "type": "doubleSum",
          "name": "added",
          "fieldName": "added"
        },
        {
          "type": "doubleSum",
          "name": "deleted",
          "fieldName": "deleted"
        },
        {
          "type": "doubleSum",
          "name": "delta",
          "fieldName": "delta"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "NONE"
      }
    },
    "ioConfig": {
      "type": "realtime",
      "firehose": {
        "type": "receiver",
        "serviceName": "testService",
        "bufferSize": 10000
      },
      "plumber": {
        "type": "realtime"
      }
    },
    "tuningConfig": {
      "type": "realtime",
      "maxRowsInMemory": 500000,
      "intermediatePersistPeriod": "PT10m",
      "windowPeriod": "PT10m",
      "basePersistDirectory": "/tmp/realtime/basePersist",
      "rejectionPolicy": {
        "type": "serverTime"
      }
    }
  }
}

And I am submitting it with:

curl -X POST 'http://localhost:8090/druid/indexer/v1/task' -H 'content-type: application/json' -d@rt-test.json

Any input?

Brad Peabody

unread,

Sep 21, 2015, 4:57:22 AM9/21/15

to Druid User

I realized after I posted this that I jumped subjects a bit - this is a realtime example and my original post was about batch data. I ended up doing batch processing via local files; at this point I'm trying to get realtime to work, thus this question. (Sorry about the thread mis-use!)

Fangjin Yang

unread,

Sep 22, 2015, 12:01:28 PM9/22/15

to Druid User

Hi Brad, trying to stream files directly into Druid is going to be extremely difficult until https://groups.google.com/forum/#!searchin/druid-development/windowperiod/druid-development/kHgHTgqKFlQ/fXvtsNxWzlMJ is completed.

For the time being, if you have any static set of data, I recommend using batch ingestion, and if you have a stream of current time data, to use realtime ingestion.

Reply all

Reply to author

Forward