How to rollup older data by monthly

129 views
Skip to first unread message

chitra raj

unread,
Feb 17, 2021, 7:54:53 PM2/17/21
to druid...@googlegroups.com
Can someone guide me on how to do the rollup by monthly for the data thats already ingested into druid cluster at weekly level.

Chitra

Ben Krug

unread,
Feb 18, 2021, 12:00:15 PM2/18/21
to druid...@googlegroups.com
You can reindex the data, which is basically ingesting from the existing data and either overwriting or sending to a new datasource, with a monthly query granularity.  Eg, https://druid.apache.org/docs/0.20.1/ingestion/data-management.html#reindexing-with-native-batch-ingestion .  I'm not sure why they warn against doing this for data > 1G, I've seen it used on much larger datasets.

On Wed, Feb 17, 2021 at 4:54 PM chitra raj <chit...@gmail.com> wrote:
Can someone guide me on how to do the rollup by monthly for the data thats already ingested into druid cluster at weekly level.

Chitra

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/CABPo10ob6hzmOHV21cGHpG80fXyoVg1_-EWz2i8sjfWj7LWFsg%40mail.gmail.com.

chitra raj

unread,
Feb 18, 2021, 2:00:35 PM2/18/21
to druid...@googlegroups.com
I am unable to find the exact api payload, can you provide an example, that can pull data from current data sources and build indexed data ?
Chitra

Ben Krug

unread,
Feb 18, 2021, 3:02:35 PM2/18/21
to druid...@googlegroups.com
Here is one I had lying around.  I'm not sure if leaving dimensions and metrics out of the spec will grab everything or not, you can try, or list them all explicity.  In this one, I load from one datasource into another, but you can make them the same, and keep appendToExisting false so that you overwrite the old with the new.

{

 "type": "index_parallel",

 "spec": { 

    "dataSchema" : {

      "dataSource": "reLoad",

      "timestampSpec": {

        "format": "auto",

        "column": "ts"

      },

      "transformSpec": {

        "transforms": [

          {

            "type": "expression",

            "name": "legacy_order_downloaded",

            "expression": "bytes_sent >= byte_offset"

          }

        ],

        "filter": {

          "type": "not",

          "field": {

            "type": "selector",

            "dimension": "order_id",

            "value": ""

          }

        }

      },

      "dimensionsSpec": {

        "dimensions": [

  {"name": "order_id", "type": "string"},

          {"name": "legacy_order_downloaded", "type": "long"}

        ]

      },

      "metricsSpec": [

          {"name": "bytes_sent", "type": "longSum", "fieldName": "bytes_sent"},

          {"name": "byte_offset", "type": "longMax", "fieldName": "byte_offset"}

      ],

      "granularitySpec": {

  "type": "uniform",

  "segmentGranularity": "HOUR",

  "queryGranularity": "MINUTE",

  "rollup": true

      }

    },

    "ioConfig": {

      "type": "index_parallel",

      "inputSource": {

        "type": "druid",

        "dataSource": "firstLoad",

        "interval": "2020-11-09/2020-11-11",

        "dimensions": ["order_id","legacy_order_downloaded"],

        "metrics": ["byte_offset","bytes_sent"]

      },

      "appendToExisting": false

    }

  },

  "tuningConfig": {

    "type": "index_parallel",

    "maxRowsPerSegment": 500000,

    "maxNumConcurrentSubTasks": 5

  }

}


Peter Marshall

unread,
Feb 22, 2021, 10:03:43 AM2/22/21
to Druid User
Hey Chitra - just a note of caution - it sounds like you are wanting to roll up from week to month?  Have you checked whether metrics for weeks that span multiple months end up inside an acceptable month? 

chitra raj

unread,
Feb 22, 2021, 4:17:55 PM2/22/21
to druid...@googlegroups.com
HI Peter/Ben,

Current the indexes are in hourly rollups during ingestion time.

I applied below json to run rollups by week with segment months.
Json payload
{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "rollup-developmentReq",
      "dimensionsSpec" : {
        "dimensions" : [
          "repo",
          "username"
        ]
      },
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "metricsSpec" : [
        { "type" : "count", "name" : "count" },
        { "type" : "longSum", "name" : "duration", "fieldName" : "duration" },
        { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }

      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "month",
        "queryGranularity" : "week",
        "intervals" : ["2021-01-22/2021-02-01"],
        "rollup" : true
      }
    },
    "ioConfig" : {
      "type" : "index",
      "inputSource" : {
        "type" : "druid",
        "index" : "development-requests"
      },
      "inputFormat" : {
        "type" : "json"

      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index",
      "maxRowsPerSegment" : 5000000,
      "maxRowsInMemory" : 25000
    }
  }
}

Error after submitting task to overlord using post-index json 
2021-02-22T21:08:29,020 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS.
java.lang.NullPointerException
	at org.apache.druid.indexing.common.task.FiniteFirehoseProcessor.process(FiniteFirehoseProcessor.java:96) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:859) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.IndexTask.runTask(IndexTask.java:467) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:137) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:419) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:391) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_242]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_242]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_242]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
2021-02-22T21:08:29,033 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - Unregistering chat handler[index_rollup-developmentReq_2021-02-22T21:08:24.482Z]
2021-02-22T21:08:29,033 INFO [task-runner-0-priority-0] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_rollup-devhub_2021-02-22T21:08:24.482Z] status changed to [FAILED].
2021-02-22T21:08:29,041 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_rollup-developmentReq_2021-02-22T21:08:24.482Z",
  "status" : "FAILED",
  "duration" : 71,
  "errorMsg" : "java.lang.NullPointerException\n\tat org.apache.druid.indexing.common.task.FiniteFirehoseProcessor.pro...",
  "location" : {
    "host" : null,
    "port" : -1,
    "tlsPort" : -1
  }
}
2021-02-22T21:08:29,066 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.http.SegmentListerResource as a root resource class
2021-02-22T21:08:29,067 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering com.fasterxml.jackson.jaxrs.smile.JacksonSmileProvider as a provider class
2021-02-22T21:08:29,067 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider as a provider class
2021-02-22T21:08:29,067 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.initialization.jetty.CustomExceptionMapper as a provider class
2021-02-22T21:08:29,067 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.initialization.jetty.ForbiddenExceptionMapper as a provider class
2021-02-22T21:08:29,068 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.initialization.jetty.BadRequestExceptionMapper as a provider class

Peter Marshall

unread,
Mar 10, 2021, 3:20:26 AM3/10/21
to Druid User
Did you get to the bottom of this?

I noted that your ioconfig might need changing

Reply all
Reply to author
Forward
0 new messages