Maintain metrics when reindexing rolled up Datasource to filter out rows

52 views
Skip to first unread message

Daniel Nash

unread,
Mar 14, 2025, 4:02:54 PMMar 14
to Druid User
I'm attempting to reindex an existing datasource that has rollup turned on and a few accumulating metrics and ThetaSketch metrics to remove some rows for a specific user identifier.  That user identifier is a dimension of the datasource, so any rolled up rows containing that identifier are safe to drop as one.  As this is a test of reindexing for us, I'm currently just reindexing a few hourly segments into a different datasource.

The issue is the metrics for the remaining rows do not seem to be getting maintained.  All "count" metrics are reset to 1 in the new segments and any metrics that were summing dimensions that were dropped during ingest are nulled out.  For example, our input data has a "Bytes" field that we are aggregating into a "TotalBytes" field, but the "Bytes" field itself is dropped from the data/not added as a dimension in the datasource.  In the new segments, the "TotalBytes" metric is null.  Additionally, the "SrcIp" field in the input isn't saved, just the thetaSketch metric to give number of unique IPs.  We have some additional fields in the data, but they transfer fine but aren't relevant to this issue.

In theory, this seems like it should work.  I just want Druid to copy all rows that do not have the specific user identifier exactly as they are.  I don't want it to actually do any new rolling up, but I do need the existing metrics preserved.  I feel like there might be something obvious I'm missing, but I don't know what that might be.  I've tried with rollup set to true and false with the same result.

Here is my current task spec:

{
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "druid",
        "dataSource": "network-usage",
        "interval": "2025-02-23T00:00:00.000Z/2025-02-23T01:00:00.000Z",
      },
      "appendToExisting": false,
      "dropExisting": false
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": {
        "type": "dynamic"
      },
      "maxNumConcurrentSubTasks": 1
    },
    "dataSchema": {
      "dataSource": "network-usage-reindextest",
      "transformSpec": {
        "filter": {
          "type": "not",
          "field": {
            "type": "in",
            "dimension": "UserId",
            "values": [ "bobbyjones" ]
          }
        }
      },
      "timestampSpec": {
        "column": "__time",
        "format": "millis"
      },
      "granularitySpec": {
        "rollup": true,
        "segmentGranularity": "hour",
      },
      "dimensionsSpec": {
        "dimensions": [
          {
            "type": "string",
            "name": "UserId",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
        ],
        "includeAllDimensions": false,
        "useSchemaDiscovery": false
      },
      "metricsSpec": [
        {
          "type": "count",
          "name": "ConnectionCount"
        },
        {
          "type": "longSum",
          "name": "ByteCount",
          "fieldName": "Bytes"
        },
        {
          "type": "thetaSketch",
          "name": "UniqueIps",
          "fieldName": "ScrIp",
          "size": 16384
        }
      ]
    }
  }
}

Appreciate any help.

~Dan

Daniel Nash

unread,
Mar 14, 2025, 5:11:13 PMMar 14
to Druid User
I just wanted to add that I discovered in a random conversation here that Compact tasks have a transformSpec now.  I ran a test and found I could filter out the rows I needed using a Compact task and retain the metrics for the remaining rows.  I feel like, given that, there must be a way to do it with the Reindex task too, but I haven't hit on the magic combination yet.

So, I have a solution if I keep it within the same datasource, which is really what I need, but I'd prefer to experiment some more by Reindexing to a new datasource while filtering out data if anyone has more thoughts on that.

~Dan

John Kowtko

unread,
Mar 14, 2025, 7:23:59 PMMar 14
to Druid User
Hi Daniel,  I thought I replied to this, but now don't see my reply --

A couple of things I noticed:

 * The count metric is type "count" ... that is fine for initial ingestion when you are counting the raw events coming in ... but subsequent reindexes and compactions should use longSum there.
 * make sure you include all dimension fields that you want to keep, in the dimensionsSpec.  Otherwise they will be removed.
 * Druid has fluid schema, i.e. from one time interval to the next the column set can be different.  So the NULL values you are seeing are likely because older segments have fields that the new segments don't have ... so those fields may still show up in the queries with null values.

Let me know if that makes sense, and if you have any followup questions.

Thanks.  John

Daniel Nash

unread,
Mar 15, 2025, 1:24:43 PMMar 15
to Druid User
Hey John,

Thank you so very much for your reply.  You gave me the missing pieces that got this all working.  The NULL values in my case should not have been NULL as the original data rows had valid data in them, but I appreciate what you are saying about the possibility.

Based on your comments, I was able to just change the metricsSpec section of my original reindex task spec to this and it all worked as intended:

      "metricsSpec": [
        {
          "type": "longSum",
          "name": "ConnectionCount"
          "fieldName": "ConnectionCount"

        },
        {
          "type": "longSum",
          "name": "ByteCount",
          "fieldName": "ByteCount"

        },
        {
          "type": "thetaSketch",
          "name": "UniqueIps",
          "fieldName": "UniqueIps",
          "size": 16384
        }
      ]

As you can see, I needed to change the metrics to point to the metrics that were setup during initial ingestion instead of the dimensions that had been discarded.  All seemed to carry over correctly, even the ThetaSketch ones.

Cheers,
Dan

John Kowtko

unread,
Mar 17, 2025, 11:52:19 AMMar 17
to Druid User
Hi Daniel, glad to hear you got it all figured out ... another crisis averted ...!  ;) 

Let us know if you run into any other issues or questions.

Thanks.  john

Reply all
Reply to author
Forward
0 new messages