I'm attempting to reindex an existing datasource that has rollup turned on and a few accumulating metrics and ThetaSketch metrics to remove some rows for a specific user identifier. That user identifier is a dimension of the datasource, so any rolled up rows containing that identifier are safe to drop as one. As this is a test of reindexing for us, I'm currently just reindexing a few hourly segments into a different datasource.
The issue is the metrics for the remaining rows do not seem to be getting maintained. All "count" metrics are reset to 1 in the new segments and any metrics that were summing dimensions that were dropped during ingest are nulled out. For example, our input data has a "Bytes" field that we are aggregating into a "TotalBytes" field, but the "Bytes" field itself is dropped from the data/not added as a dimension in the datasource. In the new segments, the "TotalBytes" metric is null. Additionally, the "SrcIp" field in the input isn't saved, just the thetaSketch metric to give number of unique IPs. We have some additional fields in the data, but they transfer fine but aren't relevant to this issue.
In theory, this seems like it should work. I just want Druid to copy all rows that do not have the specific user identifier exactly as they are. I don't want it to actually do any new rolling up, but I do need the existing metrics preserved. I feel like there might be something obvious I'm missing, but I don't know what that might be. I've tried with rollup set to true and false with the same result.
Here is my current task spec:
{
"type": "index_parallel",
"spec": {
"ioConfig": {
"type": "index_parallel",
"inputSource": {
"type": "druid",
"dataSource": "network-usage",
"interval": "2025-02-23T00:00:00.000Z/2025-02-23T01:00:00.000Z",
},
"appendToExisting": false,
"dropExisting": false
},
"tuningConfig": {
"type": "index_parallel",
"partitionsSpec": {
"type": "dynamic"
},
"maxNumConcurrentSubTasks": 1
},
"dataSchema": {
"dataSource": "network-usage-reindextest",
"transformSpec": {
"filter": {
"type": "not",
"field": {
"type": "in",
"dimension": "UserId",
"values": [ "bobbyjones" ]
}
}
},
"timestampSpec": {
"column": "__time",
"format": "millis"
},
"granularitySpec": {
"rollup": true,
"segmentGranularity": "hour",
},
"dimensionsSpec": {
"dimensions": [
{
"type": "string",
"name": "UserId",
"multiValueHandling": "SORTED_ARRAY",
"createBitmapIndex": true
},
],
"includeAllDimensions": false,
"useSchemaDiscovery": false
},
"metricsSpec": [
{
"type": "count",
"name": "ConnectionCount"
},
{
"type": "longSum",
"name": "ByteCount",
"fieldName": "Bytes"
},
{
"type": "thetaSketch",
"name": "UniqueIps",
"fieldName": "ScrIp",
"size": 16384
}
]
}
}
}
Appreciate any help.
~Dan