Ingestion spec vs compaction - clarification

48 views
Skip to first unread message

richarde

unread,
Apr 19, 2024, 11:15:50 PMApr 19
to Druid User
Hi all,

I have the following in my Kafka ingestion spec:

"granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "MONTH",
        "queryGranularity": {
          "type": "none"
        },
        "rollup": false,
        "intervals": []
      },

And then I have this autocompaction spec:

{
  "dataSource": "AxonProdTelegrafEvent",
  "taskPriority": 25,
  "inputSegmentSizeBytes": 100000000000000,
  "maxRowsPerSegment": null,
  "skipOffsetFromLatest": "P1M",
  "tuningConfig": {
    "maxRowsInMemory": null,
    "appendableIndexSpec": null,
    "maxBytesInMemory": null,
    "maxTotalRows": null,
    "splitHintSpec": null,
    "partitionsSpec": {
      "type": "dynamic",
      "maxRowsPerSegment": 5000000,
      "maxTotalRows": null
    },
    "indexSpec": null,
    "indexSpecForIntermediatePersists": null,
    "maxPendingPersists": null,
    "pushTimeout": null,
    "segmentWriteOutMediumFactory": null,
    "maxNumConcurrentSubTasks": null,
    "maxRetry": null,
    "taskStatusCheckPeriodMs": null,
    "chatHandlerTimeout": null,
    "chatHandlerNumRetries": null,
    "maxNumSegmentsToMerge": null,
    "totalNumMergeTasks": null,
    "maxColumnsToMerge": null,
    "type": "index_parallel",
    "forceGuaranteedRollup": false
  },
  "granularitySpec": {
    "segmentGranularity": "MONTH",
    "queryGranularity": {
      "type": "none"
    },
    "rollup": null
  },
  "dimensionsSpec": null,
  "metricsSpec": null,
  "transformSpec": null,
  "ioConfig": null,
  "taskContext": null
}

Question is if my ingestion and compaction is the same do I need to autocompact?

Also, my ingestion could be bursty (not at the same rate always). Should I then rather have a lower ingestion spec of DAY and then compact to MONTH after the first month? Even at MONTH my segment sizes are quite small, not even 100Meg.

richarde

unread,
Apr 20, 2024, 3:34:06 AMApr 20
to Druid User
Replying to my own post as I think I figured it out on my own. I should probably do a compaction if the supervisor was stopped, or I did server maintenance, or anything that may have caused ingestion to stop to ensure uniform segments. In my case maybe manually kicking off a compaction job is good enough. Am I thinking correctly about this?

John Kowtko

unread,
Apr 20, 2024, 10:50:36 AMApr 20
to Druid User
Hi Richard,

There are a number of things that cause fragmentation during the ingestion process ... taskCount, taskDuration, intermediateHandoffPeriod, maxRowsPerSegment, maxTotalRows ... we can tweak these to minimize fragmentation, and in many cases my customers do not run compaction because it's not needed.  

If you have low ingestion I suggest:
 * make sure taskcount is set to 1
 * increase taskDuration from the default 1 hour to 8 or 12 hours
 * as for segmentGranularity, for the detemination of MONTH vs DAY I would look at two things initially:
    - are you receiving late arrival data for many days prior to today?  If so, then MONTH will produce less fragmentation
    - if you are not receiving late arrival data, and you have lots of queries that tend to query on short intervals (i.e. 1 day, 1 week, etc), and your overall data retention isn't that long (i.e. less than a year or two), then DAY granularity might be cleaner because you can compact the data immediately after one day and not have to let the fragmentation build up for the entire month before you can compact it.

If you are not sure, then please provide your supervisor spec (you can remove the dimension and metricspec areas if you want for privacy) and show a screenshot of what the segment count looks like grouped by interval for this datasource.

Thanks.  John

richarde

unread,
Apr 22, 2024, 3:35:36 AMApr 22
to Druid User
Thanks John,

This has been most helpful.

I have changed my ingestion taskDuration to P12H, segmentGranularity to DAY and compaction to MONTH with skipOffsetFromLatest to P1M. I do not have late arrival data. This already has created far fewer segments but I will have to wait till the end of the month for this months compaction to cleanup things. This should give me roughly 60 segments in the current month that will get compacted to a single segment the next month. As I am dealing with telemetry data, most of my queries will be in the last month or so but I need to keep data indefinitely.

John Kowtko

unread,
Apr 22, 2024, 8:35:21 AMApr 22
to Druid User
Hi Richard,

If you are receiving so little data that you will end up with only one segment for the entire month, then you could run concurrent compaction instead and change your ingestion back to MONTH level ... this will allow the current month to compact periodically, keeping the current month's overall segment count down to just 1 or 2 ...

Thanks.  John

richarde

unread,
Apr 23, 2024, 3:00:17 AMApr 23
to Druid User
Hi John,

When you mean "concurrent compaction" is it:
taskDuration to P12H
segmentGranularity to MONTH
auto compaction to MONTH with skipOffsetFromLatest not set

I found that not using skipOffsetFromLatest keeps a compaction task always open thus chewing up a task slot. My cluster is small (2x i4i.xlarge data server, m6i.xlarge query and master) as I don't have a high query load yet. I will scale as required. It works just fine at the moment using the basic cluster tuning examples (using v29 which has been rock solid for me). I have a second Kafka ingestion to the same cluster that has a much, much higher load so trying to conserve resources as much as possible (task slots/memory etc).

Or would the config be

taskDuration to P12H
segmentGranularity to MONTH
auto compaction to MONTH with skipOffsetFromLatest P12H

Thanks again for your help.
Reply all
Reply to author
Forward
0 new messages