John, thank you very much for some great ideas. I don't think we'd be able to implement your first suggestion just due to the amount of data we'd have to buffer, but I think the last two might be possibilities.
> Do your backfills using batch ingestion instead of streaming.
Would I be correct in assuming that a large number of hourly segments created during batch ingestion would not have the same performance impact as what I saw during realtime ingestion?
Our initial data flows through a few microservices for characterization and enrichment before ending up in Druid. I think the easiest way to accomplish the refeeds as batch ingestion might be to insert a final microservice that would filter these older messages out of the Kafka topic feeding our realtime Druid ingestion task and write those out to disk to be batch ingested. Our refeeds go through that whole processing flow, so it wouldn't be easy to distinguish the refeeds from the more realtime data other than by date. This idea seems the most promising and I'll continue thinking on that.
> Choose a larger segment granularity for the initial ingestion so there are fewer intervals overall to deal with.
We are currently using concurrent append and replace to prevent conflicts with ingestion and compaction. While we ignore the past 7 days in our autocompaction configs, because our data is only quasi-realtime, we still see data outside that window sometimes and we were getting a lot of failed compaction tasks. So, concurrent append and replace has been great for fixing that issue. Under the "Known Limitations" section for the concurrent append and replace documentation, it explicitly discusses not mixing segment granularity and especially not compacting data into a more granular setting, which is what I think we'd probably want if we do the initial ingestion as DAY and compact to HOUR granularity.
Honestly, I'm not sure we actually need the HOUR granularity we currently have. Our segment sizes are pretty much hitting the 5 million mark the documentation suggests at HOUR granularity, so we don't need a coarser granularity to hit that. However, most of our time queries seem to be across a period of days, not really needing the HOUR granularity. Need to talk to the user base more to see if we can get away with DAY granularity instead.
~Dan