Historicals local storage - which AI answer is less of a lie?

62 views
Skip to first unread message

richarde

unread,
Jan 2, 2025, 7:32:22 PMJan 2
to Druid User
Just having a bit of fun and trying to learn something about historical instances and storage so asked Copilot and Perplexity LLM's this question. Which one is less of a lie? I would like to want the Copilot answer but I cannot find any mention of a "priority" option in the documentation.

The question asked:

In apache druid historical servers when setting the druid.segmentCache.locations array is there a way to specify priorities of the paths to use? For example if there is a mix of local SSD and network storage? historical/runtime.properties:druid.segmentCache.locations=[{"path":"var/druid/segment-cache","maxSize":"600g"}]

Perplexity gave this answer:

In Apache Druid, when setting the `druid.segmentCache.locations` array for Historical servers, there is no built-in way to specify priorities for different storage types like local SSD and network storage[1][4]. The `druid.segmentCache.locations` configuration simply defines where the local cache resides for segments assigned to a Historical node[4].

However, you can implement a strategy to prioritize faster storage by:

1. Listing the faster storage (e.g., SSD) first in the `druid.segmentCache.locations` array.
2. Allocating more space to the faster storage.

For example:

```properties
druid.segmentCache.locations=[
  {"path":"/fast/ssd/path","maxSize":"500g"},
  {"path":"/slower/network/path","maxSize":"1000g"}
]
```

This configuration will cause Druid to fill up the faster SSD storage first before using the slower network storage[5]. While not a true priority system, it effectively prioritizes the use of faster storage for frequently accessed segments.

Remember that Druid uses memory mapping for segment caching, so having sufficient free system memory relative to the total size of `druid.segmentCache.locations` is crucial for optimal query performance[5].

Citations:
[1] https://druid.apache.org/docs/latest/configuration/
[2] https://docs.imply.io/druid/docs/operations/basic-cluster-tuning
[3] https://druid.apache.org/docs/latest/tutorials/cluster/
[4] https://src.isharkfly.com/Apache/druid/src/commit/7b65f0635d04fce905dd7f211c20b104e8009bca/docs/content/Configuration.md
[5] https://druid.apache.org/docs/latest/operations/basic-cluster-tuning/
[6] https://groups.google.com/g/druid-user/c/1MQ6IPngxU0
[7] https://github.com/aws-solutions/scalable-analytics-using-apache-druid-on-aws/blob/main/README.md
[8] https://druid.apache.org/docs/latest/design/deep-storage
[9] https://druid.apache.org/docs/latest/design/coordinator/
[10] https://www.reddit.com/r/HPC/comments/10d9o7n/distributed_ssd_cache_for_network_storage/
[11] https://druid.apache.org/docs/latest/design/historical/
[12] https://druid.apache.org/docs/latest/design/architecture/

Copilot gave this answer:

Yes, in Apache Druid, you can specify priorities for the paths in the druid.segmentCache.locations array by using the priority field. This allows you to prioritize certain storage paths over others, which is useful when you have a mix of local SSD and network storage.

Here's an example configuration:

druid.segmentCache.locations=[ {"path": "var/druid/segment-cache-ssd", "maxSize": "300g", "priority": 1}, {"path": "var/druid/segment-cache-network", "maxSize": "600g", "priority": 2} ]

In this example, segments will be stored in the var/druid/segment-cache-ssd path first (priority 1) until it reaches its maximum size, and then in the var/druid/segment-cache-network path (priority 2).

This setup ensures that the faster local SSD storage is utilized before falling back to the network storage[1][2].

Would you like more details on configuring other aspects of Apache Druid?


References

Ben Krug

unread,
Jan 3, 2025, 12:59:09 PMJan 3
to druid...@googlegroups.com
I think both are confidently wrong.  (If you find anything in their references that support what you say, I'd be happy to learn.)

I used an older AI method (google) and googled druid.segmentCache.locations and in the docs there's a section I eventually found
that describes the options we have for how segment cache locations are chosen:

I don't see that any of these would be in line with either response that you got.  (Again, happy to learn, if I'm missing stuff.)

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/druid-user/d73ad4c0-8bcb-4d07-9b73-044666f63673n%40googlegroups.com.

Renato Cron

unread,
Jan 4, 2025, 10:43:27 AMJan 4
to Druid User

As already noted, based on the Druid documentation, neither AI response is correct.

There's no built-in priority system for segment cache locations. Here's what you can do instead:

For your use case of mixing local SSD and network storage, you'll want to use Druid's tiering feature instead of trying to prioritize paths within a single Historical, but them you will need to pay double RAM/CPU for that.

  1. Set up two separate Historical node groups:
    • A "hot" tier using local SSD storage
    • A standard tier using NAS storage
  2. Configure different tiers by setting druid.server.tier in your historical's runtime.properties:
    properties 
    # Hot tier historicals (local SSD) druid.server.tier=hot druid.segmentCache.locations=[{"path":"/local/ssd/path","maxSize":"300g"}] # Standard tier historicals (NAS) druid.server.tier=_default_tier druid.segmentCache.locations=[{"path":"/nas/path","maxSize":"1000g"}]

For segment distribution between cache locations within a single Historical, Druid only supports these strategies via druid.segmentCache.locationSelector.strategy: leastBytesUsed (default), roundRobin, random, or mostAvailableSize, none will be sufficient to be used the way you wanted, as usually the NAS will be larger han local disk

Also important to note: regardless of storage type, query performance heavily depends on how much RAM is available for page caching and heap memory to keep frequently accessed segments unpacked in memory. Make sure to properly size your system memory for optimal performance.

richarde

unread,
Jan 5, 2025, 12:34:30 AMJan 5
to Druid User
Thanks, Renat/Ben,

Renat your answer as you mentioned requires a step upgrade in hardware/instance costs. It would be nice if there was some middle ground between what I have and what you propose. My use case is ingesting streaming telemetry data, thus time series. I have enough hardware resources to keep around 1 years data on the historical nodes SSD. Without spending double the amount on instance costs (moving to the next instance type on AWS doubles the cost) to get more online storage I would like to be able to deploy network storage as a lower cost/slower segment cache on my historical nodes. Ideally segments should move from the fastest to slower storage based on a priority setting as the AI suggested, assuming my streaming ingestion is to the fastest/highest priority storage.

I know about querying from deep storage and MSQ, but that would require quite a bit of redesign of my app to be able to handle the async nature of MSQ. Cloud network storage is quite cheap and I would like to be able to leverage it as a storage tier between local SSD and S3 to have more data online.

Ben Krug

unread,
Jan 6, 2025, 1:15:56 PMJan 6
to druid...@googlegroups.com
Hello Rhicharde -

It sounds like tiering might help.  You could set up a "hot" tier and a "cold" tier, and update retention rules to move older segments to the cold tier.
I don't know whether moving based on age would help in your case or not, but that's the best I can think of.

See here for more details if you're interested.

Reply all
Reply to author
Forward
0 new messages