Druid DataNode storage vs HDFS Storage

ANANTHAN A

unread,

Mar 4, 2024, 6:23:02 AM3/4/24

to Druid User

Hi Team,

We are new to Druid. In our Druid cluster setup, we are using HDFS as DeepStorage. After verifying the documentation/posts, it seems that Druid will copy the segments from DeepStorage (HDFS) and store them in the DataNodes(local Storage) to make them available for queries.

My question is, if I plan to ingest 100TB of data into Druid with a replica factor of 2 in both HDFS and Druid, do I need to have a minimum of 200TB of storage in both Druid's data nodes and HDFS to store that 100TB of data?

In other words, if the replica factor is the same in HDFS and Druid, do I need to have an equal amount of space in Druid data nodes as I have in HDFS? If yes, is there any alternative way to resolve/reduce the storage capacity of Druid data nodes by fetching uncached data from HDFS on demand?

we are using druid version 28.0.1

FYI.

Thanks,

Ananthan.

John Kowtko

unread,

Mar 4, 2024, 8:09:17 AM3/4/24

to Druid User

HI Ananthan,

Druid stores only one replica of each segment in Deep Storage ... it assumes the Deep Storage facility will take care of High Availability needs by doing whatever it needs to (presumably it will replicate the files underneath, but Druid is not concerned with that). Also the segment files are Zipped up when stored in Deep Storage, so could be as little as 1/3 the size of when they are loaded onto Historicals.

On the Historicals (Data Nodes) segments are stored in unzipped form and the number of copies of each segment is dictated by each datasource's Retention Rules. An active segment can have zero or more copies store on Historicals ... "zero" being the Cold Tier specification. To make it somewhat easier to monitor the disk used, the Druid web console Datasources tab has columns that show "Total Data Size" and "Replicated Size" for each datasource.

John

ANANTHAN A

unread,

Mar 4, 2024, 12:42:19 PM3/4/24

to Druid User

Thanks for the reply, John.

Initially, we thought that HDFS was the primary storage, with data nodes serving as a cache for fast queries. We assumed that data nodes might load partial data from HDFS on demand, suggesting a minimal storage architecture.

Peter Marshall

unread,

Mar 25, 2024, 10:44:50 AM3/25/24

to Druid User

Just adding to John's note here that, via MSQ, you can query direct from deep storage. Check out the notebook on this that's called Full Timeline Queries in the query section, if memory serves...

https://github.com/implydata/learn-druid

Ben Krug

unread,

Mar 25, 2024, 12:13:02 PM3/25/24

to druid...@googlegroups.com

To add some more ... As John said, you can load a copy (or more) to data nodes or not. YOu can also query deep storage directly with MSQ.

General use case for "hot" data would be to load 1 or 2 copies to data nodes, and I believe they use mmap and cache that data as well as they can, for fast querying.

So, if you have, eg, 100 GB of data in deep storage, and 10 data nodes (eg), you'd need 10G on each data node for the segment cache and 1 replica, 20G for 2 replicas, etc.

Hope this all helps!

On Mon, Mar 25, 2024 at 7:44 AM 'Peter Marshall' via Druid User <druid...@googlegroups.com> wrote:

Just adding to John's note here that, via MSQ, you can query direct from deep storage. Check out the notebook on this that's called Full Timeline Queries in the query section, if memory serves...

https://github.com/implydata/learn-druid

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/0a0b8d2e-46f8-451b-aa09-d3dbb211cb61n%40googlegroups.com.

Reply all

Reply to author

Forward

Message has been deleted