Sharing deep storage & metadata between clusters

AR

unread,

Apr 27, 2023, 11:51:08 PM4/27/23

to Druid User

Hi,

We are setting up 2 clusters - primary & secondary. Both clusters have their own ZK, master, data & query nodes (nothing shared b/w them). The objective of the secondary is to provide load balancing for the queries and run ingestion tasks if primary has to be brought down for any reason (upgrade, server maintenance, outage etc).

We were initially thinking of having separate deep storage & metadata for both clusters and ingesting data into both of them simultaneously. This is a waste of resources and adds additional burden of maintaining the ingestion jobs and ensuring data integrity on both clusters. The deep storage & metadata db already have redundancy and failover. What we actually need is a standby cluster which can serve queries (load balancing) and pick up ingestion workloads if primary is down for any reason.

We were looking at the cluster setup options and it seems to be possible for both clusters to share the same deep storage and metadata db.

Scenario 2 in the below link seems to imply that 2 clusters can share the segment files.

Link: https://support.imply.io/hc/en-us/articles/115004960053-Migrate-existing-Druid-Cluster-to-a-new-Imply-cluster

1. If ingestion workloads (realtime & batch) run on primary, will secondary pick up and load the segments when they are published from the primary cluster since they share the metadata db?
2. We are planning on keeping the MiddleManager processes down on secondary so that no ingestion tasks can run. We will start them up when needed. When we start a realtime ingestion task on secondary which was previously running on the primary, will it continue from same offset? We believe it should since the metadata is the same.
3. Will the secondary cluster serve queries from the historicals if the MiddleManagers are not running? We believe that this should not be an issue.
4. We understand that queries to realtime processes will have to run solely on the primary as the middlemanager tasks are running there otherwise the query results will not contain the latest data.
5. Are there any other operational issues or risks with this approach?

Thanks,

AR.

Sergio Ferragut

unread,

May 1, 2023, 11:24:29 AM5/1/23

to Druid User

Answers:

#1 - Yes. But this is where problems will start. The coordinator on each cluster will make different decisions about where segments need to be replicated and store that info in the MDB. The MDB would end up with a mix of segment to historical mappings from two clusters. This would not work.

#2 - Correct. Only one cluster could be doing ingestion to avoid data corruption. Plus the overlord is the locking authority, so you should never have two that are active on the same data.

#3 - The conflict will occur in the MDB where each of the cluster's coordinators would write different information about the distribution of segment replicas across their corresponding historicals. Brokers would have info about segments on different clusters. I'm not sure if this would cause them to submit portions of queries to both clusters. I don't think this is workable without additional info about where to direct queries.

#4 - Not sure. real-time segments are also written in the MDB, a newly starting broker on the secondary cluster would also see these segments and likely direct queries to them. new segment info announced to the brokers would only be seen by the primary, but this might get problematic.

#5 - as mentioned in other answers, the metadata conflicts would be the primary issue, I think.

Sharing Deep Storage and ingesting only once seem like interesting optimizations for multi-cluster HA.

I wonder if a future enhancement could go in this direction by adding a cluster_id to the metadata tables wherever it makes sense in order to solve this conflict. The coordinator in cluster one would only affect rows in cluster one and coordinator in cluster two only affect cluster 2 metadata.

Overlords have a similar conflict when it comes to tasks and their lock management responsibilities. One solution like you said would be to only do ingestion in one cluster. But perhaps here there could also be some failover improvements to redirect ingestion to different clusters. With multi-cluster locking at the data source level? maybe...

For now, I think you will need to ingest twice.

John Kowtko

unread,

May 1, 2023, 11:47:25 AM5/1/23

to Druid User

Sergio, will Historical/Broker Tiering help here, at least for the query isolation part?

AR

unread,

May 2, 2023, 6:01:07 AM5/2/23

to Druid User

Hi Sergio,

Thank you for your inputs.
If I understand correctly, then the main blocker for the "ingest once and use in multiple clusters" approach is the segment distribution metadata that is updated in the MDB by the coordinator.

We actually setup the clusters like what I described above and ingested data for one of the data sources. I have queried the data from both clusters and the results look fine so far.

Looking from the Druid console (on both clusters), the segment distribution seems to be stored in the table "sys.server_segments" table. I was looking at the metadata tables in Postgres but I could not find any of the tables from the "sys" schema in there. Where are these tables persisted? If they are persisted in ZK, then it should not be an issue since the clusters don't share ZK (even if shared ZK, they could still use a different base path). Querying this table shows no conflict b/w segments and hosts. All segments seem to be loaded on the historical nodes belonging to that cluster.

I restarted the secondary cluster and checked the logs on the coordinator & broker. Only the hosts belonging to the secondary cluster were detected. I shut down a couple of historical nodes on the secondary so that segment rebalancing will be triggered. The cluster did rebalance and the segments were reassigned to the remaining nodes in the same cluster (checked by querying the "sys.server_segments" table from the Druid console). When I checked the primary cluster, the segment allocation had not changed (I used a specific segment id for testing). The queries still returned correct results on both clusters.

Only the "Ingestion" tab on the secondary cluster shows the tasks that ran on primary as these details are coming from the "druid_tasks" metadata table.

While your point about the segment distribution conflict seems valid, the test doesn't seem to back it up. Or am I missing something very obvious?? :)

We will keep the setup running for a few more days and check if we hit any issues. In addition, we are also planning on shutting down the Overlord process on the secondary cluster (in addition to the MMs) so that there is no scope of any task running on the secondary cluster. Is there any task that is triggered by the coordinator directly which doesn't run on MM?

Thanks,

AR.

Sergio Ferragut

unread,

May 2, 2023, 10:20:15 AM5/2/23

to Druid User

That is very interesting. I'm wondering whether the brokers are submitting queries to historicals across both clusters. I'm not sure how the brokers distinguish among the segments that belong to the "local" and "external" clusters. Perhaps there is a natural filtering mechanism where only the historicals that announce themselves locally are considered.

I'll am very interested in seeing your continued results. Perhaps it makes sense to test failover scenarios and look in some detail at the queries that are running and which historicals are involved in each query response.

Like John mentions perhaps setting up tier affinity can help insure the correct tier of historicals is used. I don't know of anyone that has set this up across clusters, so I really don't know the behavior. But if brokers are spreading queries across clusters, this might help get that under control by naming the historical tiers different across clusters and having brokers assign the tiers they watch with druid.broker.segment.watchedTiers.

Also Are you using zookeeper or http for the druid.serverview.type ? Not sure what the difference will be across clusters but if you are using ZK and there is a different ZK for each cluster, this might explain how this is working.

AR

unread,

May 2, 2023, 12:44:54 PM5/2/23

to Druid User

Hi Sergio,

Each cluster has its own ZK ensemble.

druid.serverview.type --> is not modified. It is using the default value of "batch".

"I'm wondering whether the brokers are submitting queries to historicals across both clusters."

- We have a load test running on both clusters currently. Metrics and request logging is enabled on both clusters. I checked the broker logs of the secondary cluster. No requests have been routed to the historicals of the primary cluster. As I mentioned previously, the only historicals detected by the secondary broker are the secondary historical nodes. I tried to check the same in the historical logs but it doesn't seem to log the hostname from where the query originated. I will have to extract a sample list of Query Ids from the primary broker and scan them in the historical nodes' logs.

"I'm not sure how the brokers distinguish among the segments that belong to the "local" and "external" clusters."

- Where would the conflict arise from since the only historicals that the broker is aware of seem to be the ones that belong to the same cluster?

Is there anything else that can be checked to verify that each cluster is operating independent of the other?

Where is the "sys.server_segments" table (and other tables from the "sys" schema) persisted? This could be one place where a conflict could arise due to both clusters overwriting each other's mapping. But this table seems to be mutually exclusive. I verified again that there are no secondary historicals when i query the table in the primary cluster and vice-versa.

Thanks,

AR.

John Kowtko

unread,

May 3, 2023, 9:08:24 AM5/3/23

to Druid User

Hi AR, I believe the "sys" tables are virtual and cached on the Broker:

see: https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#information-schema

The information is garnered from the union of metadata of the individual segments ... to my knowledge there is no hard "data dictionary" in Druid.

AR

unread,

May 3, 2023, 11:52:36 PM5/3/23

to Druid User

Hi John,

The "sys" and "information" schema tables may be cached on the broker but the broker cannot be the source of these tables. Since brokers can be scaled horizontally dynamically, where would a new broker get the "sys" and "information" schema details from? Also, the cache would be transient and last only up to the lifetime of the broker. If there is only one broker and it is restarted then the cache has to be refreshed again.

From my understanding, the co-ordinator would be the source of this info since it controls the segment loading/unloading across the cluster. Is this part of the cluster state? And where is this cluster state persisted? Since the co-ordinator process can also be restarted, it needs to load the latest cluster state from somewhere.

druid.coordinator.startDelay --> "The operation of the Coordinator works on the assumption that it has an up-to-date view of the state of the world when it runs, the current ZK interaction code, however, is written in a way that doesn’t allow the Coordinator to know for a fact that it’s done loading the current state of the world. This delay is a hack to give it enough time to believe that it has all the data."

Looking at the above config property desc, is the cluster state stored in ZK?

Or does the co-ordinator rely on the historicals to relay the details on which segments are loaded by them?

Thanks,
AR.

John Kowtko

unread,

May 4, 2023, 9:06:04 AM5/4/23

to druid...@googlegroups.com

Hi AR, I don't know the details well here (hopefully someone else will chime in) ...

As far as I know the hard datasource schema info rests only within each segment ... therefore if you ingest new data into an existing datasource and new columns appear in the ingested data, the datasource then "magically" has these new columns in it. For that to work the metadata must be fluid enough and info be able to propagate quickly around.

So there is communication between Coordinator, Broker and Historicals (and MM for real-time segments) ... exactly how that is orchestrated I am not sure.

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/b4e554d3-a2d3-4492-a314-1f0289d0f34an%40googlegroups.com.

--

John Kowtko

Technical Success Manager

john....@imply.io

Sergio Ferragut

unread,

May 4, 2023, 5:08:30 PM5/4/23

to Druid User

The metadata DB should have a "druid" schema where you will find the backing tables for the "sys" tables and others.

AR

unread,

May 5, 2023, 1:41:43 AM5/5/23

to Druid User

Hi Sergio,
I already checked the "druid" DB on the MDB. It contains the segment metadata in the "druid_segments" table. But that has only the load spec for each segment.
There is no table which contains the segment distribution details (backing store for "sys.server_segments" table) which would be one of the sources of any conflict b/w the primary & secondary clusters.

I will report back if we see any conflicts b/w the primary and secondary clusters.

Really appreciate both you and John for taking the time to help with your advice, inputs and suggestions.

Thank you,

AR.

John Kowtko

unread,

May 5, 2023, 9:37:33 AM5/5/23

to Druid User

The DRUID_SEGMENTS table has a PAYLOAD field, but it is not stored in cleartext:

9 rows selected

ij> select payload from druid_segments fetch first 3 rows only;
PAYLOAD
--------------------------------------------------------------------------------------------------------------------------------
7b2264617461536f75726365223a2277696b697065646961222c22696e74657276616c223a22323031362d30362d32375430303a30303a30302e3030305a2f3&
7b2264617461536f75726365223a226b74746d2d6e65737465642d76322d323031392d30382d3235222c22696e74657276616c223a22323031392d30382d323&
7b2264617461536f75726365223a226b74746d2d6e6573746564222c22696e74657276616c223a222d3134363133363534332d30392d30385430383a32333a3&

3 rows selected
ij>

AR

unread,

May 5, 2023, 10:55:01 AM5/5/23

to Druid User

Hi John,

I was referring to the contents of the "payload" column in "druig_segments". It only contains the "version", "dimensions", "metrics", "load_spec" for the segment. It doesn't contain any info on where the segment is loaded.

Payload schema:

{
"dataSource": "",
"interval": "",
"version": "",
"loadSpec": {
"type": "",
"bucket": "",
"key": "",
"S3Schema": ""
},
"dimensions": "",
"metrics": "",
"shardSpec": {
"type": "numbered",
"partitionNum": 0,
"partitions": 1
},
"binaryVersion": 9,
"size": 37511,
"identifier": ""
}

Thanks,

AR.

John Kowtko

unread,

May 5, 2023, 11:00:46 AM5/5/23

to Druid User

Okay, that is SYS.SERVER_SEGMENTS ... must be only a virtual table then, which I think only resides in the Coordinator and Broker caches. The coordinator figures out which segment should be hosted by which Historical tells the Historicals (through ZK), then passes that info to the Brokers so they know which Historicals to access for each query.

Sergio please confirm!

Thanks. John

AR

unread,

May 8, 2023, 4:31:54 AM5/8/23

to Druid User

Hi John,

"sys.server_segments" is the table that i was referring to. If this is virtual and only resides in the caches of the coordinator & broker, then that might explain why we have not seen any conflicts b/w the primary and secondary cluster so far. We will keep a watch and I will report back if we see any issues.

Thanks,

AR.

John Kowtko

unread,

May 8, 2023, 1:55:45 PM5/8/23

to Druid User

Okay, someone else may need to confirm ... to my knowledge the Coordinator figures out which historical holds which segments, then maybe pushes that into to ZK, but not into metadata DB since there is no hard persistence requirement of segment-to-historical assignment, and this probably changes constantly as the Coordinator rebalances the load whenever ingestion creates new segments.

Reply all

Reply to author

Forward