Table per Topic and implications with too many tables.

966 views

Skip to first unread message

Praveena Manvi

unread,

Sep 29, 2016, 3:20:39 AM9/29/16

to debezium

Hi,

We are evaluating to use debizium to drive

- audit requirements where we want to trace back what all happened in a transaction, we are planning to use GTID for this.

- building materized with combining multiple tables into one.

It will be great of developers/users of this library can share thoughts on this perceived issues. We have use case where in we need to support ~6000 tables (600 tables per database(100)) that results in around 60K topics.

Resource (memory, CPU) overhead of creating a Kafka Topic vs. Kafka partition and if creating tens of thousands of Kafka Topics is a bad idea. To this, here is one of the answers from Jay Kreps (creator of kafka)

https://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka

"single topic with 100k partitions would be effectively the same as 100k topics with one partition each."

I am thinking of 2 solutions (Is it a problem in the first place to create 10K topics??)

a. Different kafka clusters say 600 topics per cluster results in managing too many

b. Group multiple tables into single topic to reduce the number topics.

Please note that we have around million inserts everyday (across all 60K tables)

My experiment showed in my local machine creating 10K topics that zookeeper will have about ~100MB memory increase and zk node count was ~47K and watchers were ~2k as seen from JMX tab.I did not see any significant degradation from in cpu and gc.

Thanks,

Praveen

Randall Hauch

unread,

Sep 29, 2016, 1:35:14 PM9/29/16

to debezium, Praveena Manvi

On September 29, 2016 at 2:20:41 AM, Praveena Manvi (praveen...@gmail.com) wrote:

Hi,

We are evaluating to use debizium to drive
- audit requirements where we want to trace back what all happened in a transaction, we are planning to use GTID for this.
- building materized with combining multiple tables into one.

It will be great of developers/users of this library can share thoughts on this perceived issues. We have use case where in we need to support ~6000 tables (600 tables per database(100)) that results in around 60K topics.

Resource (memory, CPU) overhead of creating a Kafka Topic vs. Kafka partition and if creating tens of thousands of Kafka Topics is a bad idea. To this, here is one of the answers from Jay Kreps (creator of kafka)
https://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka
"single topic with 100k partitions would be effectively the same as 100k topics with one partition each."

Right, one of the biggest factors with anticipating Kafka cluster performance is the number of topic partitions (i.e., the total number of partitions for all of the topics). It’s hard to say whether 60K topic partitions is that big, since it depends so much on the hardware resources where you run the Kafka cluster. Of course, you’re going to want quite a few Kafka brokers in your cluster so that each one is serving a smaller number of topics. IIUC, Kafka scales out very well to larger numbers of brokers in a cluster, but the question is whether you’d need more broker machines than you want to buy/run/manage.

I am thinking of 2 solutions (Is it a problem in the first place to create 10K topics??)
a. Different kafka clusters say 600 topics per cluster results in managing too many

Yes, I think it makes sense to spread the topics for your databases across multiple Kafka clusters **if/when** you find that a single larger Kafka cluster is not able to support the total number of topic partitions. Personally, I’d try pretty hard to use a single Kafka cluster and to scale that to get the required throughput and performance.

b. Group multiple tables into single topic to reduce the number topics.

Grouping multiple tables into a single topic is a poor idea for a number of reasons. First, you’ll have the same volume of data, so I’m not sure that it really buys you much in terms of Kafka performance. Second, it convolutes the meaning of a single stream, making it difficult for consumers to deal with the heterogeneity of a single topic. Think about your consumer that is building a materialized view, when it has to check whether each event from one of the sources for the view. Then compare that to a consumer that can simply process all messages from one or more topics corresponding to the view’s source tables. The latter is a lot simpler.

You’ve not mentioned sharded tables, but we are working on enhancing the MySQL connector to work with sharded tables so that all events from the different shards of a table go to a single topic. See https://issues.jboss.org/browse/DBZ-121 for details. If you’re using sharding, this may help reduce the number of topics somewhat, and feedback/input on the issue would be greatly appreciated. Part of the solution of DBZ-121 might allow you to provide a custom “topic selector” that allows you to control the topic to which the events from each table are sent.

If you really want to merge the events from multiple disparate tables into a single topic because *all* of your consumers are expecting this, then you could always write a consumer app (perhaps using Kafka Streams) that does this and outputs other “combined” streams. The streams output by Debezium could then be “temporary” with a relatively short but fixed retention policy (long enough to ensure your merging processor has time to process each event, and to account for any downtime).

Please note that we have around million inserts everyday (across all 60K tables)

My experiment showed in my local machine creating 10K topics that zookeeper will have about ~100MB memory increase and zk node count was ~47K and watchers were ~2k as seen from JMX tab.I did not see any significant degradation from in cpu and gc.

I think the bigger question about the cluster is it’s throughput and resiliency when one broker fails or is taken down. For example, with 1K topics and 10 brokers, each broker will (roughly) be leader for about 100 topics and followers for 200 more (assuming 3 replicas). When one such broker is down, it will impact performance and the ability to recover from (additional) failures.

If you have questions about sizing Kafka, I suggest contacting the Apache Kafka users mailing list or, if you want more involved consulting, contact Confluent.

Hope this was at least somewhat helpful. Best regards,

Randall

Reply all

Reply to author

Forward

0 new messages