Ksqldb Vs Postgresql

0 views

Skip to first unread message

Terina

unread,

Aug 4, 2024, 10:50:37 PM8/4/24

to browacdreadez

Aftera day spent experimenting with various configuration options, I've not been able to get the Debezium SQL Server connector to work with ksqlDB embedded Kafka Connect. There appears to be no clear guidance on either the ksqlDB or Debezium websites on how to set this up.

What documentation there is suggests that one just needs to install the relevant Kafka Connect plugin into a specified location and everything should work magically out of the box... but unfortunately I've had no such luck and the error messages I'm receiving aren't giving me any feedback to work with.

My Dockerfile for the ksqldb-sqlserver service above, which basically just takes the default image and then copies the contents of the Debezium SQL Server connector plugin to /usr/share/confluent-hub-components, looks as follows:

The ksqlDB CLI returns 'HTTP ERROR 500 Request failed' and, on inspecting the Docker logs for ksqldb-server, I see the following cryptic message (which I've had to truncate slightly because of StackOverflow character limits):

I'm not 100% sure, but I don't think this means that the SqlServerConnectorConfig class is missing, but rather that there was a fundamental error on initialization (probably something more fundamental than some errors in the connector configuration).

KsqlDB is an event streaming database for building stream processing applications on top of Apache Kafka. In the previous article, ksqlDB was introduced and the benefits, and reasons for the existence of another solution within the Kafka universe to build stream processing applications were outlined. KsqlDB proposes a different approach to Kafka Streams.

This article will deepen the presentation and provide a detailed picture of ksqlDB architecture and concepts necessary for a better understanding. As with the first article, the goal is to provide an understanding of the differences between stream processing solutions within the Kafka ecosystem (Kafka Streams and ksqlDB).

This article is part of a three-part series. It is recommended to read the previous article before reading this one. At the end of this article, you should gain a better understanding of ksqlDB and the use cases and issues it aims to address.

ksqlDB provides an SQL-like interface to data streams, allowing for filtering, aggregations and even joins across data streams. ksqlDB uses Kafka as the storage engine and to work as the compute engine. A solution with ksqlDB can be considered as a simple two-tier architecture with a compute layer and a storage layer, ksqlDB and Kafka. Data is stored in Kafka and the processing of data happens in ksqlDB. Figure 1 displays this simple two-tier architecture. It is important to note that both layers and infrastructure can be elastically scaled independently of the other.

ksqlDB is made of four components divided into two main groups of components: ksqlDB server and ksqlDB clients: a SQL engine, a REST service, a command line interface and a user interface.

The ksqlDB server is responsible for running stream processing applications. When you deploy your ksqlDB application, it runs on ksqlDB Server instances that are independent, fault-tolerant, and scalable.

A ksqlDB server is made up of an SQL engine and a rest API. ksqlDB servers are deployed separately from the Kafka cluster (usually on separate machines/containers). The following picture gives an overview of a ksqlDB server.

A query submitted to ksqlDB is compiled into a Kafka Streams application that is executed in the background, following the same execution model of Kafka Streams applications. A Kafka Streams application instance is run in a Java Virtual Machine (JVM) on a machine and multiple instances of an application are executed either on the same machine or spread across multiple machines. Each application instance runs in their own JVM process and is mapped to one or more partition, based on how many partitions the input topic has. This is depicted in Figure 4, where different instances of a Kafka Streams application MyApp process different partitions of a topic, with each instance running in a separate JVM process.

ksqlDB follows a similar approach for parallelism and workload distribution among ksqlDB server instances. A submitted query can be executed on multiple instances (ksqlDB servers), and each instance will process a portion of the input data from the input topic(s) as well as generate portions of the output data to output topic(s).

Workloads created by a specific query set can be distributed across multiple ksqlDB servers using the ksql.service.id configuration. The service id configuration is used to define if a ksqlDB server instance belongs to a specific ksqlDB cluster: If multiple ksqlDB servers connect to the same Kafka cluster with the same ksql.service.id they form a ksqlDB cluster and share the workload. Otherwise, if multiple ksqlDB servers connect to the same Kafka cluster but don't share the same ksql.service.id, then each ksqlDB server gets a different command topic and forms separate ksqlDB clusters based on the ksql.service.id configuration. This is shown in Figure 5, where 3 ksqlDB servers share different ksql.service.ids, building 2 different ksqlDB clusters. It is important here to notice that a ksqlDB cluster here refers to a group of cooperating ksqlDB servers, processing the same workload. In this example, the assignment of more ksqlDB servers to the "finance" cluster can be useful to enhance the processing of workloads.

Since ksqlDB servers with the same service ID are members of the same consumer group, Kafka automatically handles the reassignment/distribution of workload as new ksqlDB servers are added or removed (removal could be manual or automatic, e.g., due to a system fault).

Running additional instances with the same service id will grant your application additional processing capacity. Scaling down, removing ksqlDB servers can be done at any time. In all cases, ksqlDB server instances communicate with Kafka clusters so that changes can be conducted without a necessary restarting of the application.

The SQL engine is part of the ksqlDB server responsible for parsing SQL statements, converting the statements into Kafka Streams topologies, and ultimately running the Kafka Streams applications and queries. A visualization of this process is shown in Figure 6.

At the beginning of the process, the desired application logic is translated into SQL statements which are build and run by the engine on available ksqlDB servers. Each ksqlDB Server instance runs a ksqlDB engine. The ksqlDB engine is implemented in the KsqlEngine.java class.

ksqlDB includes a REST interface that allows clients to interact with the SQL engine. It enables communication from the ksqlDB command line interface (CLI), ksqlDB UI, Confluent Control Center, or from any other REST client. Over the REST interface, clients can submit different types of queries, like DML statements, and execute different tasks like checking the cluster status/health of the cluster and much more. By default, the rest interface listens on port 8088 and communicates over HTTP, but the endpoint's port can be changed using the listener's config. Communication over HTTPs can be enabled using the ssl configs.

The ksqlDB CLI (ksqlDB command line interface) is a command-line application that allows interactions with a running ksqlDB server. Through a console with interactive sessions, users can submit queries, inspect topics, adjust ksqlDB configurations, thus experimenting with ksqlDB and developing streaming applications. The ksqlDB CLI is designed to be familiar to users of MySQL, Postgres, and similar applications. The ksqlDB CLI is implemented in the io.confluent.ksql.cli package.

It is distributed as a Docker image (confluentinc/ksqldb-cli) and is part of various Confluent Platform distributions (fully managed on Confluent Cloud, or through a self-managed deployment).

The Confluent Platform also includes a UI for interacting with ksqlDB. This UI is a commercial feature, only available for the commercially licensed version of Confluent Platform and Confluent Cloud. More than a visualization of queries and submitted queries, it also allows additional operations like the visualization of the data flow of data, the creation of streams and tables using web forms.

In this mode, the REST interface is used by the various clients (favorite REST clients, ksqlDB CLI or Confluent Control) to connect to. Clients can submit new queries anytime, on the fly, interacting with ksqlDB servers through the REST interface (see Figure 7). For example, A user can explore the existing topics in the Kafka cluster, write queries, and inspect their results in real time. This mode also allows any number of server nodes to be dynamically started, and persistent queries to be added or removed without restarting the servers.

The sharing of statements with servers in the cluster is only possible with the help of an internal topic called the command topic. All queries submitted to the SQL engine (via the REST API) are written to this topic. This topic auto created and managed by ksqlDB, stores besides SQL statements, some metadata to ensures that statements are built compatibly across ksqlDB restarts and upgrades. KsqlDB servers in the same cluster (sharing the same ksql.service.id) are able to share statements being executed, and the workloads associated with them.

The headless mode, also called application mode, doesn't allow clients interactively submitting queries against the ksqlDB cluster. All queries are written in a SQL file, and ksqlDB started with this file as an argument: ksqlDB server instances will use this file and each server instance will read the given SQL file, compile the ksqlDB statements into Kafka Streams applications and start execution of the generated applications (see Figure 8).