Kafka connect - Push based connector support

489 views
Skip to first unread message

Andrew Stevenson

unread,
Jan 24, 2016, 2:26:48 PM1/24/16
to Confluent Platform
Hi Guys,

I'm interested in standardising around Kafka Connect for all ingress and egress from Kafka, however the source connectors seem to be polling based. I would like to support pushed based sources as well, for example the Twitter Stream API. Is possible with the current framework? 

Additionally I have data sources that send notifications via HTTP that data is available, I would like to write a connector to perform the HTTP get based on information in the notification so I'm wondering if this is possible, .i.e how could I get the notification into the connector to trigger a reconfigure? 

This can also be a pattern for a JDBC import since some systems I've worked against only what an extract to happen when they say (performance reasons). In this scenario they send a notification containing the sql and connection string  to an end point , this was then passed on to Spark to perform the import. I like this since the system (Spark) performing the import didn't need to know any business knowledge about the source to identify new events. He just executed what he was told.

If I can sort out these data sources, or a pattern for them, then it allows the streaming engine (Flink or Samza) to focus on only reading and writing from Kafka.

Thanks

Andrew

Ewen Cheslack-Postava

unread,
Jan 26, 2016, 2:22:46 PM1/26/16
to Confluent Platform
On Sun, Jan 24, 2016 at 11:26 AM, Andrew Stevenson <astev...@outlook.com> wrote:
Hi Guys,

I'm interested in standardising around Kafka Connect for all ingress and egress from Kafka, however the source connectors seem to be polling based. I would like to support pushed based sources as well, for example the Twitter Stream API. Is possible with the current framework?

Kafka Connect sources are pull-based for a few reasons. First, although connectors should generally run continuously, making them pull-based means that the connector/Kafka Connect decides when data is actually pulled, which allows for things like pausing connectors without losing data, brief periods of unavailability as connectors are moved, etc. Second, in distributed mode the tasks that pull data may need to be rebalanced across workers, which means they won't have a consistent location or address. While in standalone mode you could guarantee a fixed network endpoint to work with (and point other services at), this doesn't work in distributed mode where tasks can be moving around between workers.

Regarding your specific example of the Twitter API, while this is push-based in the sense that new events are pushed over a long-running HTTP request, you can implement this in a source connector and effectively pull from the API just by limiting how fast you read from the connection. When poll() is invoked, you just grab however much data you want off the connection and return it. You'll have to setup a new connection if tasks are moved around and as far as I know there's no concept of offsets that you can use in the Twitter API, but you should be able to implement a connector that pulls tweet data.
 

Additionally I have data sources that send notifications via HTTP that data is available, I would like to write a connector to perform the HTTP get based on information in the notification so I'm wondering if this is possible, .i.e how could I get the notification into the connector to trigger a reconfigure? 

I don't think this would be easy to implement. The key problem here is having a fixed URL to send those HTTP notifications to. You might be able to setup another service that buffers those notifications, then the connector could retrieve and process them. Or, you might be able to setup some load balancer magic where tasks register themselves with the load balancer to handle routing to the connector tasks. I haven't tried anything like this yet, though, so I'm not sure how well it would actually work.
 

This can also be a pattern for a JDBC import since some systems I've worked against only what an extract to happen when they say (performance reasons). In this scenario they send a notification containing the sql and connection string  to an end point , this was then passed on to Spark to perform the import. I like this since the system (Spark) performing the import didn't need to know any business knowledge about the source to identify new events. He just executed what he was told.

If this was passed to Spark, doesn't that mean there was some system buffering it along the way before the Spark job was setup to process the request?
 

If I can sort out these data sources, or a pattern for them, then it allows the streaming engine (Flink or Samza) to focus on only reading and writing from Kafka.

Are you using these systems with custom connectors like this today? I'd be curious what the current approach is.
 
-Ewen


Thanks

Andrew

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/187f8dae-fadb-4946-8067-a937789a422c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Ewen

Andrew Stevenson

unread,
Jan 26, 2016, 3:04:50 PM1/26/16
to confluent...@googlegroups.com
Thanks Ewen,

I’ll look at implementing the Twitter in a connector.

Regarding the Spark jdbc ingest, we do have a buffer….Kafka! I have Flume agents accept the notification, convert to my internal Avro schema and dump them into a Kafka channel. On the other end we have a custom dispatcher/router process who then serves the results to long running Spark jobs (again via Kafka) to do the ingress. We scale up and down the number of Spark drivers based on the queue depth in Kafka. If I could collapse this into Kafka connect it would be great!

I’m wondering if I could use the same approach with Connect, feed to notifications in via Kafka or JMS?  My goal is reduce the number of components for the ingestion layer.

Regards

Andrew
From: <confluent...@googlegroups.com> on behalf of Ewen Cheslack-Postava <ew...@confluent.io>
Reply-To: <confluent...@googlegroups.com>
Date: Tuesday 26 January 2016 at 20:22
To: Confluent Platform <confluent...@googlegroups.com>
Subject: Re: Kafka connect - Push based connector support
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/xkVrHIe_WSI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platf...@googlegroups.com.

To post to this group, send email to confluent...@googlegroups.com.

Ewen Cheslack-Postava

unread,
Jan 26, 2016, 11:45:15 PM1/26/16
to Confluent Platform
On Tue, Jan 26, 2016 at 12:04 PM, Andrew Stevenson <astev...@outlook.com> wrote:
Thanks Ewen,

I’ll look at implementing the Twitter in a connector.

Regarding the Spark jdbc ingest, we do have a buffer….Kafka! I have Flume agents accept the notification, convert to my internal Avro schema and dump them into a Kafka channel. On the other end we have a custom dispatcher/router process who then serves the results to long running Spark jobs (again via Kafka) to do the ingress. We scale up and down the number of Spark drivers based on the queue depth in Kafka. If I could collapse this into Kafka connect it would be great!

Ok, that makes sense. I was going to suggest Kafka as the buffer but that seemed a bit redundant :)
 

I’m wondering if I could use the same approach with Connect, feed to notifications in via Kafka or JMS?  My goal is reduce the number of components for the ingestion layer.

Absolutely! You can always run Connect in standalone mode, which would make it behave much like a Flume agent. The obvious drawback is that you don't get the fault tolerance benefits of distributed mode and there's more configuration, monitoring, and management overhead. But if you really just need a couple of fixed endpoints you can hit to push data to and have them publish to Kafka, it's a great solution for that. Also helps to at least reduce the number of frameworks/libraries/dependencies you're working with!

-Ewen


For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Ewen

Unmesh Joshi

unread,
Apr 30, 2016, 2:39:39 AM4/30/16
to Confluent Platform
I wanted to build a prototype with infrosphere CDC. https://github.com/IBMStreams/streamsx.cdc/blob/master/CDCStreamsUserExit/ to push messages on Kafka. Because it is push based, it seems difficult to map it to connector.
I guess this will be a common issue all CDC integrations? A related question. I was playing a bit with bottled-water plugin. Is there plan to factor bottled-water as kafka connector?

Regards

Andrew

Thanks

Andrew
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.



--
Thanks,
Ewen

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/xkVrHIe_WSI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platform+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.

To post to this group, send email to confluent...@googlegroups.com.



--
Thanks,
Ewen

Andrew Stevenson

unread,
Apr 30, 2016, 4:25:47 AM4/30/16
to confluent...@googlegroups.com
Hi Unmesh,

You can buffer the inbound events in a queue. Then drain it in the poll. We do this for twitter, bloomberg and a Cassandra source. Let me know if I can help.

https://github.com/datamountaineer

Regards

Andrew

From: Unmesh Joshi
Sent: ‎30/‎04/‎2016 08:39
To: Confluent Platform
To unsubscribe from this group and all its topics, send an email to confluent-platf...@googlegroups.com.

To post to this group, send email to confluent...@googlegroups.com.

Unmesh Joshi

unread,
Apr 30, 2016, 5:00:34 AM4/30/16
to confluent...@googlegroups.com
Yeah buffering is the solution to connect these to models, I was trying to see if I can avoid it somehow.


Regards

Andrew

Thanks

Andrew
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.

To post to this group, send email to confluent...@googlegroups.com.



--
Thanks,
Ewen

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/xkVrHIe_WSI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.

To post to this group, send email to confluent...@googlegroups.com.



--
Thanks,
Ewen

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/xkVrHIe_WSI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/69c7b4ee-9ff1-4c52-afca-5e2c3b505dd9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.

To post to this group, send email to confluent...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages