Moving data from S3 to Kafka

2,389 views
Skip to first unread message

Satish Varma Dandu

unread,
Jun 23, 2016, 4:50:49 PM6/23/16
to Confluent Platform
Our use case involves moving data S3 to Kafka. We are looking into Kafka Connect . Does kafka connect support S3 as a source? What other options are available to stream data from S3 to Kafka. ANy help would be great.

Regards,
-Satish

Alex Loddengaard

unread,
Jun 24, 2016, 1:59:25 PM6/24/16
to confluent...@googlegroups.com
Hi Satish,

Kafka Connect can certainly support S3 as a source. However, I don't know of an existing S3 source connector. It's possible one hasn't been built yet. Perhaps someone else knows of one?

A workaround for now could be to build your own S3 source connector (and hopefully open source it!), or build a custom producer. The tricky bit will be tracking which S3 objects (or partial objects) have been produced into Kafka.

Hope this helps.

Alex

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/1d73770c-460a-4040-b60b-c2850a7bdb5d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Satish Varma Dandu

unread,
Jun 30, 2016, 6:15:28 PM6/30/16
to Confluent Platform
Thanks Alex for the reply. 

>> Kafka Connect can certainly support S3 as a source.

In distributed mode, how does kafka connect load balance. For Ex: if we need to load 5 files and we have 5 brokers, can each broker get an S3object. Also, if one of the broker goes down; does other broker take over & complete the job?

Thanks,
-Satish



On Friday, June 24, 2016 at 10:59:25 AM UTC-7, Alex Loddengaard wrote:
Hi Satish,

Kafka Connect can certainly support S3 as a source. However, I don't know of an existing S3 source connector. It's possible one hasn't been built yet. Perhaps someone else knows of one?

A workaround for now could be to build your own S3 source connector (and hopefully open source it!), or build a custom producer. The tricky bit will be tracking which S3 objects (or partial objects) have been produced into Kafka.

Hope this helps.

Alex
On Thu, Jun 23, 2016 at 1:50 PM, Satish Varma Dandu <dsv...@gmail.com> wrote:
Our use case involves moving data S3 to Kafka. We are looking into Kafka Connect . Does kafka connect support S3 as a source? What other options are available to stream data from S3 to Kafka. ANy help would be great.

Regards,
-Satish

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.

David Tucker

unread,
Jul 5, 2016, 8:16:02 PM7/5/16
to confluent...@googlegroups.com
Satish,

Check out the documentation on Kafka Connect (http://docs.confluent.io/3.0.0/connect/design.html#architecture).  The Connectors are completely separate from the Kafka Brokers themselves.

The number of brokers does not affect the parallelism of the Connectors; that configuration is controlled by the Connector itself and the capacity of the Connect Workers.   It would be fairly simple to define a Connector such that each S3 object would be supported by a separate Connector Task (effectively a thread within the Worker processes).   

The Workers will support failover.   Connector Tasks from a failed Worker node will be redistributed to the other Worker nodes, picking up right where they left off.

It's probably worth taking a look at some other Source connectors.   For example, the JDBC Source connector supports the bulk upload of a database table.   That would be a good model for the transfer of a complete S3 object.

Regards,
   David

NOTE: It's worth paying particular attention to the consistency semantics of S3 if you suspect that the Kafka Connector will be reading from the S3 objects as they are being updated by some external processes.   You don't want to read half the data from "the first version" of the object and the remander from "a later version".




To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.

To post to this group, send email to confluent...@googlegroups.com.

Pradeep Sadashivamurthy

unread,
Jul 9, 2017, 5:24:54 PM7/9/17
to Confluent Platform
Reply all
Reply to author
Forward
0 new messages