--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/97d80440-a3d3-4279-a621-09ded85e2d95%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To post to this group, send email to confluent...@googlegroups.com.
Thanks Ewen, some of your statements I found out myself meanwhile, when looking into the source code of the connectors. But I am at the early stages of understanding all side-effects still.If I am not mistaken, the most important problem is the once-and-only-once semantic, which is solved using the Write Ahead Log - and that is where all the S3 problems play a big role.While I understand the importance of guaranteed delivery, I would argue that it is less important, especially when the target S3 file system does not guarantee it either.
To be very blunt: What is desired more? Guarantee that the data is in S3 once-and-only-once at the expense of missing recent data in the Kafka queue for a longer time? Or having a few duplicates once a while in S3? I would prefer having all data in S3 rather missing recent data for an unknown amount of time. Hence I would make WAL optional in the S3 connector implementation. Probably even in the hdfs connector assuming it has negative side effects there either.
Imagine you have a sensor producing vault codes. You do not want to lose any records, you want to see the error codes rather quickly (rotate.interval as the max latency). But if you get the same row twice in the target file, with the same error code, the same sensor timestamp even, this you can handle at query time. If that is a problem, a database with transactional consistency would be better suited anyhow.
Regarding the Parquet part of the question I understand too little at the moment. Just would hate to use Spark streaming to achieve such a simple task - read Kafka Avro and dump to S3 in Parquet format.
-Werner
Am Sonntag, 14. Mai 2017 00:30:57 UTC+2 schrieb Ewen Cheslack-Postava:
Despite the fact that HDFS provides an S3 filesystem implementation, it doesn't (and can't) truly provide the same semantics of a filesystem -- S3 is an eventually consistent system that does not provide the semantics you'd expect from a regular filesystem.Because of this, the designs of the two connectors for getting the semantics many people want (i.e. exactly once) are different. For example, in the case of HDFS, we can append to files, have read after write semantics, efficient renaming of files, and efficient listing of files in the filesystem. In contrast, S3 has none of these. This means the HDFS connectors approach using temp files, a WAL file for file commits, moving files to commit them, and using file listings to recover offsets does not work for S3.There was at least one attempt to adapt the HDFS connector for S3, but it required adding a separate ACID data store for the WAL file and even when you do this, recovery after rebalancing or crashing grows increasingly expensive over time and becomes impractical with even a relatively small number of files because of the performance of S3 LISTs.Regarding parquet support, unfortunately the parquet makes a lot of assumptions about working with HDFS -- we think it is possible to get it adapted to the S3 connector, but we couldn't include it with the first version because it's not as simple as using the parquet library to generate the file.For rotate.interval.ms, because of S3's eventual consistency, we require that partitioners and commit triggers are deterministic to get exactly once delivery (which generally means purely based on the data). There's some work to get time-based triggers using the record timestamps into the next release of the S3 connector.-Ewen
On Fri, May 5, 2017 at 4:54 AM, Werner Daehn <werner...@gmail.com> wrote:
I struggle to understand why there are two kafka connectors, the hdfs and the S3 one.My requirements are
- write into S3
- write in parquet format
- support rotate.interval.ms (and flush.size)
Per my investigation, the S3 connector does not support the latter two, the hdfs connector not the first. And looking at the source code, both have a lot in common.So if anything, there should be one connector as the foundation and then multiple implementations with the various storage classes for hdfs, S3, Google Cloud etc.Can somebody shed some light into this, please, and tell me where I am wrong? And how I can achieve my goal of loading S3 with parquet files and rotate.interval.ms?Thanks in advance
--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsubscribe@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/97d80440-a3d3-4279-a621-09ded85e2d95%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/0a92c8b3-2681-4cbf-a84c-d64531ab5b4d%40googlegroups.com.