We are in the middle of setting up a tracking pipeline sending various analytics data through Kafka. During the original design phase, we didn't take Gobblin into account at all (of course it didn't exist yet) — we were planning on just writing our own Python based Kafka consumer to pull events out of the Kafka topics, buffer them, then dump them onto S3 or HDFS.
However, looking at the docs it seems like we could probably build out a KafkaSource, Extractor, etc, and, say, kick off a job every 30 minutes to do the same thing. Upside is we’d have Gobblin handling both data from external partners and from the tracking pipeline so we probably get benefits from code re-use, scheduling, etc — aside from the short-term timeline hits we’d take from trying to write code against an unfamiliar ecosystem, are there are any architectural downsides in Gobblin that anybody knows of that would make this a bad idea later? I know there is no pre-existing Kafka source but I assume that’s because LinkedIn has Camus already so there was no need to write one.
thanks
Eric