Investigating Gobblin's behavior when a Kafka -> S3 ingestion fails

68 views
Skip to first unread message

Jorge Montero

unread,
May 16, 2016, 1:43:57 PM5/16/16
to gobblin-users
I am planning to put Gobblin in production as a way to provide some more durable access to kafka topics. As part of that, I have been trying to see how gobblin acts when it fails.

My only change from a simple setup was creating a custom WriterPartitioner to make sure I can retain the original kafka partition number. I do this since, by default, I was getting one file per partition, but there was no way to know which partition the data had come from, and I really want to be able to know if I gobblin is dropping data. I can post it if someone is interested.

I have discovered some pretty strange behavior when I reuse the same base path across runs though. Imagine I am writing to /day/hour/topic/partition

The first time I run something in a given hour, the code works fine, and writes to s3 properly: /day/hour/topic/partition/part-something is created
The second time I do this on the same hour, it doesn't override the file, or add another: It writes to /day/hour/topic/partition/partition/part-something! So it saves the data, but to a place I would have not expected.
The third time, it fails with a stack trace like this:

16/05/16 17:16:29 ERROR runtime.AbstractJobLauncher: Failed to commit dataset state for dataset  of job <jobname>
java.io.IOException: java.io.IOException: Target /day/hour/topic/partition/partition is a directory
at gobblin.util.ParallelRunner.close(ParallelRunner.java:291)
at com.google.common.io.Closer.close(Closer.java:214)
at gobblin.publisher.BaseDataPublisher.close(BaseDataPublisher.java:122)
at com.google.common.io.Closer.close(Closer.java:214)
at gobblin.runtime.JobContext.commit(JobContext.java:390)
at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:274)
at gobblin.runtime.mapreduce.CliMRJobLauncher.run(CliMRJobLauncher.java:60)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.stripe.kafkatools.gobblin.AirflowMain$.main(AirflowMain.scala:51)
at com.stripe.kafkatools.gobblin.AirflowMain.main(AirflowMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Target /day/hour/topic/partition/partition is a directory
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:503)
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:505)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:351)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
at gobblin.util.HadoopUtils.movePath(HadoopUtils.java:155)
at gobblin.util.ParallelRunner$6.call(ParallelRunner.java:265)
at gobblin.util.ParallelRunner$6.call(ParallelRunner.java:258)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

And not only does it fail, but the kafka consumer has move forward, so the offsets that the failed run was ingesting will be lost, unless I do manual recovery.
I can get similar errors of data ingested and thrown away by just entering a completely invalid s3 output file

So I wonder:

Is any of this expected behavior? I was very surprised by the behavior of the second run: Overwriting the fille, adding another file to the same directory or outright failing were things I would have expected it to do.

When there is an error in the middle of the commit sequence, is there any way I can recover/retry the work straight from hdfs, or is it all deleted?

Would it make sense to have a setting that makes Gobblin reset the consumer's offset if there is a failure on commit? We have workflows where dropping data into the floor just isn't OK.

Is there any gude on Gobblin's behavior on errors at different stages? I am going to end up having to write a runbook for when things inevitably go wrong in production, so any operational knowledge out there would be helpful

Sahil Takiar

unread,
May 16, 2016, 2:00:30 PM5/16/16
to Jorge Montero, gobblin-users
Its not the expected behavior, it should either append a new file if new data for that partition shows up, or it should create a new folder if the partition has not been seen before.

Can you send your configuration files over? Can you make sure you have set data.publisher.type=TimePartitionedDataPublisher

--Sahil

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/bbd30f80-c8c0-43b7-a913-3bac54da6d6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jorge Montero

unread,
May 16, 2016, 10:13:02 PM5/16/16
to gobblin-users, hib...@gmail.com
Thank you! Switching to TimePartitionedDataPublished from BasicData published made it work far better. I'd love to see this part of the system better documented: I wish I would understand it better, so I could PR the improvements myself.

However, what this does not fix is what happens if I trigger a failure when writing to s3 (by cutting connectivity, or running an old, buggy version of the s3 library). In this case, it's perfectly OK for gobblin to fail to save the data, but Gobblin's Kafka consumer is moved forward, as if it had succeeded, so without manual intervention, whatever that run read from Kafka is not going to get persisted.

If nothing else, connectivity problems happen, so this makes me want to write external tools to mitigate a data loss if there is a crash during finalizing.

Has this not been an issue for other people in production? Is this something that would make sense to add to Gobblin mainline itself?

Sahil Takiar

unread,
May 17, 2016, 12:08:34 PM5/17/16
to Jorge Montero, gobblin-users
Hey Jorge,

Yes, unfortunately its not very well documented. Documentation can be modified the same way code can be edited, and a GitHub Pull Request can be opened for any documentation changes. All Gobblin documentation lives under the "gobblin-docs" folders; the page for Partitioned Writers is under: gobblin-docs/user-guide/Partitioned-Writers.md

We also have a GitHub Issue open to simplify this process: https://github.com/linkedin/gobblin/issues/883

As for your other questions, what version of Gobblin are you using? The behavior you outlined shouldn't be happening, the worst case scenario is that Gobblin commits the data, but fails to roll the offsets forward; the reverse should not happen. This should guarantee "at least once" delivery of data.

We are actively working to the make the guarantee "exactly once", but this feature is still a WIP. Documentation on the "exactly once" feature can be found here: http://gobblin.readthedocs.io/en/latest/miscellaneous/Exactly-Once-Support/

--Sahil

Shirshanka Das

unread,
Aug 22, 2016, 9:35:57 AM8/22/16
to Sahil Takiar, Jorge Montero, gobblin-users
Hi Jorge,
  Could you reply back on the status of your pipeline and answers to Sahil's questions? 

thanks!
Shirshanka


To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.

To post to this group, send email to gobbli...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.

To post to this group, send email to gobbli...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages