[0.9.1.1] S3 read timeouts during batch ingestion tasks

Ryan O'Rourke

unread,

Oct 20, 2016, 1:14:01 PM10/20/16

to Druid User

Hello,

In our batch ingestions we experience read timeouts communicating with S3. This is pretty much the only time we see ingestion tasks fail.

My understanding is that some timeouts are to be expected when communicating with S3. Maybe the peons could do a better job handling them, but we can tolerate some low rate of failure - we have a daemon that watches for failed tasks and restarts them.

However, we're trying to scale up our indexer capacity to deal with backfill situations, and observationally it appears that as we increase the number of peons running on a single VM, the rate of failures due to S3 read timeouts also increases. I can't say this for sure or give you hard numbers - at this point, it's just "anecdata". But for example, I've been experimenting with running 20 peons on a 40-CPU m4.10xlarge, and for 27 tasks I have 14 failures.

Is there anything that can be done to reduce these failures? If not, does anyone have a sense what is a "healthy" number of workers to run in parallel?

java.lang.IllegalStateException: java.net.SocketTimeoutException: Read timed out
	at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:106) ~[commons-io-2.4.jar:2.4]
	at io.druid.data.input.impl.FileIteratingFirehose.hasMore(FileIteratingFirehose.java:52) ~[druid-api-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.common.task.IndexTask.generateSegment(IndexTask.java:389) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:221) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_60]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_60]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_60]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_60]
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_60]
	at java.net.SocketInputStream.read(SocketInputStream.java:170) ~[?:1.8.0_60]
	at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_60]
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[?:1.8.0_60]
	at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593) ~[?:1.8.0_60]
	at sun.security.ssl.InputRecord.read(InputRecord.java:532) ~[?:1.8.0_60]
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) ~[?:1.8.0_60]
	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930) ~[?:1.8.0_60]
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) ~[?:1.8.0_60]
	at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198) ~[httpcore-4.4.3.jar:4.4.3]
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178) ~[httpcore-4.4.3.jar:4.4.3]
	at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) ~[httpclient-4.5.1.jar:4.5.1]
	at org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:78) ~[jets3t-0.9.4.jar:0.9.4]
	at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:146) ~[jets3t-0.9.4.jar:0.9.4]
	at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_60]
	at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_60]
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_60]
	at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_60]
	at java.io.BufferedReader.fill(BufferedReader.java:161) ~[?:1.8.0_60]
	at java.io.BufferedReader.readLine(BufferedReader.java:324) ~[?:1.8.0_60]
	at java.io.BufferedReader.readLine(BufferedReader.java:389) ~[?:1.8.0_60]
	at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:95) ~[commons-io-2.4.jar:2.4]
	... 9 more

Ryan O'Rourke

unread,

Oct 27, 2016, 11:09:15 AM10/27/16

to Druid User

Bump... we've had some success with scaling up with small instances running only a single peon each. But even there it seems when we cross some threshold of total workers everything starts failing out with S3 timeouts.

We've seen that happen some time when our configuration hasn't changed too - we were chugging along with 7 single-peon indexers, then one hour had like a 75% failure rate, and the next hour a 100% failure rate, then without any changes on our side, it went back down to ~5% for a while.

Anyone have any insight, or maybe suggest a place we can put our data other than S3 that is more robust?

Gian Merlino

unread,

Oct 27, 2016, 12:42:04 PM10/27/16

to druid...@googlegroups.com

Hey Ryan,

From my experience the sporadic failures are a reality of running with S3. https://github.com/druid-io/druid/pull/3611 should help here, aiming for 0.9.3 for that one. It includes automatic retries for data fetching in the index task.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/92c8b878-bbfd-40f5-8557-20169dd7b1af%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward