Frequent "PageTransportTimeoutException: Encountered too many errors talking to a worker node"

83 views
Skip to first unread message

Thomas Sun

unread,
Sep 22, 2016, 11:12:35 AM9/22/16
to presto...@googlegroups.com
Hi,

I'm running 0.143 version on a cluster of 200 machines, each with 56 cores and 128GB ram. We are running into "PageTransportTimeoutException: Encountered too many errors talking to a worker node" very frequently mostly with complex join queries but sometimes even simple select. I tried out suggestions in https://github.com/prestodb/presto/issues/4182, but didn't get any improvements.

I see a lot of these errors in worker nodes:
org.glassfish.jersey.server.internal.process.MappableException: org.eclipse.jetty.io.EofException at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:91) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162) at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1154) at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:621) at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:377) at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:367) at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:274) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.process(Errors.java:267) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:297) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:254) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1030) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:373) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:381) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:344) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:835) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1685) at io.airlift.http.server.TraceTokenFilter.doFilter(TraceTokenFilter.java:63) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668) at io.airlift.http.server.TimingFilter.doFilter(TimingFilter.java:52) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:397) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1162) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:169) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:517) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:745) Caused by: org.eclipse.jetty.io.EofException at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:197) at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:419) at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:313) at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:141) at org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:732) at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241) at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:224) at org.eclipse.jetty.server.HttpConnection.send(HttpConnection.java:512) at org.eclipse.jetty.server.HttpChannel.sendResponse(HttpChannel.java:673) at org.eclipse.jetty.server.HttpChannel.write(HttpChannel.java:722) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:179) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:163) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:415) at io.airlift.http.server.TimingFilter$TimedServletOutputStream.write(TimingFilter.java:162) at org.glassfish.jersey.servlet.internal.ResponseWriter$NonCloseableOutputStreamWrapper.write(ResponseWriter.java:298) at org.glassfish.jersey.message.internal.CommittingOutputStream.write(CommittingOutputStream.java:229) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$UnCloseableOutputStream.write(WriterInterceptorExecutor.java:299) at com.fasterxml.jackson.core.json.UTF8JsonGenerator._flushBuffer(UTF8JsonGenerator.java:1853) at com.fasterxml.jackson.core.json.UTF8JsonGenerator.writeEndObject(UTF8JsonGenerator.java:336) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:154) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContentsUsing(IndexedListSerializer.java:124) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:69) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:21) at com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:183) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeWithType(BeanSerializerBase.java:525) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:507) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:100) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:21) at com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:183) at com.fasterxml.jackson.databind.SerializerProvider.defaultSerializeValue(SerializerProvider.java:896) at com.fasterxml.jackson.datatype.jdk8.OptionalSerializer.serialize(OptionalSerializer.java:92) at com.fasterxml.jackson.datatype.jdk8.OptionalSerializer.serialize(OptionalSerializer.java:19) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152) at com.fasterxml.jackson.databind.ser.std.MapSerializer.serializeFields(MapSerializer.java:491) at com.fasterxml.jackson.databind.ser.std.MapSerializer.serialize(MapSerializer.java:412) at com.fasterxml.jackson.databind.ser.std.MapSerializer.serialize(MapSerializer.java:27) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:100) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:21) at com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:183) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:100) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:21) at com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:183) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:100) at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:21) at com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:183) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152) at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) at com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:602) at io.airlift.jaxrs.JsonMapper.writeTo(JsonMapper.java:245) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.invokeWriteTo(WriterInterceptorExecutor.java:265) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:250) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162) at org.glassfish.jersey.server.internal.JsonWithPaddingInterceptor.aroundWriteTo(JsonWithPaddingInterceptor.java:106) at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162) at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:85) ... 44 more Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.writev0(Native Method) at sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51) at sun.nio.ch.IOUtil.write(IOUtil.java:148) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:504) at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:177)

If anyone could give some guidance that'd be greatly appreciated.

Thanks,
Tom

Stephen Sprague

unread,
Sep 22, 2016, 11:47:06 AM9/22/16
to presto...@googlegroups.com
i don't have any magic bullets for you but if you go to the presto admin page while the query is running, click on the query id.  then look the a DAG graph.   does it show any kind of statuses like "BLOCKED" or anything that looks unusual?

are there any other clues from examining that page and its hyperlinks that look like something's amiss?


what is the load on your 200 machines? is any particular one totally out of band? 


some ideas anyway.

--
You received this message because you are subscribed to the Google Groups "Presto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stephen Sprague

unread,
Sep 22, 2016, 12:31:52 PM9/22/16
to presto...@googlegroups.com
fyi. i think load is a pretty good proxy for trouble. no guarantees but its a good warning signal IMHO.

we have a 30 node cluster and while we have newrelic, pepperdata, ganglia, logic monitor, nagios and who knows what else... call me a cave man but i just run uptime from the cmd line in a loop.  nice 'n simple.

so i get something like this:

dndwr2
 09:18:43 up 221 days,  2:16,  0 users,  load average: 0.62, 3.46, 5.99
dndwr3
 09:18:43 up 231 days, 22:44,  0 users,  load average: 1.26, 4.13, 6.27
dndwr4
 09:18:43 up 1022 days, 14:41,  0 users,  load average: 1.44, 5.10, 6.97
dndwr5
 09:18:44 up 223 days,  2:04,  0 users,  load average: 1.47, 4.56, 7.00
dndwr6
 09:18:44 up 225 days, 52 min,  0 users,  load average: 1.51, 3.76, 5.94
dndwr7
 09:18:44 up 225 days,  1:40,  0 users,  load average: 1.42, 3.97, 6.28
dndwr8
 09:18:44 up 225 days, 58 min,  0 users,  load average: 0.93, 4.00, 6.34
dndwr9
 09:18:44 up 776 days, 18:46,  0 users,  load average: 1.74, 4.38, 6.76
dndwr10
 09:18:44 up 268 days,  2:59,  0 users,  load average: 1.28, 4.50, 6.56
dndwr11
 09:18:44 up 558 days, 19:29,  0 users,  load average: 1.25, 3.76, 6.30
dndwr12
 09:18:44 up 558 days, 18:56,  0 users,  load average: 1.55, 4.48, 6.67
dndwr13
 09:18:44 up 552 days, 23:51,  0 users,  load average: 0.98, 3.74, 6.31
dndwr14
 09:18:45 up 223 days,  2:10,  0 users,  load average: 1.49, 4.32, 6.32
dndwr15
 09:18:45 up 558 days, 18:52,  0 users,  load average: 1.06, 3.96, 6.24
dndwr16
 09:18:45 up 558 days, 20:50,  0 users,  load average: 1.07, 4.27, 6.63
dndwr17
 09:18:45 up 558 days, 19:41,  0 users,  load average: 1.02, 4.50, 6.82
dndwr18
 09:18:45 up 223 days,  2:22,  0 users,  load average: 0.93, 3.46, 5.66
dndwr19
 09:18:45 up 558 days, 19:35,  0 users,  load average: 0.70, 3.73, 6.14
dndwr20
 09:18:46 up 84 days, 15:45,  0 users,  load average: 1.34, 3.95, 6.40
dndwr21
 09:18:46 up 91 days,  8:30,  0 users,  load average: 0.61, 3.87, 6.52
dndwr22
 09:18:46 up 100 days, 16:45,  0 users,  load average: 0.85, 3.43, 6.02
dndwr23
 09:18:46 up 100 days, 16:04,  0 users,  load average: 1.22, 3.79, 6.18
dndwr24
 09:18:46 up 54 days, 2 min,  0 users,  load average: 0.66, 3.17, 5.99
dndwr25
 09:18:46 up 8 days, 21:15,  0 users,  load average: 1.60, 4.00, 6.48
dndwr26
 09:18:46 up 79 days,  7:50,  0 users,  load average: 0.95, 2.77, 4.43
dndwr27
 09:18:46 up 100 days, 16:03,  0 users,  load average: 1.78, 4.66, 6.81


pretty easy to spot something wacky on the cluster this way.


Cheers,
Stephen.

Dain Sundstrom

unread,
Sep 22, 2016, 12:59:22 PM9/22/16
to presto...@googlegroups.com
We have found two main causes of this problem.

The first is that we can get full process pauses for 10s of seconds and at 200 nodes you get unlucky enough that you see one lone enough to get a failure. In the past, this was been caused by GC, but since we moved to G1 and tuned the code, we typically don’t see this anymore (except for heavy writers which is why I’m writing a new ORC writer from scratch). Now days it is almost always cause by something bad running on the Presto machines. This could be as simple as a back ground process killing the disks or using to much memory which cause the Presto heap to be paged out. A common cause of this is running map reduce on Presto nodes. But it can be as complex as kernel bugs. We found one recently (on an old kernel) that caused the whole machine to lockup for 30s. I would start by looking for gaps in the http log on the machine that fell out of the cluster.

The other issue can be network saturation. In a big join you can generate a huge amount of traffic, which can saturate the top of rack switch or in your case of multiple racks the cluster switches. This is not a problem in places like AWS where you are spread out, but happens onprem when you are packed into racks. At FB we have problem saturating the top of rack switch in some clusters, but typically not the rack to rack paths (we use fabric now). Generally, the network should back off and slow down your system but depending not the configuration, you can get a retransmit storm that kills the whole network.

Als I’d update the the newest release and set `task.info-update-interval` to like 3 seconds. Which keeps the “chattiness” of our coordinator to worker protocol down.

-dain
> --
> You received this message because you are subscribed to the Google Groups "Presto" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to presto-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages