worker nodes get offline when quering large data set

yuyang lan

unread,

Nov 12, 2013, 10:25:59 AM11/12/13

to presto...@googlegroups.com

Hi guys,

I'm testing presto in a 10 node cluster.

When I try to run simple select query (select * from log where user_id = ... ) against a relatively large data set ( > 200 GB RCFILE ),

the query usually faileda as 'No nodes available to run query' after 2 or 3 minutes.

Here's a sample of errors in coordinator's server.log:

2013-11-13T00:01:25.041+0900 ERROR Stage-20131112_145914_00181_pzfe3.1-21869 com.facebook.presto.execution.SqlStageExecution Error while starting stage 20131112_145914_00181_pzfe3.1
java.lang.IllegalStateException: No nodes available to run query
at com.google.common.base.Preconditions.checkState(Preconditions.java:150) ~[guava-15.0.jar:na]
at com.facebook.presto.execution.NodeScheduler$NodeSelector.selectNode(NodeScheduler.java:166) ~[presto-main-0.52.jar:0.52]
at com.facebook.presto.execution.SqlStageExecution.chooseNode(SqlStageExecution.java:531) [presto-main-0.52.jar:0.52]
at com.facebook.presto.execution.SqlStageExecution.startTasks(SqlStageExecution.java:467) [presto-main-0.52.jar:0.52]
at com.facebook.presto.execution.SqlStageExecution.access$300(SqlStageExecution.java:80) [presto-main-0.52.jar:0.52]
at com.facebook.presto.execution.SqlStageExecution$5.run(SqlStageExecution.java:435) [presto-main-0.52.jar:0.52]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_45]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_45]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]

While there's no obvious error message in worker node's server.log except the following one :

2013-11-12T23:59:53.281+0900 ERROR Discovery-2 io.airlift.discovery.client.CachingServiceSelector Cannot connect to discovery server for refresh (collector/general): Lookup of collector failed for http://coordinator:8411/v1/service/collector/general
2013-11-12T23:59:53.289+0900 INFO Discovery-0 io.airlift.discovery.client.CachingServiceSelector Discovery server connect succeeded for refresh (collector/general)

When I run "select * from sys.node" immediately after the failed query, I will normally see only my coordinator node left there.

But all the worker nodes returned online automatically in several minutes.

So any hint to solve this problem ?

Here's my setup:

worker node x 10 : 32 GB mem, -Xmx28G, task.max-memory=10GB

a dedicated server running presto coordinator and discovery service separated.

Thanks

Lan

David Phillips

unread,

Nov 12, 2013, 10:20:43 PM11/12/13

to presto...@googlegroups.com

On Tue, Nov 12, 2013 at 7:25 AM, yuyang lan <asyu...@gmail.com> wrote:

I'm testing presto in a 10 node cluster.

When I try to run simple select query (select * from log where user_id = ... ) against a relatively large data set ( > 200 GB RCFILE ),

the query usually faileda as 'No nodes available to run query' after 2 or 3 minutes.

[...]

When I run "select * from sys.node" immediately after the failed query, I will normally see only my coordinator node left there.

But all the worker nodes returned online automatically in several minutes.

So any hint to solve this problem ?

Here's my setup:
worker node x 10 : 32 GB mem, -Xmx28G, task.max-memory=10GB
a dedicated server running presto coordinator and discovery service separated.

It sounds like the worker nodes are "falling out" of discovery while running the query. Nodes need to periodically refresh their announcement with the discovery server, otherwise the discovery server will assume they died and remove the announcement. The internal table "sys.node" is populated via discovery, so this explains why it is empty when this problem occurs.

Another issue is that the coordinator's heartbeat monitor could be detecting the nodes as failed even if they are refreshing themselves in discovery. We actually have an endpoint on the coordinator that lets you view the failed nodes (/v1/node/failed), but unfortunately I just discovered that was broken by a recent refactoring, so you will get a 404 if you try to access it. (We will fix this soon with proper tests.)

Can you give some detail about your cluster?

Are these real machines or virtual machines?

Is the network 1 gigabit or 10 gigabit? Is anything else running on these machines?

Are all of the machines connected to the same network switch?

What is the capacity of that switch?

Where are the HDFS data nodes located on the network in relation to the machines running Presto?

If the network or machines are severely overloaded during the large query it could cause this problem.

yuyang lan

unread,

Nov 14, 2013, 1:04:19 PM11/14/13

to presto...@googlegroups.com, da...@acz.org

Hi , thank you for your reply.

> Can you give some detail about your cluster?

> Are these real machines or virtual machines?

> Is the network 1 gigabit or 10 gigabit? Is anything else running on these machines?

> Are all of the machines connected to the same network switch?

> What is the capacity of that switch?

> Where are the HDFS data nodes located on the network in relation to the machines running Presto?

It's a cluster with 11 real machines ( 1 coordinator & 10 workers ).

They're all in the same rack and connected by a 1 Gb top-of-rack switch ( 24 port 32Gbps bandwidth ).

And all presto worker nodes also run Datanode daemon locally as part of a small HDFS that contains about twenty nodes.

After read your reply, I tried do some normal kernel network optimization but seems didn't work.

Then I tried to add some debug code around DiscoveryNodeManager & HeartbeatFailureDetector

and found that the node failures are caused by ping request connection timeout.

So adding "failure-detector.http-client.connect-timeout=1m" to my coordinator's config.properties finally solved my problem.

Thanks again !

Martin Traverso

unread,

Nov 15, 2013, 11:47:57 AM11/15/13

to presto...@googlegroups.com, da...@acz.org

The failure detector has a default connection timeout of 1s. That should be plenty of time for any well-behaved system, so the fact that you've had to increase it may be indicative of other networking problems. By setting it to 1 minute, you've effectively disabled the failure detector. You can achieve the same result by setting the "failure-detector.enabled" config option (in config.properties) to false.

Martin

Ravi Shekhar

unread,

Mar 10, 2014, 6:23:56 AM3/10/14

to presto...@googlegroups.com

Hi All,

I am facing the similar Issue.

I have below mentioned Configuration:

Co-ordinator Node : 1

Worker Nodes : 6

I am able to query the hive metastore ( viz. show tables, schema etc.). But when I try to execute a select on any table, I get the bellow mentioned exception in the Co-ordinator log.

java.lang.IllegalStateException: No nodes available to run query

at com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[guava-16.0.1.jar:na]

at com.facebook.presto.execution.NodeScheduler$NodeSelector.computeAssignments(NodeScheduler.java:183) ~[presto-main-0.60.jar:0.60]

at com.facebook.presto.execution.SqlStageExecution.scheduleSourcePartitionedNodes(SqlStageExecution.java:639) [presto-main-0.60.jar:0.60]

at com.facebook.presto.execution.SqlStageExecution.startTasks(SqlStageExecution.java:557) [presto-main-0.60.jar:0.60]

at com.facebook.presto.execution.SqlStageExecution.access$200(SqlStageExecution.java:94) [presto-main-0.60.jar:0.60]

at com.facebook.presto.execution.SqlStageExecution$4.run(SqlStageExecution.java:529) [presto-main-0.60.jar:0.60]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_45]

at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_45]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]

at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]

I have tried to set the property failure-detector.enabled (as discussed above) on the co-ordinator node, but still the problem presists.

Please suggest.

Regards

Ravi Shekhar

Dain Sundstrom

unread,

Mar 11, 2014, 1:34:06 PM3/11/14

to presto...@googlegroups.com

The most common causes of this problem are the nodes not registering with discovery, the failure detector kicking out nodes and not setting datasources property. The first two can be diagnosed with this query:

select * from sys.node;

This will show every node in your cluster and if the coordinator considers the node healthy.

For the last problem, check that on every node the etc/config.properties contains: