Hi guys,
I'm testing presto in a 10 node cluster.
When I try to run simple select query (select * from log where user_id = ... ) against a relatively large data set ( > 200 GB RCFILE ),
the query usually faileda as 'No nodes available to run query' after 2 or 3 minutes.
Here's a sample of errors in coordinator's server.log:
2013-11-13T00:01:25.041+0900 ERROR Stage-20131112_145914_00181_pzfe3.1-21869 com.facebook.presto.execution.SqlStageExecution Error while starting stage 20131112_145914_00181_pzfe3.1
java.lang.IllegalStateException: No nodes available to run query
at com.google.common.base.Preconditions.checkState(Preconditions.java:150) ~[guava-15.0.jar:na]
at com.facebook.presto.execution.NodeScheduler$NodeSelector.selectNode(NodeScheduler.java:166) ~[presto-main-0.52.jar:0.52]
at com.facebook.presto.execution.SqlStageExecution.chooseNode(SqlStageExecution.java:531) [presto-main-0.52.jar:0.52]
at com.facebook.presto.execution.SqlStageExecution.startTasks(SqlStageExecution.java:467) [presto-main-0.52.jar:0.52]
at com.facebook.presto.execution.SqlStageExecution.access$300(SqlStageExecution.java:80) [presto-main-0.52.jar:0.52]
at com.facebook.presto.execution.SqlStageExecution$5.run(SqlStageExecution.java:435) [presto-main-0.52.jar:0.52]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_45]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_45]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
While there's no obvious error message in worker node's server.log except the following one :
2013-11-12T23:59:53.281+0900 ERROR Discovery-2 io.airlift.discovery.client.CachingServiceSelector Cannot connect to discovery server for refresh (collector/general): Lookup of collector failed for
http://coordinator:8411/v1/service/collector/general2013-11-12T23:59:53.289+0900 INFO Discovery-0 io.airlift.discovery.client.CachingServiceSelector Discovery server connect succeeded for refresh (collector/general)
When I run "select * from sys.node" immediately after the failed query, I will normally see only my coordinator node left there.
But all the worker nodes returned online automatically in several minutes.
So any hint to solve this problem ?
Here's my setup:
worker node x 10 : 32 GB mem, -Xmx28G, task.max-memory=10GB
a dedicated server running presto coordinator and discovery service separated.
Thanks
Lan