Query fails in Presto - java.lang.RuntimeException: Error fetching next

Sanjay Subramanian

unread,

Jun 2, 2014, 8:24:25 PM6/2/14

to presto...@googlegroups.com

Hi guys

I have run more complex queries and also longer running times (about 5 minutes) but this query keeps failing at about 4min 34sec

After this error , I cannot run Presto unless I restart the Presto server and Discovery.

Presto is running on a 2 node cluster with CDH4.7.0 (64GB per node)

I restarted the servers and Presto is back but this query just cripples Presto

The query is a SELECT query from one table

SELECT COL1, COL2....................,COL64, MIN(COL65) from MYTABLE GROUP BY COL1, COL2....................,COL64

Any clues ?

presto069 --debug -f ./foofla.hql --output-format TSV > ./foofla.hql.out

2014-06-02T16:54:28.895-0700 INFO main io.airlift.log.Logging Logging to stderr

2014-06-02T16:54:29.410-0700 INFO main org.eclipse.jetty.util.log Logging initialized @1019ms

java.lang.RuntimeException: Error fetching next

at com.facebook.presto.client.StatementClient.advance(StatementClient.java:209)

at com.facebook.presto.cli.Query.waitForData(Query.java:140)

at com.facebook.presto.cli.Query.renderQueryOutput(Query.java:100)

at com.facebook.presto.cli.Query.renderOutput(Query.java:82)

at com.facebook.presto.cli.Console.process(Console.java:223)

at com.facebook.presto.cli.Console.executeCommand(Console.java:216)

at com.facebook.presto.cli.Console.run(Console.java:91)

at com.facebook.presto.cli.Presto.main(Presto.java:31)

Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException

at com.google.common.base.Throwables.propagate(Throwables.java:160)

at io.airlift.http.client.ResponseHandlerUtils.propagate(ResponseHandlerUtils.java:22)

at io.airlift.http.client.FullJsonResponseHandler.handleException(FullJsonResponseHandler.java:53)

at io.airlift.http.client.FullJsonResponseHandler.handleException(FullJsonResponseHandler.java:33)

at io.airlift.http.client.jetty.JettyHttpClient.execute(JettyHttpClient.java:205)

at com.facebook.presto.client.StatementClient.advance(StatementClient.java:182)

... 7 more

Caused by: java.util.concurrent.TimeoutException

at org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:208)

at io.airlift.http.client.jetty.JettyHttpClient.execute(JettyHttpClient.java:198)

... 8 more

Running the same query inside CLI

Query 20140602_234045_00005_cjapn, RUNNING, 2 nodes, 328 splits

http://10.255.32.180:8081/v1/query/20140602_234045_00005_cjapn?pretty

Splits: 200 queued, 128 running, 0 done

CPU Time: 122.6s total, 1.46K rows/s, 3.99MB/s, 3% active

Per Node: 0.4 parallelism, 552 rows/s, 1.5MB/s

Parallelism: 0.8

2:43 [ 180K rows, 489MB] [ 1.1K rows/s, 3.01MB/s] [ <=> ]

STAGES ROWS ROWS/s BYTES BYTES/s QUEUED RUN DONE

0.........R 0 0 0B 0B 0 1 0

1.......R 0 0 0B 0B 0 4 0

2.....S 180K 1.1K 489M 3.01M 200 123 0

2014-06-02T16:45:28.849-0700 DEBUG main com.facebook.presto.cli.StatusPrinter error printing status

java.lang.RuntimeException: Error fetching next

at com.facebook.presto.client.StatementClient.advance(StatementClient.java:209) ~[presto:0.69]

at com.facebook.presto.cli.StatusPrinter.printInitialStatusUpdates(StatusPrinter.java:94) ~[presto:0.69]

at com.facebook.presto.cli.Query.renderQueryOutput(Query.java:97) [presto:0.69]

at com.facebook.presto.cli.Query.renderOutput(Query.java:82) [presto:0.69]

at com.facebook.presto.cli.Console.process(Console.java:223) [presto:0.69]

at com.facebook.presto.cli.Console.runConsole(Console.java:165) [presto:0.69]

at com.facebook.presto.cli.Console.run(Console.java:94) [presto:0.69]

Query 20140602_234045_00005_cjapn, RUNNING, 2 nodes

http://10.255.32.180:8081/v1/query/20140602_234045_00005_cjapn?pretty

Splits: 328 total, 0 done (0.00%)

CPU Time: 122.6s total, 1.46K rows/s, 3.99MB/s, 3% active

Per Node: 0.2 parallelism, 317 rows/s, 885KB/s

Parallelism: 0.4

4:43 [180K rows, 489MB] [634 rows/s, 1.73MB/s]

Query is gone (server restarted?)

Dain Sundstrom

unread,

Jun 3, 2014, 1:45:29 PM6/3/14

to presto...@googlegroups.com

What version of presto are you using? If it is not the latest, can you upgrade and try again?

It looks like the coordinator is getting overwhelmed and the client is having difficulties communicating with the coordinator. This is likely because you only have two nodes in the cluster, and I'm guessing, that the coordinator is performing work on behalf of the query in addition to coordinator. The problem worker code is written to use all available resources, so it can starve the coordinator.

I would either add some more nodes to the cluster and disable work on the coordinator, or I would reduce "task.shard.max-threads" on the coordinator to be less than the number of cores on the box.

-dain

David Phillips

unread,

Jun 3, 2014, 2:50:34 PM6/3/14

to presto...@googlegroups.com

On Mon, Jun 2, 2014 at 5:24 PM, Sanjay Subramanian <sanjaysu...@gmail.com> wrote:

After this error , I cannot run Presto unless I restart the Presto server and Discovery.

[...]

I restarted the servers and Presto is back but this query just cripples Presto

First off, try running with the embedded discovery server (which is example config in the deployment instructions use):

discovery-server.enabled=true

We run this configuration at Facebook now without any problems, and plan to simply the instructions by making this the default.

When you say you have to restart, does this mean the Presto processes crashed (exited)? If so, is there any message from the JVM in launcher.log?

SELECT COL1, COL2....................,COL64, MIN(COL65) from MYTABLE GROUP BY COL1, COL2....................,COL64

This is a very wide group, which leads me to think you might be running out of memory.

Can you post your etc/config.properties and etc/jvm.config?

Sanjay Subramanian

unread,

Jun 4, 2014, 6:36:18 PM6/4/14

to presto...@googlegroups.com

Hi Dain

Its a two node cluster and u can find the specs of my cluster here

http://bigdatalatte.wordpress.com/2014/06/01/setting-up-presto0-69-on-an-existing-cloudera-cdh4-7-0-cluster/

This query runs in Hive and Impala and finishes successfully. Increasing the number of nodes wont be possible at this point if the non-presto options on the same hardware are working.

I would like to learn if there are any configuration tweaks

thanks

sanjay

Sanjay Subramanian

unread,

Jun 4, 2014, 6:44:54 PM6/4/14

to presto...@googlegroups.com, da...@acz.org

hey David

My specs are here

http://bigdatalatte.wordpress.com/2014/06/01/setting-up-presto0-69-on-an-existing-cloudera-cdh4-7-0-cluster/

with this setting discovery-server.enabled=true I could not get the two nodes to run :-(

yes the Launcher log says java.lang.OutOfMemoryError: Java heap space - DOH sorry I should have checked this first

thanks

sanjay

Dain Sundstrom

unread,

Jun 11, 2014, 2:30:34 PM6/11/14

to presto...@googlegroups.com

Two nodes is pretty tiny for a cluster, but it should work -- but you are unlikely to see peak performance out of the system.

My guess is the problem you are seeing is happening because the coordinator is configured perform work on behalf of the query (which you would do when you only have two nodes). The work is overwhelming the ability of the coordinator to coordinate the query, so you see timeouts and then failures. Depending on the query you are running, this can be exacerbated by a lack of real memory on the machines and excessive garbage created by some functions. Additionally, if you are using hardware virtualization or something like linux cgroups, processor affinity, or lxc, the default Presto configurations may be too aggressive. The Presto default configuration is based on the number of processors reported by the JVM, but if that doesn't match the real number of processors available to the process, we create too many threads.

You can attempt to dial-in the system in so it can finish the execution of your queries, by tuning the number of worker threads by setting the `task.shard.max-threads` property. Additionally, I would watch the performance of the processes using VisualGC to make sure you are not stuck in a GC death spiral.

Also, Presto does not need a separate discovery installation anymore. For a cluster this small, it is most likely detrimental.

-dain

David Phillips

unread,

Jun 11, 2014, 2:45:15 PM6/11/14

to presto...@googlegroups.com

On Wed, Jun 4, 2014 at 3:44 PM, Sanjay Subramanian <sanjaysu...@gmail.com> wrote:

with this setting discovery-server.enabled=true I could not get the two nodes to run :-(

This should work and is the recommended way to deploy. Can you tell us what error you saw?

yes the Launcher log says java.lang.OutOfMemoryError: Java heap space - DOH sorry I should have checked this first

The reason you run out of memory is that task.max-memory is set too large relative to your JVM max heap size. It should be a fraction of the JVM heap size.

The exact value required depends on the complexity and concurrency of running queries. For example, for Presto clusters at Facebook that are shared by many users, we use a 16GB JVM heap with a 256MB task memory limit.

Try using a lower value like 4GB or 2GB. Increase it if your queries need more memory, and decrease it if the JVM OOMs.

You can also increase your JVM heap size if the machines have more memory available.

Reply all

Reply to author

Forward