Query fails in Presto - java.lang.RuntimeException: Error fetching next

1,952 views
Skip to first unread message

Sanjay Subramanian

unread,
Jun 2, 2014, 8:24:25 PM6/2/14
to presto...@googlegroups.com
Hi guys

I have run more complex queries and also longer running times (about 5 minutes) but this query keeps failing at about 4min 34sec
After this error , I cannot run Presto unless I restart the Presto server and Discovery.

Presto is running on a 2 node cluster with CDH4.7.0 (64GB per node)

I restarted the servers and Presto is back but this query just cripples Presto

The query is a SELECT query from one table 

SELECT COL1, COL2....................,COL64, MIN(COL65) from MYTABLE GROUP BY COL1, COL2....................,COL64

Any clues ?




presto069 --debug -f ./foofla.hql --output-format TSV > ./foofla.hql.out 

2014-06-02T16:54:28.895-0700     INFO   main    io.airlift.log.Logging  Logging to stderr
2014-06-02T16:54:29.410-0700     INFO   main    org.eclipse.jetty.util.log      Logging initialized @1019ms
java.lang.RuntimeException: Error fetching next
        at com.facebook.presto.client.StatementClient.advance(StatementClient.java:209)
        at com.facebook.presto.cli.Query.waitForData(Query.java:140)
        at com.facebook.presto.cli.Query.renderQueryOutput(Query.java:100)
        at com.facebook.presto.cli.Query.renderOutput(Query.java:82)
        at com.facebook.presto.cli.Console.process(Console.java:223)
        at com.facebook.presto.cli.Console.executeCommand(Console.java:216)
        at com.facebook.presto.cli.Console.run(Console.java:91)
        at com.facebook.presto.cli.Presto.main(Presto.java:31)
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException
        at com.google.common.base.Throwables.propagate(Throwables.java:160)
        at io.airlift.http.client.ResponseHandlerUtils.propagate(ResponseHandlerUtils.java:22)
        at io.airlift.http.client.FullJsonResponseHandler.handleException(FullJsonResponseHandler.java:53)
        at io.airlift.http.client.FullJsonResponseHandler.handleException(FullJsonResponseHandler.java:33)
        at io.airlift.http.client.jetty.JettyHttpClient.execute(JettyHttpClient.java:205)
        at com.facebook.presto.client.StatementClient.advance(StatementClient.java:182)
        ... 7 more
Caused by: java.util.concurrent.TimeoutException
        at org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:208)
        at io.airlift.http.client.jetty.JettyHttpClient.execute(JettyHttpClient.java:198)
        ... 8 more


Running the same query inside CLI
Query 20140602_234045_00005_cjapn, RUNNING, 2 nodes, 328 splits
Splits:   200 queued, 128 running, 0 done
CPU Time: 122.6s total, 1.46K rows/s, 3.99MB/s, 3% active
Per Node: 0.4 parallelism,   552 rows/s,  1.5MB/s
Parallelism: 0.8
2:43 [ 180K rows,  489MB] [ 1.1K rows/s, 3.01MB/s] [       <=>                                ]
 
     STAGES   ROWS  ROWS/s  BYTES  BYTES/s  QUEUED    RUN   DONE
0.........R      0       0     0B       0B       0      1      0
  1.......R      0       0     0B       0B       0      4      0
    2.....S   180K    1.1K   489M    3.01M     200    123      0
2014-06-02T16:45:28.849-0700    DEBUG   main    com.facebook.presto.cli.StatusPrinter   error printing status
java.lang.RuntimeException: Error fetching next
        at com.facebook.presto.client.StatementClient.advance(StatementClient.java:209) ~[presto:0.69]
        at com.facebook.presto.cli.StatusPrinter.printInitialStatusUpdates(StatusPrinter.java:94) ~[presto:0.69]
        at com.facebook.presto.cli.Query.renderQueryOutput(Query.java:97) [presto:0.69]
        at com.facebook.presto.cli.Query.renderOutput(Query.java:82) [presto:0.69]
        at com.facebook.presto.cli.Console.process(Console.java:223) [presto:0.69]
        at com.facebook.presto.cli.Console.runConsole(Console.java:165) [presto:0.69]
        at com.facebook.presto.cli.Console.run(Console.java:94) [presto:0.69]
 
Query 20140602_234045_00005_cjapn, RUNNING, 2 nodes
Splits: 328 total, 0 done (0.00%)
CPU Time: 122.6s total, 1.46K rows/s, 3.99MB/s, 3% active
Per Node: 0.2 parallelism,   317 rows/s,  885KB/s
Parallelism: 0.4
4:43 [180K rows, 489MB] [634 rows/s, 1.73MB/s]
 
Query is gone (server restarted?)


Dain Sundstrom

unread,
Jun 3, 2014, 1:45:29 PM6/3/14
to presto...@googlegroups.com
What version of presto are you using?  If it is not the latest, can you upgrade and try again?

It looks like the coordinator is getting overwhelmed and the client is having difficulties communicating with the coordinator.  This is likely because you only have two nodes in the cluster, and I'm guessing, that the coordinator is performing work on behalf of the query in addition to coordinator.  The problem worker code is written to use all available resources, so it can starve the coordinator.  

I would either add some more nodes to the cluster and disable work on the coordinator, or I would reduce "task.shard.max-threads" on the coordinator to be less than the number of cores on the box.

-dain

David Phillips

unread,
Jun 3, 2014, 2:50:34 PM6/3/14
to presto...@googlegroups.com
On Mon, Jun 2, 2014 at 5:24 PM, Sanjay Subramanian <sanjaysu...@gmail.com> wrote:
After this error , I cannot run Presto unless I restart the Presto server and Discovery.

[...]

I restarted the servers and Presto is back but this query just cripples Presto

First off, try running with the embedded discovery server (which is example config in the deployment instructions use):

    discovery-server.enabled=true

We run this configuration at Facebook now without any problems, and plan to simply the instructions by making this the default.

When you say you have to restart, does this mean the Presto processes crashed (exited)? If so, is there any message from the JVM in launcher.log?

SELECT COL1, COL2....................,COL64, MIN(COL65) from MYTABLE GROUP BY COL1, COL2....................,COL64

This is a very wide group, which leads me to think you might be running out of memory.

Can you post your etc/config.properties and etc/jvm.config?

Sanjay Subramanian

unread,
Jun 4, 2014, 6:36:18 PM6/4/14
to presto...@googlegroups.com
Hi Dain

Its a two node cluster and u can find the specs of my cluster here 


This query runs in Hive and Impala and finishes successfully. Increasing the number of nodes wont be possible at this point if the non-presto options on the same hardware are working.
I would like to learn if there are any configuration tweaks

thanks

sanjay

Sanjay Subramanian

unread,
Jun 4, 2014, 6:44:54 PM6/4/14
to presto...@googlegroups.com, da...@acz.org
hey David

My specs are here 


with this setting  discovery-server.enabled=true  I could not get the two nodes to run :-( 

yes the Launcher log says java.lang.OutOfMemoryError: Java heap space - DOH sorry I should have checked this first 

thanks
sanjay

Dain Sundstrom

unread,
Jun 11, 2014, 2:30:34 PM6/11/14
to presto...@googlegroups.com
Two nodes is pretty tiny for a cluster, but it should work -- but you are unlikely to see peak performance out of the system.

My guess is the problem you are seeing is happening because the coordinator is configured perform work on behalf of the query (which you would do when you only have two nodes).  The work is overwhelming the ability of the coordinator to coordinate the query, so you see timeouts and then failures.  Depending on the query you are running, this can be exacerbated by a lack of real memory on the machines and excessive garbage created by some functions.  Additionally, if you are using hardware virtualization or something like linux cgroups, processor affinity, or lxc, the default Presto configurations may be too aggressive.  The Presto default configuration is based on the number of processors reported by the JVM, but if that doesn't match the real number of processors available to the process, we create too many threads.

You can attempt to dial-in the system in so it can finish the execution of your queries, by tuning the number of worker threads by setting the `task.shard.max-threads` property. Additionally, I would watch the performance of the processes using VisualGC to make sure you are not stuck in a GC death spiral.

Also, Presto does not need a separate discovery installation anymore.  For a cluster this small, it is most likely detrimental.

-dain

David Phillips

unread,
Jun 11, 2014, 2:45:15 PM6/11/14
to presto...@googlegroups.com
On Wed, Jun 4, 2014 at 3:44 PM, Sanjay Subramanian <sanjaysu...@gmail.com> wrote:
with this setting  discovery-server.enabled=true  I could not get the two nodes to run :-( 

This should work and is the recommended way to deploy. Can you tell us what error you saw? 

yes the Launcher log says java.lang.OutOfMemoryError: Java heap space - DOH sorry I should have checked this first 

The reason you run out of memory is that task.max-memory is set too large relative to your JVM max heap size. It should be a fraction of the JVM heap size.

The exact value required depends on the complexity and concurrency of running queries. For example, for Presto clusters at Facebook that are shared by many users, we use a 16GB JVM heap with a 256MB task memory limit.

Try using a lower value like 4GB or 2GB. Increase it if your queries need more memory, and decrease it if the JVM OOMs.

You can also increase your JVM heap size if the machines have more memory available.

Reply all
Reply to author
Forward
0 new messages