Presto OOM issues.

Thomas Rieck

unread,

May 26, 2016, 10:23:05 AM5/26/16

to Presto

We have been using Presto 0.111 without issue on AWS EMR clusters in order to perform cross join unnest operations on a field whose values are arrays of json objects. However, with the introduction of EMR 4.X, Amazon has been pre-loading the clusters with a later version of Presto, disabling the ability to install any other version of Presto. This transition into a later version of Presto has revealed a fundamental issue for us that was not present when using 0.111.

When running the same query as before, which consists of performing a cross-join unnest on an array of json objects, the cluster has an internal error with an OutOfMemoryError. Upon further investigation, we found out that the configurations we were using previously with 0.111 have become deprecated, such as task.max-memory, in favor of query.max-memory, etc. During our testing, we determined that Presto 0.112 is the last release that is viable in our use case. This conclusion led us to comparing the changes made from release 0.112 to 0.113. The major change appears to be enabling the clusterMemoryManager, which is explicitly set to true in MemoryManagerConfig in 0.113. This change results in the MemoryPool enabling blocking, which we believe could be the cause of our issues.

In the case that the blocking is not the cause of our problem, is there any configuration change we could make in later versions of Presto to alleviate our OutOfMemory problems related to performing cross-join unnest operations?

Thank you very much,

Tom

Martin Traverso

unread,

May 28, 2016, 12:45:55 PM5/28/16

to Presto

You can try tweaking the query.max-memory and query.max-memory-per-node settings. Take a look at this document for more details: https://prestodb.io/docs/current/installation/deployment.html

In general, nodes shouldn't crash with a Java OutOfMemoryError. This is an indication that the nodes are not configured correctly (there's less memory in the system than the server was configured with), or there's a bug in how Presto is accounting for memory for certain operations. The fact that it used to work could mean there's a regression starting with that version, too.

If you can't get it to work with those settings, let us know and we'll help you investigate further.

Martin

Message has been deleted

Thomas Rieck

unread,

Jun 1, 2016, 9:14:49 AM6/1/16

to Presto

Following your suggestion, we have tweaked the query.max-memory and query.max-memory-per-node parameters and we no longer seem to see the OutOfMemory errors. However, we are now seeing “PrestoException: Could not communicate with the remote task. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes.” After retrying the query multiple times to confirm this was not a transient issue, we looked into the server-log for the stack trace of the exception. The following was the stacktrace:

ERROR remote-task-callback-33 com.facebook.presto.execution.StageStateMachine Stage 20160531_193313_00004_hquq6.1 failed

com.facebook.presto.spi.PrestoException: Could not communicate with the remote task. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes.

at com.facebook.presto.server.HttpRemoteTask.updateTaskInfo(HttpRemoteTask.java:398)

at com.facebook.presto.server.HttpRemoteTask.access$700(HttpRemoteTask.java:114)

at com.facebook.presto.server.HttpRemoteTask$ContinuousTaskInfoFetcher.success(HttpRemoteTask.java:779)

at com.facebook.presto.server.HttpRemoteTask$ContinuousTaskInfoFetcher.success(HttpRemoteTask.java:700)

at com.facebook.presto.server.HttpRemoteTask$SimpleHttpResponseHandler.onSuccess(HttpRemoteTask.java:856)

at com.facebook.presto.server.HttpRemoteTask$SimpleHttpResponseHandler.onSuccess(HttpRemoteTask.java:838)

at com.google.common.util.concurrent.Futures$6.run(Futures.java:1319)

at io.airlift.concurrent.BoundedExecutor.executeOrMerge(BoundedExecutor.java:69)

at io.airlift.concurrent.BoundedExecutor.access$000(BoundedExecutor.java:28)

at io.airlift.concurrent.BoundedExecutor$1.run(BoundedExecutor.java:40)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

When investigating this further, the following warning was in the log repeatedly before the error occurred each time:

WARN ContinuousTaskInfoFetcher-20160531_193313_00004_hquq6.2.3-548 com.facebook.presto.server.HttpRemoteTask Error getting info for task 20160531_193313_00004_hquq6.2.3: Server refused connection: http://10.0.40.143:8889/v1/task/20160531_193313_00004_hquq6.2.3?summarize: http://10.0.40.143:8889/v1/task/20160531_193313_00004_hquq6.2.3

Although this does seem to be the underlying issue, which manifests in the PrestoException error, we are unsure how to proceed with resolving it. Is there any additional insight you can provide into why the server might be refusing connection for a task?

Thank you for your support on this problem.

Reply all

Reply to author

Forward