Query exceeded local memory limit

Ankur Goel

unread,

Aug 19, 2016, 4:37:49 PM8/19/16

to Presto

Hi Guys,

I am running into this issue again, but this time I have more info to share

I am using SQL window function (PARTITION BY col1, col2 ORDER BY col3)

on 30GB / 25M rows of data and one of the nodes executing this particular

plan fragment fails with the error.

I have 25 workers running Presto 0.147

Looking closer at the plan I see that 8 buckets were chose to partition the data

which clearly is insufficient in this case and searching PrestoDB group lead me

to the suggestion around tweaking

query.initial-hash-partitions

1. Is this property still being used ?

2. Can I set this property in the query session ?

3. How do we control partitioning/parallelism for different

    parts of a query plan (like in Pig/Hive/MR) ?

4. Is the option of spill-to-disk available in 0.147 ?

For BIG queries involving WINDOW functions or COUNT(DISTINCT)

spill-to-disk / over-partition may be a better choice than failing the

query with 'Exceeded local memory limit'. My 2 cents :-)

Thanks

-Ankur

Kamil Bajda-Pawlikowski

unread,

Aug 19, 2016, 4:49:24 PM8/19/16

to Presto

Hi Ankur,

1. Yes

2. Yes (hash_partition_count)

3. See task.concurrency property and some tuning advice at http://teradata.github.io/presto/docs/148t/admin/tuning.html

4. No

Our team at Teradata is currently working on spill to disk. Aggregations and joins are first, then we will work on additional operators. No firm date yet.

What is your query.max-memory-per-node value? How many nodes do you have?

You should should be able to just add nodes and increase the hash partition count to get your query working in the meantime.

Thanks,

Kamil

Message has been deleted

Ankur Goel

unread,

Aug 19, 2016, 5:52:37 PM8/19/16

to Presto

Thanks for your response Kamil,

Here are my Presto worker JVM settings

-server
-Xmx12G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p

query.max-memory-per-node = 6G

Number of nodes = 25 (workers) 1 (coordinator)

So I tried setting the session property hash_partition_count=17

and did 3 runs all of which failed with below stack trace.

The query did make more progress than last time (hash_partition_count=8)

but failed every time with this

com.facebook.presto.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://host:8080/v1/task/20160819_212743_00167_b5rbk.3.7/results/20160819_212743_00167_b5rbk.2.4/1 - requests failed for 61.49s)
	at com.facebook.presto.operator.HttpPageBufferClient$1.onFailure(HttpPageBufferClient.java:375)
	at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
	at com.google.common.base.Throwables.propagate(Throwables.java:160)
	at io.airlift.http.client.ResponseHandlerUtils.propagate(ResponseHandlerUtils.java:22)
	at com.facebook.presto.operator.HttpPageBufferClient$PageResponseHandler.handleException(HttpPageBufferClient.java:540)

I have seen Jetty throw this error when network bandwidth is saturated.

Any idea how to prevent this ?

Should we try forcing local split scheduling ?

Can node-scheduler.network-topology be set in a query session ?

Thanks

-Ankur

Kamil Bajda-Pawlikowski

unread,

Aug 19, 2016, 5:56:20 PM8/19/16

to presto...@googlegroups.com

Please try making hash_partition_count=25 so that all workers participate.

If this doesn't help, try reducing the value of initial_splits_per_node by 2 (to display the current value please run "show session;").

Thanks,
Kamil

On Aug 19, 2016 5:48 PM, "Ankur Goel" <ankur....@gmail.com> wrote:

Thanks for your response Kamil,

Here are my Presto worker JVM settings
-server
-Xmx12G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
query.max-memory-per-node = 6G
Number of nodes = 25 (workers) 1 (coordinator)

So I tried setting the session property hash_partition_count=17
and did 3 runs all of which failed with below stack trace.

The query did make more progress than last time (hash_partition_count=8)
but failed every time with this

com.facebook.presto.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://17.125.67.146:8080/v1/task/20160819_212743_00167_b5rbk.3.7/results/20160819_212743_00167_b5rbk.2.4/1 - requests failed for 61.49s)
	at com.facebook.presto.operator.HttpPageBufferClient$1.onFailure(HttpPageBufferClient.java:375)
	at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
	at com.google.common.base.Throwables.propagate(Throwables.java:160)
	at io.airlift.http.client.ResponseHandlerUtils.propagate(ResponseHandlerUtils.java:22)
	at com.facebook.presto.operator.HttpPageBufferClient$PageResponseHandler.handleException(HttpPageBufferClient.java:540)

I have seen Jetty throw this error when network bandwidth is saturated.

Any idea how to prevent this ?

Should we try forcing local split scheduling ?

Can node-scheduler.network-topology be set in a query session ?

Thanks
-Ankur

--
You received this message because you are subscribed to a topic in the Google Groups "Presto" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/presto-users/7_2rPASbkdw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to presto-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Ankur Goel

unread,

Aug 19, 2016, 9:07:08 PM8/19/16

to Presto

I did set the hash_partition_count=24 (no. of workers)

and reduced initial_splits_per_node by 6
the query finishes successfully.

Thank you for your help!

Much appreciated :-)

The query performs poorly, though, takes

13 minutes to process 44 GB of data.

Let me read the teradata page on tuning to see

what more I can find.

Thanks again

-Ankur

Kamil Bajda-Pawlikowski

unread,

Aug 21, 2016, 9:56:39 PM8/21/16

to presto...@googlegroups.com

OK, great. Let's look a bit deeper...

Could you please share your query & explain plan?

On Fri, Aug 19, 2016 at 9:06 PM, Ankur Goel <ankur....@gmail.com> wrote:

I did set the hash_partition_count=24 (no. of workers)

and reduced initial_splits_per_node by 6

the query no finishes successfully!

Thank you for your help!
Much appreciated :-)

The query performs poorly, though, takes
13 minutes to process 44 GB of data.

Let me read the teradata page on tuning to see
what more I can find.

Thanks again

-Ankur

To unsubscribe from this group and all its topics, send an email to presto-users...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Ankur Goel

unread,

Aug 22, 2016, 10:13:31 PM8/22/16

to Presto

Here is the query

SELECT approx_distinct(col5) uniq_col_5,

approx_distinct(col6) uniq_col6,

approx_distinct(col7) uniq_col7,

sum(total_col_4) total_col_4,

sum(if(col77 <> '-1' AND (cast(date_diff('SECOND', cast(col77 AS TIMESTAMP),cast(col40 AS TIMESTAMP)) AS bigint) < 2), total_col_4, 0)) total_dupe_col_4,

sum(total_col_4) - sum(if(col77 <> '-1' AND (cast(date_diff('SECOND', cast(col77 AS TIMESTAMP),cast(col40 AS TIMESTAMP)) AS bigint) < 2), total_col_4, 0)) total_deduped_col_4

FROM

( SELECT substr(col40,1,10) dt ,

col5,

col6,

case when cardinality(mylist)>0 then mylist[1].col7 else null end col7,

case when col11 = 'string1' then 1 else 0 end total_col_4,

col40,

COALESCE(lead(col40) over (partition BY col5,col6

ORDER BY col40 DESC),'-1') col77

FROM schema.table_old

WHERE substr(col22.subcol1,10,4) = 'string22'

AND dt >= '2016-06-08'

AND col22.subcol2 IN ('string2', 'string3')

AND col33.isTest = FALSE

AND NOT CONTAINS (col44,'U13')

AND NOT CONTAINS (col44,'T13')

UNION ALL

SELECT substr(col40,1,10) dt ,

col5,

col6,

case when cardinality(mylist)>0 then mylist[1].col7 else null end col7,

case when col11 = 'string1' then 1 else 0 end total_col_4,

col40,

COALESCE(lead(col40) over (partition BY col5,col6

ORDER BY col40 DESC),'-1') col77

FROM schema.table

WHERE substr(col22.subcol1,10,4) = 'string22'

AND dt <= '2016-08-17'

AND col22.subcol2 IN ('string2', 'string3')

AND col33.isTest = FALSE

AND NOT CONTAINS (col44,

'U13')

AND NOT CONTAINS (col44,

'T13')) a

;

Kamil Bajda-Pawlikowski

unread,

Aug 23, 2016, 1:20:35 PM8/23/16

to Presto

I suggest looking at EXPLAIN and EXPLAIN ANALYZE. Presto UI also displays nice visualized Query Plan with some stats that are good to analyze.

If you are comfortable with posting the results here, someone can probably take a look.

One performance debugging technique is to materialize on disk certain stages of the original query (in your case the result of the UNION ALL) and measure the time of the broken down queries to identify which one is a bottleneck. After that, you can also break down the UNION ALL and see what you get.

On the cluster system level, are you observing CPU being utilized a lot? If not, please consider increasing the value of the following session property "task_concurrency" (defaults to 1) to maybe 4 or 8 (depending on the CPU cores you have per node).