Crashing nodes under load

471 views
Skip to first unread message

Sam Guleff

unread,
Sep 15, 2017, 6:47:20 PM9/15/17
to Presto
Hi All,

We're looking for some help from the community as we've seen queries failing recently due to nodes crashing.  It's predictable in that a specific set of queries can cause this to happen.  Any advice on optimizing our settings or VM sizes would be greatly appreciated.

We operate a 25 node Presto .179 in Azure cloud.

VM sizes are 8-Core / 110GB of memory each

 It is current utilized for both ad-hoc queries of Hive tables (Parquet and some ORC) external tables and have recently started to see nodes crashing with the following error:

com.facebook.presto.spi.PrestoException: Could not communicate with the remote task. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (10.115.1.62:8080)
	at com.facebook.presto.server.remotetask.ContinuousTaskStatusFetcher.updateTaskStatus(ContinuousTaskStatusFetcher.java:239)
	at com.facebook.presto.server.remotetask.ContinuousTaskStatusFetcher.success(ContinuousTaskStatusFetcher.java:170)
	at com.facebook.presto.server.remotetask.ContinuousTaskStatusFetcher.success(ContinuousTaskStatusFetcher.java:53)
	at com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.onSuccess(SimpleHttpResponseHandler.java:49)
	at com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.onSuccess(SimpleHttpResponseHandler.java:27)
	at com.google.common.util.concurrent.Futures$4.run(Futures.java:1135)
	at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


Our usage patterns vary:

Table sizes can be measured in the 100TB range down to 1TB (some queries are not filtered by our partitioning scheme)
max concurrent queries: 15-30 typically with often none running or periodically a peak of 70 queries. 
* Note I've seen 10 concurrent queries against a relatively small table 500GB table will cause a crash.


Our conf settings are:

conf.properties:
query.max-memory-per-node=50GB
query.max-memory=950GB

jvm.conf:
-server
-Xmx100G
-Xms100G
-XX:+UseG1GC
-XX:-UseBiasedLocking
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
-XX:ReservedCodeCacheSize=512M
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=xxxx


hive.properties:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://x.x.x.x:xxxx,thrift://x.x.x.x:xxxx
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
hive.allow-drop-table=true
hive.storage-format=PARQUET
hive.immutable-partitions=true
hive.metastore-cache-ttl=5m
hive.metastore-refresh-interval=10s
hive.parquet.use-column-names=true

/etc/security/limits.conf:
*         soft    nproc       1024000
*         hard    nproc       1024000
*         hard    nofile      1024000
*         soft    nofile      1024000

Sky Yin

unread,
Sep 15, 2017, 7:10:38 PM9/15/17
to presto...@googlegroups.com
One thing we did to solve this problem is lowering the value of task.concurrency from the default 16 to 8 in conf.properties. Hope it helps.

--
You received this message because you are subscribed to the Google Groups "Presto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to presto-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Prabakaran Hadoop

unread,
Oct 29, 2019, 10:29:31 PM10/29/19
to Presto
Hi I am also facing same issue. Any body explain me what is the reason for this issue?
Reply all
Reply to author
Forward
0 new messages