Increase the parallelism / number of executors.

6,235 views
Skip to first unread message

Ram Viswanadha

unread,
Nov 13, 2015, 8:35:46 PM11/13/15
to Google Cloud Dataproc Discussions
Hi,
I was successful in running a recommender on DataProc but it took 8 hrs for executing. I noticed that there were only 2 executors and 2 tasks running at any given point of time. Is there any way to increase the level of parallelism on the cluster? I tried setting spark.default.parallelism value since it forces the level of parallelism and turns off the dynamic allocation, but that did not have any affect.

I have 3 worker nodes with 104GB of memory and 96 virtual cores in total. Neither CPU or the Memory are fully utilized.

TIA!

Executors (2)
  • Memory: 998.3 MB Used (32.6 GB Total)
  • Disk: 0.0 B Used
Executor IDAddressRDD BlocksStorage MemoryDisk UsedActive TasksFailed TasksComplete TasksTotal TasksTask TimeInputShuffle ReadShuffle WriteLogsThread Dump
2rc-spark-poc-w-2.c.<blah>-data.internal:479632998.3 MB / 19.3 GB0.0 B20244.1 m0.0 B0.0 B3.5 MBThread Dump
driver10.240.0.2:5981800.0 B / 13.3 GB0.0 B00000 ms0.0 B0.0 B0.0 BThread Dump

Ram Viswanadha

unread,
Nov 13, 2015, 8:42:20 PM11/13/15
to Google Cloud Dataproc Discussions

Here is more information. Although I set the spark.default.parallelism to 96 (as recommended in the docs), it appears that the setting of this value does not have any effect on spark.dynamicAllocation.enabled property

Spark Properties

NameValue
spark.akka.frameSize512
spark.app.idapplication_1446129053095_0020
spark.app.nameBigQueryRecommenderTest
spark.default.parallelism96
spark.driver.appUIAddresshttp://10.240.0.2:4040
spark.driver.extraJavaOptions-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
spark.driver.host10.240.0.2
spark.driver.maxResultSize13120m
spark.driver.memory26241m
spark.driver.port39578
spark.dynamicAllocation.enabledtrue
spark.dynamicAllocation.initialExecutors100000
spark.dynamicAllocation.maxExecutors100000
spark.dynamicAllocation.minExecutors1
spark.eventLog.dirfile:///var/log/spark/events
spark.eventLog.enabledtrue
spark.executor.cores8
spark.executor.extraJavaOptions-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
spark.executor.iddriver
spark.executor.memory38281m
spark.externalBlockStore.folderNamespark-df1835a7-21cc-4099-b2c8-f778a3ea8bad
spark.fileserver.urihttp://10.240.0.2:45850
spark.history.fs.logDirectoryfile:///var/log/spark/events
spark.jarsfile:/tmp/a96f2f8d-3775-432f-a773-2b33548efea5/rc-bq-spark-poc-1.0-SNAPSHOT.jar,file:/tmp/a96f2f8d-3775-432f-a773-2b33548efea5/jedis-2.7.2.jar,file:/tmp/a96f2f8d-3775-432f-a773-2b33548efea5/commons-pool2-2.4.2.jar,file:/usr/lib/spark/lib/spark-assembly.jar
spark.masteryarn-client
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTSrc-spark-poc-m
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASEShttp://rc-spark-poc-m:8088/proxy/application_1446129053095_0020
spark.scheduler.minRegisteredResourcesRatio0.0
spark.scheduler.modeFIFO
spark.serializerorg.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabledtrue
spark.submit.deployModeclient
spark.ui.filtersorg.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
spark.yarn.am.memory38281m
spark.yarn.application.tagsdataproc_job_a96f2f8d-3775-432f-a773-2b33548efea5
spark.yarn.executor.memoryOverhead2679
spark.yarn.historyServer.addressrc-spark-poc-m:18080
spark.yarn.tagsdataproc_job_a96f2f8d-3775-432f-a773-2b33548efea5

Yogesh Nath

unread,
Mar 16, 2016, 7:27:43 PM3/16/16
to Google Cloud Dataproc Discussions
were you able to increase level of parallelism using this setting?

Dennis Huo

unread,
Mar 16, 2016, 8:08:13 PM3/16/16
to Google Cloud Dataproc Discussions
Reply all
Reply to author
Forward
0 new messages