Increase the parallelism / number of executors.

Ram Viswanadha

unread,

Nov 13, 2015, 8:35:46 PM11/13/15

to Google Cloud Dataproc Discussions

Hi,

I was successful in running a recommender on DataProc but it took 8 hrs for executing. I noticed that there were only 2 executors and 2 tasks running at any given point of time. Is there any way to increase the level of parallelism on the cluster? I tried setting spark.default.parallelism value since it forces the level of parallelism and turns off the dynamic allocation, but that did not have any affect.

I have 3 worker nodes with 104GB of memory and 96 virtual cores in total. Neither CPU or the Memory are fully utilized.

TIA!

Executors (2)

Memory: 998.3 MB Used (32.6 GB Total)
Disk: 0.0 B Used

Executor ID	Address	RDD Blocks	Storage Memory	Disk Used	Active Tasks	Failed Tasks	Complete Tasks	Total Tasks	Task Time	Input	Shuffle Read	Shuffle Write	Logs	Thread Dump
2	rc-spark-poc-w-2.c.<blah>-data.internal:47963	2	998.3 MB / 19.3 GB	0.0 B	2	0	2	4	4.1 m	0.0 B	0.0 B	3.5 MB	stdout stderr	Thread Dump
driver	10.240.0.2:59818	0	0.0 B / 13.3 GB	0.0 B	0	0	0	0	0 ms	0.0 B	0.0 B	0.0 B		Thread Dump

Ram Viswanadha

unread,

Nov 13, 2015, 8:42:20 PM11/13/15

to Google Cloud Dataproc Discussions

Here is more information. Although I set the spark.default.parallelism to 96 (as recommended in the docs), it appears that the setting of this value does not have any effect on spark.dynamicAllocation.enabled property

Spark Properties

Name	Value
spark.akka.frameSize	512
spark.app.id	application_1446129053095_0020
spark.app.name	BigQueryRecommenderTest
spark.default.parallelism	96
spark.driver.appUIAddress	http://10.240.0.2:4040
spark.driver.extraJavaOptions	-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
spark.driver.host	10.240.0.2
spark.driver.maxResultSize	13120m
spark.driver.memory	26241m
spark.driver.port	39578
spark.dynamicAllocation.enabled	true
spark.dynamicAllocation.initialExecutors	100000
spark.dynamicAllocation.maxExecutors	100000
spark.dynamicAllocation.minExecutors	1
spark.eventLog.dir	file:///var/log/spark/events
spark.eventLog.enabled	true
spark.executor.cores	8
spark.executor.extraJavaOptions	-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
spark.executor.id	driver
spark.executor.memory	38281m
spark.externalBlockStore.folderName	spark-df1835a7-21cc-4099-b2c8-f778a3ea8bad
spark.fileserver.uri	http://10.240.0.2:45850
spark.history.fs.logDirectory	file:///var/log/spark/events
spark.jars	file:/tmp/a96f2f8d-3775-432f-a773-2b33548efea5/rc-bq-spark-poc-1.0-SNAPSHOT.jar,file:/tmp/a96f2f8d-3775-432f-a773-2b33548efea5/jedis-2.7.2.jar,file:/tmp/a96f2f8d-3775-432f-a773-2b33548efea5/commons-pool2-2.4.2.jar,file:/usr/lib/spark/lib/spark-assembly.jar
spark.master	yarn-client
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS	rc-spark-poc-m
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES	http://rc-spark-poc-m:8088/proxy/application_1446129053095_0020
spark.scheduler.minRegisteredResourcesRatio	0.0
spark.scheduler.mode	FIFO
spark.serializer	org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled	true
spark.submit.deployMode	client
spark.ui.filters	org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
spark.yarn.am.memory	38281m
spark.yarn.application.tags	dataproc_job_a96f2f8d-3775-432f-a773-2b33548efea5
spark.yarn.executor.memoryOverhead	2679
spark.yarn.historyServer.address	rc-spark-poc-m:18080
spark.yarn.tags	dataproc_job_a96f2f8d-3775-432f-a773-2b33548efea5

Yogesh Nath

unread,

Mar 16, 2016, 7:27:43 PM3/16/16

to Google Cloud Dataproc Discussions

were you able to increase level of parallelism using this setting?

Dennis Huo

unread,

Mar 16, 2016, 8:08:13 PM3/16/16

to Google Cloud Dataproc Discussions

I believe we followed up on this StackOverflow question instead: http://stackoverflow.com/questions/33713805/extremely-slow-processing-on-dataproc-9-hours-vs-3-mins-on-local-machine

The key was to use RDD.repartition: https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#repartition(int,%20scala.math.Ordering)

Reply all

Reply to author

Forward