number of threads on worker nodes

pto...@gmail.com

unread,

Aug 21, 2013, 11:42:32 AM8/21/13

to spark...@googlegroups.com

Is there a way to specify the number of threads that would run on the worker nodes...

I did notice that Spark workers process data using multiple threads when the data size grows... And it was evident after i performed the below -

Let me explain what i did -

I had a three node cluster. I had some data that i intended to process on Spark ... So, i created an RDD and in the map function ( the call function of the class ), i was printing out the

rows of data in a file.. The file was named <blah-blah>.<thread-id> [ and i suffixed the thread id using the standard Thread.getId function] ... I did this just to find out if things were getting processed in parrallel at the Worker node level or not... With this, i did notice that beyond a certain number of rows of data, i started seeing more number of threadIDs.. ( eventually, causing more number of files to get generated on the nodes ).. Note that i was appending data to the files...

What i wanted to know was whether i can force this behavior using some API... i.e request Spark to spawn N number of threads per worker...

Please advise.

Patrick Wendell

unread,

Aug 21, 2013, 12:46:09 PM8/21/13

to spark...@googlegroups.com

You can set SPARK_WORKER_CORES when launching your cluster.

http://spark-project.org/docs/latest/spark-standalone.html#cluster-launch-scripts

In general you'll get best performance by setting this to at least the number of actual cores on the slave node.

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Pranay Tonpay

unread,

Aug 22, 2013, 12:06:11 AM8/22/13

to spark...@googlegroups.com, pwen...@gmail.com

But then if i am running that over mesos cluster, that option wont work ( ? ) ... Also, when i tried this over Spark cluster itself, i didn't see those many threads being spawned (after i set this variable) to achieve more parallelism... I am really in need for more parallelism due to 2 reasons,

a) Each of those operations that are being run in call method, are a bit slow, so my throughput goes down and things just crawl ...

b) My machines are very high end ( almost 32 cores ), and they are dedicated for me, so i am allowed to bump up the parallelism factor to match that configuration....

Is there a way out where i can control this?

Pls advise.

thx

pranay

--
You received this message because you are subscribed to a topic in the Google Groups "Spark Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/spark-users/v6qRBB33mjw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to spark-users...@googlegroups.com.

Patrick Wendell

unread,

Aug 22, 2013, 12:34:46 AM8/22/13

to Pranay Tonpay, spark...@googlegroups.com

If you are using mesos then you need to tell mesos that your machine has more cores - consult the mesos lists for help doing this. Spark should greedily acquire as many cores as mesos is offering it. However, one thing is that Spark will only use one core for each task, maybe that's the issue.

When running the spark standalone scheduler consult the UI to make sure it's seeing the entire number of cores you have. That will be exactly the number of worker threads the executor will spawn.

- Patrick

Pranay Tonpay

unread,

Aug 22, 2013, 1:02:01 AM8/22/13

to Patrick Wendell, spark...@googlegroups.com

thx Patrick ... you are right, i don't see all the cores in the UI available when i try with smaller data set ( 1,00,000 rows ), but when i bump it up to 25 million, i get to see all the cores ....

Reply all

Reply to author

Forward