Is there a way to specify the number of threads that would run on the worker nodes...
I did notice that Spark workers process data using multiple threads when the data size grows... And it was evident after i performed the below -
Let me explain what i did -
I had a three node cluster. I had some data that i intended to process on Spark ... So, i created an RDD and in the map function ( the call function of the class ), i was printing out the
rows of data in a file.. The file was named <blah-blah>.<thread-id> [ and i suffixed the thread id using the standard Thread.getId function] ... I did this just to find out if things were getting processed in parrallel at the Worker node level or not... With this, i did notice that beyond a certain number of rows of data, i started seeing more number of threadIDs.. ( eventually, causing more number of files to get generated on the nodes ).. Note that i was appending data to the files...
What i wanted to know was whether i can force this behavior using some API... i.e request Spark to spawn N number of threads per worker...
Please advise.