Worker role frequently "stablizing" or "unresponsive"

237 views
Skip to first unread message

kindohm

unread,
Jun 16, 2011, 1:44:55 PM6/16/11
to Twister4Azure
I'm running a custom MapReduce operation which functions very similar
to WordCount. I have about 225 input files that are about 4.7 MB each.
After getting about 20% through, the worker role stopped processing
and in the management portal it says that the deployment was
"transitioning" and that the worker role was being "stabilized".
Eventually the worker role started processing again for a short period
of time, but went back to "stabilizing" a few times. After about 25%
through the job, the worker role status was "Unresponsive" (in bold,
red text). Eventually the worker role came back online, but it keeps
going back to either a "stabilizing" or "unresponsive" state.

I am using one worker role instance, and one thread per instance.

I've stopped the deployment completely and will try the basic
WordCount job.

Any thoughts on what the issue might be?

-Mike

Thilina Gunarathne

unread,
Jun 16, 2011, 2:04:20 PM6/16/11
to twiste...@googlegroups.com
HI Mike,
I'm wondering whether you are running any memory intensive jobs or a job that outputs a large number of key value pairs. Azure small instances are notorious for any memory intensive jobs..  Currently Twister4Azure reducers load all the data in to memory for sorting and that can also give rise to memory issues. One solution is to try with multiple instances with multiple reducers or to try with larger instance types.

I'm planning on doing some detailed benchmarks of memory behavior, probably in early July and to address any memory related issues.

thanks,
Thilina   


--
http://salsahpc.indiana.edu/twister4azure/
You received this message because you are subscribed to the Google Groups "Twister4Azure" group.
To unsubscribe from this group, send email to twister4azur...@googlegroups.com



--
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina

kindohm

unread,
Jun 16, 2011, 2:11:09 PM6/16/11
to Twister4Azure
I'm not sure what a "large number" is in this case :) My specific job
is generating thousands of keys, each with thousands of values. My
number of keys or values per key should not exceed 10,000 in my job,
and the total number of key value pairs may approach but should not
exceed 100,000,000. Does that seem large? :)

I've successfully re-run my job with less than 5% of the original data
(10 input files rather than 225).

In all likelihood my data size is just too large for a single, small
instance.

I'm currently spinning up a second worker role instance... I'll keep
tinkering with this.

-Mike

Thilina Gunarathne

unread,
Jun 16, 2011, 2:58:55 PM6/16/11
to twiste...@googlegroups.com
My guess is that the reducer is the bottleneck.. Azure small instances have only 1.7GB of memory and if we leave out ~1GB for the OS, that leaves us less than 1GB for our applications. Since the reducers load everything in to memory, I would imagine 100 million is little bit too much for one reducer to handle :).  

One possibility is to get the reducers to perform an in-place sort directly from the disk to the disk, but unfortunately this will give a performance hit. May be we should provide multiple options for the shuffle/sorting phases..

thanks,
Thilina

kindohm

unread,
Jun 16, 2011, 3:02:15 PM6/16/11
to Twister4Azure
Thanks for the feedback.

I added another reducer task to my smaller job size and the total time
was reduced by almost half (not too surprising). If I get time I may
try my larger test with 1 GB of input with more reducer tasks added.
Should the number of reduce tasks equal the number of worker instances
(assuming one thread per instance)?

-Mike

Thilina Gunarathne

unread,
Jun 16, 2011, 3:12:41 PM6/16/11
to twiste...@googlegroups.com
Should the number of reduce tasks equal the number of worker instances
(assuming one thread per instance)?
Not necessarily. Currently you can only configure number of reduce worker per instance. Hence the minimal you can have would be reduce workers== num.instances.  In case you are using larger instances, it's possible to have more map workers than reducer workers (eg: extra large with 8 map workers and 1 reduce worker). 

The number of reduce tasks per job needs to be dependent on the nature of the job. If it's too reducer heavy (looks like the case with you job), then it's better to have more reducers. Actually I forgot to mention earlier, it's possible for you to have more than one reduce task while having only one reducer worker. The job execution time will be longer as the reducers will execute one after another, but the memory load would be smaller.

Another caveat, is that the fault tolerance is currently disabled for reducers. The reason is the 3 hour time limit of Azure queues. I had several jobs which ran more than 3 hours and I had to disable reducer fault tolerance and I forgot to turn it back on. I'll probably introduce a config option to control this in the next release, which will happen hopefully end of this month.

thanks,
Thilina

 

-Mike

--
http://salsahpc.indiana.edu/twister4azure/
You received this message because you are subscribed to the Google Groups "Twister4Azure" group.
To unsubscribe from this group, send email to twister4azur...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages