Nodes underperforming

25 views
Skip to first unread message

Andrew HB

unread,
Apr 23, 2012, 5:15:14 PM4/23/12
to Segue for R
Howdy,

I thought I'd brain dump my experiences so far using segue and see if
there is any potential for troubleshooting or lessons to be learned.

Humorously (in retrospect) I started trying to work with the package
in the hours between Amazon's update to squeeze, and the release of
segue for squeeze, so that led to a lot of obstacles that were
valuable lessons in how the package and EMR work together.

Even after the upgrades, I couldn't get 'cranPackages' or
'customPackages' to successfully bootstrap the packages I wanted to
use (R.oo and pracma), but I worked around it by shipping tarballs of
the packages with 'sourcePackagesToInstall'. Similarly, I couldn't get
'rObjectsOnNodes' to plant my objects in the local file path, but I
used 'filesOnNodes' to put binaries in the local file path, and then
loaded them in my function call.

myCluster <- createCluster(numInstances=14,
sourcePackagesToInstall=c(
"/home/andrew/R/x86_64-redhat-linux-gnu-library/2.13/
R.methodsS3.tar.gz",
"/home/andrew/R/x86_64-redhat-linux-gnu-library/2.13/R.oo.tar.gz",
"/home/andrew/R/x86_64-redhat-linux-gnu-library/2.13/pracma.tar.gz",
'/home/andrew/R/customPackages/Analyzer2EMR.tar.gz',
'/home/andrew/R/customPackages/Model3EMR.tar.gz'),
filesOnNodes=c("/home/andrew/this.DSS.Rdata"),
instancesPerNode =
8,masterInstanceType='c1.xlarge',slaveInstanceType='c1.xlarge',
masterBidPrice="0.55",slaveBidPrice="0.55"
)

My model runs in about 15 minutes, single threaded, on my local
machine, but on the EC2 instances (c1.xlarge) they take a minimum of
25 mins sometimes up to 90 minutes, and sometimes don't seem to
complete at all, and the cluster is seemingly just hung up. It could
be that the taskTimeout is terminating the model function call and
then trying it again. If my function takes 20+ minutes to run, is that
seen a single task by EMR? If this was the case, the max return time
should be 3 x taskTimeout, and I'm getting nodes spinning seemingly
indefinitely. I will experiment with longer timeout settings.

Any advice on which node types work best, how many instances to run on
each node, which Amazon Dashboard log files have the most information
and other tips for speeding and troubleshooting performance would be
much appreciated.

It might just be my imagination, but I think that it ran more smoothly
with 'On Demand' instances than with 'Spot Request' instances.

Many thanks to JD for creating segue. I'm glad to live in a world
where I can complain that my on demand supercomputer isn't working
fast enough.

James Long

unread,
Apr 24, 2012, 8:29:41 AM4/24/12
to seg...@googlegroups.com
Andrew,

Thanks for sharing your experiences. I got caught flat footed with the
Squeeze update too. Soon after I had some work to finish quickly so I
fired up Segue only to discover things didn't work! I quickly patched
things up and checked in the changes. One thing I have been doing
really poorly is incrementing version numbers. Right now this requires
changing some build scripts so I end up not doing it. I need to make
that easier so I'll actually do it. I've also been sloppy/lazy as I
just relocated abroad and that's been rather time consuming.

CRAN packages: I have not tested this so I will set up some tests and
let you know what I find.

Instances per node: I have had best luck running the same number of
instances as the number of processors I have. So, for example, you
illustrate running c1.xlarge instances. Those have 4 virtual
processors so I would run 4 instances per node.

Spot Requests: The spot request features were added by the community.
I have never used them other than a simple test. Since spot requests
have to go through the Amazon bidding engine, I would guess they might
have delays related to that. Also I have heard they can get knocked
out from under you if the price goes up. I've not tested that or
validated if that's true.

Run time: Can you cobble up an example where you get longer run time
on EMR than you get on a local single thread? There are a bunch of
things that can cause this behavior. You clearly have dived into the
code and probably are not making the simple mistakes. If you could
send a dummy example I'd be happy to try to run it and sleuth through
the log files, etc. and see what I can find. It's possible to ssh into
the master node and see the Hadoop reporting interface to see which
nodes are doing what, etc. I do that sometimes when I need to really
understand WTF is going on.

I'm glad Segue is, at least sometimes, useful. It started as a bit of
a hack for me to speed up Monte Carlo sims. But it's grown into
something a bit more generally useful. Thanks, in a large part, to
early adopters, like you, willing to struggle along side me. Thanks
for that.

-J

Radek Maciaszek

unread,
May 2, 2012, 7:38:38 AM5/2/12
to seg...@googlegroups.com
Quick comment on spot instances - as I added this feature. The main advantage of it is price. In practice you get at least 50% lower prices on the cluster, which can be a significant amount when you are running calculations on tens/hundreds of nodes for prolonged period of time.

As James wrote setting up the cluster takes a bit longer (from my experience around 15-20 minutes). But you are not charged for the time when you wait for new servers to go through the bidding engine. You set-up your maximum price but pay the current hourly rate. You will loose the node only if the current hourly rate will get bigger then your maximum bid. Loosing a server (in theory) shouldn't affect your calculations as Hadoop should rerun the unfinished/faulty mappers/reducers on available servers. That is, unless you will loose the entire cluster this way. In practice the prices are very rarely higher than on-demand prices. You can expect to loose maybe one in few hundreds clusters with the assumption you will run it for few hours every day with the on-demand prices. In my opinion it more than offsets the 50% cost gains.

Best,
Radek

-- 
DataMine Lab Ltd
M: +44(0)79 6977 8924

Phoenix Yard, 65 King’s Cross Road
WC1X 9LW London
Registration No. 06685948 England & Wales.

James Long

unread,
May 2, 2012, 7:52:25 AM5/2/12
to seg...@googlegroups.com
Radek,

Thanks for the clarification. Much appreciated.

-J
Reply all
Reply to author
Forward
0 new messages