Guys,
Let me share some of the things I figured. Few days/week back, I ran into the problem where the MasterImpl process was killed for an unknown reason and my benchmarks abruptly ended. It turns out, that was because the AgentBootStrap was sending a SIGTERM to it and the Drivers etc.
I didnt know this was not directly related to performance before, so I decided that I might want to distribute the load on multiple machines to see if that helps. I looked at the instructions in
faban.org site where it was indicated that I run the harness on one of the machines, and run the agent daemon on another machine. Initially even this didnt work. The reason being, the agent daemon in the remote machine is sent four parameters [Name of Agent Host, IP1, IP2, JVM location]. The IP1 in case of a simple private network needs to be the same as IP2 which is the IP of the Master. However this IP1 is determined by a script which does it wrong? (which I think is supposed to work for Multihomed Devices and does not work well for my basic setup and incorrectly identifies the IP1 as the IP of Agent Host).
The symptom of the script not working is Retrying Connection to agent@ etc... message in the log
I have copied the script tries to do below.
------------------------------
#!/bin/sh
#########################################################################
# The interface script determines the ip address of the
interface used to
# talk to the given remote host. If remote host cannot be
contacted, it
# will exit with an exit value of 1.
#########################################################################
COMMAND="$0"
TARGET="$1"
usage() {
echo "usage: ${COMMAND} host"
>&2
exit 1;
}
if [ -z "${TARGET}" ] ; then
usage
fi
INTERFACE=`ping -R -c1 "${TARGET}" | grep RR | awk
'{ print $2 }' 2>/tmp/interface.$$.err`
ECHO $INTERFACE
------------------------------
I must say that both of my machines are ubuntu, connected in the same switched network (nothing fancy here). However, they are both ubuntu running on DIFFERENT Xen Servers(Virtualized). Not sure if this is a bug, but I thought I will share.
After I fixed this (by just echoing the IP2 = master IP), I fixed the above problem, but I still saw the Master process was still being killed midway. So it was unfortunately not distribution of load.
Note, the master, and other processes are killed when I run the benchmark for say more than 10 minutes. If I run the benchmark for a short period of time < 10 min, I dont see the problem. I am currently looking at the code (luckily I have some of the source code of SPECJ and FABAN) to figure out what is going on, but if you guys have any hints or things to check, please do share.
Ashwin