trouble with redis on yellowstone

25 views
Skip to first unread message

Dinesh Prasanth Ganapathi

unread,
Apr 1, 2014, 10:50:25 PM4/1/14
to bigjob...@googlegroups.com
Hi,
    I was trying to run a simple-ensemble type of Bigjob script  with the coordination url set to COORD="redis://ILikeBigJob...@gw68.quarry.iu.teragrid.org:6379"   I am unable to connect to the redis server and my stderr file looks like :
File "<string>", line 59, in <module>
  File "/glade/u/home/dinesh/SAGA_MDAnalysis_ENV/lib/python2.7/site-packages/BigJob-0.64.5-py2.7.egg/bigjob/bigjob_agent.py", line 163, in __init__
    self.coordination = bigjob_coordination(server_connect_url=self.coordination_url)
  File "/glade/u/home/dinesh/SAGA_MDAnalysis_ENV/lib/python2.7/site-packages/BigJob-0.64.5-py2.7.egg/pilot/filemanagement/../../coordination/bigjob_coordination_redis.py", line 84, in __init__
    raise Exception("Cannot connect to Redis server: %s" % str(ex))
Exception: Cannot connect to Redis server: Error 110 connecting gw68.quarry.iu.teragrid.org:6379. Connection timed out.


When I try to install redis on yellowstone (by following the steps in the BIgJob manual, as follows) :
wget http://download.redis.io/redis-stable.tar.gz
tar xvzf redis-stable.tar.gz
cd redis-stable
make

cd redis-stable
./src/redis-server

This is the what i see. (This is different from what is expected according the BigJob manual and also the Redis documentation (http://redis.io/topics/quickstart)
[19434] 01 Apr 20:37:37.776 # Server started, Redis version 2.8.8
[19434] 01 Apr 20:37:37.776 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[19434] 01 Apr 20:37:37.776 * The server is now ready to accept connections on port 6379

when I open another terminal and
cd to /redis-stable/src and do a redis-cli ping i do not get back a PONG.

(Independently, This is also the case with trying to set up the redis server locally on Stampede.)

The BigJob script with COORD  = "redis://localhost:6379" also yeilds a cannot connect to redis server.

I am not sure what the if I am missing something or if there is an issue in installing a local redis server on yellowstone.
Another point is whenever I have to log into yellowstone, be it from my laptop or from stampede, I have to invariably use the ubi key.(does the access to gw68 get inhibited by this?)
Does anyone have any ideas on how I could proceed further?

Thanks and regards,
Dinesh



Ole Weidner

unread,
Apr 1, 2014, 11:41:40 PM4/1/14
to bigjob...@googlegroups.com
Hi Dinesh,

when you open another terminal — do you end up on the same login node? Most cluster have a load-balancer in front of their login nodes and you might end up on different physical machines with different hostnames. If that is the case, using ‘localhost’ won’t work. 

Can you verify / check? 

Cheers
Ole
signature.asc

Ole Weidner

unread,
Apr 2, 2014, 12:07:00 AM4/2/14
to bigjob...@googlegroups.com
Another thought… does the script run locally (just on the head node)? If not, I assume that you need to set COORD to the hostname of the head-node, rather than ‘localhost’. 

Best,
Ole
signature.asc

Dinesh Prasanth Ganapathi

unread,
Apr 2, 2014, 12:18:17 AM4/2/14
to bigjob...@googlegroups.com
Hi Ole,
           I know yellowstone has 6 login nodes but I am not sure how to determine which login node I am on.  If there is a way I can know that maybe I ll be able to test if redis-server works there.  I did not get any errors during installation though.

Thanks and regards,
Dinesh

Dinesh Prasanth Ganapathi

unread,
Apr 2, 2014, 1:29:14 AM4/2/14
to bigjob...@googlegroups.com
Hi Ole,
           Thanks for the tip. It seems to work now.

Thanks and regards,
Dinesh





Dinesh Prasanth Ganapathi

unread,
Apr 2, 2014, 9:06:33 AM4/2/14
to bigjob...@googlegroups.com
Hi,
     I am running redis server on yslogin1 node.  my WORKDIR os set to to /glade/p/work/dinesh/md_traj_expts . I have a bash script try_rmsd_calling_script.sh that is the compute for the BigJob script.

I am running the bigjob script  io_saturation_mdanalysis_yellowstone.py from the location /glade/u/home/dinesh.  I have set  COORD
="redis://yslogin1:6379"

when I setup redis and run it  i get  the following (loging onto yslogin1 and trying a redis-cli ping returns a pong, so i think redis server is working):


[17090] 02 Apr 06:43:11.167 # Server started, Redis version 2.8.8
[17090] 02 Apr 06:43:11.167 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[17090] 02 Apr 06:43:11.184 * DB loaded from disk: 0.017 seconds
[17090] 02 Apr 06:43:11.184 * The server is now ready to accept connections on port 6379


once i run io_saturation_mdanalysis_yellowstone from yslogin6 i get this

(SAGA_MDAnalysis_ENV)[dinesh@yslogin6 ~]$ python io_saturation_mdanalysis_yellowstone.py
Unable to connect/initialize Pilot-Data. Exit BigJob.* Submitted task '0' with id 'cu-c90dbfa6-ba64-11e3-a2fc-0002c9445f91' to yslogin6
* Submitted task '1' with id 'cu-c90ec77a-ba64-11e3-a2fc-0002c9445f91' to yslogin6
* Submitted task '2' with id 'cu-c90fafbe-ba64-11e3-a2fc-0002c9445f91' to yslogin6
* Submitted task '3' with id 'cu-c91097a8-ba64-11e3-a2fc-0002c9445f91' to yslogin6
Waiting for tasks to finish...
Terminating BigJob...



the bj and sj folders are created in the WORKDIR location and the stderr.txt file contains the following:

/bin/sh: aprun: command not found
/bin/sh: ibrun: command not found
/bin/sh: srun: command not found
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

I am not sure why I should encounter access related issue, As i am already on yellowstone and submitting to lsf using pilot_description.service_url = "lsf://localhost" .


the BIgjob script ( io_saturation_mdanalysis_yellowstone) is as follows:


mport os
import sys
import pilot
import traceback

""" DESCRIPTION: This example shows how to run BigJob locally to execute tasks.
"""

#------------------------------------------------------------------------------
# Redis password and 'user' name
#REDIS_PWD   = 'ILikeBigJob_wITH-REdIS'# Fill in the password to your server
#USER_NAME   = # Fill in your username on the resource you're running on

# The coordination server
#COORD       = "redis://ILikeBigJob...@gw68.quarry.iu.teragrid.org:6379"
COORD       = "redis://yslogin1:6379"
# The host to run BigJob on
HOSTNAME    = "yslogin6"
# The working directory on your machine
WORKDIR     = "/glade/p/work/dinesh/md_traj_expts"
# The number of jobs you want to run
NUMBER_JOBS = 4


#------------------------------------------------------------------------------
#

def main():
    try:
        # this describes the parameters and requirements for our pilot job
        pilot_description = pilot.PilotComputeDescription()
        pilot_description.service_url = "lsf://localhost"
        pilot_description.number_of_processes = 4
        pilot_description.working_directory = WORKDIR
        pilot_description.walltime = 50
        pilot_description.project ="URTG0003"
        pilot_description.queue ="regular"
        # create a new pilot job
        pilot_compute_service = pilot.PilotComputeService(coordination_url=COORD)
        pilotjob = pilot_compute_service.create_pilot(pilot_description)

        # submit tasks to pilot job
        tasks = list()
        for i in range(NUMBER_JOBS):
            task_desc = pilot.ComputeUnitDescription()
            task_desc.executable = 'bash'
            task_desc.arguments = ['/glade/work/p/dinesh/md_traj_expts/try_rmsd_calling_script.sh']
            #task_desc.environment = {'TASK_NO': i}
            task_desc.number_of_processes = 1
            task_desc.output = 'md_rmsd-stdout.txt'
            task_desc.error = 'md_rmsd-stderr.txt'

            task = pilotjob.submit_compute_unit(task_desc)
            print "* Submitted task '%s' with id '%s' to %s" % (i, task.get_id(), HOSTNAME)
            tasks.append(task)

        print "Waiting for tasks to finish..."
        pilotjob.wait()

        return(0)

    except Exception, ex:
            print "AN ERROR OCCURED: %s" % ((str(ex)))
            # print a stack trace in case of an exception -
            # this can be helpful for debugging the problem
            traceback.print_exc()
            return(-1)

    finally:
        # alway try to shut down pilots, otherwise jobs might end up
        # lingering in the queue
        print ("Terminating BigJob...")
        pilotjob.cancel()
        pilot_compute_service.cancel()


if __name__ == "__main__":
    sys.exit(main())





Is there something I am missing here?

Thanks and regards
Dinesh

Andre Merzky

unread,
Apr 2, 2014, 1:06:24 PM4/2/14
to bigjob...@googlegroups.com
Not sure, but it *could* be an application error, i.e. an error within
/glade/work/p/dinesh/md_traj_expts/try_rmsd_calling_script.sh. Does
that run ok when starting manually, as 'bash
/glade/work/p/dinesh/md_traj_expts/try_rmsd_calling_script.sh' ?

My $0.02, Andre.
>>>> Another thought... does the script run locally (just on the head node)? If
>>>> not, I assume that you need to set COORD to the hostname of the head-node,
>>>> rather than 'localhost'.
>>>>
>>>> Best,
>>>> Ole
>>>>
>>>> On Apr 1, 2014, at 11:41 PM, Ole Weidner <ole.w...@rutgers.edu>
>>>> wrote:
>>>>
>>>> Hi Dinesh,
>>>>
>>>> when you open another terminal -- do you end up on the same login node?
>>>> Most cluster have a load-balancer in front of their login nodes and you
>>>> might end up on different physical machines with different hostnames. If
>>>> that is the case, using 'localhost' won't work.
>>>>
>>>> Can you verify / check?
>>>>
>>>> Cheers
>>>> Ole
>>>>
>>>> On Apr 1, 2014, at 10:50 PM, Dinesh Prasanth Ganapathi
>>>> <dinesh....@gmail.com> wrote:
>>>>
>>>> when I open another terminal and
>>>> cd to /redis-stable/src and do a redis-cli ping i do not get back a
>>>> PONG.
>>>>
>>>> (Independently, This is also the case with trying to set up the redis
>>>> server locally on Stampede.)
>>>>
>>>>
>>>>
>>>>
>>>> The BigJob script with COORD = "redis://localhost:6379" also yeilds a
>>>> cannot connect to redis server.
>>>>
>>>>
>>>>
>>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "bigjob-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to bigjob-users...@googlegroups.com.
> To post to this group, send email to bigjob...@googlegroups.com.
> Visit this group at http://groups.google.com/group/bigjob-users.
> For more options, visit https://groups.google.com/d/optout.



--
Nothing is really difficult.
Reply all
Reply to author
Forward
0 new messages