BigJob Redis Communication

25 views
Skip to first unread message

Vishal Shah

unread,
Dec 24, 2013, 11:45:51 AM12/24/13
to bigjob...@googlegroups.com
Hello,

How exactly does BigJob manage communication in order to get the state of a pilot before the pilot is "running" on a machine. I ask this because I am using a redis server that has a timeout because it seems as though there are lingering connections if there is no timeout, which may or may not be causing an issue with an experiment that I am running (not completely sure yet). 

For example, if I set the timeout to 100 seconds and the queue time is greater than 100 seconds, would there be an issue?

Thanks,
Vishal

Andre Luckow

unread,
Dec 25, 2013, 8:04:39 PM12/25/13
to bigjob...@googlegroups.com
Hi Vishal,
not sure what timeout you are exactly referring to. I assume the Redis
timeout you are referring to applies to the communication channel
between the BigJob manager and agent and Redis. This timeout is not
related to the queuing time and thus, there should be no impact.

Best,
Andre
> --
> You received this message because you are subscribed to the Google Groups
> "bigjob-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to bigjob-users...@googlegroups.com.
> To post to this group, send email to bigjob...@googlegroups.com.
> Visit this group at http://groups.google.com/group/bigjob-users.
> For more options, visit https://groups.google.com/groups/opt_out.

Vishal Shah

unread,
Dec 26, 2013, 5:08:13 PM12/26/13
to bigjob...@googlegroups.com
Hey Andre,

I was indeed referring to the timeout between the manager/agent and redis; however, my issue still exists. I am currently running a script where a single compute service is created off of which compute descriptions are created. The script is meant to run for an infinite amount of time; however, I hit the following error:

Cannot connect to Redis server: Error 110 connecting gw68.quarry.iu.teragrid.org:6379. Connection timed out.

I attempted to create a new compute service once reaching the above exception; however, it appears as though this doesn't resolve the issue because the exception is caught again immediately after the new compute service is created. Also, it appears as though there is a set interval at which this occurs; however, I am not completely sure as I am investigating this as the moment.

I also attempted to use my own redis server; however, the same issue occurred.

Thanks,
Vishal


Andre Luckow

unread,
Dec 26, 2013, 8:34:42 PM12/26/13
to bigjob...@googlegroups.com, bigjob...@googlegroups.com
Hi Vishal,
does this happen immediately or only after some time. Usually Redis is capable of sustaining quite some connections. Maybe flushing and restarting Redis helps. How much pilots/cus are you managing? What script are ypu using?

Thanks!
Andre

Sent from Mailbox for iPad

Vishal Shah

unread,
Dec 26, 2013, 8:53:24 PM12/26/13
to bigjob...@googlegroups.com
Hey Andre,

This happened after roughly 108 minutes for the first run I measured the time. I am running it again to see how long it will take and to see if there is any consistency. I ran into the same issue with the redis server that I had set up on my system, which was a fresh installation. At any given time I am managing one CU. 

The script I am using basically does the following:

main():
 pilot_service = pilot.PilotComputeService(XSEDE_COORD)
        job_length = ...
        while(1):
                for j in job_length:
                        for i in range(...):
                                run_on_xsede(xsede_url=XSEDE_URL, number_of_cores=int(i), task_length=int(j),pilot_service=pilot_service)

run_on_xsede(...):
        ...
        pilotjob = pilot_service.create_pilot(pilot_compute_description=pilot_description)
        ...

Thanks,
Vishal

Andre Luckow

unread,
Jan 2, 2014, 11:15:38 PM1/2/14
to bigjob...@googlegroups.com
Hi Vishal,
how many pilots are you starting? Are you directly submitting to the
pilots or are you using the ComputeDataService?

Thanks,
Andre

Andre Luckow

unread,
Jan 2, 2014, 11:18:26 PM1/2/14
to bigjob...@googlegroups.com
Hi Vishal,
can you try to set the following parameters:

ulimit -Sn 100000 # This will only work if hard limit is big enough.
sysctl -w fs.file-max=100000

http://redis.io/topics/clients

Best,
Andre

Vishal Shah

unread,
Jan 2, 2014, 11:25:05 PM1/2/14
to bigjob...@googlegroups.com
Hey Andre,

I am creating one pilot service and then creating pilots one at a time. I am not using the ComputeDataService. 

I set the parameters and will see if I run into the error.

The following is the error:
Unable to connect/initialize Pilot-Data. Exit BigJob.Unable to connect/initialize Pilot-Data. Exit BigJob.An exception occured during XSEDE submission: Cannot connect to Redis server: Error 110 connecting zephyrhomeserver.no-ip.org:6379. Connection timed out.

(Using the redis server on my own machine)

Thanks,
VIshal

Andre Luckow

unread,
Jan 3, 2014, 9:30:00 PM1/3/14
to bigjob...@googlegroups.com
Hi Vishal,
how many pilots did you create in sum?

Thanks,
Andre

Vishal Shah

unread,
Jan 4, 2014, 4:42:21 PM1/4/14
to bigjob...@googlegroups.com
Hello Andre,

The number below the errors are the number of pilots that were created when the error was caught:


An exception occured during XSEDE submission: Cannot connect to Redis server: Error 110 connecting zephyrhomeserver.no-ip.org:6379. Connection timed out.
107

An exception occured during XSEDE submission: Cannot connect to Redis server: Error 110 connecting zephyrhomeserver.no-ip.org:6379. Connection timed out.
15

An exception occured during XSEDE submission: Cannot connect to Redis server: Error 110 connecting zephyrhomeserver.no-ip.org:6379. Connection timed out.
16

An exception occured during XSEDE submission: Cannot connect to Redis server: Error 110 connecting zephyrhomeserver.no-ip.org:6379. Connection timed out.
867

An exception occured during XSEDE submission: Cannot connect to Redis server: Error 110 connecting zephyrhomeserver.no-ip.org:6379. Connection timed out.
197

An exception occured during XSEDE submission: Cannot connect to Redis server: Error 110 connecting zephyrhomeserver.no-ip.org:6379. Connection timed out.
289

Thanks,
Vishal

Andre Luckow

unread,
Jan 8, 2014, 8:34:26 PM1/8/14
to bigjob...@googlegroups.com
Hi Vishal,
do you have an environment where I can log in and observe this failure?

Thanks,
Andre

Andre Luckow

unread,
Jan 8, 2014, 10:24:17 PM1/8/14
to bigjob...@googlegroups.com
Hi Vishal,
you are starting/stopping pilots in an endless loop which very likely
overwhelmed the Redis Server a bit.

BigJob used 1 Redis connection per pilot so far. In the develop branch
I introduced connection pooling for the Redis connection. Can you give
it another try with that version, please?

You can install the dev version with:

easy_install bigjob2

(please uninstall the prod version first).

Please let me know whether this helps.

Thanks,
Andre

Andre Merzky

unread,
Jan 9, 2014, 3:39:18 AM1/9/14
to bigjob...@googlegroups.com
On Thu, Jan 9, 2014 at 4:24 AM, Andre Luckow <andre....@gmail.com> wrote:
> Hi Vishal,
> you are starting/stopping pilots in an endless loop which very likely
> overwhelmed the Redis Server a bit.

I believe the error is random though, and not necessarily due to
overload -- from Vishal's mail, the number of pilots are

107
15
16
...

My $0.02,

Andre.
Nothing is really difficult.

Vishal Shah

unread,
Jan 9, 2014, 11:24:58 AM1/9/14
to bigjob...@googlegroups.com
Hello,

I did not try the dev branch yet; however, the with the script running with SAGA_VERBOSE and BIGJOB_VERBOSE set to DEBUG can be found here:

Reply all
Reply to author
Forward
0 new messages