Stopping Migrating Hydra Jobs?

Albert Law

unread,

Nov 16, 2014, 12:16:40 PM11/16/14

to hydr...@googlegroups.com

Hi All,

Is it possible to stop migrating Hydra jobs? I ask because I've
noticed some of these migrations don't seem to complete after 10 hours
with very little data involved.

If is is possible, then will a "stop" command suffice? And is it safe
to stop a migrating job?

Here's a snippet of the relevant migration activity from the spawn.log.

WARN [Timer-3] 2014-11-16 01:21:20,695 Spawn.java (line 3634)
Migrating b6bf6a77-367b-4ca9-b193-45a52b0fc173/1 to
ffc1bf5b-e626-4b7c-87d2-662718267d85
WARN [qtp162108213-766] 2014-11-16 11:08:15,596 Spawn.java (line 2164)
[task.stop] stopping migrating b6bf6a77-367b-4ca9-b193-45a52b0fc173/1

WARN [Timer-3] 2014-11-16 01:01:11,637 Spawn.java (line 3634)
Migrating b6bf6a77-367b-4ca9-b193-45a52b0fc173/2 to
c0991eac-e107-4075-961b-96eb5cbcb2d0
WARN [qtp162108213-765] 2014-11-16 11:08:16,525 Spawn.java (line 2164)
[task.stop] stopping migrating b6bf6a77-367b-4ca9-b193-45a52b0fc173/2

WARN [Timer-3] 2014-11-16 00:29:41,546 Spawn.java (line 3634)
Migrating 1b595273-e742-4269-80fd-5839bb2f38f3/1 to
c0991eac-e107-4075-961b-96eb5cbcb2d0
WARN [qtp162108213-761] 2014-11-16 11:04:06,289 Spawn.java (line 2164)
[task.stop] stopping migrating 1b595273-e742-4269-80fd-5839bb2f38f3/1

WARN [Timer-3] 2014-11-15 23:44:14,394 Spawn.java (line 3634)
Migrating 1b595273-e742-4269-80fd-5839bb2f38f3/2 to
4f1bff93-c8a9-4d02-9fd2-c07e97dd141d
WARN [qtp162108213-756] 2014-11-16 11:08:11,418 Spawn.java (line 2164)
[task.stop] stopping migrating 1b595273-e742-4269-80fd-5839bb2f38f3/2

--
Albert Law
Lead Engineer - Data Acquisition
NewBrand
http://www.newbrandanalytics.com/

Ian Barfield

unread,

Nov 17, 2014, 1:00:51 PM11/17/14

to Albert Law, hydr...@googlegroups.com, Al

+ al in case he knows off the top of his head if this was a problem at some point

In general, I feel not great about interacting with jobs in any state other than IDLE or RUNNING. It sounds like a problem with migration if it is taking that long. Migration is only enabled for sufficiently small amounts of data to begin with.

It is a nice throughput optimization for running job tasks, but for your smaller clusters, and especially if it is causing problems, you might try simply disabling it via the system property ala:

`task.migration.enable=false`

--
You received this message because you are subscribed to the Google Groups "hydra-oss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hydra-oss+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Al Staples-Moore

unread,

Nov 17, 2014, 1:10:24 PM11/17/14

to Ian Barfield, Albert Law, hydr...@googlegroups.com

I'll preface this by saying that we've had migration turned off for our data centers for quite some time. This feature has not been tested in a while. It is probably a good idea to disable it via that system property.

Either 'stop' or 'kill' should stop the migration process. If you 'stop', then I believe it will try to migrate again the next time the task is run, so 'kill' may be preferable. In any case, you are not in danger of losing data by interacting with a migrating task.

Albert Law

unread,

Nov 17, 2014, 1:11:08 PM11/17/14

to Al Staples-Moore, Ian Barfield, hydr...@googlegroups.com

Hi All,

Ah, turning off migration now. Though, doesn't it perform important
work to keep the cluster "healthy"? Should we be doing something else
if we are no longer migrating automatically?

Stewart Allen

unread,

Nov 17, 2014, 1:17:16 PM11/17/14

to Albert Law, Al Staples-Moore, Ian Barfield, hydr...@googlegroups.com

I thought the migration failures (and related replication failures) were due to ssh box-to-box setup issues, no?

Al Staples-Moore

unread,

Nov 17, 2014, 1:20:02 PM11/17/14

to Albert Law, Ian Barfield, hydr...@googlegroups.com

There are two separate processes to discuss here, rebalancing and migrating.

Rebalancing runs periodically on idle jobs. It moves tasks off of
heavily-used hosts, and balances job tasks between the minions in the
cluster. We use rebalancing in our clusters, it is well-maintained and
it does a reasonably good job. Rebalancing has some configurable
parameters (bytes to move, frequency, etc.) but I think you guys have a
very strange setup with no Spawn v2 UI running (?) so you would need to
edit your datastore directly to make changes. Note that, although
SpawnBalancerConfig appears to read system properties, it actually uses
the version in the SpawnDataStore if it exists. I think the defaults
will do a decent job, though.

Migration was a very aggressive form of rebalancing that I tried out
some time ago. The idea was to move any queued task to an available host
once it had waited for a while on the existing hosts. It had a
significant impact on the query system, and didn't seem to help that
much for cluster setup, so we essentially abandoned it. The code should
probably just be culled, but for now the system property will disable it.

Albert Law

unread,

Nov 17, 2014, 1:20:55 PM11/17/14

to Stewart Allen, Al Staples-Moore, Ian Barfield, hydr...@googlegroups.com

Hi Stewart,

I don't think that is the case. I have manually checked all the SSH
box-to-box setups and can't find anything wrong. That said, I am also
trying to confirm/deny that by:

1) disabling GSSAPI auth-- perhaps this is getting in the way

2) creating some scripts that will constantly ssh into all the other
boxes to time the authentication wall-time-- perhaps that is taking
too long

Albert Law

unread,

Nov 17, 2014, 1:26:27 PM11/17/14

to Ian Barfield, hydr...@googlegroups.com, Al

Hi Ian,

Just checking, this is a spawn cmdline parameter, right?

Ian Barfield

unread,

Nov 17, 2014, 1:30:10 PM11/17/14

to Albert Law, hydr...@googlegroups.com, Al

It is a system property. So however you guys do that should work fine. As a command line parameter to the `java` command it would look like `-Dtask.migration.enable=false`.

Reply all

Reply to author

Forward