Cannot Kick?

Albert Law

unread,

Nov 19, 2014, 10:33:15 AM11/19/14

to hydr...@googlegroups.com

Hi All,

Does the following imply that connection with that host machine's
minion task was lost? I'm trying to figure it out because as far as I
can tell, that machine was up at the specified time.

<from spawn.log>
WARN [Timer-3] 2014-11-19 06:41:59,566 Spawn.java (line 3549)
[taskQueuesByPriority]
cannot kick a7509e05-aac4-40cb-98ab-8ef891eaafce/5 because one or more
of its hosts is down or scheduled to be failed:
[
HostState{
type=STATUS_HOST_INFO,
uuid=4f1bff93-c8a9-4d02-9fd2-c07e97dd141d,
last-update-time=1416400851653,
host=hydra02,
port=7070,
group=local,
time=1416400851649,
uptime=729078,
used=com.addthis.hydra.job.mq.HostCapacity@3aaaf13b,
user=hydra,
path=/home/hydra/hydra/minion,
max=com.addthis.hydra.job.mq.HostCapacity@2cc37694,
up=false,
dead=false,
readOnly=false,
diskReadOnly=false
}
]

Thanks!

--
Albert Law
Lead Engineer - Data Acquisition
NewBrand
http://www.newbrandanalytics.com/

Ian Barfield

unread,

Nov 19, 2014, 11:00:50 AM11/19/14

to Albert Law, hydr...@googlegroups.com

I am guessing you added newlines to that log line. I feel very foolish checking guava's source for some previously unknown to me logic that uses newline separators after some size threshold. That would have been interesting though.

The relevant code (in master, but the logic is likely similar) is:

https://github.com/addthis/hydra/blob/master/hydra-main/src/main/java/com/addthis/hydra/job/spawn/Spawn.java#L2851-2873

and

https://github.com/addthis/hydra/blob/master/hydra-main/src/main/java/com/addthis/hydra/job/mq/HostState.java#L173-173

So that will occur if a minion is in any of the following states:

- down (process down, process cannot talk to zk, etc)

- failed but not dropped

- disk is read only

- disabled (eg. via the spawn ui)

- disk is full

- in the middle of being failed, and the failure was not specifically "fs-okay"

--
You received this message because you are subscribed to the Google Groups "hydra-oss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hydra-oss+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Albert Law

unread,

Nov 19, 2014, 11:53:23 AM11/19/14

to Ian Barfield, hydr...@googlegroups.com

Hi Ian,

Is it unusual to see this type of error? I see it at least twice a day.

Ian Barfield

unread,

Nov 19, 2014, 12:04:06 PM11/19/14

to Albert Law, hydr...@googlegroups.com

A cursory search of our spawn logs shows that it occurs reasonably often. Guessing due to transient disk full issues or a server problem. Could also be some weird issue in the older version, but it is a transient complaint that would probably get printed out a lot more if it was causing a real amount of delay. I wouldn't worry about it at that rate of occurrence.

Albert Law

unread,

Nov 19, 2014, 12:09:04 PM11/19/14

to Ian Barfield, hydr...@googlegroups.com

Hi Ian,

Ah, okay. The problem is that it will set the associated job to ERROR
and disable it. Is there a way to set spawn to just try again in a
bit?

Ian Barfield

unread,

Nov 19, 2014, 12:24:14 PM11/19/14

to Albert Law, hydr...@googlegroups.com

Hm. Maybe that is a feature in newer versions?

Albert Law

unread,

Nov 19, 2014, 1:20:16 PM11/19/14

to hydr...@googlegroups.com

Hi All,

Just to clarify, if spawn tries to kick a task for a job and receives
a "down or scheduled to be failed" error, then it is expected and
designed behaviour that spawn will not retry that kick and will set
that job to an ERROR+disabled state? In such a case, that Hydra job
will forever be in ERROR+disabled.

Is that correct?

ps: I actually inserted the newlines to the log output pasted in the
original message so it would be more readable in email.

Ian Barfield

unread,

Nov 19, 2014, 2:26:05 PM11/19/14

to Albert Law, hydr...@googlegroups.com

I dug through the code a bit (although again, in master). It looks like it should only error when a replica's host was somehow failed and the normal recovery/migration somehow failed -- and even then the only path I could find where that would be detected and result in an error state is during rebalancing.

I ran some experiments on our 4 minion test cluster by variously disabling/killing minions, kicking jobs, and forcing rebalancing on them. Everything worked fine until I tried disabling three of the minions at once and forcing rebalance. I expected that since it would be unable to find enough available hosts to move replicas to (one enabled minion < two requested replicas), that it would finally reproduce your problem.

Interestingly, the job _did_ error but due to an (unintentional) NPE in spawn. The job also entered into a kind of weird state where enabling the minions caused the tasks to transition from IDLE immediately to REBALANCING. It seems similar to another bug I saw described in a commit from 4.2.12 (https://github.com/addthis/hydra/commit/171b8628acb6ff361216f4a8c248bcfa66e46f83). I am not sure what the behavior will be after the bug is fixed, but it is certainly possible your problem stems from either of these two NPE... although I expect you would probably see something to that effect in the log.

If you have a replica count that is close to the number of minions in your cluster, I can imagine how transient minion issues could easily manifest as job errors with a much higher frequency than they do for us. You might be able to work around that by adjusting either of those two numbers or disabling rebalancing, but those aren't the best options.

Reply all

Reply to author

Forward