push.ha not working without azkaban?

23 views
Skip to first unread message

David Ongaro

unread,
Mar 2, 2017, 3:25:57 PM3/2/17
to project-...@googlegroups.com
So we configured one of our clusters with push.ha properties to see how this works out. We have configured to allow a swap when there is a failed fetch on only one node, because our stores have a replication factor of two. So we got a case today where one node was unreachable during fetch but just when he attempted to swap anyway it failed like this:

2017-03-02 19:23:16,732 INFO [pool-9-thread-1] voldemort.store.readonly.swapper.AdminStoreSwapper|tcp://redacted.net:6666: tcp://redacted.net:6666 : About to attempt: DisableStoreOnFailedNodeFailedFetchStrategy
2017-03-02 19:23:16,934 ERROR [pool-9-thread-1] voldemort.client.protocol.admin.AdminClient: Node redacted.net:6666 [id 0] returned failed HandleFetchFailureResponse: Got exception while trying to execute pushHighAvailability. 
voldemort.utils.UndefinedPropertyException: Missing required property 'azkaban.flow.flowid'.
	at voldemort.utils.Props.getString(Props.java:214)
	at voldemort.store.readonly.swapper.HdfsFailedFetchLock.<init>(HdfsFailedFetchLock.java:85)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at voldemort.utils.ReflectUtils.callConstructor(ReflectUtils.java:116)
	at voldemort.utils.ReflectUtils.callConstructor(ReflectUtils.java:103)
	at voldemort.server.protocol.admin.AdminServiceRequestHandler.handleFetchFailure(AdminServiceRequestHandler.java:2097)
	at voldemort.server.protocol.admin.AdminServiceRequestHandler.handleRequest(AdminServiceRequestHandler.java:346)
	at voldemort.server.niosocket.AsyncRequestHandler.read(AsyncRequestHandler.java:192)
	at voldemort.common.nio.SelectorManagerWorker.run(SelectorManagerWorker.java:105)
	at voldemort.common.nio.AbstractSelectorManager.run(AbstractSelectorManager.java:243)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

It seems there is some heavy reflection going on during class initialization of HdfsFailedFetchLock. We don’t have any of these azkaban properties configured since we are not using Azkaban on our cluster so we are calling the VoldemortBuildAndPushJobRunner wrapper instead (or rather a customized version of it). Of course we still could set some dummy values for this properties if this would help. So the question is if this would actually help and to which values we should set them to. It seems there are 3 different properties defined in HdfsFailedFetchLock:

private final static String AZKABAN_FLOW_ID = "azkaban.flow.flowid";
private final static String AZKABAN_JOB_ID = "azkaban.job.id";
private final static String AZKABAN_EXEC_ID = "azkaban.flow.execid”;

Would setting these be enough or is there more to consider?

Thanks,

David

Felix GV

unread,
Mar 2, 2017, 4:24:08 PM3/2/17
to project-...@googlegroups.com
Hi David,

Sorry about that. I wrote that piece of code and I admit that depending on Azkaban properties was a bit lazy. The main goal is just to create a unique file name so that multiple jobs can coordinate with each other and wait for the lock to become available without stepping on each others' toes. Using these properties also provides some basic traceability after the fact, though that's arguably not really necessary.

Passing in default values would not help much if you're statically always passing the same ones.

I think a simple code change to fix this would be to check the existence of these properties, and fall back on some (sufficiently large) random number or string instead if they're unavailable.

What do you think?

--
Felix GV
Staff Software Engineer
Data Infrastructure
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv


--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

David Ongaro

unread,
Mar 2, 2017, 4:55:38 PM3/2/17
to project-...@googlegroups.com
I thought something like that, so I didn’t mean to imply that we use just some static default values. We could use the oozie workflow id the hadoop job id or the PID of the BnP for example. But then the question occurs if we can use that for all three properties and if we have to adhere some syntactic constraints.

Patching the code seems to be quite involved since I think that would imply deploying a new server version (it seems to fail on server side not on push job side). Also since the problem seems to happen during static initialization I’m not sure how to insert a check there. I guess the main culprit is this line in FailedFetchLock?

// Pass both server properties and the remote job's properties to the FailedFetchLock constructor
Object[] failedFetchLockParams = new Object[]{config, remoteJobProps};

Thanks,

David


To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.

Felix GV

unread,
Mar 2, 2017, 5:34:59 PM3/2/17
to project-...@googlegroups.com
Good points.

If you can pass the Hadoop job ID, I believe that'll be unique enough to suit the intended purpose. Feel free to also pass the rest if you'd like.

--
Felix GV
Staff Software Engineer
Data Infrastructure
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv


On Thu, Mar 2, 2017 at 1:55 PM, David Ongaro <bitt...@gmail.com> wrote:
I thought something like that, so I didn’t mean to imply that we use just some static default values. We could use the oozie workflow id the hadoop job id or the PID of the BnP for example. But then the question occurs if we can use that for all three properties and if we have to adhere some syntactic constraints.

Patching the code seems to be quite involved since I think that would imply deploying a new server version (it seems to fail on server side not on push job side). Also since the problem seems to happen during static initialization I’m not sure how to insert a check there. I guess the main culprit is this line in FailedFetchLock?

// Pass both server properties and the remote job's properties to the FailedFetchLock constructor
Object[] failedFetchLockParams = new Object[]{config, remoteJobProps};

Thanks,

David

David Ongaro

unread,
Mar 6, 2017, 3:31:39 PM3/6/17
to project-...@googlegroups.com
To bring a closure to this: providing customized values for azkaban.flow.flowid, azkaban.job.id, azkaban.flow.execid is enough to get push.ha working when you don’t use Askaban. E.g. for oozie we use something like this:

    <action name="pushStore">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <main-class>voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJobRunner</main-class>
            <arg>${pushConfiguration}</arg>
            <arg>build.output.dir=web${nameNode}${voldemortStoreArchiveDir}${hdfsStorePath}</arg>
            <!-- Inject some dummy azkaban configuration to make push.ha work -->
            <arg>azkaban.flow.flowid=${pushConfiguration}</arg>
            <arg>azkaban.job.id=${wf:id()}</arg>
            <arg>azkaban.flow.execid=${wf:run()}</arg>
            <file>${pushConfiguration}</file>
        </java>
        <ok to="end"/>
        <error to="sendPushFailureEmail"/>
    </action>

Note that we only do a push here since we still do the build separately.

I can provide a PR for the changes we needed to do to VoldemortBuildAndPushJobRunner in order to provide dynamic values for build.output.dir and the azkaban properties.

Thanks,

David
Reply all
Reply to author
Forward
0 new messages