Is CDAP Master service stateless ?

224 views
Skip to first unread message

Elkhan

unread,
Sep 27, 2016, 8:01:43 PM9/27/16
to CDAP User
Hi,

Is CDAP Master service stateless ?

Does new master service try to rerun services which dead master service tried to run before it was killed ?

Result of Experiment: If CDAP Master  service is killed, while the jobs are running, will all those jobs fail after 2nd master service becomes new master service.

I did this experiment:

- deployed two instances of CDAP master services (and all other services) in 2 different nodes.

- Then started PurchaseHistory App from UI from 1st node, where main master service is running.

- Then killed main master service (kill -9 MASTER_SERViCE1_PID) 

- It takes around 45 second for the 2nd master service becomes new main master service.

- but the Purchase job which i launched fails. 

Here are the logs:


The logs of the 2nd master service after becoming new main master service: 

2016-09-27 23:50:48,710 - INFO  [pool-6-thread-1:c.c.c.i.a.r.d.AbstractTwillProgramController$1@65] - Twill program running: program:elkhan.PurchaseHistory.flow.PurchaseFlow 9da7c6fa-671f-436f-90af-796ca5687a92
2016-09-27 23:50:50,360 - WARN  [pool-6-thread-1:c.c.c.i.a.s.ProgramLifecycleService@692] - Found 1 RunRecords with RUNNING status but the program is not actually running for program type Workflow
2016-09-27 23:50:50,610 - WARN  [pool-6-thread-1:c.c.c.i.a.s.ProgramLifecycleService@704] - Fixing RunRecord a5ed1a61-850c-11e6-b223-14feb5d7ab12 in program program:elkhan.PurchaseHistory.workflow.PurchaseHistoryWorkflow of type Workflow with RUNNING status but the program is not running
2016-09-27 23:50:50,851 - INFO  [pool-6-thread-1:c.c.c.i.a.r.d.AbstractTwillProgramController$1@65] - Twill program running: program:elkhan.PurchaseHistory.service.CatalogLookup 5ace917d-975d-4ac0-ac96-e4bd447edd7c
2016-09-27 23:50:50,858 - INFO  [pool-6-thread-1:c.c.c.i.a.r.d.AbstractTwillProgramController$1@65] - Twill program running: program:elkhan.PurchaseHistory.service.UserProfileService e254a327-c507-4b5a-ad9f-911f2fbfaf32
2016-09-27 23:50:50,864 - INFO  [pool-6-thread-1:c.c.c.i.a.r.d.AbstractTwillProgramController$1@65] - Twill program running: program:elkhan.PurchaseHistory.service.PurchaseHistoryService 176b8f07-1923-4531-b5b8-88915ffc7bfb
2016-09-27 23:50:51,893 - INFO  [pool-6-thread-1:c.c.c.i.a.s.ProgramLifecycleService@624] - Corrected 1 of run records with RUNNING status but no actual program running.

The logs of 1st main master service which was killed:
2016-09-27 23:46:46,256 - INFO  [CommonNettyHttpServiceBuilder-executor-7090:c.c.c.i.a.r.d.AbstractDistributedProgramRunner@178] - Setting scheduler queue to default
2016-09-27 23:46:46,373 - INFO  [CommonNettyHttpServiceBuilder-executor-7090:c.c.c.i.a.r.s.SparkUtils@133] - Located Spark Assembly JAR in /../..-hadoop2.6.0-...-..-1.3.1.jar
2016-09-27 23:46:46,391 - INFO  [CommonNettyHttpServiceBuilder-executor-7090:c.c.c.i.a.r.d.DistributedWorkflowProgramRunner@131] - Launching distributed workflow: PurchaseHistoryWorkflow:PurchaseHistoryWorkflow
2016-09-27 23:46:46,412 - INFO  [CommonNettyHttpServiceBuilder-executor-7090:c.c.c.i.a.r.d.AbstractDistributedProgramRunner$1$1@228] - Starting program:elkhan.PurchaseHistory.workflow.PurchaseHistoryWorkflow with debugging enabled: false, programOptions: {"name":"PurchaseHistoryWorkflow","arguments":{"apps.scheduler.queue":"default","runId":"a5ed1a61-850c-11e6-b223-14feb5d7ab12"},"userArguments":{"logical.start.time":"1475020005099"},"debug":false}, and logback: file:/var/tmp/cdap/data/tmp/1475020006256-0/logback.xml
2016-09-27 23:46:46,413 - INFO  [CommonNettyHttpServiceBuilder-executor-7090:c.c.c.i.a.r.d.AbstractDistributedProgramRunner$1$1@232] - Setting scheduler queue for app program:elkhan.PurchaseHistory.workflow.PurchaseHistoryWorkflow as default
2016-09-27 23:46:50,650 - WARN  [Thread-18136:o.a.h.c.Configuration@2670] - mapred-site.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-09-27 23:46:50,711 - WARN  [Thread-18136:o.a.h.c.Configuration@2670] - mapred-site.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.


Shankar Selvam

unread,
Sep 27, 2016, 8:54:22 PM9/27/16
to cdap...@googlegroups.com
Hi Elkhan,

From the 2nd master logs, it seems the flows and services are running fine. 
The MR job launched from workflow might have failed and the Workflow would have completed before the 2nd master services came up.  Is that the behavior you are seeing ? 

If you look into the MR job's log, we can see the actual reason for MR failure. 

We maintain the state of program's in the table "app.meta" and the first master would have recorded the state as Running for the workflow

We have a RunRecordCorrector thread in master, that checks the state of program and verifies if the program is actually running, if its not running, it fixes the run record for that program. 

From your logs, i noticed that the RunRecordCorrector fixed the state for the workflow. which is expected behavior if the workflow has already completed.

Thanks
Shankar

--
You received this message because you are subscribed to the Google Groups "CDAP User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+unsubscribe@googlegroups.com.
To post to this group, send email to cdap...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/b0e1f160-f19b-490b-a15f-43911a9f126f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Elkhan

unread,
Sep 27, 2016, 9:19:02 PM9/27/16
to CDAP User
Hi Shankar,

- No, Workflow did not complete, I see workflow for that run fail after 2nd master takes the lead.
- I did a little bit modified experiment, waited until Map task has started, then killed master service (kill -9). Both Map and Workflow jobs failed, after 2nd master took the leadership. 

Reason of MR job failure in logs:

09/27/2016 18:04:23ERRORMapReduceRunner-PurchaseHistoryBuilder:co.cask.cdap.internal.app.runtime.ProgramControllerServiceAdapter$1#81Program terminated with exception
java.lang.IllegalStateException: MapReduce execution failure: job-id : job_1473728860156_5643uber-mode : falsemap-progress : 1.0reduce-progress : 1.0cleanup-progress : 1.0setup-progress : 1.0runstate : FAILEDstart-time : 1475024609257user-name : XXXpriority : NORMALscheduling-info : NAnum-used-slots0num-reserved-slots0used-mem0reserved-mem0needed-mem0
	at com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[com.google.guava.guava-13.0.1.jar:na]
	at co.cask.cdap.internal.app.runtime.batch.MapReduceRuntimeService.run(MapReduceRuntimeService.java:363) ~[co.cask.cdap.cdap-app-fabric-3.5.1.jar:na]
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:52) ~[com.google.guava.guava-13.0.1.jar:na]
	at co.cask.cdap.internal.app.runtime.batch.MapReduceRuntimeService$1$1.run(MapReduceRuntimeService.java:422) [co.cask.cdap.cdap-app-fabric-3.5.1.jar:na]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]


 09/27/2016 18:04:29ERRORWorkflowDriver:co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver#308Failed to destroy the Workflow program_run:elkhan.PurchaseHistory.workflow.PurchaseHistoryWorkflow.1453d841-8517-11e6-a8e8-14feb5d9f859
java.lang.RuntimeException: Error making request to remote.system.operation service at <url>:60295/v1/execute/updateWorkflowToken while doing POST with headers  and body [{"type":"co.cask.cdap.proto.id.ProgramRunId","value":{"namespace":"elkhan","application":"PurchaseHistory","type":"Workflow","program":"PurchaseHistoryWorkflow","run":"1453d841-8517-11e6-a8e8-14feb5d9f859","entity":"PROGRAM_RUN"}},{"type":"co.cask.cdap.internal.app.runtime.workflow.BasicWorkflowToken","value":{"tokenValueMap":{"USER":{},"SYSTEM":{}},"maxSizeBytes":31457280,"nodeName":"PurchaseHistoryWorkflow","putAllowed":true,"bytesLeft":31457280}}]. Response: Response code: 503, message: 'Service Unavailable', body: 'Service 'DatasetService' is not available. Please wait till it is up and running.'.
	at co.cask.cdap.common.internal.remote.RemoteOpsClient.doRequest(RemoteOpsClient.java:127) ~[co.cask.cdap.cdap-common-3.5.1.jar:na]
	at co.cask.cdap.common.internal.remote.RemoteOpsClient.executeRequest(RemoteOpsClient.java:87) ~[co.cask.cdap.cdap-common-3.5.1.jar:na]
	at co.cask.cdap.common.internal.remote.RemoteOpsClient.executeRequest(RemoteOpsClient.java:83) ~[co.cask.cdap.cdap-common-3.5.1.jar:na]
	at co.cask.cdap.internal.app.store.remote.RemoteRuntimeStore.updateWorkflowToken(RemoteRuntimeStore.java:82) ~[co.cask.cdap.cdap-app-fabric-3.5.1.jar:na]
	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.destroyWorkflow(WorkflowDriver.java:299) [co.cask.cdap.cdap-app-fabric-3.5.1.jar:na]
	at co.cask.cdap.internal.app.runtime.workflow.WorkflowDriver.shutDown(WorkflowDriver.java:280) [co.cask.cdap.cdap-app-fabric-3.5.1.jar:na]
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:55) [com.google.guava.guava-13.0.1.jar:na]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]


 09/27/2016 18:04:29ERRORpcontroller-program:elkhan.PurchaseHistory.workflow.PurchaseHistoryWorkflow-1453d841-8517-11e6-a8e8-14feb5d9f859:co.cask.cdap.internal.app.runtime.distributed.AbstractProgramTwillRunnable$1#299Program runner error out.
java.lang.IllegalStateException: MapReduce execution failure: job-id : job_1473728860156_5643uber-mode : falsemap-progress : 1.0reduce-progress : 1.0cleanup-progress : 1.0setup-progress : 1.0runstate : FAILEDstart-time : 1475024609257user-name : XXXpriority : NORMALscheduling-info : NAnum-used-slots0num-reserved-slots0used-mem0reserved-mem0needed-mem0
	at com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[com.google.guava.guava-13.0.1.jar:na]
	at co.cask.cdap.internal.app.runtime.batch.MapReduceRuntimeService.run(MapReduceRuntimeService.java:363) ~[co.cask.cdap.cdap-app-fabric-3.5.1.jar:na]
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:52) ~[com.google.guava.guava-13.0.1.jar:na]
	at co.cask.cdap.internal.app.runtime.batch.MapReduceRuntimeService$1$1.run(MapReduceRuntimeService.java:422) ~[co.cask.cdap.cdap-app-fabric-3.5.1.jar:na]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.

Shankar Selvam

unread,
Sep 27, 2016, 9:40:58 PM9/27/16
to cdap...@googlegroups.com
Hi Elkhan,

By completed i wanted to imply not running, yes it could have been failed too.

About the Mapreduce Failure, the logs for MR, just indicates that the MR job status is Failed, whlile the expected status is successfull. which isn't useful to debug. if you look above those log lines ther would be actual reason for failure.  

The workflow logs indicate that the Workflow is trying to update worklow token which requires talking with Dataset service, Dataset Service is run as a part of master service, and since the dataset service is down, the request is failing, causing the exception. 

Its possible the MR job is also failing due to a similar exception, but will be good to confirm that.

Flows and Services would also have these exceptions, but since those programs run continuosly untill stopped, they just log the exception and continue running. 

It will be good to understand what's your expectation in this scenario.

Thanks
Shankar

To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+unsubscribe@googlegroups.com.

To post to this group, send email to cdap...@googlegroups.com.

Elkhan

unread,
Sep 28, 2016, 1:31:06 PM9/28/16
to CDAP User
Hi Shankar,

I looked at the logs. After 3 retries the map task fails. and the reason of failure in each attempt is dataset instance could not be created:

2016-09-28 01:04:03,516 - WARN  [AsyncDispatcher event handler:o.a.h.m.v.a.MRAppMaster@91] - Sep 28, 2016 1:04:03 AM org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$DiagnosticInformationUpdater transition
INFO: Diagnostics report from attempt_1473728860156_5643_m_000003_2: Error: co.cask.cdap.api.data.DatasetInstantiationException: Could not instantiate dataset 'elkhan:purchases'
Caused by: com.google.common.util.concurrent.UncheckedExecutionException: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
Caused by: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
Caused by: co.cask.cdap.api.dataset.DatasetManagementException: Failed to create dataset instance: dataset:elkhan.purchases
Caused by: co.cask.cdap.common.ServiceUnavailableException: Service 'DatasetService' is not available. Please wait till it is up and running.

...

2016-09-28 01:03:38,792 - WARN  [main:o.a.h.m.YarnChild@91] - Sep 28, 2016 1:03:38 AM org.apache.hadoop.mapred.YarnChild main
WARNING: Exception running child : co.cask.cdap.api.data.DatasetInstantiationException: Could not instantiate dataset 'elkhan:purchases'
Caused by: com.google.common.util.concurrent.UncheckedExecutionException: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
Caused by: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
Caused by: co.cask.cdap.api.dataset.DatasetManagementException: Error during talking to Dataset Service at <URL_OF_KILLED_MASTER>:45764/v3/namespaces/elkhan/data/datasets/purchases?owner=program:elkhan.PurchaseHistory.mapreduce.PurchaseHistoryBuilder while doing GET with headers null and body null
Caused by: java.net.ConnectException: Connection refused

It is same reason as you mentioned - Dataset Service is run as a part of master service, and since the dataset service is down, the request is failing. But it fails even before trying to udpate workflow token, it cannot even start map task, because it cannot instantiate dataset instance.

My expected scenario will be:

If the new master changes state from follower to leader, then it should track all the jobs which old master tried to run before it was dead, kill them and restart the jobs again so they can complete successfully. 

Also if the endpoint on main master is not responding, it should try follower master endpoint for instantiating Dataset instance.

Otherwise those jobs will still fail, because dataset cannot be instantiated.

In ideal case master service failover in HA mode should not influence any jobs running (cause them fail).

Thanks.

Elkhan

unread,
Oct 2, 2016, 9:26:46 PM10/2/16
to CDAP User
Created Jira ticket CDAP-7388 with regarding issue.
Reply all
Reply to author
Forward
0 new messages