I looked at the logs. After 3 retries the map task fails. and the reason of failure in each attempt is dataset instance could not be created:
2016-09-28 01:04:03,516 - WARN [AsyncDispatcher event handler:o.a.h.m.v.a.MRAppMaster@91] - Sep 28, 2016 1:04:03 AM org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$DiagnosticInformationUpdater transition
INFO: Diagnostics report from attempt_1473728860156_5643_m_000003_2: Error: co.cask.cdap.api.data.DatasetInstantiationException: Could not instantiate dataset 'elkhan:purchases'
Caused by: com.google.common.util.concurrent.UncheckedExecutionException: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
Caused by: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
Caused by: co.cask.cdap.api.dataset.DatasetManagementException: Failed to create dataset instance: dataset:elkhan.purchases
Caused by: co.cask.cdap.common.ServiceUnavailableException: Service 'DatasetService' is not available. Please wait till it is up and running.
...
2016-09-28 01:03:38,792 - WARN [main:o.a.h.m.YarnChild@91] - Sep 28, 2016 1:03:38 AM org.apache.hadoop.mapred.YarnChild main
WARNING: Exception running child : co.cask.cdap.api.data.DatasetInstantiationException: Could not instantiate dataset 'elkhan:purchases'
Caused by: com.google.common.util.concurrent.UncheckedExecutionException: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
Caused by: co.cask.cdap.api.data.DatasetInstantiationException: Failed to access dataset: dataset:elkhan.purchases
Caused by: co.cask.cdap.api.dataset.DatasetManagementException: Error during talking to Dataset Service at <URL_OF_KILLED_MASTER>:45764/v3/namespaces/elkhan/data/datasets/purchases?owner=program:elkhan.PurchaseHistory.mapreduce.PurchaseHistoryBuilder while doing GET with headers null and body null
Caused by: java.net.ConnectException: Connection refused
It is same reason as you mentioned - Dataset Service is run as a part of master service, and since the dataset service is down, the request is failing. But it fails even before trying to udpate workflow token, it cannot even start map task, because it cannot instantiate dataset instance.
My expected scenario will be:
If the new master changes state from follower to leader, then it should track all the jobs which old master tried to run before it was dead, kill them and restart the jobs again so they can complete successfully.
Also if the endpoint on main master is not responding, it should try follower master endpoint for instantiating Dataset instance.
Otherwise those jobs will still fail, because dataset cannot be instantiated.
In ideal case master service failover in HA mode should not influence any jobs running (cause them fail).
Thanks.