cdap-master can't be restarted

100 views
Skip to first unread message

ron cai

unread,
Sep 6, 2016, 1:48:54 AM9/6/16
to CDAP User
I am using CDAP 3.5.1 on HDP 2.4.

There are exceptions in the master log, and I am trying to restart the cdap-master on Ambari, but it is hung. From the log, the master entered the stopping state, but no more action.

The log attached.

2016-09-06 05:33:49,327 - ERROR [netty-executor-1032:o.a.t.d.TransactionServiceClient@225] - Thrift error for org.apache.tephra.distributed.TransactionServiceClient$2@273e2fa4: Unable to discover tx service.
2016-09-06 05:33:49,341 - ERROR [netty-executor-1032:c.c.c.c.HttpExceptionHandler@49] - Unexpected error: request=GET /v3/namespaces user=<null>:
java.lang.RuntimeException: java.lang.Exception: Thrift error for org.apache.tephra.distributed.TransactionServiceClient$2@273e2fa4: Unable to discover tx service.
        at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[com.google.guava.guava-13.0.1.jar:na]
        at org.apache.tephra.distributed.TransactionServiceClient.startShort(TransactionServiceClient.java:268) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.TransactionContext.start(TransactionContext.java:91) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.DefaultTransactionExecutor.executeOnce(DefaultTransactionExecutor.java:135) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.DefaultTransactionExecutor.executeWithRetry(DefaultTransactionExecutor.java:117) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.DefaultTransactionExecutor.execute(DefaultTransactionExecutor.java:74) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.AbstractTransactionExecutor.executeUnchecked(AbstractTransactionExecutor.java:43) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at co.cask.cdap.store.DefaultNamespaceStore.list(DefaultNamespaceStore.java:161) ~[co.cask.cdap.cdap-data-fabric-3.5.0.jar:na]
        at co.cask.cdap.internal.app.namespace.DefaultNamespaceAdmin.list(DefaultNamespaceAdmin.java:360) ~[co.cask.cdap.cdap-app-fabric-3.5.0.jar:na]
        at co.cask.cdap.gateway.handlers.NamespaceHttpHandler.getAllNamespaces(NamespaceHttpHandler.java:61) ~[co.cask.cdap.cdap-app-fabric-3.5.0.jar:na]
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) ~[na:na]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_60]
        at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_60]
        at co.cask.http.HttpMethodInfo.invoke(HttpMethodInfo.java:80) ~[co.cask.http.netty-http-0.14.0.jar:na]
        at co.cask.http.HttpDispatcher.messageReceived(HttpDispatcher.java:38) [co.cask.http.netty-http-0.14.0.jar:na]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [io.netty.netty-3.6.6.Final.jar:na]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [io.netty.netty-3.6.6.Final.jar:na]
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) [io.netty.netty-3.6.6.Final.jar:na]
        at org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43) [io.netty.netty-3.6.6.Final.jar:na]
        at org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67) [io.netty.netty-3.6.6.Final.jar:na]
        at org.jboss.netty.handler.execution.OrderedMemoryAwareThreadPoolExecutor$ChildExecutor.run(OrderedMemoryAwareThreadPoolExecutor.java:314) [io.netty.netty-3.6.6.Final.jar:na]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
Caused by: java.lang.Exception: Thrift error for org.apache.tephra.distributed.TransactionServiceClient$2@273e2fa4: Unable to discover tx service.
        at org.apache.tephra.distributed.TransactionServiceClient.execute(TransactionServiceClient.java:227) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.distributed.TransactionServiceClient.execute(TransactionServiceClient.java:185) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.distributed.TransactionServiceClient.startShort(TransactionServiceClient.java:259) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        ... 22 common frames omitted
Caused by: org.apache.thrift.TException: Unable to discover tx service.
        at org.apache.tephra.distributed.AbstractClientProvider.newClient(AbstractClientProvider.java:106) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.distributed.AbstractClientProvider.newClient(AbstractClientProvider.java:85) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.distributed.PooledClientProvider$TxClientPool.create(PooledClientProvider.java:48) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.distributed.PooledClientProvider$TxClientPool.create(PooledClientProvider.java:41) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.distributed.ElasticPool.getOrCreate(ElasticPool.java:138) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.distributed.ElasticPool.obtain(ElasticPool.java:125) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.distributed.PooledClientProvider.getCloseableClient(PooledClientProvider.java:101) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        at org.apache.tephra.distributed.TransactionServiceClient.execute(TransactionServiceClient.java:214) ~[org.apache.tephra.tephra-core-0.8.0-incubating.jar:0.8.0-incubating]
        ... 24 common frames omitted
2016-09-06 05:33:49,343 - INFO  [NettyHttpService STOPPING:c.c.c.i.a.s.AppFabricServer$1@198] - AppFabric HTTP service stopped.
2016-09-06 05:33:49,344 - INFO  [DistributedSchedulerService STOPPING:c.c.c.i.a.r.s.DistributedSchedulerService@79] - Stopping scheduler.
2016-09-06 05:33:49,344 - INFO  [DistributedSchedulerService STOPPING:c.c.c.i.a.r.s.AbstractSchedulerService@98] - Stopped stream size scheduler
2016-09-06 05:33:49,345 - INFO  [DistributedSchedulerService STOPPING:c.c.c.i.a.r.s.AbstractSchedulerService@106] - Stopped time scheduler
2016-09-06 05:33:49,347 - INFO  [ApplicationLifecycleService STOPPING:c.c.c.i.a.s.ApplicationLifecycleService@165] - Shutting down ApplicationLifecycleService
2016-09-06 05:33:49,349 - INFO  [ProgramLifecycleService STOPPING:c.c.c.i.a.s.ProgramLifecycleService@143] - Shutting down ProgramLifecycleService
2016-09-06 05:33:49,350 - INFO  [authorization-privileges-fetcher-proxy-service:c.c.c.s.a.AbstractAuthorizationService@139] - Shutdown authorization privileges-fetcher-proxy service successfully.
2016-09-06 05:33:49,351 - ERROR [authorization-privileges-fetcher-proxy-service:c.c.c.l.a.k.SimpleKafkaProducer@64] - Exception when trying to publish log message to kafka with key system:services:appfabric and topic logs.user-v2
kafka.common.QueueFullException: Event queue is full of unsent messages, could not send event: KeyedMessage(logs.user-v2,system:services:appfabric,system:services:appfabric,[B@24471929)
        at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Producer.scala:113) ~[org.apache.kafka.kafka_2.10-0.8.2.2.jar:na]
        at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Producer.scala:91) ~[org.apache.kafka.kafka_2.10-0.8.2.2.jar:na]
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) ~[org.scala-lang.scala-library-2.10.4.jar:na]
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) ~[org.scala-lang.scala-library-2.10.4.jar:na]
        at kafka.producer.Producer.asyncSend(Producer.scala:91) ~[org.apache.kafka.kafka_2.10-0.8.2.2.jar:na]
        at kafka.producer.Producer.send(Producer.scala:78) ~[org.apache.kafka.kafka_2.10-0.8.2.2.jar:na]
        at kafka.javaapi.producer.Producer.send(Producer.scala:33) ~[org.apache.kafka.kafka_2.10-0.8.2.2.jar:na]
        at co.cask.cdap.logging.appender.kafka.SimpleKafkaProducer.publish(SimpleKafkaProducer.java:62) ~[co.cask.cdap.cdap-watchdog-3.5.0.jar:na]
        at co.cask.cdap.logging.appender.kafka.KafkaLogAppender.append(KafkaLogAppender.java:61) [co.cask.cdap.cdap-watchdog-3.5.0.jar:na]
        at co.cask.cdap.logging.appender.LogAppender.append(LogAppender.java:27) [co.cask.cdap.cdap-watchdog-3.5.0.jar:na]
        at ch.qos.logback.core.AppenderBase.doAppend(AppenderBase.java:85) [ch.qos.logback.logback-core-1.0.9.jar:na]
        at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:48) [ch.qos.logback.logback-core-1.0.9.jar:na]
        at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:280) [ch.qos.logback.logback-classic-1.0.9.jar:na]
        at ch.qos.logback.classic.Logger.callAppenders(Logger.java:267) [ch.qos.logback.logback-classic-1.0.9.jar:na]
        at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:449) [ch.qos.logback.logback-classic-1.0.9.jar:na]
        at ch.qos.logback.classic.Logger.filterAndLog_1(Logger.java:421) [ch.qos.logback.logback-classic-1.0.9.jar:na]
        at ch.qos.logback.classic.Logger.info(Logger.java:611) [ch.qos.logback.logback-classic-1.0.9.jar:na]
        at co.cask.cdap.security.authorization.AbstractAuthorizationService.shutDown(AbstractAuthorizationService.java:139) [co.cask.cdap.cdap-security-3.5.0.jar:na]
        at com.google.common.util.concurrent.AbstractScheduledService$1$3.run(AbstractScheduledService.java:219) [com.google.guava.guava-13.0.1.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_60]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_60]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_60]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_60]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
2016-09-06 05:33:49,352 - INFO  [leader-election-election-master.services:c.c.c.d.r.m.MasterServiceMain@325] - Stopping service in master: EndureService{DatasetService}
2016-09-06 05:33:49,440 - INFO  [DatasetService:c.c.c.d.d.d.s.DatasetService@201] - Stopping DatasetService...
2016-09-06 05:33:52,444 - INFO  [NettyHttpService STOPPING:c.c.c.d.d.d.s.DatasetTypeHandler@63] - Stopping DatasetTypeHandler
2016-09-06 05:33:52,447 - INFO  [leader-election-election-master.services:c.c.c.d.r.m.MasterServiceMain@325] - Stopping service in master: DatasetServiceStore [RUNNING]
2016-09-06 05:33:52,447 - INFO  [leader-election-election-master.services:c.c.c.d.r.m.MasterServiceMain@325] - Stopping service in master: DefaultAuthorizationEnforcementService [RUNNING]
2016-09-06 05:33:52,447 - INFO  [authorization-enforcement-service:c.c.c.s.a.AbstractAuthorizationService@139] - Shutdown authorization enforcement service successfully.
2016-09-06 05:33:52,447 - INFO  [leader-election-election-master.services:c.c.c.d.r.m.MasterServiceMain@325] - Stopping service in master: KafkaMetricsCollectionService [RUNNING]
2016-09-06 05:33:52,448 - INFO  [leader-election-election-master.services:c.c.c.d.r.m.MasterServiceMain@325] - Stopping service in master: ZKKafkaClientService [RUNNING]
2016-09-06 05:33:52,449 INFO  [kafka-publisher] producer.Producer: Shutting down producer
2016-09-06 05:33:52,449 INFO  [kafka-publisher] producer.Producer: Shutting down producer
2016-09-06 05:33:52,449 INFO  [kafka-publisher] producer.Producer: Shutting down producer
2016-09-06 05:33:52,459 - INFO  [Thread-4:c.c.c.d.r.m.MasterServiceMain@325] - Stopping service in master: org.apache.twill.zookeeper.ZKClientServices$1@2b560ea
2016-09-06 05:33:52,459 INFO  [kafka-publisher] producer.ProducerPool: Closing all sync producers
2016-09-06 05:33:52,460 INFO  [kafka-publisher] producer.ProducerPool: Closing all sync producers
2016-09-06 05:33:52,460 INFO  [kafka-publisher] producer.ProducerPool: Closing all sync producers
                                                                                                             
Regards,
Ron

nis...@cask.co

unread,
Sep 8, 2016, 6:27:56 PM9/8/16
to CDAP User
Hi Ron,

Are you still having issues starting cdap-master?  This seems to be related to https://issues.cask.co/browse/CDAP-7233, which was causing failure in the transaction service. Please let me know if you still need help with this.

Thanks,
Nishith

ron cai

unread,
Sep 8, 2016, 9:54:38 PM9/8/16
to CDAP User, nis...@cask.co
Hi Nishith,

I have solved the problem now. It is majorly because there was short of YARN containers in my cluster. 

Thanks,
Ron


在 2016年9月9日星期五 UTC+8上午6:27:56,nis...@cask.co写道:
Reply all
Reply to author
Forward
0 new messages