Metadata cache fails with ZK Auth Error

anuptr...@gmail.com

unread,

Feb 25, 2019, 7:00:39 AM2/25/19

to CDAP User

Hi Team,

We are seeing behaviour in Region server logs where it's filled with below exception repeatedly. Frequency is 5 every second. It appeared to be fixed when Metadata cache update thread terminated as RS went offline (and came online again).

There is similar issue here ( https://issues.cask.co/browse/CDAP-12454 ) but it looks in this case exception is originating from different place?

This is on CDAP 4.3.3 on HDP 2.6.4.

2019-02-01 06:13:10,114 ERROR [tms-topic-metadata-cache-refresh] zookeeper.ZooKeeperWatcher: hconnection-0x880720d-0x999b40acc6nhf94, quorum=hostname:2181, baseZNode=/hbase-secure Received unexpected KeeperException, re-throwing exception

org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hbase-secure/meta-region-server

at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)

at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:354)

at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:624)

at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionState(MetaTableLocator.java:491)

at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionLocation(MetaTableLocator.java:172)

at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:611)

at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:592)

at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:565)

at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)

at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1209)

at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1176)

at org.apache.hadoop.hbase.client.CoprocessorHConnection.locateRegion(CoprocessorHConnection.java:41)

at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:340)

at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:159)

at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:61)

at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:211)

at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:227)

at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:201)

at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275)

at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1179)

at org.apache.hadoop.hbase.client.CoprocessorHConnection.locateRegion(CoprocessorHConnection.java:41)

at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:340)

at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:159)

at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:61)

at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:211)

at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:327)

at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:302)

at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:167)

at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:162)

at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:799)

at org.apache.hadoop.hbase.client.HTableWrapper.getScanner(HTableWrapper.java:215)

at co.cask.cdap.messaging.TopicMetadataCache.updateCache(TopicMetadataCache.java:141)

at co.cask.cdap.messaging.TopicMetadataCache$2.run(TopicMetadataCache.java:183)

Let me know if there is any alternative way to avoid this?

Thanks!

anuptr...@gmail.com

unread,

Jul 3, 2019, 2:44:21 AM7/3/19

to CDAP User

Hello Andreas,

There is no update on below question. We are seeing this for another thread "cdap-configuration-cache-refresh" and do not have any solution yet.

I see below PR which had fixed this issue in multiple places but we are still getting error, can this be checked if it still a bug which need fixes in other places?

https://github.com/cdapio/cdap/pull/9541/commits

Thanks!

Andreas Neumann

unread,

Jul 8, 2019, 7:50:57 PM7/8/19

to cdap...@googlegroups.com

I am surprised that still happens after the fix. The stack trace from the exception appears impossible with the 4.3.3 code base.

- Was this a fresh installation of CDAP 4.3.3?

- Or was it upgraded from an earlier version?

- If so, were all HBase tables upgraded to the latest version of the coprocessors?

Regards -Andreas

--
You received this message because you are subscribed to the Google Groups "CDAP User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.
To post to this group, send email to cdap...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/dac0d578-c85d-4ff4-b756-6a654f328735%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

anuptr...@gmail.com

unread,

Jul 10, 2019, 11:39:53 AM7/10/19

to CDAP User

Hi Andreas,

Thanks for your reply.

It is a fresh installation for 4.3.3 version, this should rule out any issue with co-processor upgrade?

I see the PR changes (https://github.com/cdapio/cdap/pull/9541) were merged on 15-Sep-17 to release/4.3 branch.

Is it possible the RPMs we have used didn't have these changes?

Though I see version inside pom (in above PR) is 4.3.1-SNAPSHOT which implies 4.3.3 build should have been done later.

Thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to cdap...@googlegroups.com.

Andreas Neumann

unread,

Jul 10, 2019, 2:41:50 PM7/10/19

to cdap...@googlegroups.com

Yes, this fix is included in the 4.3.3 release. It is hard to explain what is happening,

Just to be sure, can you run this in the hbase shell:

describe 'cdap_system:tms.message'

That will show the coprocessor library version for this table (I assume this is happening for the tms.message table, right?).

Thanks -Andreas

To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.

To post to this group, send email to cdap...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/8e71982d-5b2d-4bc0-91df-25afeab16186%40googlegroups.com.

anuptr...@gmail.com

unread,

Jul 11, 2019, 6:50:42 AM7/11/19

to CDAP User

Hi Andreas,

We are getting errors for "tms-topic-metadata-cache-refresh" as well as "cdap-configuration-cache-refresh". I believe the other relevant table we need to check coprocessor is 'cdap_system:config.store.table'?

Thanks!

anuptr...@gmail.com

unread,

Jul 11, 2019, 12:17:03 PM7/11/19

to CDAP User

Andreas,

We did this check on some tables

cdap_system:tms.message

cdap_system:config.store.table

cdap_system:job.queue.t

and found coprocessor is on 4.3.3. Appears we are running correct version of build.

cdap_system:config.store.table, {TABLE_ATTRIBUTES => {coprocessor$1 => '/cdap/cdap/lib/coprocessor-4.3.3-1522017174177-HBASE_11.jar|co.cask.cdap.data2.transa

ction.coprocessor.hbase11.DefaultTransactionProcessor|1073741823|', METADATA => {'cdap.hbase.version' => '1.1', 'cdap.version' => '4.3.3-1522017174177', 'dat

aset.table.prefix' => 'cdap'}}

Thanks!

Andreas Neumann

unread,

Jul 15, 2019, 3:17:18 PM7/15/19

to cdap...@googlegroups.com

Hmm that was my only clue. Wondering whether anybody else on this list has seen this issue?

-Andreas

To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.

To post to this group, send email to cdap...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/6ed26dd4-7ae2-470f-a0a4-73f0d6fe993f%40googlegroups.com.

anuptr...@gmail.com

unread,

Jul 22, 2019, 2:19:59 PM7/22/19

to CDAP User

Hi Andreas,

Shall we raise a JIRA For this and link with CDAP-12454 so that team can have a closer look and fix anything if required.

Please let me know if any specific information will be required for debugging apart from the what is mentioned in this thread.

Thanks!

Andreas Neumann

unread,

Jul 22, 2019, 2:56:15 PM7/22/19

to cdap...@googlegroups.com

Yes, a Jira would help. Please include the logs and also the information about your cluster: Version of hadoop, CDAP, etc.

Thanks -Andreas

To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/30657664-e83f-41a0-9044-5531c802f4f4%40googlegroups.com.

anuptr...@gmail.com

unread,

Aug 7, 2019, 3:23:39 PM8/7/19

to CDAP User

Hi Andreas,

Here is the JIRA created for issue as discussed-

https://issues.cask.co/browse/CDAP-15718

Please let us know if team find anything on this.

Thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to cdap...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/30657664-e83f-41a0-9044-5531c802f4f4%40googlegroups.com.

Reply all

Reply to author

Forward