Embeded Kafka server down and may not be restarted

463 views
Skip to first unread message

Fredrik

unread,
Jun 2, 2016, 9:09:13 PM6/2/16
to CDAP User
Hello,

I encounter this Kafka error when I started the CDAP:

16/06/03 01:02:31 FATAL server.KafkaServer: [Kafka Server 171121975], Fatal error during KafkaServer startup. Prepare to shutdown
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code 

Due to this issue, the app on the server may not receive stream event..

What the issue is? And how to overcome this? Need I clean the Kafka log directories? Is cleaning Kafka log a safe action which will not impact the deployed apps and datasets?

Best Regards,
Fredrik
stdout.log
stderr.log

Terence Yim

unread,
Jun 2, 2016, 11:03:41 PM6/2/16
to Fredrik, CDAP User
Hi,

This error comes from Kafka itself. What version of Java are you using and what is the Kafka data directory that you set to in the cdap-site.xml file?

Terence

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "CDAP User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.
To post to this group, send email to cdap...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/3978dae3-3acb-42e1-84a3-905484b5140a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<stdout.log>
<stderr.log>

Fredrik

unread,
Jun 2, 2016, 11:23:26 PM6/2/16
to CDAP User, frederic...@gmail.com
Hi Terence,

The issue is resolved after I manually cleaned the kafka-logs directory. What's stored in this directory, will it be auto cleaned? Since after CDAP running for a long time (no matter restarted or not), the files in that directory will make the system running out of space.

Thanks.
Best Regards,
Fredrik

Terence Yim

unread,
Jun 3, 2016, 12:25:59 AM6/3/16
to Fredrik, CDAP User
Kafka will do the cleanup periodically. I think by default the CDAP Kafka set it to everyday.

Terence

Sent from my iPhone

Fredrik

unread,
Jun 3, 2016, 1:49:20 AM6/3/16
to CDAP User, frederic...@gmail.com
Hello Terence,

Thank you for the information. Is it configurable? And where to find the configuration? Or keep monitoring the disk usage to do clean from time to time to avoid run out of disk?

Thanks.
Best Regards,
Fredrik

Terence Yim

unread,
Jun 3, 2016, 4:06:40 PM6/3/16
to CDAP User, frederic...@gmail.com
Hi Fredrik,

Currently that 1 day log cleanup time is not settable, however Kafka should be doing the cleanup, meaning you shouldn't need to clean it up yourself. Also, the cdap kafka is only used for logs and metrics collection, which shouldn't be using a lot of disk space, unless your program is logging a lot of messages within a day (e.g. logging at TRACE level). Also, how big is the disk that you set the "kafka.log.dir" to?

Terence

Fredrik

unread,
Jun 3, 2016, 9:40:16 PM6/3/16
to CDAP User, frederic...@gmail.com
Hello Terence,

The disk is 4GB logic volume ("/var"). Looks like I can not find the "kafka.log.dir" on my CDH CDAP service configuration page. How to configure it to direct the log to another space?  I opened another topic for the case that makes this happen https://groups.google.com/forum/#!topic/cdap-user/AI36_yjyMyY

And another question is: at runtime without CDAP restart, is that safe to manually cleanup that directory if the disk run out-of-space case happened?

  <property>
    <name>kafka.log.dir</name>
    <value>{{LOCAL_DIR}}/kafka-logs</value>
  </property>

Thanks.
Best Regards,
Fredrik

Terence Yim

unread,
Jun 3, 2016, 9:52:55 PM6/3/16
to Fredrik, CDAP User
Hi Fredrik,

In that case, it will by default write to “/tmp/kafka-logs. You can change it with the “kafka.log.dir” property in the cdap-site.xml (pretty much like what you shown).

Manually cleanup the kafka log directory is never a good idea, you should let Kafka to do it.

Terence

Fredrik

unread,
Jun 3, 2016, 10:10:34 PM6/3/16
to CDAP User, frederic...@gmail.com
Hi Terence,

On my CDH env, I found that it is directed to /var/tmp/cdap/kafka-logs. I'm wondering if there is a trigger to let the Kafka to clear that directory on demand besides the auto clean :-) since for my env the disk space is not easy to be extended.

Best Regards,
Fredrik

Poorna Chandra

unread,
Jun 6, 2016, 7:59:31 PM6/6/16
to Fredrik, CDAP User
Hi Fredrik,

As Terence mentioned earlier, the Kafka logs get auto cleaned every 24 hours (it is an auto time based trigger). Today that interval is not configurable. Here is the JIRA to make the interval configurable - https://issues.cask.co/browse/CDAP-5264

Poorna.


Fredrik

unread,
Jun 6, 2016, 11:31:13 PM6/6/16
to CDAP User, frederic...@gmail.com
Hello Poorna,

Thank you for the detailed information.

Thanks.
Best Regards,
Fredrik

Fredrik

unread,
Oct 20, 2016, 7:08:38 AM10/20/16
to CDAP User, frederic...@gmail.com
Hello,

Do we have the chance to reduce the size of Kafka log generated? Everyday our cluster disk is out of space by the large number of 1-gigabyte size Kafka log files (e.g., /var/tmp/cdap/kafka-logs/logs.user-v2-8/00000000000027214800.log ....).

Please suggest and comment. Thanks.

Best Regards,
Frederic

Rohit Sinha

unread,
Oct 21, 2016, 3:15:28 PM10/21/16
to CDAP User, frederic...@gmail.com
Hello Fredrik,
Are you logging excessively ? If yes, then you should try reducing it. This will help in reducing the size of kafka log. 

Also from 3.5 onwards you have an option to shorten the amount of time logs are retained by kafka by setting the following property in your cdap-site.xml

  <property>
   
<name>kafka.server.log.retention.hours</name>
   
<value>24</value>
   
<description>
      The number of hours to keep a log file before deleting it
   
</description>
 
</property>

Although, please be aware that reducing it may lead to losing logs, metrics, notification etc. in some cases. So, we advise you to reduce your logging. 

Thanks, 
Rohit

Fredrik

unread,
Oct 27, 2016, 7:17:43 AM10/27/16
to CDAP User, frederic...@gmail.com
Hi Rohit,

I followed your instruction to do the following things:
1. Reduced the extensive log items.
2. Decrease the log retention interval.
3. Increase the disk space.

Moreover I also tried to redeploy the only CDAP Kafka server to a dedicated node to better serve the log collection and storing for CDAP application. After restarting the CDAP on CDH 5.5.2, I encountered error and only the following files generated in kafka-logs directory on the newly assigned node:
.kafka_cleanshutdown  .  replication-offset-checkpoint  recovery-point-offset-checkpoint  .lock

The log and metrics could not be shown on the CDAP UI.

Then I changed the Kafka server back to the previous node on which the kafka-logs dir was cleaned, the whole directory structure was rebuilt then after CDAP restarting:
.                                 audit-5                 audit-6         notifications-stream-7
replication-offset-checkpoint     logs.user-v2-8          audit-9         notifications-streamInternal-0
recovery-point-offset-checkpoint  audit-8                 logs.user-v2-0  notifications-streamInternal-1
logs.user-v2-6                    audit-1                 logs.user-v2-3  notifications-streamInternal-2
logs.user-v2-1                    audit-3                 logs.user-v2-5  notifications-streamInternal-3
logs.user-v2-9                    audit-0                 metrics-0       notifications-streamInternal-4
audit-2                           notifications-stream-0  metrics-1       notifications-streamInternal-5
audit-4                           notifications-stream-1  metrics-2       notifications-streamInternal-6
logs.user-v2-4                    notifications-stream-2  metrics-3       notifications-streamInternal-7
logs.user-v2-7                    notifications-stream-3  metrics-4       notifications-streamInternal-8
metrics-8                         notifications-stream-4  metrics-5       notifications-streamInternal-9
logs.user-v2-2                    notifications-stream-6  metrics-6       .lock
audit-7                           notifications-stream-8  metrics-7       ..
notifications-stream-5            notifications-stream-9  metrics-9

Could you let me know the reason for this issue? How to fix? Need extra manual steps to move the CDAP Kafka server besides the role assignment operation on Cloudera Manager 5.7.0?

Attached the log snippet here, in cdap stderr.log:
16/10/27 01:26:51 INFO client.ClientUtils$: Fetching metadata from broker id:0,host:<NODE_NAME>,port:9092 with correlation id 560493 for 1 topic(s) Set(metrics)
16/10/27 01:26:51 INFO producer.SyncProducer: Connected to <NODE_NAME>:9092 for producing
16/10/27 01:26:51 INFO producer.SyncProducer: Disconnecting from <NODE_NAME>:9092
16/10/27 01:26:51 WARN producer.BrokerPartitionInfo: Error while fetching metadata      partition 0     leader: none    replicas:       isr:    isUnderReplicated: false for topic partition [metrics,0]: [class kafka.common.LeaderNotAvailableException]
16/10/27 01:26:51 WARN producer.BrokerPartitionInfo: Error while fetching metadata      partition 1     leader: none    replicas:       isr:    isUnderReplicated: false for topic partition [metrics,1]: [class kafka.common.LeaderNotAvailableException]
16/10/27 01:26:51 WARN producer.BrokerPartitionInfo: Error while fetching metadata      partition 2     leader: none    replicas:       isr:    isUnderReplicated: false for topic partition [metrics,2]: [class kafka.common.LeaderNotAvailableException]
16/10/27 01:26:51 WARN producer.BrokerPartitionInfo: Error while fetching metadata      partition 3     leader: none    replicas:       isr:    isUnderReplicated: false for topic partition [metrics,3]: [class kafka.common.LeaderNotAvailableException]
16/10/27 01:26:51 WARN producer.BrokerPartitionInfo: Error while fetching metadata      partition 4     leader: none    replicas:       isr:    isUnderReplicated: false for topic partition [metrics,4]: [class kafka.common.LeaderNotAvailableException]
16/10/27 01:26:51 WARN producer.BrokerPartitionInfo: Error while fetching metadata      partition 5     leader: none    replicas:       isr:    isUnderReplicated: false for topic partition [metrics,5]: [class kafka.common.LeaderNotAvailableException]
16/10/27 01:26:51 WARN producer.BrokerPartitionInfo: Error while fetching metadata      partition 6     leader: none    replicas:       isr:    isUnderReplicated: false for topic partition [metrics,6]: [class kafka.common.LeaderNotAvailableException]
16/10/27 01:26:51 WARN producer.BrokerPartitionInfo: Error while fetching metadata      partition 7     leader: none    replicas:       isr:    isUnderReplicated: false for topic partition [metrics,7]: [class kafka.common.LeaderNotAvailableException]
16/10/27 01:26:51 WARN producer.BrokerPartitionInfo: Error while fetching metadata      partition 8     leader: none    replicas:       isr:    isUnderReplicated: false for topic partition [metrics,8]: [class kafka.common.LeaderNotAvailableException]
16/10/27 01:26:51 WARN producer.BrokerPartitionInfo: Error while fetching metadata      partition 9     leader: none    replicas:       isr:    isUnderReplicated: false for topic partition [metrics,9]: [class kafka.common.LeaderNotAvailableException]
16/10/27 01:26:51 WARN async.DefaultEventHandler: Failed to send data since partitions [metrics,3] don't have a leader
...
16/10/27 01:30:24 ERROR utils.Utils$: fetching topic metadata for topics [Set(metrics)] from broker [ArrayBuffer(id:0,host:<NODE_NAME>,port:9092)] failed
kafka.common.KafkaException: fetching topic metadata for topics [Set(metrics)] from broker [ArrayBuffer(id:0,host:<NODE_NAME>,port:9092)] failed
        at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:72)
        at kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:82)
        at kafka.producer.async.DefaultEventHandler$$anonfun$handle$2.apply$mcV$sp(DefaultEventHandler.scala:78)
        at kafka.utils.Utils$.swallow(Utils.scala:172)
        at kafka.utils.Logging$class.swallowError(Logging.scala:106)
        at kafka.utils.Utils$.swallowError(Utils.scala:45)
        at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:78)
        at kafka.producer.Producer.send(Producer.scala:77)
        at kafka.javaapi.producer.Producer.send(Producer.scala:42)
        at org.apache.twill.internal.kafka.client.SimpleKafkaPublisher$SimplePreparer.send(SimpleKafkaPublisher.java:122)
        at co.cask.cdap.metrics.collect.KafkaMetricsCollectionService.publish(KafkaMetricsCollectionService.java:94)
        at co.cask.cdap.metrics.collect.AggregatedMetricsCollectionService.publishMetrics(AggregatedMetricsCollectionService.java:104)
        at co.cask.cdap.metrics.collect.AggregatedMetricsCollectionService.runOneIteration(AggregatedMetricsCollectionService.java:99)
        at com.google.common.util.concurrent.AbstractScheduledService$1$1.run(AbstractScheduledService.java:170)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException
        at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
        at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
        at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
        at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
        at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
        ... 20 more
16/10/27 01:30:24 ERROR async.DefaultEventHandler: Failed to send requests for topics metrics with correlation ids in [562278,562301]

Sagar Kapare

unread,
Oct 27, 2016, 10:19:07 PM10/27/16
to CDAP User, Fredrik
Hi Fredrik,

Once the kafka server is moved to the new node, you will need to perform some cleanup of the states as described in the JIRA https://issues.cask.co/browse/CDAP-7513

1. Stop cdap-master.
2. Stop cdap-kafka-server.
3. Clear zookeeper states by connecting to the zookeeper using zookeeper-client and remove the kafka directory using command - rmr /cdap/kafka
4. Run the following command to delete the log offset state in HBase:  

$ t=`date +%s%3N000000`; for r in d e; do for i in {0..9}; do echo "delete 'cdap_system:log.meta', \"\x00\x00\x00${r}logs.user-v2\x00\x00\x00\x0$i\", 'd:', $t"; done; done | hbase shell

5. Start cdap-kafka-server.
6. Start cdap-master.

Let me know if this helps.

Thanks and Regards,
Sagar

rus...@cask.co

unread,
Nov 1, 2016, 12:25:25 PM11/1/16
to CDAP User, frederic...@gmail.com
Hi Fredrik, Did this solve your issue?

Thanks,
Russ

Fredrik

unread,
Nov 1, 2016, 8:56:12 PM11/1/16
to CDAP User, frederic...@gmail.com
Hello Russ,

I followed the instruction provided by Sagar. Thanks Sagar.

It succeeded :)

While I also have a question about the CDAP meta table, since it looks like no related record in the HBase table for topic logs.user-v2?

delete 'cdap_system:log.meta', "\x00\x00\x00dlogs.user-v2\x00\x00\x00\x00", 'd:', 1478045806064000000
0 row(s) in 0.4090 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00dlogs.user-v2\x00\x00\x00\x01", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00dlogs.user-v2\x00\x00\x00\x02", 'd:', 1478045806064000000
0 row(s) in 0.0060 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00dlogs.user-v2\x00\x00\x00\x03", 'd:', 1478045806064000000
0 row(s) in 0.0050 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00dlogs.user-v2\x00\x00\x00\x04", 'd:', 1478045806064000000
0 row(s) in 0.0040 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00dlogs.user-v2\x00\x00\x00\x05", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00dlogs.user-v2\x00\x00\x00\x06", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00dlogs.user-v2\x00\x00\x00\x07", 'd:', 1478045806064000000
0 row(s) in 0.0040 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00dlogs.user-v2\x00\x00\x00\x08", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00dlogs.user-v2\x00\x00\x00\x09", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00elogs.user-v2\x00\x00\x00\x00", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00elogs.user-v2\x00\x00\x00\x01", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00elogs.user-v2\x00\x00\x00\x02", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00elogs.user-v2\x00\x00\x00\x03", 'd:', 1478045806064000000
0 row(s) in 0.0020 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00elogs.user-v2\x00\x00\x00\x04", 'd:', 1478045806064000000
0 row(s) in 0.0040 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00elogs.user-v2\x00\x00\x00\x05", 'd:', 1478045806064000000
0 row(s) in 0.0040 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00elogs.user-v2\x00\x00\x00\x06", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00elogs.user-v2\x00\x00\x00\x07", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00elogs.user-v2\x00\x00\x00\x08", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds

delete 'cdap_system:log.meta', "\x00\x00\x00elogs.user-v2\x00\x00\x00\x09", 'd:', 1478045806064000000
0 row(s) in 0.0030 seconds


Thanks.

Best Regards,
Frederic

rus...@cask.co

unread,
Nov 3, 2016, 2:03:04 PM11/3/16
to CDAP User, frederic...@gmail.com
Hi Frederic, I'm not sure I completely understand what you are asking. Did you run the command from Sagar multiple times? Or are those lines from the first time you ran the command?

Thanks,
Russ

Fredrik

unread,
Nov 7, 2016, 10:28:09 PM11/7/16
to CDAP User, frederic...@gmail.com
Hi Russ,

I only ran the command once. Looks like no rows of the topic logs.user-v2 in the HBase table that needs to be removed? since no row is deleted by that command.

Thank you.

Best Regards,
Frederic

Shankar Selvam

unread,
Nov 8, 2016, 6:22:38 PM11/8/16
to cdap...@googlegroups.com, Fredrik
Hi Fredrick,

In the script, for each of the partition rowkey, we add a column family delete marker and this delete marker will have the most recent timestamp (based on current timestamp). So when you scan the 'cdap_system:log.meta' after running the script you will see 3 entries for a rowkey

Example:

\x00\x00\x00elogs.user-v2\x00\x00\x00\x01  column=d:, timestamp=1478646758334000000, value=
\x00\x00\x00elogs.user-v2\x00\x00\x00\x01  column=d:maxEventTime, timestamp=1478300534381000000, value=\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF
\x00\x00\x00elogs.user-v2\x00\x00\x00\x01  column=d:nextOffset, timestamp=1478300534381000000, value=\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF

The timestamp for deleted column family entry (first row) is most recent 1478646758334000000 and CDAP picks that most recent value to see that the rowkey doesn't have any value for this column family and the other two rows are ignored. Hope this helps.

Thanks
Shankar

To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+unsubscribe@googlegroups.com.

To post to this group, send email to cdap...@googlegroups.com.

Fredrik

unread,
Nov 8, 2016, 9:25:30 PM11/8/16
to CDAP User, frederic...@gmail.com
Hi Shankar,


Thank you for the detailed information.

Are those 3 rows inserted by HBase coprocessor? Or is a feature of HBase deletion operation?

I have scanned the table 'cdap_system:log.meta', only 36 rows. In order to free the HDFS spaces, I have deleted many avro directories which are the column values of those rows, will this impact the CDAP running for those CDAP applications (e.g., CDAPFlowTest.WhoFlow according to the below record)?

\x00\x00\x00\xC8M0116501:CDAPF column=d:\x00\x00\x01X"\x90\x89\xEC, timestamp=1478048653159000000, value=hdfs://mycask
 lowTest:WhoFlow                -001.mydomain.com:8020/cdap/namespaces/SVC/logs/avro/CDAPFlowTest/flow-WhoFlow
                                /2016-11-02/1478048653039.avro

Best Regards,
Fredrik

Shankar Selvam

unread,
Nov 8, 2016, 10:01:58 PM11/8/16
to cdap...@googlegroups.com, Fredrik
Hi Fredrick,

Replies in line.

Are those 3 rows inserted by HBase coprocessor? Or is a feature of HBase deletion operation?
Only the first row for column family d: is added  due to delete, other two rows for columns offset and maxEventTime is managed and updated by CDAP.

I have scanned the table 'cdap_system:log.meta', only 36 rows. In order to free the HDFS spaces, I have deleted many avro directories which are the column values of those rows, will this impact the CDAP running for those CDAP applications (e.g., CDAPFlowTest.WhoFlow according to the below record)?
If you had deleted some files, you wont be able to retrieve the logs for the time range, but it should not affect the writes. It shouldn't cause any issues.
Thanks
Shankar

To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+unsubscribe@googlegroups.com.

To post to this group, send email to cdap...@googlegroups.com.

Fredrik

unread,
Nov 8, 2016, 10:30:31 PM11/8/16
to CDAP User, frederic...@gmail.com
Hi Shankar,

Got it! Thank you :)

Best Regards,
Fredrik
Reply all
Reply to author
Forward
0 new messages