Historical node crashes when data is growing

Roman Brunnemann

unread,

Aug 16, 2016, 10:38:30 AM8/16/16

to Druid User

Our historical node has the following setup:

Disk Space: > 2TB

Memory: 128 GB

The config looks like this:

druid.processing.buffer.sizeBytes=100000000

druid.processing.numThreads=4

druid.extensions.localRepository=/home/druid/.m2_hdfs/repository

druid.extensions.coordinates=["io.druid.extensions:druid-examples","io.druid.extensions:druid-kafka-eight","io.druid.extensions:mysql-metadata-storage","io.druid.extensions:druid-hdfs-storage:0.8.0-rc1"]

druid.monitoring.monitors=["com.metamx.metrics.JvmMonitor"]

# Zookeeper

druid.zk.service.host=XXX.XXX.XXX.XXX

# If you choose to compress ZK announcements, you must do so for every node type

druid.announcer.type=batch

druid.curator.compress=true

druid.discovery.curator.path=/hdfs/discovery

druid.segmentCache.locations=[{"path": "/data/druid_hdfs/indexCache", "maxSize"\: 300000000000}]

druid.server.maxSize=300000000000

# Metadata Storage (mysql)

druid.metadata.storage.type=mysql

druid.metadata.storage.connector.connectURI=jdbc\:mysql\://XXX.XXX.XXX.XXX\:3306/druid_hdfs

druid.metadata.storage.connector.user=xxxxxxxxxx

druid.metadata.storage.connector.password=xxxxxxxxxxx

druid.storage.type=hdfs

druid.storage.storageDirectory=hdfs://xxxxxxx.xxxxx.xxxxx:8020/user/druid/data

# Query Cache (we use a simple 10mb heap-based local cache on the broker)

druid.cache.type=local

druid.cache.sizeInBytes=10000000

druid.emitter=logging

druid.emitter.logging.logLevel=debug

I am starting the historical node with the following command:

nohup java -server -Xmx12g

-Xms12g

-XX:NewSize=6g

-XX:MaxNewSize=6g

-XX:MaxDirectMemorySize=32g

-XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/tmp/druid_hdfs -classpath config/_common:config/historical:lib/*:`hadoop classpath` io.druid.cli.Main server historical

Everything worked fine for over a year. But now as data grew and the files in my druid.segmentCache.locations became bigger then 115GB, the server was not able to load any segments anymore and even crashed with messages like:

# There is insufficient memory for the Java Runtime Environment to continue.

# Native memory allocation (malloc) failed to allocate 28520448 bytes for committing reserved memory.

# An error report file with more information is saved as:

# /usr/local/druid-0.8.0-rc1/hs_err_pid25770.log

Sometimes it throws just exceptions like:

13:50:57.264 [ZkCoordinator-0] ERROR io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[ ....

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[xxxxxxx

Caused by: io.druid.segment.loading.SegmentLoadingException: Error loading [hdfs://xxxxx.xxxx.xxxx

Caused by: java.io.IOException: No FileSystem for scheme: hdfs

But these hdfs exceptions seems to point in the wrong direction. I checked it manually with hadoop fs -ls <filepath from exception> and the files were there.

Regarding the memory output:

I still had around 22GB RAM free during that time. Restarting the historical server resulted in the server crashing again with the same messages. Sometimes the message is the memory issue, sometimes its the Exceptions regarding "Failed to load segment"

Now that I marked some old data as used=0 in coord database, the server works fine. But from the first question at http://druid.io/faq.html I read, that I should be able to have more data assigned to a historical node then memory available. I can live with the fact that old data will take much longer to query.

I am using druid 0.8

So what am I doing wrong in order to have more data on the historical node then memory available.

Thanks for your help.

Best regards

Roman

Fangjin Yang

unread,

Aug 16, 2016, 6:17:05 PM8/16/16

to Druid User

Hi Roman, this error points to the fact that the hdfs extension is not correctly set up. How have you included the HDFS extension and appropriate hadoop conf xml files in your classpath?

Message has been deleted

Roman Brunnemann

unread,

Aug 17, 2016, 4:17:55 AM8/17/16

to Druid User

Hi,

thanks for your answer. I doubt this is a hdfs problem as this setup worked perfectly for over a year without changing anything. And usually its not the hdfs error that pops up, but the memory problem.

Can you confirm that I should be able to assign more data to a historical node then there is memory available? And what can be the reason that it is running out of memory when there is still 20GB left on the system.

Thanks a lot for your help.

Fangjin Yang

unread,

Aug 25, 2016, 4:46:34 PM8/25/16

to Druid User

Hi Roman, yes, u can definitely have more segments stored on disk than available memory.

For the error:

13:50:57.264 [ZkCoordinator-0] ERROR io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[ ....

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[xxxxxxx

Caused by: io.druid.segment.loading.SegmentLoadingException: Error loading [hdfs://xxxxx.xxxx.xxxx

Caused by: java.io.IOException: No FileSystem for scheme: hdfs

That to me points that the historicals are not able to talk to HDFS and correctly download segments.

For the error:

# There is insufficient memory for the Java Runtime Environment to continue.

# Native memory allocation (malloc) failed to allocate 28520448 bytes for committing reserved memory.

# An error report file with more information is saved as:

# /usr/local/druid-0.8.0-rc1/hs_err_pid25770.log

You seem to have allocated more memory than what is available on your box. Are you running any other progresses on this box that may require a lot of memory?

Roman Brunnemann

unread,

Aug 26, 2016, 7:24:41 AM8/26/16

to Druid User

Hi, no thats the strange thing. There was still more then 20GB free memory when it crashed again today. But after moving to 0.9.1.1 it seems to load a lot more data then memory available without crashing. So just an update seemed to solve the issue. Thanks anyway for your help.

Dražen Bandić

unread,

Jan 17, 2017, 10:23:26 AM1/17/17

to Druid User

Hi Fangjin,

I have the exact problem like Roman, but on the 0.9.1.1 version of Druid.

Suddenly the historical nodes are crashing because of out of memory errors. But there is always about 30GB of RAM free when they crash.

The configs for the historical nodes are as in Roman's case.

Error:

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f7db0acf000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)

#

# There is insufficient memory for the Java Runtime Environment to continue.

# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.

# An error report file with more information is saved as:

# /opt/druid-0.9.1.1/hs_err_pid587779.log

I use the kafka-indexing-service where the deep storage is HDFS, and I also get this exception when loading segments:

2017-01-17T14:12:09,696 ERROR [ZkCoordinator-loading-0] io.druid.server.coordination.ZkCoordinator - [topic_2016-12-30T00:00:00.000Z_2016-12-31T00:00:00.000Z_2016-12-30T00:00:00.988Z_191] failed t

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[druid_v3_2016-12-30T00:00:00.000Z_2016-12-31T00:00:00.000Z_2016-12-30T00:00:00.988Z_191]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:309) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator.access$300(ZkCoordinator.java:62) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator$3.run(ZkCoordinator.java:398) [druid-server-0.9.1.1.jar:0.9.1.1]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_91]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

Caused by: io.druid.segment.loading.SegmentLoadingException: Map failed

at io.druid.segment.loading.MMappedQueryableIndexFactory.factorize(MMappedQueryableIndexFactory.java:52) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:96) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:305) ~[druid-server-0.9.1.1.jar:0.9.1.1]

... 7 more

Caused by: java.io.IOException: Map failed

at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:940) ~[?:1.8.0_91]

at com.google.common.io.Files.map(Files.java:864) ~[guava-16.0.1.jar:?]

at com.google.common.io.Files.map(Files.java:851) ~[guava-16.0.1.jar:?]

at com.google.common.io.Files.map(Files.java:818) ~[guava-16.0.1.jar:?]

at com.google.common.io.Files.map(Files.java:790) ~[guava-16.0.1.jar:?]

at com.metamx.common.io.smoosh.SmooshedFileMapper.mapFile(SmooshedFileMapper.java:124) ~[java-util-0.27.9.jar:?]

at io.druid.segment.IndexIO$V9IndexLoader.load(IndexIO.java:1023) ~[druid-processing-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.IndexIO.loadIndex(IndexIO.java:216) ~[druid-processing-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.loading.MMappedQueryableIndexFactory.factorize(MMappedQueryableIndexFactory.java:49) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:96) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:305) ~[druid-server-0.9.1.1.jar:0.9.1.1]

... 7 more

Caused by: java.lang.OutOfMemoryError: Map failed

at sun.nio.ch.FileChannelImpl.map0(Native Method) ~[?:1.8.0_91]

at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:937) ~[?:1.8.0_91]