Historical node with less data than the rest

31 views
Skip to first unread message

Federico Nieves

unread,
Nov 7, 2016, 10:43:09 AM11/7/16
to Druid User
Hi there!!

Our cluster consists of 4 historical nodes, all servers have the same configuration but one of them has less data than the others:


On that server I checked for error logs on historical log files and discovered the following error.

2016-11-07T15:28:32,516 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.ser
ver
.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[relyEv
entData_2016
-11-07T11:00:00.000Z_2016-11-07T12:00:00.000Z_2016-11-07T11:00:30.211Z], segment=DataSegment{size=102894248, shardSpec=LinearShardSpec{part
itionNum
=0}, metrics=[count, user_unique], dimensions=[id_partner, id_partner_user, event_type, created, url, tags, url_path, tagged, country, url_qs,
vertical
, url_subdomain, url_domain, segments, share_data, category, title, nav_type, ip, referer_subdomain, browser, search_keyword, id_segment_source
, version, referer_qs, referer_path, referer, referer_domain, sec, data_type, gt, track_type, track_code], version='2016-11-07T11:00:30.211Z', loadSpec
={type=hdfs, path=/druid/eventData/20161107T110000.000Z_20161107T120000.000Z/2016-11-07T11_00_30.211Z/0/index.zip}, interval=2016-11-07T11:00:00.00
0Z/2016-11-07T12:00:00.000Z, dataSource='eventData', binaryVersion='9'}}
io
.druid.segment.loading.SegmentLoadingException: Exception loading segment[eventData_2016-11-07T11:00:00.000Z_2016-11-07T12:00:00.000Z_2016-11-07T
11:00:30.211Z]
        at io
.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:309) ~[druid-server-0.9.1.1.jar:0.9.1.1]
        at io
.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:350) [druid-server-0.9.1.1.jar:0.9.1.1]
        at io
.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44) [druid-server-0.9.1.1.jar:0.9.1.1]
        at io
.druid.server.coordination.ZkCoordinator$1.childEvent(ZkCoordinator.java:152) [druid-server-0.9.1.1.jar:0.9.1.1]
        at org
.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:522) [curator-recipes-2.10.0.jar:?]
        at org
.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.10.0.jar:?]
        at org
.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.10.0.jar:?]
        at com
.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:?]
        at org
.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.10.0.jar:?]
        at org
.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:514) [curator-recipes-2.10.0.jar:?]
        at org
.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.10.0.jar:?]
        at org
.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:772) [curator-recipes-2.10.0.jar:?]
        at java
.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_101]
        at java
.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_101]
        at java
.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_101]
        at java
.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_101]
        at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_101]
        at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_101]
        at java
.lang.Thread.run(Thread.java:745) [?:1.8.0_101]
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File /druid/eventData/20161107T110000.000Z_20161107T120000.000Z/2016-11-07T11
_00_30
.211Z/0/index.zip does not exist
        at com
.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
        at com
.metamx.common.CompressionUtils.unzip(CompressionUtils.java:146) ~
[java-util-0.27.9.jar:?]
        at io
.druid.storage.hdfs.HdfsDataSegmentPuller.getSegmentFiles(HdfsDataSegmentPuller.java:235) ~[?:?]
        at io
.druid.storage.hdfs.HdfsLoadSpec.loadSegment(HdfsLoadSpec.java:62) ~[?:?]
        at io
.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegmentFiles(SegmentLoaderLocalCacheManager.java:143) ~[druid-server-0.9.1.1.jar:0.9.1.1]
        at io
.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:95) ~[druid-server-0.9.1.1.jar:0.9.1.1]
        at io
.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.1.1.jar:0.9.1.1]
        at io
.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:305) ~[druid-server-0.9.1.1.jar:0.9.1.1]
       
... 18 more
Caused by: java.io.FileNotFoundException: File /druid/eventData/20161107T110000.000Z_20161107T120000.000Z/2016-11-07T11_00_30.211Z/0/index.zip does not exist
        at org
.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) ~[?:?]
        at org
.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:722) ~[?:?]
        at org
.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) ~[?:?]
        at org
.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) ~[?:?]
        at org
.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:137) ~[?:?]
        at org
.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) ~[?:?]
        at org
.apache.hadoop.fs.FileSystem.open(FileSystem.java:765) ~[?:?]
        at io
.druid.storage.hdfs.HdfsDataSegmentPuller$1.openInputStream(HdfsDataSegmentPuller.java:107) ~[?:?]
        at io
.druid.storage.hdfs.HdfsDataSegmentPuller.getInputStream(HdfsDataSegmentPuller.java:298) ~[?:?]
        at io
.druid.storage.hdfs.HdfsDataSegmentPuller$3.openStream(HdfsDataSegmentPuller.java:241) ~[?:?]
        at com
.metamx.common.CompressionUtils$1.call(CompressionUtils.java:138) ~[java-util-0.27.9.jar:?]
        at com
.metamx.common.CompressionUtils$1.call(CompressionUtils.java:134) ~[java-util-0.27.9.jar:?]
        at com
.metamx.common.RetryUtils.retry(RetryUtils.java:60) ~[java-util-0.27.9.jar:?]
        at com
.metamx.common.RetryUtils.retry(RetryUtils.java:78) ~[java-util-0.27.9.jar:?]
        at com
.metamx.common.CompressionUtils.unzip(CompressionUtils.java:132) ~[java-util-0.27.9.jar:?]
        at io
.druid.storage.hdfs.HdfsDataSegmentPuller.getSegmentFiles(HdfsDataSegmentPuller.java:235) ~[?:?]
        at io
.druid.storage.hdfs.HdfsLoadSpec.loadSegment(HdfsLoadSpec.java:62) ~[?:?]
        at io
.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegmentFiles(SegmentLoaderLocalCacheManager.java:143) ~[druid-server-0.9.1.1.jar:0.9.1.1]
        at io
.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:95) ~[druid-server-0.9.1.1.jar:0.9.1.1]
        at io
.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.1.1.jar:0.9.1.1]
        at io
.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:305) ~[druid-server-0.9.1.1.jar:0.9.1.1]
       
... 18 more

That is really weird because the error says "FileNotFound" exception, but the file is present on hadoop files.

One thing I'd like to mention is that we have a kill task running every day for data older than 60 days.

Any help will be much appreciated, thank you !

Ben Vogan

unread,
Nov 7, 2016, 11:07:35 AM11/7/16
to druid...@googlegroups.com
Hi Federico,

I ran into that error as well and I do not know what caused it.  What I did to resolve it was to delete the segment-cache from that historical node and let the Coordinator re-assign the segments.  You could probably just delete the single bad segment from the cache, but my cluster was in a bad spot and I was trying to reset all of my historicals.

Good luck,
--Ben

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/fad0023e-7ead-40b8-8829-b506a8972b57%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

BENJAMIN VOGAN | Data Platform Team Lead
 
The indispensable app that rewards you for shopping.

Federico Nieves

unread,
Nov 8, 2016, 8:15:21 AM11/8/16
to Druid User
Hi Ben, thank you very much for your quick response.

Yes I saw your post on this forum, and I deleted all content inside folders:

CACHE_FOLDER/historical/eventData
CACHE_FOLDER/historical/info_dir

This made coordinator start pushing all the segments again. And when it finished it was like the image I posted earlier, with all historicals at ~70% but the one that is failing at 40% capacity.

Any other thoughts?

Thank again !
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/fad0023e-7ead-40b8-8829-b506a8972b57%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Federico Nieves

unread,
Nov 8, 2016, 12:23:22 PM11/8/16
to Druid User
Never mind, I just found the issue, that hadoop path was bad configured on the service startup. We moved hadoop folders some time ago and seems like we forgot to update this value, because it was very similar it was hard to find. Not sure why it was partially working even though path was not correctly set. Thanks for the help !
Reply all
Reply to author
Forward
0 new messages