Historical nodes not loading segments

1,274 views
Skip to first unread message

Nina

unread,
Jun 9, 2021, 3:40:39 PM6/9/21
to Druid User
Hello Everyone, 

On Druid 0.20.2, our dev cluster is ingesting data however when the data gets ingested the historical server doesn't load any segments. 

The UI is showing that the datasource has segments loaded but all of the 59 segments are unavailable.
Screen Shot 2021-06-09 at 2.23.58 PM.png

In addition, the load queue is also empty and the historical server isn't loading segments. When it does load segments, nothing occurs afterwards

Screen Shot 2021-06-09 at 2.28.39 PM.png

Here are the log errors we are currently getting:

Jun 9 12:14:46 druid-data-6 druid-data ERROR [ZKCoordinator--0] org.apache.druid.server.coordination.SegmentLoadDropHandler - Failed to load segment for dataSource: {class=org.apache.druid.server.coordination.SegmentLoadDropHandler, exceptionType=class org.apache.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none_2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z_2021-05-22T13:04:13.822Z_4], segment=DataSegment{binaryVersion=9, id=8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none_2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z_2021-05-22T13:04:13.822Z_4, loadSpec={type=>s3_zip, bucket=>druid-deep-storage-druid-devops-us, key=>druid/segments/8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none/2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z/2021-05-22T13:04:13.822Z/4/1fc09d88-e965-4c28-8dfc-17acfe7e407e/index.zip, S3Schema=>s3n}, dimensions=[serviceName, serviceId, enterpriseAccountName, enterpriseAccountGuid, enterpriseAccountPlan, enterpriseAccountStatus, accountName, accountGuid, accountPlan, accountStatus, costType, currency, planId, planPricingRegion, instancePlanName, instanceGroupType, instanceGroupName, instanceName, instanceId, usageMetric, usageUnit, cumulativeUsageQuantity, deltaUsageQuantity, cumulativeUsageCost, deltaUsageCost, cumulativeOriginalUsageCost, deltaOriginalUsageCost, instanceCreator, instanceAgeInDays, instanceRegion, sentAt, messageId, instanceTags, instanceTagKeys, instanceTagValues], metrics=[], shardSpec=NumberedShardSpec{partitionNum=4, partitions=0}, lastCompactionState=null, size=112954245}} org.apache.druid.segment.loading.SegmentLoadingException: Exception loading segment[8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none_2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z_2021-05-22T13:04:13.822Z_4] at org.apache.druid.server.coordination.SegmentLoadDropHandler.loadSegment(SegmentLoadDropHandler.java:269) ~[druid-server-0.20.2.jar:0.20.2] at org.apache.druid.server.coordination.SegmentLoadDropHandler.addSegment(SegmentLoadDropHandler.java:313) ~[druid-server-0.20.2.jar:0.20.2] at org.apache.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:61) ~[druid-server-0.20.2.jar:0.20.2] at org.apache.druid.server.coordination.ZkCoordinator.lambda$childAdded$2(ZkCoordinator.java:147) ~[druid-server-0.20.2.jar:0.20.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_282] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] Caused by: org.apache.druid.segment.loading.SegmentLoadingException: Failed to load segment 8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none_2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z_2021-05-22T13:04:13.822Z_4 in all locations. at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.loadSegmentWithRetry(SegmentLoaderLocalCacheManager.java:208) ~[druid-server-0.20.2.jar:0.20.2] at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegmentFiles(SegmentLoaderLocalCacheManager.java:162) ~[druid-server-0.20.2.jar:0.20.2] at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:129) ~[druid-server-0.20.2.jar:0.20.2] at org.apache.druid.server.SegmentManager.getAdapter(SegmentManager.java:278) ~[druid-server-0.20.2.jar:0.20.2] at org.apache.druid.server.SegmentManager.loadSegment(SegmentManager.java:224) ~[druid-server-0.20.2.jar:0.20.2] at org.apache.druid.server.coordination.SegmentLoadDropHandler.loadSegment(SegmentLoadDropHandler.java:265) ~[druid-server-0.20.2.jar:0.20.2] 


- Verified that segments are being pushed to deep storage
- Verified that our deep storage druid configuration is using S3 and not local storage
- Restarted the historical / coordinator pods


Does anyone know how to resolve this issue?


Thanks, 

Nina


Diego Lucas Jiménez

unread,
Jun 10, 2021, 4:47:39 AM6/10/21
to Druid User
Hi Nina,

I have a few things on mind, let's see if any of them:
Did you setup the Deep Storage credentials properly? I mean, in Druid, you need to have the service account loaded and said service account should have the proper permission (https://druid.apache.org/docs/latest/development/extensions-core/s3.html).

Another idea could be a firewall/network/dns issue. Are the Zookeepers able to speak with the historical, coordinator etc and between them? I had an stupid issue on my side once with the brokers because I forgot to open their ports ^^U.

I hope it helps! Check this things out and if they are not working, let us know and will try to help you from there :).

Nina

unread,
Jun 10, 2021, 12:16:06 PM6/10/21
to Druid User

Hello Diego, 

Regarding the service account being loaded, how would I verify that our Druid has the service account loaded? I do know that our segments are being pushed to our S3 bucket though. As for the permission, the configuration for S3 permission for Druid would be left to it's default configuration: druid.storage.disableAcl=false. When I port forward our historical pods and hit the api endpoint /druid/historical/v1/loadstatus I am returned with the body json below , I do believe the port is open and working 
{
   "cacheInitialized": true
}

Sometimes we do see segments being loaded into the historical but nothing occurs afterwards and it returns to the Empty load/drop queues state.
Screen Shot 2021-06-10 at 10.44.04 AM.png


I believe our Zookeepers are able to speak with the historical and coordinator as I do not see any error logs for our Zookeeper pods. Logs here:

Zookeeper logs:
  • Jun 10 10:40:10 zk-0 zookeeper INFO [myid:1] - INFO [NIOWorkerThread-2:NIOServerCnxn@518] - Processing ruok command from /127.0.0.1:43370
  • Jun 10 10:40:10 zk-6 zookeeper INFO [myid:7] - INFO [NIOWorkerThread-2:NIOServerCnxn@518] - Processing ruok command from /127.0.0.1:36690
  • Jun 10 10:40:10 zk-1 zookeeper INFO [myid:2] - INFO [NIOWorkerThread-1:NIOServerCnxn@518] - Processing ruok command from /127.0.0.1:51778
  • Jun 10 10:40:11 zk-2 zookeeper INFO [myid:3] - INFO [NIOWorkerThread-1:NIOServerCnxn@518] - Processing ruok command from /127.0.0.1:42182

Druid logs:
  • Jun 10 10:40:08 druid-data-1 druid-data INFO [ZKCoordinator--0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[var/druid/segment-cache/8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none/2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z/2021-05-22T13:04:13.822Z/4]
  • Jun 10 10:40:08 druid-data-1 druid-data INFO [ZKCoordinator--0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[var/druid/segment-cache/8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none/2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z/2021-05-22T13:04:13.822Z]
  • Jun 10 10:40:08 druid-data-1 druid-data INFO [ZKCoordinator--0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[var/druid/segment-cache/8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none/2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z]
  • Jun 10 10:40:08 druid-data-1 druid-data INFO [ZKCoordinator--0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[var/druid/segment-cache/8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none]
  • Jun 10 10:40:08 druid-data-1 druid-data WARN [ZKCoordinator--0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Asked to cleanup something[8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none_2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z_2021-05-22T13:04:13.822Z_4] that didn't exist. Skipping.
  • Jun 10 10:40:08 druid-data-1 druid-data WARN [ZKCoordinator--0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - No path to unannounce segment[8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none_2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z_2021-05-22T13:04:13.822Z_4]
  • Jun 10 10:40:08 druid-data-1 druid-data INFO [ZKCoordinator--0] org.apache.druid.server.SegmentManager - Told to delete a queryable for a dataSource[8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none] that doesn't exist.
  • Jun 10 10:40:08 druid-data-1 druid-data WARN [ZKCoordinator--0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Asked to cleanup something[8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none_2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z_2021-05-22T13:04:13.822Z_4] that didn't exist. Skipping.
  • Jun 10 10:40:08 druid-data-1 druid-data WARN [ZKCoordinator--0] org.apache.druid.server.coordination.SegmentLoadDropHandler - Unable to delete segmentInfoCacheFile[var/druid/segment-cache/info_dir/8bff0003-8393-42d1-a79b-3d5c3ffbfba2-rated-usage-none_2021-05-19T00:00:00.000Z_2021-05-19T01:00:00.000Z_2021-05-22T13:04:13.822Z_4]


    In addition, I do not see the segments being loaded into the file path var/druid/segment-cache/info_dir/ or var/druid/segment-cache/  

Diego Lucas Jiménez

unread,
Jun 18, 2021, 6:18:58 AM6/18/21
to Druid User
Hi Nina,
Sorry for the delay, I have quite a lot on my plate too.

Try the following to check ZK <--> Historical connection:
# From a Zookeeper
```
zk-0:~$ telnet druid-data-1 8083
```
It should be able to connect. If not, you have a Firewall-network issue

Now, for the S3: https://druid.apache.org/docs/latest/development/extensions-core/s3.html#configuration here is instructed how you can configure your S3 "client" on Druid side. 
Double check that you have in place druid.s3.accessKey and druid.s3.secretKey or the ENV vars or credentials file, whatever method you went with. 

Also a good test is download a segment from the S3 bucket where you are storing them with Boto3 or whatever client using the same credentials your druid is using. If it fails, then you know Druid will fail to download, time to review the S3 bucket permissions and service account.

Bear on mind I am not using AWS myself, so my specific testing is limited.

Best regards,

Nina

unread,
Jun 24, 2021, 8:23:33 AM6/24/21
to Druid User
Hello Diego, 

No worries on the delay. 

I was able to resolve my issue by disabling Druid server side encryption. I am unsure to why using server-side encryption along with an encrypted bucket caused our Druid to have issues with pulling segments from our buckets, however it was able to still push to the bucket. There may have been some errors with the way our credentials were setup perhaps, I have yet dug deeper into the issue as utilizing an encrypted bucket was sufficient for us. 



Thanks, 

Nina
Reply all
Reply to author
Forward
0 new messages