Zookeeper Crashing - Heavy Workload

118 views
Skip to first unread message

Bastian Zuehlke

unread,
Oct 16, 2021, 4:56:10 AM10/16/21
to Druid User
Hi Zookeeper,

Druid: 0.21.1
Config: Single-Server-Medium (default)

is crashing every few minutes
2021-10-16T08:27:14,126 ERROR [SyncThread:0] org.apache.zookeeper.server.ZooKeeperCriticalThread - Severe unrecoverable error, from thread : SyncThread:0
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_292]
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) ~[?:1.8.0_292]
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) ~[?:1.8.0_292]
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) ~[?:1.8.0_292]
at java.io.DataOutputStream.write(DataOutputStream.java:107) ~[?:1.8.0_292]


It happens after  CuratorLoadQueuePeon throws thousands of massive errors.

2021-10-16 08:08:12,917 ERROR o.a.d.s.c.CuratorLoadQueuePeon [Master-PeonExec--0] Server[/druid/loadQueue/localhost:8083], throwable caught when submitting [SegmentChangeRequestLoad{segment=DataSegment{binaryVersion=9, id=v3ad-1553-b17c-0306-4c14a_live_production_2021-04-15T00:00:00.000Z_2021-04-16T00:00:00.000Z_2021-04-15T00:18:27.264Z_24, loadSpec={type=>s3_zip, bucket=>vis, key=>druid/seg/v3ad-1553-b17c-0306-4c14a_live_production/2021-04-15T00:00:00.000Z_2021-04-16T00:00:00.000Z/2021-04-15T00:18:27.264Z/24/index.zip, S3Schema=>s3n}, dimensions=[param, subType, viewHost, selConf, ueid, userAgent, geo_country, geo_city, os_family, type], metrics=[], shardSpec=NumberedShardSpec{partitionNum=24, partitions=0}, lastCompactionState=null, size=35855}}].
org.apache.druid.java.util.common.ISE: /druid/loadQueue/localhost:8083/v3ad-1553-b17c-0306-4c14a_live_production_2021-04-15T00:00:00.000Z_2021-04-16T00:00:00.000Z_2021-04-15T00:18:27.264Z_24 was never removed! Failing this operation!
at org.apache.druid.server.coordinator.CuratorLoadQueuePeon$SegmentChangeProcessor.lambda$scheduleNodeDeletedCheck$1(CuratorLoadQueuePeon.java:285) ~[druid-server-0.21.1.jar:0.21.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292]

and

2021-10-16 08:27:22,278 ERROR o.a.d.s.c.d.RunRules [Coordinator-Exec--0] Unable to find matching rules!: {class=org.apache.druid.server.coordinator.duty.RunRules, segmentsWithMissingRulesCount=815483, segmentsWithMissingRules=[v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_7, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_6, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_5, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_4, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_3, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_2, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_1, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z, v3ad-9c6f-a0ae-87c4-a1bfd_preview_production__overview_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T18:07:19.942Z_1, v3ad-9c6f-a0ae-87c4-a1bfd_preview_production__overview_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T18:07:19.942Z]}

It happens after issuing  many times

POST
'/druid/coordinator/v1/datasources/' + dataSource + '/markUnused'
parameter
{
"interval": "2019-01-01T00:00:00.000Z/" + end
}

And sometimes call kill thread.

POST
'/druid/indexer/v1/task'
with 
task = {
"type": "kill",
"dataSource": dataSource,
"interval": "2019-01-01T00:00:00.000Z/" + end
};


Any help more than appreciated.

Thanks,

Bastian



Rachel Pedreschi

unread,
Oct 16, 2021, 10:53:34 AM10/16/21
to druid...@googlegroups.com
When does this error happen? Can you give us some details on what you are doing on the system at the time zookeeper OOMs?

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/b502996b-ad2f-4bba-958c-dda7fc23030an%40googlegroups.com.


--
Rachel Pedreschi
VP Developer Relations and Community
Imply.io

Bastian Zuehlke

unread,
Oct 16, 2021, 11:54:41 AM10/16/21
to Druid User
We are basically running Druid natively (no docker) on Ubuntu 18 (using standard Single-Server-Medium with extensions ["druid-datasketches", "druid-s3-extensions", "mysql-metadata-storage"] ).
Our ingestion looks like this.

Ingestion:
{
"index": {
"type": "index",
"spec": {
"dataSchema": {
"dataSource": "v3ad-699b-9307-69d4-6b5a7_live_production",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"dimensionsSpec": {
"dimensions": [
"param",
"subType",
"campaignID",
"config",
"realm",
"subID",
"viewHost",
"selConf",
"ueid",
"userAgent",
"ipAddress",
"geo_country",
"geo_city",
"os_family",
"type",
"_ms"
]
},
"timestampSpec": {
"column": "timestamp",
"format": "iso"
}
}
},
"metricsSpec": [],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "day",
"queryGranularity": "none",
"rollup": false
}
},
"ioConfig": {
"type": "index",
"firehose": {
"type": "local",
"baseDir": "/uploads/",
"filter": "v3ad-699b-9307-69d4-6b5a7_live_production_7e1df48f-4ec1-4838-921e-2b94df5972af.json"
},
"appendToExisting": true
},
"tuningConfig": {
"type": "index",
"maxRowsPerSegment": 5000000,
"maxRowsInMemory": 25000,
"forceExtendableShardSpecs": true
}
}
},
"ms": 1634309971801,
"id": "7e1df48f-4ec1-4838-921e-2b94df5972af",
"name": "v3ad-699b-9307-69d4-6b5a7_live_production",
"fname": "v3ad-699b-9307-69d4-6b5a7_live_production_7e1df48f-4ec1-4838-921e-2b94df5972af.json",
"indexFile": "/uploads/v3ad-699b-9307-69d4-6b5a7_live_production_7e1df48f-4ec1-4838-921e-2b94df5972af.json",
"data": null,
"state": 1
},

We are using Druid since 2.5 years. Lately migrated to a newer version, thats why you still see firehose.
However, the latest system was working smooth for several months, not any error in the log.Two days ago we figured that new segments are correctly ingested but not announced in the segment-cache.
Still no errors in any of the logs.
After several Druid restarts the system got more and more unstable, at some point even the ingestion didn´t work anymore. 
Now ZK (we using the standard config of ZK) is permanently restarting and we get more and more errors in the log. 
We get more and more errors in the logs.

For instance, which looks severs:

2021-10-16 13:12:38,835 ERROR o.a.d.s.r.CoordinatorRuleManager [CoordinatorRuleManager-Exec--0] Exception while polling for rules
org.apache.druid.java.util.common.IOE: Retries exhausted, couldn't fulfill request to [http://localhost:8081/druid/coordinator/v1/rules].
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:221) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:127) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.server.router.CoordinatorRuleManager.poll(CoordinatorRuleManager.java:137) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:55) [druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:51) [druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:97) [druid-core-0.21.1.jar:0.21.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_292]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_292]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_292]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]

but 

http://localhost:8081/druid/coordinator/v1/rules returns {"_default":[{"tieredReplicants":{"_default_tier":1},"type":"loadForever"}]}

Any idea ?

Thanks,

Bastian

Laxmikant Pandhare

unread,
Oct 22, 2023, 3:30:13 PM10/22/23
to Druid User
We started getting this error after restart of druid servers. Any help or suggestions on this?

Sergio Ferragut

unread,
Oct 23, 2023, 1:06:04 PM10/23/23
to druid...@googlegroups.com
What version of Druid are you on?
We are generally moving away from ZK. 
Depending on you version, you might want to look at changing from ZK based functions to http based functions with parameters:
druid.indexer.runner.type=httpRemote

Sergio Ferragut

unread,
Oct 23, 2023, 1:21:24 PM10/23/23
to druid...@googlegroups.com
Sorry, hit send before I meant to...
also
druid.serverview.type=http - for http based segment discovery

Laxmikant Pandhare

unread,
Oct 23, 2023, 4:35:24 PM10/23/23
to Druid User
Yeah - it is working now. I did something wrong and it was crashing.
Reply all
Reply to author
Forward
0 new messages