Zookeeper Crashing - Heavy Workload

Bastian Zuehlke

unread,

Oct 16, 2021, 4:56:10 AM10/16/21

to Druid User

Hi Zookeeper,

Druid: 0.21.1

Config: Single-Server-Medium (default)

is crashing every few minutes

2021-10-16T08:27:14,126 ERROR [SyncThread:0] org.apache.zookeeper.server.ZooKeeperCriticalThread - Severe unrecoverable error, from thread : SyncThread:0

java.lang.OutOfMemoryError: Java heap space

at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_292]

at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) ~[?:1.8.0_292]

at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) ~[?:1.8.0_292]

at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) ~[?:1.8.0_292]

at java.io.DataOutputStream.write(DataOutputStream.java:107) ~[?:1.8.0_292]

It happens after CuratorLoadQueuePeon throws thousands of massive errors.

2021-10-16 08:08:12,917 ERROR o.a.d.s.c.CuratorLoadQueuePeon [Master-PeonExec--0] Server[/druid/loadQueue/localhost:8083], throwable caught when submitting [SegmentChangeRequestLoad{segment=DataSegment{binaryVersion=9, id=v3ad-1553-b17c-0306-4c14a_live_production_2021-04-15T00:00:00.000Z_2021-04-16T00:00:00.000Z_2021-04-15T00:18:27.264Z_24, loadSpec={type=>s3_zip, bucket=>vis, key=>druid/seg/v3ad-1553-b17c-0306-4c14a_live_production/2021-04-15T00:00:00.000Z_2021-04-16T00:00:00.000Z/2021-04-15T00:18:27.264Z/24/index.zip, S3Schema=>s3n}, dimensions=[param, subType, viewHost, selConf, ueid, userAgent, geo_country, geo_city, os_family, type], metrics=[], shardSpec=NumberedShardSpec{partitionNum=24, partitions=0}, lastCompactionState=null, size=35855}}].

org.apache.druid.java.util.common.ISE: /druid/loadQueue/localhost:8083/v3ad-1553-b17c-0306-4c14a_live_production_2021-04-15T00:00:00.000Z_2021-04-16T00:00:00.000Z_2021-04-15T00:18:27.264Z_24 was never removed! Failing this operation!

at org.apache.druid.server.coordinator.CuratorLoadQueuePeon$SegmentChangeProcessor.lambda$scheduleNodeDeletedCheck$1(CuratorLoadQueuePeon.java:285) ~[druid-server-0.21.1.jar:0.21.1]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292]

and

2021-10-16 08:27:22,278 ERROR o.a.d.s.c.d.RunRules [Coordinator-Exec--0] Unable to find matching rules!: {class=org.apache.druid.server.coordinator.duty.RunRules, segmentsWithMissingRulesCount=815483, segmentsWithMissingRules=[v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_7, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_6, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_5, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_4, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_3, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_2, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_1, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z, v3ad-9c6f-a0ae-87c4-a1bfd_preview_production__overview_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T18:07:19.942Z_1, v3ad-9c6f-a0ae-87c4-a1bfd_preview_production__overview_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T18:07:19.942Z]}

It happens after issuing many times

POST

'/druid/coordinator/v1/datasources/' + dataSource + '/markUnused'

parameter

{

"interval": "2019-01-01T00:00:00.000Z/" + end

}

And sometimes call kill thread.

POST

'/druid/indexer/v1/task'

with

task = {

"type": "kill",

"dataSource": dataSource,

"interval": "2019-01-01T00:00:00.000Z/" + end

};

Any help more than appreciated.

Thanks,

Bastian

Rachel Pedreschi

unread,

Oct 16, 2021, 10:53:34 AM10/16/21

to druid...@googlegroups.com

When does this error happen? Can you give us some details on what you are doing on the system at the time zookeeper OOMs?

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/b502996b-ad2f-4bba-958c-dda7fc23030an%40googlegroups.com.

--

Rachel Pedreschi

VP Developer Relations and Community

Imply.io

Bastian Zuehlke

unread,

Oct 16, 2021, 11:54:41 AM10/16/21

to Druid User

We are basically running Druid natively (no docker) on Ubuntu 18 (using standard Single-Server-Medium with extensions ["druid-datasketches", "druid-s3-extensions", "mysql-metadata-storage"] ).

Our ingestion looks like this.

Ingestion:

{

"index": {

"type": "index",

"spec": {

"dataSchema": {

"dataSource": "v3ad-699b-9307-69d4-6b5a7_live_production",

"parser": {

"type": "string",

"parseSpec": {

"format": "json",

"dimensionsSpec": {

"dimensions": [

"param",

"subType",

"campaignID",

"config",

"realm",

"subID",

"viewHost",

"selConf",

"ueid",

"userAgent",

"ipAddress",

"geo_country",

"geo_city",

"os_family",

"type",

"_ms"

]

},

"timestampSpec": {

"column": "timestamp",

"format": "iso"

}

},

"metricsSpec": [],

"granularitySpec": {

"type": "uniform",

"segmentGranularity": "day",

"queryGranularity": "none",

"rollup": false

}

},

"ioConfig": {

"type": "index",

"firehose": {

"type": "local",

"baseDir": "/uploads/",

"filter": "v3ad-699b-9307-69d4-6b5a7_live_production_7e1df48f-4ec1-4838-921e-2b94df5972af.json"

},

"appendToExisting": true

},

"tuningConfig": {

"type": "index",

"maxRowsPerSegment": 5000000,

"maxRowsInMemory": 25000,

"forceExtendableShardSpecs": true

}

},

"ms": 1634309971801,

"id": "7e1df48f-4ec1-4838-921e-2b94df5972af",

"name": "v3ad-699b-9307-69d4-6b5a7_live_production",

"fname": "v3ad-699b-9307-69d4-6b5a7_live_production_7e1df48f-4ec1-4838-921e-2b94df5972af.json",

"indexFile": "/uploads/v3ad-699b-9307-69d4-6b5a7_live_production_7e1df48f-4ec1-4838-921e-2b94df5972af.json",

"data": null,

"state": 1

},

We are using Druid since 2.5 years. Lately migrated to a newer version, thats why you still see firehose.

However, the latest system was working smooth for several months, not any error in the log.Two days ago we figured that new segments are correctly ingested but not announced in the segment-cache.

Still no errors in any of the logs.

After several Druid restarts the system got more and more unstable, at some point even the ingestion didn´t work anymore.

Now ZK (we using the standard config of ZK) is permanently restarting and we get more and more errors in the log.

We get more and more errors in the logs.

For instance, which looks severs:

2021-10-16 13:12:38,835 ERROR o.a.d.s.r.CoordinatorRuleManager [CoordinatorRuleManager-Exec--0] Exception while polling for rules

org.apache.druid.java.util.common.IOE: Retries exhausted, couldn't fulfill request to [http://localhost:8081/druid/coordinator/v1/rules].

at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:221) ~[druid-server-0.21.1.jar:0.21.1]

at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:127) ~[druid-server-0.21.1.jar:0.21.1]

at org.apache.druid.server.router.CoordinatorRuleManager.poll(CoordinatorRuleManager.java:137) ~[druid-server-0.21.1.jar:0.21.1]

at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:55) [druid-core-0.21.1.jar:0.21.1]

at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:51) [druid-core-0.21.1.jar:0.21.1]

at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:97) [druid-core-0.21.1.jar:0.21.1]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_292]

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_292]

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_292]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]

but

http://localhost:8081/druid/coordinator/v1/rules returns {"_default":[{"tieredReplicants":{"_default_tier":1},"type":"loadForever"}]}

Any idea ?

Thanks,

Bastian

Laxmikant Pandhare

unread,

Oct 22, 2023, 3:30:13 PM10/22/23

to Druid User

We started getting this error after restart of druid servers. Any help or suggestions on this?

Sergio Ferragut

unread,

Oct 23, 2023, 1:06:04 PM10/23/23

to druid...@googlegroups.com

What version of Druid are you on?

We are generally moving away from ZK.

Depending on you version, you might want to look at changing from ZK based functions to http based functions with parameters:

druid.indexer.runner.type=httpRemote

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/4171912c-cd41-43af-8434-cc0f5644725fn%40googlegroups.com.

Sergio Ferragut

unread,

Oct 23, 2023, 1:21:24 PM10/23/23

to druid...@googlegroups.com

Sorry, hit send before I meant to...

also

druid.serverview.type=http - for http based segment discovery

Laxmikant Pandhare

unread,

Oct 23, 2023, 4:35:24 PM10/23/23

to Druid User

Yeah - it is working now. I did something wrong and it was crashing.

Reply all

Reply to author

Forward