Zookeeper Crashing - Heavy Workload

19 views
Skip to first unread message

Bastian Zuehlke

unread,
Oct 16, 2021, 4:56:10 AMOct 16
to Druid User
Hi Zookeeper,

Druid: 0.21.1
Config: Single-Server-Medium (default)

is crashing every few minutes
2021-10-16T08:27:14,126 ERROR [SyncThread:0] org.apache.zookeeper.server.ZooKeeperCriticalThread - Severe unrecoverable error, from thread : SyncThread:0
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_292]
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) ~[?:1.8.0_292]
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) ~[?:1.8.0_292]
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) ~[?:1.8.0_292]
at java.io.DataOutputStream.write(DataOutputStream.java:107) ~[?:1.8.0_292]


It happens after  CuratorLoadQueuePeon throws thousands of massive errors.

2021-10-16 08:08:12,917 ERROR o.a.d.s.c.CuratorLoadQueuePeon [Master-PeonExec--0] Server[/druid/loadQueue/localhost:8083], throwable caught when submitting [SegmentChangeRequestLoad{segment=DataSegment{binaryVersion=9, id=v3ad-1553-b17c-0306-4c14a_live_production_2021-04-15T00:00:00.000Z_2021-04-16T00:00:00.000Z_2021-04-15T00:18:27.264Z_24, loadSpec={type=>s3_zip, bucket=>vis, key=>druid/seg/v3ad-1553-b17c-0306-4c14a_live_production/2021-04-15T00:00:00.000Z_2021-04-16T00:00:00.000Z/2021-04-15T00:18:27.264Z/24/index.zip, S3Schema=>s3n}, dimensions=[param, subType, viewHost, selConf, ueid, userAgent, geo_country, geo_city, os_family, type], metrics=[], shardSpec=NumberedShardSpec{partitionNum=24, partitions=0}, lastCompactionState=null, size=35855}}].
org.apache.druid.java.util.common.ISE: /druid/loadQueue/localhost:8083/v3ad-1553-b17c-0306-4c14a_live_production_2021-04-15T00:00:00.000Z_2021-04-16T00:00:00.000Z_2021-04-15T00:18:27.264Z_24 was never removed! Failing this operation!
at org.apache.druid.server.coordinator.CuratorLoadQueuePeon$SegmentChangeProcessor.lambda$scheduleNodeDeletedCheck$1(CuratorLoadQueuePeon.java:285) ~[druid-server-0.21.1.jar:0.21.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292]

and

2021-10-16 08:27:22,278 ERROR o.a.d.s.c.d.RunRules [Coordinator-Exec--0] Unable to find matching rules!: {class=org.apache.druid.server.coordinator.duty.RunRules, segmentsWithMissingRulesCount=815483, segmentsWithMissingRules=[v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_7, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_6, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_5, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_4, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_3, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_2, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z_1, v3ad-a20d-aaf7-c6e9-d9d18_live_production_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T02:26:31.306Z, v3ad-9c6f-a0ae-87c4-a1bfd_preview_production__overview_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T18:07:19.942Z_1, v3ad-9c6f-a0ae-87c4-a1bfd_preview_production__overview_2021-10-07T00:00:00.000Z_2021-10-08T00:00:00.000Z_2021-10-07T18:07:19.942Z]}

It happens after issuing  many times

POST
'/druid/coordinator/v1/datasources/' + dataSource + '/markUnused'
parameter
{
"interval": "2019-01-01T00:00:00.000Z/" + end
}

And sometimes call kill thread.

POST
'/druid/indexer/v1/task'
with 
task = {
"type": "kill",
"dataSource": dataSource,
"interval": "2019-01-01T00:00:00.000Z/" + end
};


Any help more than appreciated.

Thanks,

Bastian



Rachel Pedreschi

unread,
Oct 16, 2021, 10:53:34 AMOct 16
to druid...@googlegroups.com
When does this error happen? Can you give us some details on what you are doing on the system at the time zookeeper OOMs?

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/b502996b-ad2f-4bba-958c-dda7fc23030an%40googlegroups.com.


--
Rachel Pedreschi
VP Developer Relations and Community
Imply.io

Bastian Zuehlke

unread,
Oct 16, 2021, 11:54:41 AMOct 16
to Druid User
We are basically running Druid natively (no docker) on Ubuntu 18 (using standard Single-Server-Medium with extensions ["druid-datasketches", "druid-s3-extensions", "mysql-metadata-storage"] ).
Our ingestion looks like this.

Ingestion:
{
"index": {
"type": "index",
"spec": {
"dataSchema": {
"dataSource": "v3ad-699b-9307-69d4-6b5a7_live_production",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"dimensionsSpec": {
"dimensions": [
"param",
"subType",
"campaignID",
"config",
"realm",
"subID",
"viewHost",
"selConf",
"ueid",
"userAgent",
"ipAddress",
"geo_country",
"geo_city",
"os_family",
"type",
"_ms"
]
},
"timestampSpec": {
"column": "timestamp",
"format": "iso"
}
}
},
"metricsSpec": [],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "day",
"queryGranularity": "none",
"rollup": false
}
},
"ioConfig": {
"type": "index",
"firehose": {
"type": "local",
"baseDir": "/uploads/",
"filter": "v3ad-699b-9307-69d4-6b5a7_live_production_7e1df48f-4ec1-4838-921e-2b94df5972af.json"
},
"appendToExisting": true
},
"tuningConfig": {
"type": "index",
"maxRowsPerSegment": 5000000,
"maxRowsInMemory": 25000,
"forceExtendableShardSpecs": true
}
}
},
"ms": 1634309971801,
"id": "7e1df48f-4ec1-4838-921e-2b94df5972af",
"name": "v3ad-699b-9307-69d4-6b5a7_live_production",
"fname": "v3ad-699b-9307-69d4-6b5a7_live_production_7e1df48f-4ec1-4838-921e-2b94df5972af.json",
"indexFile": "/uploads/v3ad-699b-9307-69d4-6b5a7_live_production_7e1df48f-4ec1-4838-921e-2b94df5972af.json",
"data": null,
"state": 1
},

We are using Druid since 2.5 years. Lately migrated to a newer version, thats why you still see firehose.
However, the latest system was working smooth for several months, not any error in the log.Two days ago we figured that new segments are correctly ingested but not announced in the segment-cache.
Still no errors in any of the logs.
After several Druid restarts the system got more and more unstable, at some point even the ingestion didn´t work anymore. 
Now ZK (we using the standard config of ZK) is permanently restarting and we get more and more errors in the log. 
We get more and more errors in the logs.

For instance, which looks severs:

2021-10-16 13:12:38,835 ERROR o.a.d.s.r.CoordinatorRuleManager [CoordinatorRuleManager-Exec--0] Exception while polling for rules
org.apache.druid.java.util.common.IOE: Retries exhausted, couldn't fulfill request to [http://localhost:8081/druid/coordinator/v1/rules].
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:221) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:127) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.server.router.CoordinatorRuleManager.poll(CoordinatorRuleManager.java:137) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:55) [druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:51) [druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:97) [druid-core-0.21.1.jar:0.21.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_292]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_292]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_292]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]

but 

http://localhost:8081/druid/coordinator/v1/rules returns {"_default":[{"tieredReplicants":{"_default_tier":1},"type":"loadForever"}]}

Any idea ?

Thanks,

Bastian
Reply all
Reply to author
Forward
0 new messages