Unstable Overlord after Druid upgrade to 0.10.1 version

已查看 537 次
跳至第一个未读帖子

Jan Kogut

未读,
2017年9月6日 08:34:442017/9/6
收件人 Druid User
Hello,

I did Druid upgrade to 0.10.1 version.
I follow rolling-updates procedures and use Rolling restart (graceful-termination-based) strategy for Middle Managers upgrade.

Unfortunately after upgrade Overlord logs WARNINGS and ERRORS:

1. Exception while sending request

2017-09-06 11:07:30,597 WARN i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_sample_1_events-0] Exception while sending request
io
.druid.java.util.common.IAE: Received 400 Bad Request with body: Task must be paused before changing the end offsets
    at io
.druid.indexing.kafka.KafkaIndexTaskClient.submitRequest(KafkaIndexTaskClient.java:515) [druid-kafka-indexing-service-0.10.1.jar:0.10.1]
    at io
.druid.indexing.kafka.KafkaIndexTaskClient.setEndOffsets(KafkaIndexTaskClient.java:293) [druid-kafka-indexing-service-0.10.1.jar:0.10.1]
    at io
.druid.indexing.kafka.KafkaIndexTaskClient$11.call(KafkaIndexTaskClient.java:429) [druid-kafka-indexing-service-0.10.1.jar:0.10.1]
    at io
.druid.indexing.kafka.KafkaIndexTaskClient$11.call(KafkaIndexTaskClient.java:425) [druid-kafka-indexing-service-0.10.1.jar:0.10.1]
    at java
.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_141]
    at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
    at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
    at java
.lang.Thread.run(Thread.java:748) [?:1.8.0_141]




2. Exception while getting current/latest offsets

2017-09-06 11:28:43,877 WARN i.d.i.k.s.KafkaSupervisor [KafkaSupervisor-prod_sample_1_events-Reporting-0] Exception while getting current/latest offsets
java
.util.concurrent.TimeoutException: Timeout waiting for task.
    at com
.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:269) ~[guava-16.0.1.jar:?]
    at com
.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) ~[guava-16.0.1.jar:?]
    at io
.druid.indexing.kafka.supervisor.KafkaSupervisor.updateCurrentOffsets(KafkaSupervisor.java:1802) ~[?:?]
    at io
.druid.indexing.kafka.supervisor.KafkaSupervisor.lambda$updateCurrentAndLatestOffsets$12(KafkaSupervisor.java:1810) ~[?:?]
    at java
.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_141]
    at java
.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_141]
    at java
.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_141]
    at java
.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_141]
    at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
    at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
    at java
.lang.Thread.run(Thread.java:748) [?:1.8.0_141] 




3. Kafka partitions [,,,] do not match task partition [,,,] occurs frequently

2017-09-06 11:25:31,410 WARN i.d.i.k.s.KafkaSupervisor [KafkaSupervisor-prod_sample_1_events-Reporting-0] Lag metric: Kafka partitions [0, 1, 2, 3, 4, 5, 6, 7] do not match task partitions [1, 2, 3, 4, 6]



Overlord hangs, and while trying to restart there are another problems:



Cannot get a connection, general error

2017-09-06 11:44:10,741 WARN i.d.j.u.c.RetryUtils [qtp1373220972-115] Failed on try 1, retrying in 720ms.
org.skife.jdbi.v2.exceptions.UnableToObtainConnectionException: java.sql.SQLException: Cannot get a connection, general error
    at org.skife.jdbi.v2.DBI.open(DBI.java:230) ~[jdbi-2.63.1.jar:2.63.1]
    at org.skife.jdbi.v2.DBI.withHandle(DBI.java:279) ~[jdbi-2.63.1.jar:2.63.1]
    at io.druid.metadata.SQLMetadataConnector$2.call(SQLMetadataConnector.java:129) ~[druid-server-0.10.1.jar:0.10.1]
    at io.druid.java.util.common.RetryUtils.retry(RetryUtils.java:63) [java-util-0.10.1.jar:0.10.1]
    at io.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) [java-util-0.10.1.jar:0.10.1]
    at io.druid.metadata.SQLMetadataConnector.retryWithHandle(SQLMetadataConnector.java:133) [druid-server-0.10.1.jar:0.10.1]
    at io.druid.metadata.SQLMetadataConnector.retryWithHandle(SQLMetadataConnector.java:142) [druid-server-0.10.1.jar:0.10.1]
    at io.druid.metadata.IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForIntervals(IndexerSQLMetadataStorageCoordinator.java:127) [druid-server-0.10.1.jar:0.10.1]
    at io.druid.metadata.IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval(IndexerSQLMetadataStorageCoordinator.java:119) [druid-server-0.10.1.jar:0.10.1]
    at io.druid.indexing.common.actions.SegmentAllocateAction.perform(SegmentAllocateAction.java:156) [druid-indexing-service-0.10.1.jar:0.10.1]
    at io.druid.indexing.common.actions.SegmentAllocateAction.perform(SegmentAllocateAction.java:54) [druid-indexing-service-0.10.1.jar:0.10.1]
    at io.druid.indexing.common.actions.LocalTaskActionClient.submit(LocalTaskActionClient.java:64) [druid-indexing-service-0.10.1.jar:0.10.1]
    at io.druid.indexing.overlord.http.OverlordResource$3.apply(OverlordResource.java:341) [druid-indexing-service-0.10.1.jar:0.10.1]
    at io.druid.indexing.overlord.http.OverlordResource$3.apply(OverlordResource.java:330) [druid-indexing-service-0.10.1.jar:0.10.1]
    at io.druid.indexing.overlord.http.OverlordResource.asLeaderWith(OverlordResource.java:673) [druid-indexing-service-0.10.1.jar:0.10.1]
    at io.druid.indexing.overlord.http.OverlordResource.doAction(OverlordResource.java:327) [druid-indexing-service-0.10.1.jar:0.10.1]
    at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_141]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_141]
    at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) [jersey-server-1.19.3.jar:1.19.3]
    at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) [jersey-servlet-1.19.3.jar:1.19.3]
    at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) [jersey-servlet-1.19.3.jar:1.19.3]
    at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) [jersey-servlet-1.19.3.jar:1.19.3]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) [javax.servlet-api-3.1.0.jar:3.1.0]
    at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:286) [guice-servlet-4.1.0.jar:?]
    at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:276) [guice-servlet-4.1.0.jar:?]
    at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:181) [guice-servlet-4.1.0.jar:?]
    at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) [guice-servlet-4.1.0.jar:?]
    at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) [guice-servlet-4.1.0.jar:?]
    at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120) [guice-servlet-4.1.0.jar:?]
    at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135) [guice-servlet-4.1.0.jar:?]
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
    at io.druid.server.http.RedirectFilter.doFilter(RedirectFilter.java:72) [druid-server-0.10.1.jar:0.10.1]
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.Server.handle(Server.java:534) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.sql.SQLException: Cannot get a connection, general error
    at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:131) ~[commons-dbcp2-2.0.1.jar:2.0.1]
    at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:1413) ~[commons-dbcp2-2.0.1.jar:2.0.1]
    at org.skife.jdbi.v2.DataSourceConnectionFactory.openConnection(DataSourceConnectionFactory.java:36) ~[jdbi-2.63.1.jar:2.63.1]
    at org.skife.jdbi.v2.DBI.open(DBI.java:212) ~[jdbi-2.63.1.jar:2.63.1]
    ... 63 more
Caused by: java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) ~[?:1.8.0_141]
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) ~[?:1.8.0_141]
    at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:524) ~[commons-pool2-2.2.jar:2.2]
    at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:438) ~[commons-pool2-2.2.jar:2.2]
    at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:361) ~[commons-pool2-2.2.jar:2.2]
    at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:119) ~[commons-dbcp2-2.0.1.jar:2.0.1]
    at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:1413) ~[commons-dbcp2-2.0.1.jar:2.0.1]
    at org.skife.jdbi.v2.DataSourceConnectionFactory.openConnection(DataSourceConnectionFactory.java:36) ~[jdbi-2.63.1.jar:2.63.1]
    at org.skife.jdbi.v2.DBI.open(DBI.java:212) ~[jdbi-2.63.1.jar:2.63.1]
    ... 63 more



Couldn't stop Thread and Failed to get task runner

2017-09-06 11:44:25,739 WARN o.e.j.u.t.QueuedThreadPool [Thread-41] qtp1373220972{STOPPING,40<=40<=40,i=0,q=36} Couldn't stop Thread[qtp1373220972-102,5,main]
2017-09-06 11:44:25,739 WARN o.e.j.u.t.QueuedThreadPool [Thread-41] qtp1373220972{STOPPING,40<=40<=40,i=0,q=36} Couldn't stop Thread[qtp1373220972-84,5,main]
2017-09-06 11:44:25,739 WARN o.e.j.u.t.QueuedThreadPool [Thread-41] qtp1373220972{STOPPING,40<=40<=40,i=0,q=36} Couldn't stop Thread[qtp1373220972-91,5,main]


2017-09-06 11:51:03,706 ERROR i.d.i.k.s.KafkaSupervisor [KafkaIndexTaskClient-prod_sample_1_events-0] Failed to get task runner because I'm not the leader!




The only option to stop it is to kill -9 overlord_pid




OUR CONFIGS:


overlord.runtime.properties

druid.host={{overlord-fqdn}}
druid.port={{overlord-port}}
druid.service=druid/overlord

# # Upload all task logs to deep storage
# druid.indexer.logs.type=s3
# druid.indexer.logs.s3Bucket={{druid-s3-bucket-name}}
# druid.indexer.logs.s3Prefix=prod/logs/v1

# Run in remote mode
druid.indexer.runner.type=remote
druid.indexer.runner.minWorkerVersion=#{WORKER_VERSION}

# Store all task state in the metadata storage
druid.indexer.storage.type=metadata

druid.monitoring.monitors=["com.metamx.metrics.SysMonitor","io.druid.client.cache.CacheMonitor","io.druid.server.metrics.EventReceiverFirehoseMonitor","com.metamx.metrics.JvmMonitor"][root@druid-master01 ~]#



middleManager.runtime.properties

druid.host={{mm-fqdn}}
druid.port={{mm-port}}
druid.service=druid/middlemanager

druid.monitoring.monitors=["com.metamx.metrics.SysMonitor","io.druid.client.cache.CacheMonitor","io.druid.server.metrics.EventReceiverFirehoseMonitor","com.metamx.metrics.JvmMonitor"]


# Resources for peons
druid.indexer.runner.javaOpts=-server -Xmx6g -XX:MaxDirectMemorySize=4g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -Dcom.amazonaws.services.s3.enableV4
druid.indexer.task.baseTaskDir=/persistent/task/

# # Store task logs in deep storage --> please refere to runtime.properties in the _common dir
# druid.indexer.logs.type=s3
# druid.indexer.logs.s3Bucket=s3-deep-storage
# druid.indexer.logs.s3Prefix=prod/logs/v1

# Peon properties
druid.indexer.fork.property.druid.monitoring.monitors=["com.metamx.metrics.JvmMonitor"]
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=1073741824
druid.indexer.fork.property.druid.processing.numThreads=2
druid.indexer.fork.property.druid.processing.numMergeBuffers=1

#druid.indexer.fork.property.druid.segmentCache.locations=[{"path": "/persistent/zk_druid", "maxSize": 0}]
druid.indexer.fork.property.druid.server.http.numThreads=100
druid.indexer.fork.property.druid.storage.archiveBaseKey=prod
druid.indexer.fork.property.druid.storage.archiveBucket={{druid-s3-bucket-name}}-archive
druid.indexer.fork.property.druid.storage.baseKey=prod/v1
druid.indexer.fork.property.druid.storage.bucket={{druid-s3-bucket-name}}
druid.indexer.fork.property.druid.storage.type=s3

druid.worker.capacity=11
druid.worker.ip=localhost
druid.worker.version=0



Any idea?




Regards,
Jan

Slim Bouguerra

未读,
2017年9月6日 09:51:452017/9/6
收件人 druid...@googlegroups.com
Does seem you have an issue with the SQL metadata storage access, make sure that your overlord has right sql metadata storage the 0.10.1 and the good credentials.
you can manually restart the offset to force reading from the latest one but that can lead to lose of data https://github.com/druid-io/druid/pull/3678/files
 
-- 

B-Slim
_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/8e1ef120-dbb2-4e36-b438-592f6337672e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

已删除帖子

Jan Kogut

未读,
2017年9月6日 10:36:512017/9/6
收件人 Druid User
Hi Slim,

there is no problem with SQL access otherwise ALL of the KIS tasks would fail (we have  Overlord task success ratio 63.3% at the moment).




sample SQL access related logs from overlord: 

INFO i.d.m.IndexerSQLMetadataStorageCoordinator [qtp1373220972-92] Updated metadata
INFO i.d.m.IndexerSQLMetadataStorageCoordinator [qtp1373220972-92] Published segment  
INFO i.d.m.IndexerSQLMetadataStorageCoordinator [qtp1373220972-90] Alocating pending segment 
INFO i.d.m.IndexerSQLMetadataStorageCoordinator [qtp1373220972-90] Found existing pending segment 



 RDS: max_allowed_packet = 1024*1024*16


Regards,
Jan

jerry jung

未读,
2017年9月7日 01:14:142017/9/7
收件人 Druid User
Can you check the offset information of the metadata?
Please check the following query .

select from druid_dataSource where dataSource='datasource_id';

Sincerely,
Yousun Jeong

Gian Merlino

未读,
2017年9月8日 03:32:152017/9/8
收件人 druid...@googlegroups.com
Hey Jan,

It seems like something has really slowed down. I think the "SQLException: Cannot get a connection, general error" messages you're getting mean that there are no free connections in the connection pool for too long, in turn probably because the ones that are currently in use are taking a long time to finish.

I would suggest first looking to see if your overlord is stuck because it's GCing a lot. If not, then I'd check if your metadata store is healthy (can you query it directly, does it have any slow queries in its query log).

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

Jan Kogut

未读,
2017年9月8日 16:27:202017/9/8
收件人 Druid User
Hey Gian,


SQLException: Cannot get a connection, general error

this error I have only while trying to restart unresponsive overlord, which process eventually is killed with kill -9 command.

I did some research regarding system behavior, so I see the following patterns:

1. decreased druid.indexer.fork.property.druid.server.http.numThreads from 100 --> 50

2. added more middleManager servers (6core Xeon) ---> increased peon number from 44 to 55 = (12thread -1 per server) * 5 servers

3. found that overlord assigns more peons then its configured to do:
for example for dataSource configure with: "taskCount": 8  overlord assigned 24 !!! tasks (as I understand some of them stuck and never finished)

and there are more paused successfully tasks then completed successfully

grep 'completed successfully' /var/log/druid/overlord.log | wc -l
11
grep 'paused successfully' /var/log/druid/overlord.log | wc -l
54


when overlord is about to hung you can see Connection refused for many dataSources:

2017-09-08 20:20:58,300 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_sample_1_events-5] submitRequest failed for [http://druid-mm02:8104/druid/worker/v1/chat/index_kafka_prod_sample_1_events_068972c80c54c05_himhgfpp/offsets/current], with message [Connection refused (Connection refused)]




4. checked number of open files:

ansible druid-mm -a "lsof | grep java | wc -l" -s

druid
-mm05 | SUCCESS | rc=0 >>
1370934

druid
-mm03 | SUCCESS | rc=0 >>
1731782

druid
-mm02 | SUCCESS | rc=0 >>
1727398

druid
-mm01 | SUCCESS | rc=0 >>
1715144

druid
-mm04 | SUCCESS | rc=0 >>
1739617


5. checked MySQL

discovered that pendingSegments should be cleaned manually (on druid-user group), and that count of 163280 id should not take almost 2mins

MySQL [druid]> select count(id) from druid_pendingSegments;
+-----------+
| count(id) |
+-----------+
|    163280 |
+-----------+
1 row in set (1 min 48.18 sec)


also found that the total number of segments is less then twice the amount of pending segments (so cleaning the pending segments should maybe help a little) but were calculated faster :)

MySQL [druid]> select count(id) from druid_segments;
+-----------+
| count(id) |
+-----------+
|    237642 |
+-----------+
1 row in set (1 min 27.62 sec)



So I will try to clean 95% of pendingSegments and monitor MySQL performance.


Regards,
Jan

Jan Kogut

未读,
2017年9月10日 22:40:422017/9/10
收件人 Druid User
Hi Yousun/Jerry,

yes I checked the query you suggested, and everything is OK,

I also use overlord REST_API to obtain status info about KIS dataSource tasks, and found out that there is no problem at all, overlord knows exactly what are the latestOffsets  :)

   
    "latestOffsets": {
     
"0": 306648700,
     
"1": 304854330,
     
"2": 308518602,
     
"3": 298993034,
     
"4": 305214569,
     
"5": 311090025,
     
"6": 315678079,
     
"7": 307856233
   
},


While observing the system I found that everything is fine until .... the query of DEATH TO OVERLORD appears on brokers .... :)

Overlord dies only and only when some queries for "suspected" dataSource appears. Does it make any sense ?

Is it possible that incorrect KIS for specific dataSource configuration can lead to overlords death ? 



Regards,
Jan
回复全部
回复作者
转发
0 个新帖子