"Bad number of metrics" error logs in Realtime Node.

Flowyi

unread,

Nov 29, 2014, 11:54:03 PM11/29/14

to druid-de...@googlegroups.com

Hi!

I deployed a Druid system and kept ingesting data into RT node through Kafka, found some data lost comparing to Mysql which shared the same data source.

After checking RT node log, I found those waring and errors every hour. Are those the reason of data lost? Any idea how to fix them?

2014-11-24 10:00:00,794 WARN [dsp_client-overseer-2] io.druid.segment.realtime.plumber.RealtimePlumber - [2014-11-24T01:00:00.000Z] < [20

14-11-24T00:00:00.000Z] Skipping persist and merge.

2014-11-24 10:00:00,802 ERROR [dsp_client-2014-11-18T08:00:00.000Z-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Fa

iled to persist merged index[dsp_client]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class com.metamx.common.

IAE, exceptionMessage=Bad number of metrics[48], expected [45], interval=2014-11-18T08:00:00.000Z/2014-11-18T09:00:00.000Z}

com.metamx.common.IAE: Bad number of metrics[48], expected [45]

at io.druid.segment.IndexMerger.merge(IndexMerger.java:269)

at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:169)

at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:162)

at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:348)

at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

My schemas.json:

{

"schema": {

"dataSource": "dsp_client",

"aggregators": [

{

"type": "count",

"name": "row_count"

},

{"type":"longSum", "name":"ips", "fieldName":"ips"},

///... 43 more aggs here

],

"indexGranularity": "hour",

"shardSpec": {

"type": "linear",

"partitionNum": 2

}

},

"config": {

"maxRowsInMemory": 500000,

"intermediatePersistPeriod": "PT10m"

},

"firehose": {

"type": "kafka-0.8",

"consumerProps": {

"zookeeper.connect": "192.168.3.16:2181,192.168.3.18:2181",

"zookeeper.connection.timeout.ms": "15000",

"zookeeper.session.timeout.ms": "40000",

"zookeeper.sync.time.ms": "5000",

"group.id": "druid-real-time-node-client",

"fetch.message.max.bytes": "1048586",

"auto.offset.reset": "largest",

"auto.commit.enable": "true"

},

"feed": "dsp_client_topic",

"parser": {

"timestampSpec": {

"column": "timestamp"

},

"data": {

"format": "json",

"dimensions": [

"campaign_id",

// ... 22 more dims here

]

}

},

"plumber": {

"type": "realtime",

"windowPeriod": "PT60m",

"segmentGranularity": "hour",

"basePersistDirectory": "/data/druid/realtime/basePersist",

"rejectionPolicyFactory": {

"type": "messageTime"

}

Fangjin Yang

unread,

Dec 1, 2014, 2:20:04 PM12/1/14

to druid-de...@googlegroups.com

Hi Flowyi, do you see this exception always for the same interval or different intervals? Also, when you say results differ, how much do they differ? How are you doing the comparison?

Flowyi

unread,

Dec 1, 2014, 10:30:26 PM12/1/14

to druid-de...@googlegroups.com

Hi Fangjin,

the exception is always for the same interval.

And we found the folder "2014-11-18T08:00:00.000Z_2014-11-18T09:00:00.000Z" exist in the plumber's basePersistDirectory ( e.g /data/druid/realtime/basePersist/dsp_client/ ) , after we removed that outdated folder, the exceptions are gone.

I think it was because we used to change the RT node's schemas.json, by renaming some metrics columns. After we restart RT node, the temp base persist file(2014-11-18T08:00:00.000Z_2014-11-18T09:00:00.000Z) still there and cannot be merged to the newly created persist files.

And about the comparison, I issue two groupby queries with the same meaning to Mysql and Druid. The groupby queries shouldn't have problem because I tested them by batch loading csv files dumped by Mysql to Druid, and they returned the same result.

Flowyi

unread,

Dec 1, 2014, 10:43:58 PM12/1/14

to druid-de...@googlegroups.com

Although the removal fixed that exception, it results in data lost. So how to update the realtime schemas correctly ?

Fangjin Yang

unread,

Dec 2, 2014, 1:05:28 PM12/2/14

to druid-de...@googlegroups.com

Hi Flowyi,

Schema changes can be tricky with standalone real-time nodes. We actually use tranquility (https://github.com/metamx/tranquility) and the indexing service in production to mitigate such issues. For a longer explanation, you may want to read: https://groups.google.com/forum/#!searchin/druid-development/fangjin$20yang$20%22thoughts%22/druid-development/aRMmNHQGdhI/muBGl0Xi_wgJ

BTW, do you have an accompanying batch pipeline to fix up your data?

Flowyi

unread,

Dec 2, 2014, 8:54:42 PM12/2/14

to druid-de...@googlegroups.com

Thanks Fangjin, we are moving to Tranquility, and yes, there should always be a fix up :)

Reply all

Reply to author

Forward