Index Problem

Xiaoming Zhang

unread,

Sep 8, 2014, 11:05:04 PM9/8/14

to druid-de...@googlegroups.com

Hi, druid team:

We have four machines, each will consume two clusters, the two clusters are the same, will give two kafka topics:

RT Machine 1:

[

{

"dataSource": "topic1",

"partitionNum": 1

"zookeeper.connect": "cluster1"

},

{

"dataSource": "topic2",

"partitionNum": 1

"zookeeper.connect": "cluster1"

},

{

"dataSource": "topic1",

"partitionNum": 2

"zookeeper.connect": "cluster2"

},

{

"dataSource": "topic2",

"partitionNum": 2

"zookeeper.connect": "cluster2"

}

]

And for other three RT nodes, we have partionNum as below:

for RT2: 4,4,5,5

for RT3: 6,6,7,7

for RT4: 8,8,99

We set the plumber as below for each partition:

"plumber": {

"type": "realtime",

"windowPeriod": "PT50m",

"segmentGranularity": "hour",

"basePersistDirectory": "/kafka1/druidrealtime/basepersist",

"rejectionPolicyFactory": {

"type": "test"

}

We except to see that every hour there will be 8 tuples inserted into table "druid_segments"

Like this:

pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_1

pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_2

pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_4

pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_5

pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_6

pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_7

pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_8

pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_9

while I found that for some hourly time, the tuples are not completed:

For this hour, we find only one tuple:

pulsar_raw_2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_4

For this hour, we lost 6 and 7:

pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_1

pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_2

pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_4

pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_5

pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_8

pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_9

And I checked duid.log:

I checked one RT node and I find exception:

druid.log.2:2014-09-04 19:34:06,171 INFO [chief-pulsar_raw] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-04T19:34:06.165-07:00","service":"realtime","host":"phxdbx1183.phx.ebay.com:8083","severity":"component-failure","description":"Problem loading sink[pulsar_raw] from disk.","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.FileNotFoundException","exceptionMessage":"/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/index.drd (No such file or directory)","exceptionStackTrace":"java.io.FileNotFoundException: /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/index.drd (No such file or directory)\n\tat java.io.FileInputStream.open(Native Method)\n\tat java.io.FileInputStream.<init>(FileInputStream.java:146)\n\tat io.druid.segment.SegmentUtils.getVersionFromDir(SegmentUtils.java:24)\n\tat io.druid.segment.IndexIO.loadIndex(IndexIO.java:128)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.bootstrapSinksFromDisk(RealtimePlumber.java:521)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.startJob(RealtimePlumber.java:152)\n\tat io.druid.segment.realtime.RealtimeManager$FireChief.run(RealtimeManager.java:184)\n","interval":"2014-09-04T04:00:00.000-07:00/2014-09-04T05:00:00.000-07:00"}}]

druid.log.2:2014-09-04 19:34:07,126 ERROR [chief-pulsar_raw] io.druid.segment.realtime.plumber.RealtimePlumber - Problem loading sink[pulsar_raw] from disk.: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.FileNotFoundException, exceptionMessage=/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/index.drd (No such file or directory), interval=2014-09-04T04:00:00.000-07:00/2014-09-04T05:00:00.000-07:00}

druid.log.2:2014-09-04 19:34:07,126 INFO [chief-pulsar_raw] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-04T19:34:07.126-07:00","service":"realtime","host":"phxdbx1183.phx.ebay.com:8083","severity":"component-failure","description":"Problem loading sink[pulsar_raw] from disk.","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.FileNotFoundException","exceptionMessage":"/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/index.drd (No such file or directory)","exceptionStackTrace":"java.io.FileNotFoundException: /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/index.drd (No such file or directory)\n\tat java.io.FileInputStream.open(Native Method)\n\tat java.io.FileInputStream.<init>(FileInputStream.java:146)\n\tat io.druid.segment.SegmentUtils.getVersionFromDir(SegmentUtils.java:24)\n\tat io.druid.segment.IndexIO.loadIndex(IndexIO.java:128)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.bootstrapSinksFromDisk(RealtimePlumber.java:521)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.startJob(RealtimePlumber.java:152)\n\tat io.druid.segment.realtime.RealtimeManager$FireChief.run(RealtimeManager.java:184)\n","interval":"2014-09-04T04:00:00.000-07:00/2014-09-04T05:00:00.000-07:00"}}]

Ant ideas about this?

Thank you very much!

Fangjin Yang

unread,

Sep 8, 2014, 11:11:47 PM9/8/14

to druid-de...@googlegroups.com

Hi Xiaoming, I believe this is the same problem discussed in a few of the previous threads. Can you provide the following info:

1) Druid version you are running with.

2) The segment versions do not seem to be in UTC. Are you running with UTC timezone?

3) What is in /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00?

4) What is in /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/?

5) If you clear about the sink directory and rerun the experiments, is this behavior reproducible with 0.6.152?

Thanks,

FJ

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/9a33f080-befe-47fe-843e-5a072e1c0af1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xiaoming Zhang

unread,

Sep 8, 2014, 11:20:37 PM9/8/14

to druid-de...@googlegroups.com

1) Druid version you are running with.

package: druid-services-0.6.121

2) The segment versions do not seem to be in UTC. Are you running with UTC timezone?

We use MST time

3) What is in /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00?

[cronus@phxdbx1183 kafka1]$ cd /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00

[cronus@phxdbx1183 2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00]$ ls -lh

total 40K

drwxr-xr-x 2 cronus app 4.0K Sep 4 04:05 0

drwxr-xr-x 2 cronus app 4.0K Sep 4 04:14 1

drwxr-xr-x 2 cronus app 4.0K Sep 4 04:23 2

drwxr-xr-x 2 cronus app 4.0K Sep 4 04:33 3

drwxr-xr-x 2 cronus app 4.0K Sep 4 04:43 4

drwxr-xr-x 3 cronus app 4.0K Sep 4 19:29 5

drwxr-xr-x 2 cronus app 4.0K Sep 4 05:03 6

drwxr-xr-x 2 cronus app 4.0K Sep 4 05:13 7

drwxr-xr-x 2 cronus app 4.0K Sep 4 05:23 8

drwxr-xr-x 2 cronus app 4.0K Sep 4 05:33 9

4) What is in /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/?

[cronus@phxdbx1183 2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00]$ cd /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/

[cronus@phxdbx1183 5]$ ls -lh

total 4.0K

drwxr-xr-x 2 cronus app 4.0K Sep 4 19:29 v8-tmp

5) If you clear about the sink directory and rerun the experiments, is this behavior reproducible with 0.6.152?

Will test this later

在 2014年9月8日星期一UTC-7下午8时11分47秒，Fangjin Yang写道：

Fangjin Yang

unread,

Sep 9, 2014, 5:22:02 PM9/9/14

to druid-de...@googlegroups.com

Hi Xiaoming, it seems like an index file went missing and is causing some problems with your real-time node startup. I recall fixing similar problems from 0.6.121 to the current version and I was wondering if you mind trying to reproduce the problem with the latest stable?

Thanks,

FJ

Xiaoming Zhang

unread,

Sep 10, 2014, 6:03:20 AM9/10/14

to druid-de...@googlegroups.com

Do i need to clean every thing if i give a new version to my cluster?

I just not clean anything and started the new version cluster and get the following:

2014-09-10 02:43:46,417 INFO [chief-pulsar_raw] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-10T02:43:46.416-07:00","service":"realtime","host":"phxdbx1180.phx.ebay.com:8083","severity":"component-failure","description":"Problem loading sink[pulsar_raw] from disk.","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.FileNotFoundException","exceptionMessage":"/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/5/index.drd (No such file or directory)","exceptionStackTrace":"java.io.FileNotFoundException: /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/5/index.drd (No such file or directory)\n\tat java.io.FileInputStream.open(Native Method)\n\tat java.io.FileInputStream.<init>(FileInputStream.java:146)\n\tat io.druid.segment.SegmentUtils.getVersionFromDir(SegmentUtils.java:24)\n\tat io.druid.segment.IndexIO.loadIndex(IndexIO.java:159)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.bootstrapSinksFromDisk(RealtimePlumber.java:529)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.startJob(RealtimePlumber.java:150)\n\tat io.druid.segment.realtime.RealtimeManager$FireChief.run(RealtimeManager.java:184)\n","interval":"2014-09-04T00:00:00.000-07:00/2014-09-04T01:00:00.000-07:00"}}]

2014-09-10 02:50:11,967 ERROR [pulsar_raw-2014-09-06T07:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[pulsar_raw]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.FileNotFoundException, exceptionMessage=/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-06T07:00:00.000-07:00_2014-09-06T08:00:00.000-07:00/merged/v8-tmp/smoosher/meta.smoosh (No such file or directory), interval=2014-09-06T07:00:00.000-07:00/2014-09-06T08:00:00.000-07:00}

2014-09-10 02:50:11,968 ERROR [pulsar_raw-2014-09-06T07:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[pulsar_raw]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class com.metamx.common.IAE, exceptionMessage=Unknown version[3], interval=2014-09-06T07:00:00.000-07:00/2014-09-06T08:00:00.000-07:00}

2014-09-10 02:50:11,968 INFO [pulsar_raw-2014-09-06T07:00:00.000-07:00-persist-n-merge] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-10T02:50:11.968-07:00","service":"realtime","host":"phxdbx1180.phx.ebay.com:8083","severity":"component-failure","description":"Failed to persist merged index[pulsar_raw]","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"com.metamx.common.IAE","exceptionMessage":"Unknown version[3]","exceptionStackTrace":"com.metamx.common.IAE: Unknown version[3]\n\tat io.druid.segment.data.GenericIndexed.read(GenericIndexed.java:366)\n\tat io.druid.segment.IndexIO$DefaultIndexIOHandler.convertV8toV9(IndexIO.java:378)\n\tat io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:839)\n\tat io.druid.segment.IndexMerger.merge(IndexMerger.java:308)\n\tat io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:170)\n\tat io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:163)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:348)\n\tat io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:744)\n","interval":"2014-09-06T07:00:00.000-07:00/2014-09-06T08:00:00.000-07:00"}}]

BTW:

Is it possible to make "indexGranularity" 10 second in firehose?

Thanks!

yokspher

在 2014年9月9日星期二UTC-7下午2时22分02秒，Fangjin Yang写道：

Fangjin Yang

unread,

Sep 10, 2014, 1:51:59 PM9/10/14

to druid-de...@googlegroups.com

Hi Xiaoming,

I suspect that your real-time bootstrap is in a broken state because of some error that occurred. I'd like to see if the problem can be reproduced with the latest version. Can we try removing everything under /kafka1/druidrealtime/basepersist and trying 0.6.152? As for 10s index granularities, this is possible and you can extend QueryGranularity.java with the logic.

Xiaoming Zhang

unread,

Sep 10, 2014, 10:05:31 PM9/10/14

to druid-de...@googlegroups.com

Hi, Fangjin:

You mean i need to build my own version of druid if i need to change the index granularity to 10 second?

And I used the version of 0.6.146, cause you said i should try the latest stable so i choose this one.

在 2014年9月10日星期三UTC-7上午10时51分59秒，Fangjin Yang写道：

Fangjin Yang

unread,

Sep 10, 2014, 11:25:26 PM9/10/14

to druid-de...@googlegroups.com

Hi Xiaoming, I'm assuming this is a POC environment. I was just wondering if you could:

1) Remove the contents in /kafka1/druidrealtime/basepersist/

2) Use the latest stable

3) Try to reproduce the error.

You can contribute the code for a 10 second index granularity and we can include it in the next release.

Xiaoming Zhang

unread,

Sep 11, 2014, 4:01:30 AM9/11/14

to druid-de...@googlegroups.com

Hi, Fangjin:

I removed things under /kafka1/druidrealtime/basepersist/ of all RT nodes

And i use the latest table 0.6.146

I searched the log which shows no exception, good for you!

But when i check the index in sql db, it is still not good.

Our cluster has 8 partitionNum, 1,3,4,5,6,7,8,9, while the index segment only has about 3 to 4 index segments for each hour.

Another problem:

We search the data count of each hour with query:

{

"queryType": "timeseries",

"dataSource": "pulsar_raw",

"granularity": "hour",

"aggregations": [

{ "type": "count", "name": "count" }

],

"intervals": [ "2014-09-09/2014-09-10" ]

}

We get the count of each hour like:

[ {

"timestamp" : "2014-09-09T07:00:00.000Z",

"result" : {

"count" : 10381336

}

}, {

"timestamp" : "2014-09-09T08:00:00.000Z",

"result" : {

"count" : 11277862

}

}, {

"timestamp" : "2014-09-09T09:00:00.000Z",

"result" : {

"count" : 12163299

}

.....

{

"timestamp" : "2014-09-10T04:00:00.000Z",

"result" : {

"count" : 8864453

}

}, {

"timestamp" : "2014-09-10T05:00:00.000Z",

"result" : {

"count" : 8816756

}

}, {

"timestamp" : "2014-09-10T06:00:00.000Z",

"result" : {

"count" : 9947131

}

} ]

We have another Cassandra system which will get the same message and we searched the data:

cqlsh> select * from sojtracking.mc_mctimegroupmetric where metricname='MCDruidRawEventCount' and groupid='2014-09-09';

metricname | groupid | metrictime | tag_metrictime | value

----------------------+------------+--------------------------+----------------+----------

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 00:00:00-0700 | 1410246000000 | 18165012

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 01:00:00-0700 | 1410249600000 | 19739481

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 02:00:00-0700 | 1410253200000 | 21292068

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 03:00:00-0700 | 1410256800000 | 21587227

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 04:00:00-0700 | 1410260400000 | 22717716

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 05:00:00-0700 | 1410264000000 | 23881615

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 06:00:00-0700 | 1410267600000 | 24692035

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 07:00:00-0700 | 1410271200000 | 26587748

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 08:00:00-0700 | 1410274800000 | 27055307

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 09:00:00-0700 | 1410278400000 | 28132186

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 10:00:00-0700 | 1410282000000 | 28824256

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 11:00:00-0700 | 1410285600000 | 30515629

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 12:00:00-0700 | 1410289200000 | 31999651

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 13:00:00-0700 | 1410292800000 | 29975886

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 14:00:00-0700 | 1410296400000 | 26556515

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 15:00:00-0700 | 1410300000000 | 23895235

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 16:00:00-0700 | 1410303600000 | 19780544

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 17:00:00-0700 | 1410307200000 | 19707490

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 18:00:00-0700 | 1410310800000 | 20011370

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 19:00:00-0700 | 1410314400000 | 19602294

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 20:00:00-0700 | 1410318000000 | 18242883

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 21:00:00-0700 | 1410321600000 | 15509185

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 22:00:00-0700 | 1410325200000 | 15428086

MCDruidRawEventCount | 2014-09-09 | 2014-09-09 23:00:00-0700 | 1410328800000 | 17408436

For each hour, the number of Druid is about 57% of Cassandra, which is not match and makes us confused that maybe data is lost.

But i checked the log there is no throw or parse fail.

Any ideas about this?

Thanks a lot !

yokspher

在 2014年9月10日星期三UTC-7下午8时25分26秒，Fangjin Yang写道：

Xiaoming Zhang

unread,

Sep 11, 2014, 4:11:56 AM9/11/14

to druid-de...@googlegroups.com

BTW:

Could you please add one line to QueryGranularity.java?

TEN_SECOND ( 10 * 1000)

在 2014年9月10日星期三UTC-7下午8时25分26秒，Fangjin Yang写道：

Fangjin Yang

unread,

Sep 11, 2014, 2:10:25 PM9/11/14

to druid-de...@googlegroups.com

Hi Xiaoming,

The query you are issuing will not give you a count of ingested rows. The query you are issuing is returning a count of Druid rows, which factors in rollup. If you want to query for the ingested rows, you need to specify a 'count' aggregator at ingestion time. Then, at query time, issue a query with a longSum aggregator. I have some more comments inline.

On Thursday, September 11, 2014 1:01:30 AM UTC-7, Xiaoming Zhang wrote:

Hi, Fangjin:

I removed things under /kafka1/druidrealtime/basepersist/ of all RT nodes

And i use the latest table 0.6.146

I searched the log which shows no exception, good for you!

But when i check the index in sql db, it is still not good.

Our cluster has 8 partitionNum, 1,3,4,5,6,7,8,9, while the index segment only has about 3 to 4 index segments for each hour.

Do you mean the reset of the partitions are not being persisted? Can you share:

1) Your configuration

2) rejectionPolicy

3) windowPeriod

4) Any exceptions or log messages during handoff on the real-time node

5) Any exceptions or log messages during handoff on the coordinator node

Your data is likely being rolled up.

Xiaoming Zhang

unread,

Sep 11, 2014, 10:41:21 PM9/11/14

to druid-de...@googlegroups.com

My bad.

The query i give you is the one i use, our QA use the query like this:

{
"queryType": "timeseries",
"dataSource": "pulsar_raw",
"granularity": "hour",
"aggregations": [

{ "type": "longSum", "name": "count", "fieldName": "count" }
],
"intervals": [ "2014-09-09/2014-09-10" ]
}

And we had this "count" config at the ingest time.

"aggregators": [

{

"type": "count",

"name": "count"

}

],

The query i use and QA use give similar numbers, just slightly different which is still about 57%.

1) Your configuration

Attached all config files to you

2) rejectionPolicy

test

3) windowPeriod

PT50m

4) Any exceptions or log messages during handoff on the real-time node

These errors are produced after i use 0.6.146 adn clean things under /kafka1/druidrealtime/basepersist/.

And we tried to add HDFS dependency, after running for a while, it automatically shutdown.

2014-09-11 01:53:34,664 ERROR [pulsar_raw-2014-09-11T01:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[pulsar_raw]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.FileNotFoundException, exceptionMessage=/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-11T01:00:00.000-07:00_2014-09-11T02:00:00.000-07:00/merged/v8-tmp/inverted.drd (No such file or directory), interval=2014-09-11T01:00:00.000-07:00/2014-09-11T02:00:00.000-07:00}

2014-09-11 01:53:34,667 INFO [pulsar_raw-2014-09-11T01:00:00.000-07:00-persist-n-merge] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-11T01:53:34.664-07:00","service":"realtime","host":"phxdbx1180.phx.ebay.com:8083","severity":"component-failure","description":"Failed to persist merged index[pulsar_raw]","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.FileNotFoundException","exceptionMessage":"/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-11T01:00:00.000-07:00_2014-09-11T02:00:00.000-07:00/merged/v8-tmp/inverted.drd (No such file or directory)","exceptionStackTrace":"java.io.FileNotFoundException: /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-11T01:00:00.000-07:00_2014-09-11T02:00:00.000-07:00/merged/v8-tmp/inverted.drd (No such file or directory)\n\tat java.io.FileOutputStream.open(Native Method)\n\tat java.io.FileOutputStream.<init>(FileOutputStream.java:221)\n\tat com.google.common.io.Files$FileByteSink.openStream(Files.java:201)\n\tat com.google.common.io.Files$FileByteSink.openStream(Files.java:189)\n\tat com.google.common.io.ByteSink.getOutput(ByteSink.java:84)\n\tat com.google.common.io.ByteSink.getOutput(ByteSink.java:47)\n\tat io.druid.common.utils.SerializerUtils.writeString(SerializerUtils.java:49)\n\tat io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:779)\n\tat io.druid.segment.IndexMerger.merge(IndexMerger.java:308)\n\tat io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:170)\n\tat io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:163)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:348)\n\tat io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:744)\n","interval":"2014-09-11T01:00:00.000-07:00/2014-09-11T02:00:00.000-07:00"}}]

2014-09-11 02:04:02,324 ERROR [chief-pulsar_raw] io.druid.segment.realtime.plumber.RealtimePlumber - Problem loading sink[pulsar_raw] from disk.: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.FileNotFoundException, exceptionMessage=/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-10T23:00:00.000-07:00_2014-09-11T00:00:00.000-07:00/5/index.drd (No such file or directory), interval=2014-09-10T23:00:00.000-07:00/2014-09-11T00:00:00.000-07:00}

2014-09-11 02:47:38,649 INFO [druid-group_phxdbx1181.phx.ebay.com-1410426252834-3c8f81d4_watcher_executor] kafka.consumer.ZookeeperConsumerConnector - [druid-group_phxdbx1181.phx.ebay.com-1410426252834-3c8f81d4], exception during rebalance

org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /consumers/druid-group/ids/druid-group_phxdbx1180.phx.ebay.com-1410426242551-3ca0a833

at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)

5) Any exceptions or log messages during handoff on the coordinator node

Sorry, i forget to log things at coordinator nodes...

druid-deploy.zip

Fangjin Yang

unread,

Sep 11, 2014, 11:43:11 PM9/11/14

to druid-de...@googlegroups.com

Inline.

On Thursday, September 11, 2014 7:41:21 PM UTC-7, Xiaoming Zhang wrote:

My bad.

The query i give you is the one i use, our QA use the query like this:
{
"queryType": "timeseries",
"dataSource": "pulsar_raw",
"granularity": "hour",
"aggregations": [
{ "type": "longSum", "name": "count", "fieldName": "count" }
],
"intervals": [ "2014-09-09/2014-09-10" ]
}

And we had this "count" config at the ingest time.
"aggregators": [
{
"type": "count",
"name": "count"
}
],
The query i use and QA use give similar numbers, just slightly different which is still about 57%.

Okay, I suspect the difference is because of errors across your realtime pipeline where not all partitions are actually working and handing off. I suspect that if you do run a batch job of the data you will see correct results.

Looking through the configs, I believe the problem is here:

"rejectionPolicyFactory": {

"type": "test"

}

"test" should never be used beyond testing basic test of hand off. Please see:

http://druid.io/docs/latest/Plumber.html

It was a mistake to create this rejection policy and we will remove it. Please see:

https://github.com/metamx/druid/issues/691

I suggest you try and use "serverTime" as the rejectionPolicy for current time data.

One other thing that is very curious is the No such file or directory errors. I have been trying to reproduce these errors locally but have been unsuccessful.

Please try to remove /kafka1/druidrealtime/basepersist/ and try again. If things still are not working, any logs during persist or handoff would be extremely helpful.

Xiaoming Zhang

unread,

Sep 12, 2014, 3:11:14 AM9/12/14

to druid-de...@googlegroups.com

We try to add HDFS dependency but we failed by the following error, and we did not find any place that uses this commons-codec:commons-codec:jar:1.7, also we add this druid.extensions.remoteRepositories=[] to config, still not help:

2014-09-11 02:25:15,096 INFO [main] io.druid.initialization.Initialization - Loading extension[io.druid.extensions:druid-hdfs-storage:0.6.153] for class[io.druid.cli.CliCommandCreator]

2014-09-11 03:06:00,841 ERROR [main] io.druid.initialization.Initialization - Unable to resolve artifacts for [io.druid.extensions:druid-hdfs-storage:jar:0.6.153 (runtime) -> [] < [central (http://repo1.maven.org/maven2/, releases+snapshots), (https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local, releases+snapshots)]].

org.eclipse.aether.resolution.DependencyResolutionException: Failed to collect dependencies at io.druid.extensions:druid-hdfs-storage:jar:0.6.153 -> net.java.dev.jets3t:jets3t:jar:0.9.1 -> org.apache.httpcomponents:httpclient:jar:4.2 -> commons-codec:commons-codec:jar:1.7

at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:380)

at io.tesla.aether.internal.DefaultTeslaAether.resolveArtifacts(DefaultTeslaAether.java:289)

at io.druid.initialization.Initialization.getClassLoaderForCoordinates(Initialization.java:199)

at io.druid.initialization.Initialization.getFromExtensions(Initialization.java:141)

at io.druid.cli.Main.main(Main.java:78)

Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed to collect dependencies at io.druid.extensions:druid-hdfs-storage:jar:0.6.153 -> net.java.dev.jets3t:jets3t:jar:0.9.1 -> org.apache.httpcomponents:httpclient:jar:4.2 -> commons-codec:commons-codec:jar:1.7

at org.eclipse.aether.internal.impl.DefaultDependencyCollector.collectDependencies(DefaultDependencyCollector.java:292)

at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:342)

... 4 more

Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to read artifact descriptor for commons-codec:commons-codec:jar:1.7

at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.loadPom(DefaultArtifactDescriptorReader.java:335)

at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.readArtifactDescriptor(DefaultArtifactDescriptorReader.java:217)

at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:461)

at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:573)

at org.eclipse.aether.internal.impl.DefaultDependencyCollector.collectDependencies(DefaultDependencyCollector.java:261)

... 5 more

Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not transfer artifact commons-codec:commons-codec:pom:1.7 from/to central (http://repo1.maven.org/maven2/): connect timed out

at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:459)

at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts(DefaultArtifactResolver.java:262)

at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifact(DefaultArtifactResolver.java:239)

at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.loadPom(DefaultArtifactDescriptorReader.java:320)

... 10 more

Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact commons-codec:commons-codec:pom:1.7 from/to central (http://repo1.maven.org/maven2/): connect timed out

at io.tesla.aether.connector.AetherRepositoryConnector$2.wrap(AetherRepositoryConnector.java:830)

at io.tesla.aether.connector.AetherRepositoryConnector$2.wrap(AetherRepositoryConnector.java:824)

at io.tesla.aether.connector.AetherRepositoryConnector$GetTask.flush(AetherRepositoryConnector.java:619)

at io.tesla.aether.connector.AetherRepositoryConnector.get(AetherRepositoryConnector.java:238)

at org.eclipse.aether.internal.impl.DefaultArtifactResolver.performDownloads(DefaultArtifactResolver.java:535)

at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:436)

... 13 more

Caused by: java.net.SocketTimeoutException: connect timed out

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)

at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)

at java.net.Socket.connect(Socket.java:579)

at com.squareup.okhttp.Connection.connect(Connection.java:100)

at com.squareup.okhttp.internal.http.HttpEngine.connect(HttpEngine.java:287)

at com.squareup.okhttp.internal.http.HttpEngine.sendSocketRequest(HttpEngine.java:248)

at com.squareup.okhttp.internal.http.HttpEngine.sendRequest(HttpEngine.java:197)

at com.squareup.okhttp.internal.http.HttpURLConnectionImpl.execute(HttpURLConnectionImpl.java:388)

at com.squareup.okhttp.internal.http.HttpURLConnectionImpl.getResponse(HttpURLConnectionImpl.java:339)

at com.squareup.okhttp.internal.http.HttpURLConnectionImpl.getResponseCode(HttpURLConnectionImpl.java:540)

at io.tesla.aether.okhttp.OkHttpAetherClient$ResponseAdapter.getStatusCode(OkHttpAetherClient.java:196)

at io.tesla.aether.connector.AetherRepositoryConnector$GetTask.resumableGet(AetherRepositoryConnector.java:549)

at io.tesla.aether.connector.AetherRepositoryConnector$GetTask.run(AetherRepositoryConnector.java:391)

at io.tesla.aether.connector.AetherRepositoryConnector.get(AetherRepositoryConnector.java:232)

在 2014年9月11日星期四UTC-7下午8时43分11秒，Fangjin Yang写道：

Xiaoming Zhang

unread,

Sep 12, 2014, 5:16:29 AM9/12/14

to druid-de...@googlegroups.com

We changed one RT node config to this:

druid.service=realtime

druid.port=8083

druid.zk.service.host=chd1b02c-4f09.stratus.phx.ebay.com:2181,chd1b02c-a440.stratus.phx.ebay.com:2181,phx7b02c-f8c8.stratus.phx.ebay.com:2181,phx7b02c-8b12.stratus.phx.ebay.com:2181,chd1b02c-43ff.stratus.phx.ebay.com:2181,phx7b02c-b07c.stratus.phx.ebay.com:2181,phx7b01c-303e.stratus.phx.ebay.com:2181,phx7b02c-b6e0.stratus.phx.ebay.com:2181,chd1b02c-903e.stratus.phx.ebay.com:2181

druid.extensions.coordinates=["io.druid.extensions:druid-kafka-eight:0.6.142","io.druid.extensions:druid-hdfs-storage:0.6.142"]

druid.realtime.chathandler.type=announce

# The realtime config file.

druid.realtime.specFile=config/realtime/kafka.soj.spec

# Change this config to db to hand off to the rest of the Druid cluster

druid.publish.type=db

# These configs are only required for real hand off

druid.db.connector.connectURI=jdbc\:mysql\://mymisc1.db.stratus.ebay.com\:3306/sojdruid

druid.db.connector.user=sojdruid

druid.db.connector.password=sojdruid

druid.processing.buffer.sizeBytes=100000000

druid.processing.numThreads=1

druid.monitoring.monitors=["io.druid.segment.realtime.RealtimeMetricsMonitor"]

druid.extensions.remoteRepositories=[]

druid.segmentCache.locations=[{"path": "/kafka1/druid/segmentCache", "maxSize": 1000000000000}]

druid.storage.type=hdfs

druid.storage.storageDirectory=hdfs://artemis-nn.vip.ebay.com:8020/tmp/DruidSegments

When i try to strat the node, it just hang at the hdfs extension and unable to start, here is the log:

2014-09-12 02:03:51,888 INFO [main] io.druid.guice.PropertiesModule - Loading properties from runtime.properties

2014-09-12 02:03:51,920 INFO [main] org.hibernate.validator.internal.util.Version - HV000001: Hibernate Validator 5.0.1.Final

2014-09-12 02:03:52,408 INFO [main] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=[io.druid.extensions:druid-kafka-eight:0.6.142, io.druid.extensions:druid-hdfs-storage:0.6.142], localRepository='/home/cronus/.m2/repository', remoteRepositories=[]}]

2014-09-12 02:03:52,532 INFO [main] io.druid.initialization.Initialization - Loading extension[io.druid.extensions:druid-kafka-eight:0.6.142] for class[io.druid.cli.CliCommandCreator]

2014-09-12 02:03:52,895 INFO [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/io/druid/extensions/druid-kafka-eight/0.6.142/druid-kafka-eight-0.6.142.jar]

2014-09-12 02:03:52,896 INFO [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/org/apache/kafka/kafka_2.9.2/0.8.0/kafka_2.9.2-0.8.0.jar]

2014-09-12 02:03:52,896 INFO [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/org/scala-lang/scala-library/2.9.2/scala-library-2.9.2.jar]

2014-09-12 02:03:52,896 INFO [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/net/sf/jopt-simple/jopt-simple/3.2/jopt-simple-3.2.jar]

2014-09-12 02:03:52,896 INFO [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/org/slf4j/slf4j-simple/1.6.4/slf4j-simple-1.6.4.jar]

2014-09-12 02:03:52,896 INFO [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/org/scala-lang/scala-compiler/2.9.2/scala-compiler-2.9.2.jar]

2014-09-12 02:03:52,896 INFO [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/com/101tec/zkclient/0.3/zkclient-0.3.jar]

2014-09-12 02:03:52,896 INFO [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.jar]

2014-09-12 02:03:52,896 INFO [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar]

2014-09-12 02:03:52,896 INFO [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/com/yammer/metrics/metrics-annotation/2.2.0/metrics-annotation-2.2.0.jar]

2014-09-12 02:03:52,900 INFO [main] io.druid.initialization.Initialization - Loading extension[io.druid.extensions:druid-hdfs-storage:0.6.142] for class[io.druid.cli.CliCommandCreator]

It takes so long and nothings happens

在 2014年9月11日星期四UTC-7下午8时43分11秒，Fangjin Yang写道：

Xiaoming Zhang

unread,

Sep 14, 2014, 10:54:15 PM9/14/14

to druid-de...@googlegroups.com

We uploaded the missed jars and removed things under /kafka1/druidrealtime/basepersist/, finally it seems worked.

But here is the exception found at historical node:

druid.log:2014-09-14 19:33:56,736 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Node[/druid/loadQueue/phxdbx1190.phx.ebay.com:8081/pulsar_raw_2014-09-04T01:00:00.000-07:00_2014-09-04T02:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_4] was removed

druid.log:2014-09-14 19:33:56,742 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - New request[LOAD: pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9] with node[/druid/loadQueue/phxdbx1190.phx.ebay.com:8081/pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9].

druid.log:2014-09-14 19:33:56,742 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Loading segment pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9

druid.log:2014-09-14 19:33:56,742 INFO [ZkCoordinator-0] io.druid.segment.loading.OmniSegmentLoader - Deleting directory[/kafka1/druid/indexCache/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/2014-09-04T00:00:00.000-07:00/9]

druid.log:2014-09-14 19:33:56,742 INFO [ZkCoordinator-0] io.druid.server.coordination.SingleDataSegmentAnnouncer - Unannouncing segment[pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9] at path[/druid/servedSegments/phxdbx1190.phx.ebay.com:8081/pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9]

druid.log:2014-09-14 19:33:56,742 INFO [ZkCoordinator-0] io.druid.curator.announcement.Announcer - unannouncing [/druid/servedSegments/phxdbx1190.phx.ebay.com:8081/pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9]

druid.log:2014-09-14 19:33:56,742 ERROR [ZkCoordinator-0] io.druid.curator.announcement.Announcer - Path[/druid/servedSegments/phxdbx1190.phx.ebay.com:8081/pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9] not announced, cannot unannounce.

druid.log:2014-09-14 19:33:56,742 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Completely removing [pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9] in [30,000] millis

druid.log:2014-09-14 19:33:56,746 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Completed request [LOAD: pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9]

druid.log:2014-09-14 19:33:56,746 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9], segment=DataSegment{size=406328782, shardSpec=LinearShardSpec{partitionNum=9}, metrics=[count], dimensions=[_cn, _con, _cty, _lat, _lon, _rgn, app, bti, bu, curprice, dd_bf, dd_d, dd_os, g, itm, itmtitle, js_ev_type, js_evt_kafka_produce_ts, mloc, p, referer, remoteip, t, timestamp, u], version='2014-09-04T00:00:00.000-07:00', loadSpec={type=local, path=/kafka2/druid/localStorage/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/2014-09-04T00:00:00.000-07:00/9/index.zip}, interval=2014-09-04T00:00:00.000-07:00/2014-09-04T01:00:00.000-07:00, dataSource='pulsar_raw', binaryVersion='9'}}

druid.log:2014-09-14 19:33:56,746 INFO [ZkCoordinator-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-14T19:33:56.746-07:00","service":"historical","host":"phxdbx1190.phx.ebay.com:8081","severity":"component-failure","description":"Failed to load segment for dataSource","data":{"class":"io.druid.server.coordination.ZkCoordinator","exceptionType":"io.druid.segment.loading.SegmentLoadingException","exceptionMessage":"Exception loading segment[pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9]","exceptionStackTrace":"io.druid.segment.loading.SegmentLoadingException: Exception loading segment[pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9]\n\tat io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:136)\n\tat io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44)\n\tat io.druid.server.coordination.BaseZkCoordinator$1.childEvent(BaseZkCoordinator.java:113)\n\tat org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:509)\n\tat org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:503)\n\tat org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92)\n\tat com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)\n\tat org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:83)\n\tat org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:500)\n\tat org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35)\n\tat org.apache.curator.framework.recipes.cache.PathChildrenCache$10.run(PathChildrenCache.java:762)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:262)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:262)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:744)\nCaused by: io.druid.segment.loading.SegmentLoadingException: Asked to load path[/kafka2/druid/localStorage/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/2014-09-04T00:00:00.000-07:00/9/index.zip], but it doesn't exist.\n\tat io.druid.segment.loading.LocalDataSegmentPuller.getFile(LocalDataSegmentPuller.java:100)\n\tat io.druid.segment.loading.LocalDataSegmentPuller.getSegmentFiles(LocalDataSegmentPuller.java:41)\n\tat io.druid.segment.loading.OmniSegmentLoader.getSegmentFiles(OmniSegmentLoader.java:125)\n\tat io.druid.segment.loading.OmniSegmentLoader.getSegment(OmniSegmentLoader.java:93)\n\tat io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:145)\n\tat io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:132)\n\t... 17 more\n","segment":{"dataSource":"pulsar_raw","interval":"2014-09-04T00:00:00.000-07:00/2014-09-04T01:00:00.000-07:00","version":"2014-09-04T00:00:00.000-07:00","loadSpec":{"type":"local","path":"/kafka2/druid/localStorage/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/2014-09-04T00:00:00.000-07:00/9/index.zip"},"dimensions":"_cn,_con,_cty,_lat,_lon,_rgn,app,bti,bu,curprice,dd_bf,dd_d,dd_os,g,itm,itmtitle,js_ev_type,js_evt_kafka_produce_ts,mloc,p,referer,remoteip,t,timestamp,u","metrics":"count","shardSpec":{"type":"linear","partitionNum":9},"binaryVersion":9,"size":406328782,"identifier":"pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9"}}}]

在 2014年9月11日星期四UTC-7下午8时43分11秒，Fangjin Yang写道：

Xiaoming Zhang

unread,

Sep 15, 2014, 5:07:01 AM9/15/14

to druid-de...@googlegroups.com

Sorry to bother you agagin, i have two questions:

1. Is there a way that may smoothly delete files which include ones in RT nodes and historical nodes?

2. Is there a way that may smoothly change the index segment granularity?

Thanks a lot for your help!

在 2014年9月11日星期四UTC-7下午8时43分11秒，Fangjin Yang写道：

Xiaoming Zhang

unread,

Sep 15, 2014, 5:59:25 AM9/15/14

to druid-de...@googlegroups.com

We cleaned things under /kafka1/druidrealtime/basepersist/ and started all RT nodes.

While we still get following errors:

2014-09-15 01:50:00,559 ERROR [pulsar_event-2014-09-14T23:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[pulsar_event]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=No FileSystem for scheme: hdfs, interval=2014-09-14T23:00:00.000-07:00/2014-09-15T00:00:00.000-07:00}

2014-09-15 01:50:00,559 INFO [pulsar_event-2014-09-14T23:00:00.000-07:00-persist-n-merge] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-15T01:50:00.559-07:00","service":"realtime","host":"phxdbx1182.phx.ebay.com:8083","severity":"component-failure","description":"Failed to persist merged index[pulsar_event]","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.IOException","exceptionMessage":"No FileSystem for scheme: hdfs","exceptionStackTrace":"java.io.IOException: No FileSystem for scheme: hdfs\n\tat org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2304)\n\tat org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2311)\n\tat org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)\n\tat org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)\n\tat org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)\n\tat org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)\n\tat org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)\n\tat io.druid.storage.hdfs.HdfsDataSegmentPusher.push(HdfsDataSegmentPusher.java:75)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:356)\n\tat io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:744)\n","interval":"2014-09-14T23:00:00.000-07:00/2014-09-15T00:00:00.000-07:00"}}]

2014-09-15 02:08:57,174 ERROR [pulsar_session-2014-09-14T23:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[pulsar_session]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=No FileSystem for scheme: hdfs, interval=2014-09-14T23:00:00.000-07:00/2014-09-15T00:00:00.000-07:00}

在 2014年9月11日星期四UTC-7下午8时43分11秒，Fangjin Yang写道：

Fangjin Yang

unread,

Sep 15, 2014, 2:28:44 PM9/15/14

to druid-de...@googlegroups.com

Hi Xiaoming, a few more configuration problems it seems like:

Looking at this log line:

Asked to load path[/kafka2/druid/localStorage/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/2014-09-04T00:00:00.000-07:00/9/index.zip], but it doesn't exist.

It appears local storage is what is specified as the deep storage mechanism. Unless you are running all nodes on a single system, this won't work for handoff. Do you guys want to load segments to hdfs?

Thanks,

FJ

Fangjin Yang

unread,

Sep 15, 2014, 2:32:27 PM9/15/14

to druid-de...@googlegroups.com

Hi, see inline.

On Monday, September 15, 2014 2:07:01 AM UTC-7, Xiaoming Zhang wrote:

Sorry to bother you agagin, i have two questions:

No worries, we are happy to answer all questions.

1. Is there a way that may smoothly delete files which include ones in RT nodes and historical nodes?

Real-time nodes should cleanly delete all files after a successful handoff. Historical nodes should cleanly delete all segment files once they've dropped a segment. You can use the concept of rules (http://druid.io/docs/latest/Rule-Configuration.html) to automate when data is dropped from a cluster. If you want to cleanly remove all segments, metadata, and segment files in deep storage, you can look at the kill task (http://druid.io/docs/latest/Tasks.html).

2. Is there a way that may smoothly change the index segment granularity?

Yes. You can change the segment granularity at ingest time, and if you want to change granularities for already created segments, you can look at http://druid.io/docs/latest/Ingestion-FAQ.html about reingesting data with different schemas.

Xiaoming Zhang

unread,

Sep 17, 2014, 3:28:20 AM9/17/14

to druid-de...@googlegroups.com

Thank you very much for your answer.

I mentioned that we have two clusters, though they perform the same way: send topic 1 and topic 2 to druid.

When RT node is ingesting data, we query the rows, which turns out that the number is not match with our cassandra system.

Recently we started our historical node and we query the data of historical nodes, we found that the number is exactly the same as casandra.

Is this a bug of Druid?

在 2014年9月15日星期一UTC-7上午11时32分27秒，Fangjin Yang写道：

Xiaoming Zhang

unread,

Sep 17, 2014, 6:13:59 AM9/17/14

to druid-de...@googlegroups.com

BTW:

Is there any way that can make Druid allow cross domain request?

Any config may make this done?

在 2014年9月15日星期一UTC-7上午11时32分27秒，Fangjin Yang写道：

Fangjin Yang

unread,

Sep 18, 2014, 12:56:52 AM9/18/14

to druid-de...@googlegroups.com

I do not believe this is a bug, but instead a misconfiguration with the real-time pipeline. Do you still see this problem after removing the "test" rejection policy?

Fangjin Yang

unread,

Sep 18, 2014, 12:58:24 AM9/18/14

to druid-de...@googlegroups.com

Hi Xiaoming, by cross domain request, do you mean deploying Druid in multiple data centers? Can you give me some more detail on the use case?

Xiaoming Zhang

unread,

Sep 18, 2014, 1:05:16 AM9/18/14

to druid-de...@googlegroups.com

Yes, we use rejection policy as servertime in our cluseter, this happens all the time.

We use javascript to send request to the duird broker node to query, as the broswer not allow doing this.

在 2014年9月17日星期三UTC-7下午9时58分24秒，Fangjin Yang写道：

Fangjin Yang

unread,

Sep 18, 2014, 9:48:37 PM9/18/14

to druid-de...@googlegroups.com

Hi Xiaoming, so when the results do not match up, do you mean the results do no not match up during real-time data ingestion? Can you provide some more details, I am not quite understanding the situation which causes data not to match up.

Xiaoming Zhang

unread,

Sep 18, 2014, 10:22:00 PM9/18/14

to druid-de...@googlegroups.com

We query the data which is in RT nodes now, we get a count number "a", after the RT nodes handoff the data to historical nodes, we count the data in the same time period, we get a count number "b", "a" and "b" is not the same.

And "a" is about the half of "b".

About the cross domain problem, we use javascript to send duird query to get result, which is not allowed by browser. Duird should add some header to allow this.

在 2014年9月18日星期四UTC-7下午6时48分37秒，Fangjin Yang写道：

Fangjin Yang

unread,

Sep 18, 2014, 11:35:28 PM9/18/14

to druid-de...@googlegroups.com

Hmmm, that is very strange, and something I haven't seen before. Some things that come to mind

1) Exceptions may be occurring on the real-time nodes, preventing correct data from returning

2) Exceptions or timeouts may be occurring on the brokers

In these 2 cases, logs will really help.

3) Misconfiguration confusing the broker. Using your coordinator console, make sure all shards of a segment appear before you make the query. Perhaps there is something wrong with segment announcement.

One thing you can try to do is query each rt node individually. All queryable Druid nodes share the same query API. Make sure each individual RT node actually returns results.

As for the cross domain problem, you can take a look at QueryServlet.java to make changes in how Druid parses, send and returns http request headers. All queries use this class.

Xiaoming Zhang

unread,

Sep 26, 2014, 5:11:53 AM9/26/14

to druid-de...@googlegroups.com

We query data from Druid and the query we use:

var oneMinuteRaw = {

"queryType": "timeseries",

"dataSource": "pulsar_event",

"granularity": "second",

"aggregations": [

{ "type": "count", "name": "hits"}

],

"intervals": ["PT5m/"+nowTime]

}

This query will be executed each second and fresh our graph, we found the graph is not stable and it is twist up and down time to time.

I logged the data once this happened:

[index, value] pair.

0,6418,1,6521,2,6378,3,6461,4,6471,5,6400,6,6530,7,6372,8,6480,9,6384,10,6618,11,6480,12,6510,13,6338,14,6394,15,6678,16,6504,17,6429,18,6464,19,6521,20,6414,21,6451,22,6449,23,6475,24,6338,25,6291,26,6427,27,6416,28,6363,29,6196,30,6280,31,6107,32,6265,33,6376,34,6460,35,6356,36,6588,37,6619,38,6425,39,6427,40,6509,41,6496,42,6642,43,6610,44,6407,45,6395,46,6362,47,6328,48,6413,49,6377,50,6481,51,6378,52,6360,53,6451,54,6319,55,6368,56,6278,57,6399,58,6435,59,6427,60,6508,61,6278,62,6380,63,6472,64,6372,65,6395,66,6341,67,6431,68,6496,69,6528,70,6484,71,6575,72,6445,73,6516,74,6413,75,6377,76,6441,77,6563,78,6650,79,6417,80,6445,81,6390,82,6378,83,6455,84,6387,85,6391,86,6413,87,6544,88,6424,89,6581,90,6386,91,6418,92,6303,93,6371,94,6448,95,6203,96,6342,97,6450,98,6469,99,6603,100,6368,101,6413,102,6428,103,6820,104,6213,105,6523,106,6233,107,6471,108,6478,109,6305,110,6250,111,6507,112,6503,113,6349,114,6542,115,6340,116,6476,117,6404,118,6379,119,6239,120,6569,121,6402,122,6428,123,6363,124,6378,125,6251,126,6378,127,6565,128,6396,129,6473,130,6581,131,6300,132,6310,133,6339,134,6338,135,6272,136,6363,137,6425,138,6479,139,6360,140,6314,141,6194,142,6511,143,6541,144,6349,145,6333,146,6502,147,6408,148,6262,149,6388,150,6214,151,6112,152,6472,153,6346,154,6421,155,6369,156,6256,157,6268,158,6073,159,6322,160,6148,161,6359,162,6351,163,6453,164,6314,165,6270,166,6357,167,6232,168,6050,169,6421,170,6409,171,6443,172,6320,173,6363,174,6319,175,6262,176,6307,177,6410,178,6432,179,6190,180,6239,181,6259,182,6286,183,6355,184,6550,185,6330,186,6316,187,6242,188,6351,189,6497,190,6136,191,6273,192,6336,193,6429,194,6260,195,6221,196,6269,197,6307,198,6369,199,6396,200,6292,201,6412,202,6310,203,6365,204,6199,205,6405,206,5982,207,6158,208,6383,209,6225,210,6205,211,6205,212,6323,213,6182,214,6254,215,6343,216,6470,217,6390,218,6568,219,6660,220,6475,221,6496,222,6369,223,6735,224,6291,225,6492,226,6272,227,6158,228,6284,229,6293,230,6218,231,6345,232,6326,233,6281,234,6358,235,6184,236,6361,237,6232,238,6261,239,6178,240,6367,241,6270,242,6250,243,6521,244,6253,245,6166,246,6450,247,6346,248,6280,249,6445,250,6194,251,6244,252,6259,253,6343,254,6329,255,6398,256,6184,257,6220,258,6342,259,6290,260,6404,261,6355,262,6094,263,6408,264,6393,265,6212,266,6236,267,6237,268,6411,269,6174,270,6315,271,6203,272,6353,273,6223,274,6196,275,6377,276,6416,277,6411,278,6441,279,6251,280,6147,281,6154,282,6010,283,6484,284,6091,285,6203,286,6201,287,6410,288,6337,289,6410,290,6269,291,6148,292,6199,293,6237,294,6404,295,6276,296,6319,297,6420,298,6246,299,6195 trafficsource.js:56

0,5551,1,5588,2,5590,3,5547,4,5650,5,5569,6,5594,7,5543,8,5736,9,5616,10,5665,11,5463,12,5601,13,5770,14,5610,15,5514,16,5595,17,5677,18,5556,19,5617,20,5602,21,5591,22,5449,23,5471,24,5584,25,5617,26,5524,27,5395,28,5431,29,5323,30,5403,31,5506,32,5602,33,5518,34,5736,35,5747,36,5558,37,5568,38,5626,39,5618,40,5728,41,5753,42,5543,43,5525,44,5502,45,5490,46,5510,47,5501,48,5615,49,5499,50,5537,51,5600,52,5461,53,5524,54,5482,55,5577,56,5534,57,5554,58,5672,59,5459,60,5526,61,5575,62,5522,63,5566,64,5502,65,5563,66,5648,67,5684,68,5655,69,5710,70,5573,71,5633,72,5578,73,5529,74,5545,75,5687,76,5762,77,5560,78,5591,79,5537,80,5512,81,5595,82,5537,83,5558,84,5544,85,5668,86,5592,87,5721,88,5539,89,5554,90,5480,91,5523,92,5592,93,5351,94,5469,95,5601,96,5632,97,5704,98,5487,99,5596,100,5569,101,5948,102,5411,103,5644,104,5424,105,5644,106,5631,107,5485,108,5404,109,5671,110,5639,111,5509,112,5681,113,5506,114,5618,115,5558,116,5497,117,5367,118,5667,119,5615,120,5576,121,5513,122,5472,123,5420,124,5537,125,5690,126,5521,127,5615,128,5736,129,5467,130,5454,131,5532,132,5503,133,5429,134,5526,135,5576,136,5607,137,5570,138,5487,139,5403,140,5641,141,5691,142,5515,143,5510,144,5647,145,5573,146,5425,147,5582,148,5396,149,5303,150,5573,151,5483,152,5531,153,5506,154,5453,155,5436,156,5277,157,5489,158,5375,159,5576,160,5498,161,5561,162,5467,163,5452,164,5487,165,5407,166,5260,167,5574,168,5523,169,5560,170,5477,171,5552,172,5514,173,5430,174,5466,175,5604,176,5593,177,5359,178,5434,179,5440,180,5457,181,5504,182,5699,183,5473,184,5488,185,5424,186,5500,187,5634,188,5336,189,5477,190,5494,191,5612,192,5388,193,5414,194,5430,195,5480,196,5480,197,5528,198,5486,199,5541,200,5483,201,5474,202,5353,203,5533,204,5216,205,5357,206,5522,207,5395,208,5398,209,5373,210,5448,211,5366,212,5420,213,5490,214,5595,215,5551,216,5684,217,5761,218,5623,219,5603,220,5509,221,5826,222,5473,223,5622,224,5411,225,5353,226,5488,227,5470,228,5404,229,5515,230,5456,231,5426,232,5481,233,5386,234,5501,235,5424,236,5427,237,5380,238,5510,239,5440,240,5421,241,5649,242,5418,243,5300,244,5608,245,5488,246,5463,247,5544,248,5406,249,5402,250,5432,251,5482,252,5509,253,5530,254,5375,255,5395,256,5492,257,5443,258,5560,259,5547,260,5292,261,5564,262,5548,263,5410,264,5452,265,5408,266,5567,267,5352,268,5502,269,5389,270,5494,271,5390,272,5368,273,5540,274,5542,275,5559,276,5572,277,5432,278,5335,279,5371,280,5201,281,5652,282,5301,283,5352,284,5399,285,5578,286,5491,287,5569,288,5445,289,5329,290,5421,291,5412,292,5486,293,5477,294,5443,295,5547,296,5439,297,5391,298,5486,299,5486 trafficsource.js:56

0,6310,1,6258,2,6393,3,6285,4,6333,5,6263,6,6490,7,6361,8,6368,9,6128,10,6353,11,6472,12,6312,13,6223,14,6342,15,6375,16,6279,17,6315,18,6302,19,6265,20,6128,21,6155,22,6328,23,6321,24,6188,25,6110,26,6165,27,6053,28,6101,29,6175,30,6317,31,6180,32,6433,33,6434,34,6359,35,6329,36,6328,37,6318,38,6478,39,6541,40,6216,41,6213,42,6188,43,6234,44,6239,45,6200,46,6296,47,6178,48,6273,49,6303,50,6164,51,6251,52,6185,53,6302,54,6269,55,6268,56,6362,57,6141,58,6238,59,6305,60,6292,61,6268,62,6234,63,6279,64,6397,65,6445,66,6413,67,6486,68,6271,69,6340,70,6279,71,6206,72,6292,73,6410,74,6487,75,6313,76,6311,77,6228,78,6215,79,6311,80,6264,81,6260,82,6276,83,6377,84,6335,85,6476,86,6249,87,6272,88,6198,89,6272,90,6257,91,6051,92,6223,93,6293,94,6307,95,6433,96,6226,97,6313,98,6262,99,6706,100,6091,101,6371,102,6187,103,6331,104,6393,105,6263,106,6101,107,6419,108,6462,109,6221,110,6382,111,6238,112,6350,113,6283,114,6182,115,6046,116,6371,117,6370,118,6291,119,6227,120,6177,121,6161,122,6252,123,6407,124,6200,125,6344,126,6441,127,6138,128,6166,129,6265,130,6229,131,6098,132,6226,133,6284,134,6301,135,6269,136,6240,137,6109,138,6370,139,6447,140,6265,141,6209,142,6368,143,6267,144,6107,145,6295,146,6094,147,5974,148,6275,149,6146,150,6264,151,6225,152,6158,153,6127,154,5979,155,6179,156,6066,157,6309,158,6220,159,6303,160,6163,161,6115,162,6217,163,6111,164,5939,165,6317,166,6229,167,6249,168,6218,169,6262,170,6229,171,6126,172,6177,173,6313,174,6338,175,6057,176,6147,177,6138,178,6149,179,6198,180,6455,181,6192,182,6232,183,6140,184,6186,185,6363,186,6008,187,6170,188,6204,189,6331,190,6096,191,6075,192,6208,193,6118,194,6230,195,6258,196,6197,197,6245,198,6249,199,6185,200,6037,201,6230,202,5899,203,6117,204,6265,205,6049,206,6078,207,6044,208,6150,209,6038,210,6155,211,6194,212,6265,213,6230,214,6391,215,6494,216,6344,217,6299,218,6217,219,6533,220,6221,221,6383,222,6150,223,6051,224,6200,225,6189,226,6085,227,6228,228,6160,229,6163,230,6202,231,6128,232,6234,233,6099,234,6154,235,6099,236,6194,237,6107,238,6100,239,6366,240,6114,241,5982,242,6307,243,6172,244,6164,245,6245,246,6137,247,6063,248,6124,249,6119,250,6168,251,6223,252,6052,253,6075,254,6165,255,6161,256,6270,257,6221,258,5990,259,6216,260,6231,261,6109,262,6185,263,6131,264,6260,265,6046,266,6219,267,6086,268,6187,269,6077,270,6080,271,6255,272,6294,273,6287,274,6228,275,6119,276,5988,277,6031,278,5907,279,6360,280,5954,281,6006,282,6146,283,6256,284,6200,285,6269,286,6140,287,6015,288,6145,289,6053,290,6166,291,6217,292,6138,293,6249,294,6153,295,6086,296,6204,297,6217,298,6156,299,6175

The second result is quite different from the first and third, which shows a drop and will rise once the third data is loaded.

Why this happened and how to solve?

Thank you!

在 2014年9月18日星期四UTC-7下午8时35分28秒，Fangjin Yang写道：

Xiaoming Zhang

unread,

Sep 28, 2014, 2:48:10 AM9/28/14

to druid-de...@googlegroups.com

We have two kafka clusters that will give data to druid:

PHX cluster:

chd1b02c-4f09.stratus.phx.ebay.com:2181,chd1b02c-a440.stratus.phx.ebay.com:2181,phx7b02c-f8c8.stratus.phx.ebay.com:2181,phx7b02c-8b12.stratus.phx.ebay.com:2181,chd1b02c-43ff.stratus.phx.ebay.com:2181,phx7b02c-b07c.stratus.phx.ebay.com:2181,phx7b01c-303e.stratus.phx.ebay.com:2181,phx7b02c-b6e0.stratus.phx.ebay.com:2181,chd1b02c-903e.stratus.phx.ebay.com:2181

SLC cluster:

slc4b01c-a352.stratus.slc.ebay.com:2181,slc4b01c-f6af.stratus.slc.ebay.com:2181,slc4b01c-cbf6.stratus.slc.ebay.com:2181,slc4b01c-9e45.stratus.slc.ebay.com:2181,slc4b01c-ed7c.stratus.slc.ebay.com:2181,slc4b01c-fb9d.stratus.slc.ebay.com:2181,slc4b02c-09b2.stratus.slc.ebay.com:2181,slc4b01c-6ee6.stratus.slc.ebay.com:2181,slc4b01c-ab23.stratus.slc.ebay.com:2181

There is an attribute druid.zk.service.host at each runtimre-properties file, previously i only add PHX machines to druid.zk.service.host.

We query RT data we find that the data volume is only half of the same time of yesterday.

Do we need to add all machines to druid.zk.service.host of all nodes?

I tried this, and historical nodes can not give any result, once i delete slc machines, historical nodes can give result.

Here is the RT config :

[

{

"schema": {

"dataSource": "pulsar_event",

"aggregators": [

{

"type": "count",

"name": "count"

}

],

"indexGranularity": "second",

"shardSpec": {

"type": "linear",

"partitionNum": 1

}

},

"config": {

"maxRowsInMemory": 500000,

"intermediatePersistPeriod": "PT1m"

},

"firehose": {

"type": "kafka-0.8",

"consumerProps": {

"zookeeper.connect": "chd1b02c-4f09.stratus.phx.ebay.com:2181,chd1b02c-a440.stratus.phx.ebay.com:2181,phx7b02c-f8c8.stratus.phx.ebay.com:2181,phx7b02c-8b12.stratus.phx.ebay.com:2181,chd1b02c-43ff.stratus.phx.ebay.com:2181,phx7b02c-b07c.stratus.phx.ebay.com:2181,phx7b01c-303e.stratus.phx.ebay.com:2181,phx7b02c-b6e0.stratus.phx.ebay.com:2181,chd1b02c-903e.stratus.phx.ebay.com:2181",

"zookeeper.connection.timeout.ms": "15000",

"zookeeper.session.timeout.ms": "15000",

"zookeeper.sync.time.ms": "5000",

"group.id": "druid-group1",

"fetch.message.max.bytes": "1048586",

"auto.offset.reset": "largest",

"auto.commit.enable": "false"

},

"feed": "Trkng.druid-sojEvent",

"parser": {

"timestampSpec": {

"column": "timestamp",

"format": "auto"

},

"data": {

"format": "json"

},

"dimensions": ["guid","uid","tenant","eventType","site","country","city","linespeed","continent","region","browserFamily","browserVersion","deviceFamily","deviceClass","osFamily","osVersion","page"]

}

},

"plumber": {

"type": "realtime",

"windowPeriod": "PT50m",

"segmentGranularity": "hour",

"basePersistDirectory": "/kafka1/druidrealtime/basepersist1",

"rejectionPolicyFactory": {

"type": "serverTime"

}

},

{

"schema": {

"dataSource": "pulsar_session",

"aggregators": [

{

"type": "count",

"name": "count"

}

],

"indexGranularity": "second",

"shardSpec": {

"type": "linear",

"partitionNum": 1

}

},

"config": {

"maxRowsInMemory": 500000,

"intermediatePersistPeriod": "PT1m"

},

"firehose": {

"type": "kafka-0.8",

"consumerProps": {

"zookeeper.connect": "chd1b02c-4f09.stratus.phx.ebay.com:2181,chd1b02c-a440.stratus.phx.ebay.com:2181,phx7b02c-f8c8.stratus.phx.ebay.com:2181,phx7b02c-8b12.stratus.phx.ebay.com:2181,chd1b02c-43ff.stratus.phx.ebay.com:2181,phx7b02c-b07c.stratus.phx.ebay.com:2181,phx7b01c-303e.stratus.phx.ebay.com:2181,phx7b02c-b6e0.stratus.phx.ebay.com:2181,chd1b02c-903e.stratus.phx.ebay.com:2181",

"zookeeper.connection.timeout.ms": "15000",

"zookeeper.session.timeout.ms": "15000",

"zookeeper.sync.time.ms": "5000",

"group.id": "druid-group1",

"fetch.message.max.bytes": "1048586",

"auto.offset.reset": "largest",

"auto.commit.enable": "false"

},

"feed": "Trkng.druid-sessionEvent",

"parser": {

"timestampSpec": {

"column": "timestamp",

"format": "auto"

},

"data": {

"format": "json"

},

"dimensions": ["uid","rv","entryPage","exitPage","trafficSource","tenant","site","city","region","country","continent","app","linespeed","browserFamily","browserVersion","deviceFamily","deviceClass","osFamily","osVersion","treatments"]

}

},

"plumber": {

"type": "realtime",

"windowPeriod": "PT50m",

"segmentGranularity": "hour",

"basePersistDirectory": "/kafka1/druidrealtime/basepersist1",

"rejectionPolicyFactory": {

"type": "serverTime"

}

},

{

"schema": {

"dataSource": "pulsar_event",

"aggregators": [

{

"type": "count",

"name": "count"

}

],

"indexGranularity": "second",

"shardSpec": {

"type": "linear",

"partitionNum": 2

}

},

"config": {

"maxRowsInMemory": 500000,

"intermediatePersistPeriod": "PT1m"

},

"firehose": {

"type": "kafka-0.8",

"consumerProps": {

"zookeeper.connect": "slc4b01c-a352.stratus.slc.ebay.com:2181,slc4b01c-f6af.stratus.slc.ebay.com:2181,slc4b01c-cbf6.stratus.slc.ebay.com:2181,slc4b01c-9e45.stratus.slc.ebay.com:2181,slc4b01c-ed7c.stratus.slc.ebay.com:2181,slc4b01c-fb9d.stratus.slc.ebay.com:2181,slc4b02c-09b2.stratus.slc.ebay.com:2181,slc4b01c-6ee6.stratus.slc.ebay.com:2181,slc4b01c-ab23.stratus.slc.ebay.com:2181",

"zookeeper.connection.timeout.ms": "15000",

"zookeeper.session.timeout.ms": "15000",

"zookeeper.sync.time.ms": "5000",

"group.id": "druid-group1",

"fetch.message.max.bytes": "1048586",

"auto.offset.reset": "largest",

"auto.commit.enable": "false"

},

"feed": "Trkng.druid-sojEvent",

"parser": {

"timestampSpec": {

"column": "timestamp",

"format": "auto"

},

"data": {

"format": "json"

},

"dimensions": ["guid","uid","tenant","eventType","site","country","city","linespeed","continent","region","browserFamily","browserVersion","deviceFamily","deviceClass","osFamily","osVersion","page"]

}

},

"plumber": {

"type": "realtime",

"windowPeriod": "PT50m",

"segmentGranularity": "hour",

"basePersistDirectory": "/kafka1/druidrealtime/basepersist2",

"rejectionPolicyFactory": {

"type": "serverTime"

}

},

{

"schema": {

"dataSource": "pulsar_session",

"aggregators": [

{

"type": "count",

"name": "count"

}

],

"indexGranularity": "second",

"shardSpec": {

"type": "linear",

"partitionNum": 2

}

},

"config": {

"maxRowsInMemory": 500000,

"intermediatePersistPeriod": "PT1m"

},

"firehose": {

"type": "kafka-0.8",

"consumerProps": {

"zookeeper.connect": "slc4b01c-a352.stratus.slc.ebay.com:2181,slc4b01c-f6af.stratus.slc.ebay.com:2181,slc4b01c-cbf6.stratus.slc.ebay.com:2181,slc4b01c-9e45.stratus.slc.ebay.com:2181,slc4b01c-ed7c.stratus.slc.ebay.com:2181,slc4b01c-fb9d.stratus.slc.ebay.com:2181,slc4b02c-09b2.stratus.slc.ebay.com:2181,slc4b01c-6ee6.stratus.slc.ebay.com:2181,slc4b01c-ab23.stratus.slc.ebay.com:2181",

"zookeeper.connection.timeout.ms": "15000",

"zookeeper.session.timeout.ms": "15000",

"zookeeper.sync.time.ms": "5000",

"group.id": "druid-group1",

"fetch.message.max.bytes": "1048586",

"auto.offset.reset": "largest",

"auto.commit.enable": "false"

},

"feed": "Trkng.druid-sessionEvent",

"parser": {

"timestampSpec": {

"column": "timestamp",

"format": "auto"

},

"data": {

"format": "json"

},

"dimensions": ["uid","rv","entryPage","exitPage","trafficSource","tenant","site","city","region","country","continent","app","linespeed","browserFamily","browserVersion","deviceFamily","deviceClass","osFamily","osVersion","treatments"]

}

},

"plumber": {

"type": "realtime",

"windowPeriod": "PT50m",

"segmentGranularity": "hour",

"basePersistDirectory": "/kafka1/druidrealtime/basepersist2",

"rejectionPolicyFactory": {

"type": "serverTime"

}

]

在 2014年9月18日星期四UTC-7下午8时35分28秒，Fangjin Yang写道：

Fangjin Yang

unread,

Sep 29, 2014, 11:59:18 PM9/29/14

to druid-de...@googlegroups.com

Inline.

On Saturday, September 27, 2014 11:48:10 PM UTC-7, Xiaoming Zhang wrote:

We have two kafka clusters that will give data to druid:
PHX cluster:
chd1b02c-4f09.stratus.phx.ebay.com:2181,chd1b02c-a440.stratus.phx.ebay.com:2181,phx7b02c-f8c8.stratus.phx.ebay.com:2181,phx7b02c-8b12.stratus.phx.ebay.com:2181,chd1b02c-43ff.stratus.phx.ebay.com:2181,phx7b02c-b07c.stratus.phx.ebay.com:2181,phx7b01c-303e.stratus.phx.ebay.com:2181,phx7b02c-b6e0.stratus.phx.ebay.com:2181,chd1b02c-903e.stratus.phx.ebay.com:2181
SLC cluster:
slc4b01c-a352.stratus.slc.ebay.com:2181,slc4b01c-f6af.stratus.slc.ebay.com:2181,slc4b01c-cbf6.stratus.slc.ebay.com:2181,slc4b01c-9e45.stratus.slc.ebay.com:2181,slc4b01c-ed7c.stratus.slc.ebay.com:2181,slc4b01c-fb9d.stratus.slc.ebay.com:2181,slc4b02c-09b2.stratus.slc.ebay.com:2181,slc4b01c-6ee6.stratus.slc.ebay.com:2181,slc4b01c-ab23.stratus.slc.ebay.com:2181

Just to be clear, each of your Kafka clusters use a 9 node ZK setup? Any particular reason why you need so many? Aren't you affected by quorum decision times?

There is an attribute druid.zk.service.host at each runtimre-properties file, previously i only add PHX machines to druid.zk.service.host.

Druid uses ZK for distributed coordination. The druid.zk.service.host has nothing to do with Kafka.

We query RT data we find that the data volume is only half of the same time of yesterday.

Have all events been ingested? Are you sure the bottleneck is Druid and not Kafka? We've seen similar questions a few times in the forum and Kafka is most often the bottleneck (https://groups.google.com/forum/#!topic/druid-development/ntAHm8HigMk). Do you have any other consumers in your cluster?

Do we need to add all machines to druid.zk.service.host of all nodes?

I think there is a misunderstanding here about how Druid works.

Xiaoming Zhang

unread,

Oct 9, 2014, 2:15:33 AM10/9/14

to druid-de...@googlegroups.com

Hi, Fangjin:

Thank you very much for you answer!

The reason why we use 9 zk nodes is not that strange, each time we are allocated 9 machines for a bash, so we use one bash machines as a cluster.It is has any bad affects on Druid?

You said this attribute is used for coordinators so i understand your words, i make a mistake, sorry.

Could you please check the email i sent here at 9/26, it is about query result unstable problem.

Hope to get your response soon.

Thanks

Zhang Xiaoming.

在 2014年9月30日星期二UTC+8上午11时59分18秒，Fangjin Yang写道：

...

Fangjin Yang

unread,

Oct 9, 2014, 1:16:22 PM10/9/14

to druid-de...@googlegroups.com

Hi Xiaoming, I've sent several personal emails to you already, have you had a chance to look at them?

I've also gotten several personal emails from other members of your team about this issue. If possible, I'd like to work out the issue and update this thread once things are resolved.

在 2014年9月30日星期二UTC+8上午11时59分18秒，Fangjin Yang写道：

"<a href="http://zookeeper.session.timeout.ms" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fzookeeper.session.timeout.ms\46sa\75D\46sntz\0751\46usg\75AFQjCNFLb3h_9euEZIZWW80VhqlgdxWIdQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fzookeeper.session.timeout.ms\46sa\75D\46sntz\0751\
...

Reply all

Reply to author

Forward