Index Problem

707 views
Skip to first unread message

Xiaoming Zhang

unread,
Sep 8, 2014, 11:05:04 PM9/8/14
to druid-de...@googlegroups.com
Hi, druid team:

We have four machines, each will consume two clusters, the two clusters are the same, will give two kafka topics:

RT Machine 1:
[
     {
"dataSource": "topic1", 
"partitionNum": 1
"zookeeper.connect": "cluster1"
    },
    {
"dataSource": "topic2", 
"partitionNum": 1
"zookeeper.connect": "cluster1"
    },
    {
"dataSource": "topic1", 
"partitionNum": 2
"zookeeper.connect": "cluster2"
    },
    {
"dataSource": "topic2", 
"partitionNum": 2
"zookeeper.connect": "cluster2"
    }
]

And for other three RT nodes, we have partionNum as below:

for RT2: 4,4,5,5
for RT3: 6,6,7,7
for RT4: 8,8,99

We set the plumber as below for each partition:

  "plumber": {
            "type": "realtime",
            "windowPeriod": "PT50m",
            "segmentGranularity": "hour",
            "basePersistDirectory": "/kafka1/druidrealtime/basepersist",
            "rejectionPolicyFactory": {
                "type": "test"
            }
        }


We except to see that every hour there will be 8 tuples inserted into table "druid_segments"

Like this:
pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_1
pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_2
pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_4
pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_5
pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_6
pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_7
pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_8
pulsar_raw_2014-09-04T05:00:00.000-07:00_2014-09-04T06:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_9

while I found that for some hourly time, the tuples are not completed:

For this hour, we find only  one tuple:
pulsar_raw_2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_4

For this hour, we lost 6 and 7:
pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_1
pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_2
pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_4
pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_5
pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_8
pulsar_raw_2014-09-04T03:00:00.000-07:00_2014-09-04T04:00:00.000-07:00_2014-09-04T03:00:00.000-07:00_9

And I checked duid.log:

I checked one RT node and I find exception:

druid.log.2:2014-09-04 19:34:06,171 INFO  [chief-pulsar_raw] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-04T19:34:06.165-07:00","service":"realtime","host":"phxdbx1183.phx.ebay.com:8083","severity":"component-failure","description":"Problem loading sink[pulsar_raw] from disk.","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.FileNotFoundException","exceptionMessage":"/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/index.drd (No such file or directory)","exceptionStackTrace":"java.io.FileNotFoundException: /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/index.drd (No such file or directory)\n\tat java.io.FileInputStream.open(Native Method)\n\tat java.io.FileInputStream.<init>(FileInputStream.java:146)\n\tat io.druid.segment.SegmentUtils.getVersionFromDir(SegmentUtils.java:24)\n\tat io.druid.segment.IndexIO.loadIndex(IndexIO.java:128)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.bootstrapSinksFromDisk(RealtimePlumber.java:521)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.startJob(RealtimePlumber.java:152)\n\tat io.druid.segment.realtime.RealtimeManager$FireChief.run(RealtimeManager.java:184)\n","interval":"2014-09-04T04:00:00.000-07:00/2014-09-04T05:00:00.000-07:00"}}]

druid.log.2:2014-09-04 19:34:07,126 ERROR [chief-pulsar_raw] io.druid.segment.realtime.plumber.RealtimePlumber - Problem loading sink[pulsar_raw] from disk.: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.FileNotFoundException, exceptionMessage=/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/index.drd (No such file or directory), interval=2014-09-04T04:00:00.000-07:00/2014-09-04T05:00:00.000-07:00}
druid.log.2:2014-09-04 19:34:07,126 INFO  [chief-pulsar_raw] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-04T19:34:07.126-07:00","service":"realtime","host":"phxdbx1183.phx.ebay.com:8083","severity":"component-failure","description":"Problem loading sink[pulsar_raw] from disk.","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.FileNotFoundException","exceptionMessage":"/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/index.drd (No such file or directory)","exceptionStackTrace":"java.io.FileNotFoundException: /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/index.drd (No such file or directory)\n\tat java.io.FileInputStream.open(Native Method)\n\tat java.io.FileInputStream.<init>(FileInputStream.java:146)\n\tat io.druid.segment.SegmentUtils.getVersionFromDir(SegmentUtils.java:24)\n\tat io.druid.segment.IndexIO.loadIndex(IndexIO.java:128)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.bootstrapSinksFromDisk(RealtimePlumber.java:521)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.startJob(RealtimePlumber.java:152)\n\tat io.druid.segment.realtime.RealtimeManager$FireChief.run(RealtimeManager.java:184)\n","interval":"2014-09-04T04:00:00.000-07:00/2014-09-04T05:00:00.000-07:00"}}]


Ant ideas about this?

Thank you very much!






Fangjin Yang

unread,
Sep 8, 2014, 11:11:47 PM9/8/14
to druid-de...@googlegroups.com
Hi Xiaoming, I believe this is the same problem discussed in a few of the previous threads. Can you provide the following info:

1) Druid version you are running with.
2) The segment versions do not seem to be in UTC. Are you running with UTC timezone?
3) What is in /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00?
4) What is in /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/?
5) If you clear about the sink directory and rerun the experiments, is this behavior reproducible with 0.6.152?

Thanks,
FJ
 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/9a33f080-befe-47fe-843e-5a072e1c0af1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xiaoming Zhang

unread,
Sep 8, 2014, 11:20:37 PM9/8/14
to druid-de...@googlegroups.com
1) Druid version you are running with.
package: druid-services-0.6.121
2) The segment versions do not seem to be in UTC. Are you running with UTC timezone?
We use MST time
3) What is in /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00?
[cronus@phxdbx1183 kafka1]$ cd /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00
[cronus@phxdbx1183 2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00]$ ls -lh
total 40K
drwxr-xr-x 2 cronus app 4.0K Sep  4 04:05 0
drwxr-xr-x 2 cronus app 4.0K Sep  4 04:14 1
drwxr-xr-x 2 cronus app 4.0K Sep  4 04:23 2
drwxr-xr-x 2 cronus app 4.0K Sep  4 04:33 3
drwxr-xr-x 2 cronus app 4.0K Sep  4 04:43 4
drwxr-xr-x 3 cronus app 4.0K Sep  4 19:29 5
drwxr-xr-x 2 cronus app 4.0K Sep  4 05:03 6
drwxr-xr-x 2 cronus app 4.0K Sep  4 05:13 7
drwxr-xr-x 2 cronus app 4.0K Sep  4 05:23 8
drwxr-xr-x 2 cronus app 4.0K Sep  4 05:33 9
4) What is in /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/?
[cronus@phxdbx1183 2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00]$ cd  /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T04:00:00.000-07:00_2014-09-04T05:00:00.000-07:00/5/
[cronus@phxdbx1183 5]$ ls -lh
total 4.0K
drwxr-xr-x 2 cronus app 4.0K Sep  4 19:29 v8-tmp
5) If you clear about the sink directory and rerun the experiments, is this behavior reproducible with 0.6.152?
Will test this later

在 2014年9月8日星期一UTC-7下午8时11分47秒,Fangjin Yang写道:

Fangjin Yang

unread,
Sep 9, 2014, 5:22:02 PM9/9/14
to druid-de...@googlegroups.com
Hi Xiaoming, it seems like an index file went missing and is causing some problems with your real-time node startup. I recall fixing similar problems from 0.6.121 to the current version and I was wondering if you mind trying to reproduce the problem with the latest stable?

Thanks,
FJ

Xiaoming Zhang

unread,
Sep 10, 2014, 6:03:20 AM9/10/14
to druid-de...@googlegroups.com
Do i need to clean every thing if i give a new version to my cluster?

I just not clean anything and started the new version cluster and get the following:

2014-09-10 02:43:46,417 INFO  [chief-pulsar_raw] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-10T02:43:46.416-07:00","service":"realtime","host":"phxdbx1180.phx.ebay.com:8083","severity":"component-failure","description":"Problem loading sink[pulsar_raw] from disk.","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.FileNotFoundException","exceptionMessage":"/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/5/index.drd (No such file or directory)","exceptionStackTrace":"java.io.FileNotFoundException: /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/5/index.drd (No such file or directory)\n\tat java.io.FileInputStream.open(Native Method)\n\tat java.io.FileInputStream.<init>(FileInputStream.java:146)\n\tat io.druid.segment.SegmentUtils.getVersionFromDir(SegmentUtils.java:24)\n\tat io.druid.segment.IndexIO.loadIndex(IndexIO.java:159)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.bootstrapSinksFromDisk(RealtimePlumber.java:529)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber.startJob(RealtimePlumber.java:150)\n\tat io.druid.segment.realtime.RealtimeManager$FireChief.run(RealtimeManager.java:184)\n","interval":"2014-09-04T00:00:00.000-07:00/2014-09-04T01:00:00.000-07:00"}}]
2014-09-10 02:50:11,967 ERROR [pulsar_raw-2014-09-06T07:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[pulsar_raw]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.FileNotFoundException, exceptionMessage=/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-06T07:00:00.000-07:00_2014-09-06T08:00:00.000-07:00/merged/v8-tmp/smoosher/meta.smoosh (No such file or directory), interval=2014-09-06T07:00:00.000-07:00/2014-09-06T08:00:00.000-07:00}

2014-09-10 02:50:11,968 ERROR [pulsar_raw-2014-09-06T07:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[pulsar_raw]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class com.metamx.common.IAE, exceptionMessage=Unknown version[3], interval=2014-09-06T07:00:00.000-07:00/2014-09-06T08:00:00.000-07:00}

2014-09-10 02:50:11,968 INFO  [pulsar_raw-2014-09-06T07:00:00.000-07:00-persist-n-merge] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-10T02:50:11.968-07:00","service":"realtime","host":"phxdbx1180.phx.ebay.com:8083","severity":"component-failure","description":"Failed to persist merged index[pulsar_raw]","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"com.metamx.common.IAE","exceptionMessage":"Unknown version[3]","exceptionStackTrace":"com.metamx.common.IAE: Unknown version[3]\n\tat io.druid.segment.data.GenericIndexed.read(GenericIndexed.java:366)\n\tat io.druid.segment.IndexIO$DefaultIndexIOHandler.convertV8toV9(IndexIO.java:378)\n\tat io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:839)\n\tat io.druid.segment.IndexMerger.merge(IndexMerger.java:308)\n\tat io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:170)\n\tat io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:163)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:348)\n\tat io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:744)\n","interval":"2014-09-06T07:00:00.000-07:00/2014-09-06T08:00:00.000-07:00"}}]

BTW:

Is it possible to make "indexGranularity" 10 second in firehose?

Thanks!
yokspher


在 2014年9月9日星期二UTC-7下午2时22分02秒,Fangjin Yang写道:

Fangjin Yang

unread,
Sep 10, 2014, 1:51:59 PM9/10/14
to druid-de...@googlegroups.com
Hi Xiaoming,

I suspect that your real-time bootstrap is in a broken state because of some error that occurred. I'd like to see if the problem can be reproduced with the latest version. Can we try removing everything under /kafka1/druidrealtime/basepersist and trying 0.6.152? As for 10s index granularities, this is possible and you can extend QueryGranularity.java with the logic.

Xiaoming Zhang

unread,
Sep 10, 2014, 10:05:31 PM9/10/14
to druid-de...@googlegroups.com
Hi, Fangjin:

You mean i need to build my own version of druid  if i need to change the index granularity to 10 second?

And I used the version of 0.6.146, cause you said i should try the latest stable so i choose this one.

在 2014年9月10日星期三UTC-7上午10时51分59秒,Fangjin Yang写道:

Fangjin Yang

unread,
Sep 10, 2014, 11:25:26 PM9/10/14
to druid-de...@googlegroups.com
Hi Xiaoming, I'm assuming this is a POC environment. I was just wondering if you could: 
1) Remove the contents in /kafka1/druidrealtime/basepersist/
2) Use the latest stable
3) Try to reproduce the error.

You can contribute the code for a 10 second index granularity and we can include it in the next release.

Xiaoming Zhang

unread,
Sep 11, 2014, 4:01:30 AM9/11/14
to druid-de...@googlegroups.com
Hi, Fangjin:

I removed things under  /kafka1/druidrealtime/basepersist/ of all RT nodes

And i use the latest table 0.6.146

I searched the log which shows no exception, good for you!

But when i check the index in sql db, it is still not good.

Our cluster has 8 partitionNum, 1,3,4,5,6,7,8,9, while the index segment only has about 3 to 4 index segments for each hour.

Another problem:

We search the data count of each hour with query:
{
"queryType": "timeseries",
"dataSource": "pulsar_raw",
"granularity": "hour",
"aggregations": [
{ "type": "count", "name": "count" }
],
"intervals": [ "2014-09-09/2014-09-10" ]
}

We get the count of each hour like:
[ {
  "timestamp" : "2014-09-09T07:00:00.000Z",
  "result" : {
    "count" : 10381336
  }
}, {
  "timestamp" : "2014-09-09T08:00:00.000Z",
  "result" : {
    "count" : 11277862
  }
}, {
  "timestamp" : "2014-09-09T09:00:00.000Z",
  "result" : {
    "count" : 12163299
  }
}
.....
 {
  "timestamp" : "2014-09-10T04:00:00.000Z",
  "result" : {
    "count" : 8864453
  }
}, {
  "timestamp" : "2014-09-10T05:00:00.000Z",
  "result" : {
    "count" : 8816756
  }
}, {
  "timestamp" : "2014-09-10T06:00:00.000Z",
  "result" : {
    "count" : 9947131
  }
} ]


We have another Cassandra system which will get the same message  and we searched the data:
cqlsh> select * from sojtracking.mc_mctimegroupmetric where metricname='MCDruidRawEventCount' and groupid='2014-09-09';

 metricname           | groupid    | metrictime               | tag_metrictime | value
----------------------+------------+--------------------------+----------------+----------
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 00:00:00-0700 |  1410246000000 | 18165012
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 01:00:00-0700 |  1410249600000 | 19739481
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 02:00:00-0700 |  1410253200000 | 21292068
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 03:00:00-0700 |  1410256800000 | 21587227
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 04:00:00-0700 |  1410260400000 | 22717716
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 05:00:00-0700 |  1410264000000 | 23881615
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 06:00:00-0700 |  1410267600000 | 24692035
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 07:00:00-0700 |  1410271200000 | 26587748
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 08:00:00-0700 |  1410274800000 | 27055307
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 09:00:00-0700 |  1410278400000 | 28132186
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 10:00:00-0700 |  1410282000000 | 28824256
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 11:00:00-0700 |  1410285600000 | 30515629
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 12:00:00-0700 |  1410289200000 | 31999651
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 13:00:00-0700 |  1410292800000 | 29975886
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 14:00:00-0700 |  1410296400000 | 26556515
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 15:00:00-0700 |  1410300000000 | 23895235
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 16:00:00-0700 |  1410303600000 | 19780544
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 17:00:00-0700 |  1410307200000 | 19707490
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 18:00:00-0700 |  1410310800000 | 20011370
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 19:00:00-0700 |  1410314400000 | 19602294
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 20:00:00-0700 |  1410318000000 | 18242883
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 21:00:00-0700 |  1410321600000 | 15509185
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 22:00:00-0700 |  1410325200000 | 15428086
 MCDruidRawEventCount | 2014-09-09 | 2014-09-09 23:00:00-0700 |  1410328800000 | 17408436


For each hour, the number of Druid is about 57% of Cassandra, which is not match and makes us confused that maybe data is lost.
But i checked  the log there is no throw or parse fail.

Any ideas about this?

Thanks a lot !
yokspher



在 2014年9月10日星期三UTC-7下午8时25分26秒,Fangjin Yang写道:

Xiaoming Zhang

unread,
Sep 11, 2014, 4:11:56 AM9/11/14
to druid-de...@googlegroups.com
BTW:

Could you please add one line to QueryGranularity.java?

TEN_SECOND         (        10  *  1000)


在 2014年9月10日星期三UTC-7下午8时25分26秒,Fangjin Yang写道:

Fangjin Yang

unread,
Sep 11, 2014, 2:10:25 PM9/11/14
to druid-de...@googlegroups.com
Hi Xiaoming,

The query you are issuing will not give you a count of ingested rows. The query you are issuing is returning a count of Druid rows, which factors in rollup. If you want to query for the ingested rows, you need to specify a 'count' aggregator at ingestion time. Then, at query time, issue a query with a longSum aggregator. I have some more comments inline.


On Thursday, September 11, 2014 1:01:30 AM UTC-7, Xiaoming Zhang wrote:
Hi, Fangjin:

I removed things under  /kafka1/druidrealtime/basepersist/ of all RT nodes

And i use the latest table 0.6.146

I searched the log which shows no exception, good for you!

But when i check the index in sql db, it is still not good.

Our cluster has 8 partitionNum, 1,3,4,5,6,7,8,9, while the index segment only has about 3 to 4 index segments for each hour.

Do you mean the reset of the partitions are not being persisted? Can you share:
1) Your configuration
2) rejectionPolicy
3) windowPeriod
4) Any exceptions or log messages during handoff on the real-time node
5) Any exceptions or log messages during handoff on the coordinator node
Your data is likely being rolled up.

Xiaoming Zhang

unread,
Sep 11, 2014, 10:41:21 PM9/11/14
to druid-de...@googlegroups.com
My bad.

The query i give you is the one i use, our QA use the query like this:
{
"queryType": "timeseries",
"dataSource": "pulsar_raw",
"granularity": "hour",
"aggregations": [
{ "type": "longSum", "name": "count", "fieldName": "count" }
],
"intervals": [ "2014-09-09/2014-09-10" ]
}

And we had  this "count" config at the ingest time.
"aggregators": [
                {
                    "type": "count",
                    "name": "count"
                }
            ],
The query i use and QA use give similar numbers, just  slightly different which is still about 57%.


1) Your configuration
Attached all config files to you
2) rejectionPolicy
test
3) windowPeriod
PT50m
4) Any exceptions or log messages during handoff on the real-time node
These errors are produced after i use 0.6.146 adn clean things under   /kafka1/druidrealtime/basepersist/.
And we tried to add HDFS dependency, after running for a while, it automatically shutdown.

2014-09-11 01:53:34,664 ERROR [pulsar_raw-2014-09-11T01:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[pulsar_raw]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.FileNotFoundException, exceptionMessage=/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-11T01:00:00.000-07:00_2014-09-11T02:00:00.000-07:00/merged/v8-tmp/inverted.drd (No such file or directory), interval=2014-09-11T01:00:00.000-07:00/2014-09-11T02:00:00.000-07:00}

2014-09-11 01:53:34,667 INFO  [pulsar_raw-2014-09-11T01:00:00.000-07:00-persist-n-merge] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-11T01:53:34.664-07:00","service":"realtime","host":"phxdbx1180.phx.ebay.com:8083","severity":"component-failure","description":"Failed to persist merged index[pulsar_raw]","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.FileNotFoundException","exceptionMessage":"/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-11T01:00:00.000-07:00_2014-09-11T02:00:00.000-07:00/merged/v8-tmp/inverted.drd (No such file or directory)","exceptionStackTrace":"java.io.FileNotFoundException: /kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-11T01:00:00.000-07:00_2014-09-11T02:00:00.000-07:00/merged/v8-tmp/inverted.drd (No such file or directory)\n\tat java.io.FileOutputStream.open(Native Method)\n\tat java.io.FileOutputStream.<init>(FileOutputStream.java:221)\n\tat com.google.common.io.Files$FileByteSink.openStream(Files.java:201)\n\tat com.google.common.io.Files$FileByteSink.openStream(Files.java:189)\n\tat com.google.common.io.ByteSink.getOutput(ByteSink.java:84)\n\tat com.google.common.io.ByteSink.getOutput(ByteSink.java:47)\n\tat io.druid.common.utils.SerializerUtils.writeString(SerializerUtils.java:49)\n\tat io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:779)\n\tat io.druid.segment.IndexMerger.merge(IndexMerger.java:308)\n\tat io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:170)\n\tat io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:163)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:348)\n\tat io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:744)\n","interval":"2014-09-11T01:00:00.000-07:00/2014-09-11T02:00:00.000-07:00"}}]

2014-09-11 02:04:02,324 ERROR [chief-pulsar_raw] io.druid.segment.realtime.plumber.RealtimePlumber - Problem loading sink[pulsar_raw] from disk.: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.FileNotFoundException, exceptionMessage=/kafka1/druidrealtime/basepersist/pulsar_raw/2014-09-10T23:00:00.000-07:00_2014-09-11T00:00:00.000-07:00/5/index.drd (No such file or directory), interval=2014-09-10T23:00:00.000-07:00/2014-09-11T00:00:00.000-07:00}

2014-09-11 02:47:38,649 INFO  [druid-group_phxdbx1181.phx.ebay.com-1410426252834-3c8f81d4_watcher_executor] kafka.consumer.ZookeeperConsumerConnector - [druid-group_phxdbx1181.phx.ebay.com-1410426252834-3c8f81d4], exception during rebalance 
org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /consumers/druid-group/ids/druid-group_phxdbx1180.phx.ebay.com-1410426242551-3ca0a833
        at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)

5) Any exceptions or log messages during handoff on the coordinator node
Sorry, i forget to log things at coordinator nodes...
druid-deploy.zip

Fangjin Yang

unread,
Sep 11, 2014, 11:43:11 PM9/11/14
to druid-de...@googlegroups.com
Inline.


On Thursday, September 11, 2014 7:41:21 PM UTC-7, Xiaoming Zhang wrote:
My bad.

The query i give you is the one i use, our QA use the query like this:
{
"queryType": "timeseries",
"dataSource": "pulsar_raw",
"granularity": "hour",
"aggregations": [
{ "type": "longSum", "name": "count", "fieldName": "count" }
],
"intervals": [ "2014-09-09/2014-09-10" ]
}

And we had  this "count" config at the ingest time.
"aggregators": [
                {
                    "type": "count",
                    "name": "count"
                }
            ],
The query i use and QA use give similar numbers, just  slightly different which is still about 57%.

Okay, I suspect the difference is because of errors across your realtime pipeline where not all partitions are actually working and handing off. I suspect that if you do run a batch job of the data you will see correct results.
 
Looking through the configs, I believe the problem is here:
        "rejectionPolicyFactory": {
            "type": "test"
        }

"test" should never be used beyond testing basic test of hand off. Please see:

It was a mistake to create this rejection policy and we will remove it. Please see:

I suggest you try and use "serverTime" as the rejectionPolicy for current time data.

One other thing that is very curious is the No such file or directory errors. I have been trying to reproduce these errors locally but have been unsuccessful. 

Please try to remove /kafka1/druidrealtime/basepersist/ and try again. If things still are not working, any logs during persist or handoff would be extremely helpful.

Xiaoming Zhang

unread,
Sep 12, 2014, 3:11:14 AM9/12/14
to druid-de...@googlegroups.com
We try to add HDFS dependency but we failed by the following error, and we did not find any place that uses this  commons-codec:commons-codec:jar:1.7, also we add this druid.extensions.remoteRepositories=[] to config, still not help:

2014-09-11 02:25:15,096 INFO  [main] io.druid.initialization.Initialization - Loading extension[io.druid.extensions:druid-hdfs-storage:0.6.153] for class[io.druid.cli.CliCommandCreator]

2014-09-11 03:06:00,841 ERROR [main] io.druid.initialization.Initialization - Unable to resolve artifacts for [io.druid.extensions:druid-hdfs-storage:jar:0.6.153 (runtime) -> [] < [central (http://repo1.maven.org/maven2/, releases+snapshots),  (https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local, releases+snapshots)]].

org.eclipse.aether.resolution.DependencyResolutionException: Failed to collect dependencies at io.druid.extensions:druid-hdfs-storage:jar:0.6.153 -> net.java.dev.jets3t:jets3t:jar:0.9.1 -> org.apache.httpcomponents:httpclient:jar:4.2 -> commons-codec:commons-codec:jar:1.7

        at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:380)

        at io.tesla.aether.internal.DefaultTeslaAether.resolveArtifacts(DefaultTeslaAether.java:289)

        at io.druid.initialization.Initialization.getClassLoaderForCoordinates(Initialization.java:199)

        at io.druid.initialization.Initialization.getFromExtensions(Initialization.java:141)

        at io.druid.cli.Main.main(Main.java:78)

Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed to collect dependencies at io.druid.extensions:druid-hdfs-storage:jar:0.6.153 -> net.java.dev.jets3t:jets3t:jar:0.9.1 -> org.apache.httpcomponents:httpclient:jar:4.2 -> commons-codec:commons-codec:jar:1.7

        at org.eclipse.aether.internal.impl.DefaultDependencyCollector.collectDependencies(DefaultDependencyCollector.java:292)

        at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:342)

        ... 4 more

Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to read artifact descriptor for commons-codec:commons-codec:jar:1.7

        at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.loadPom(DefaultArtifactDescriptorReader.java:335)

        at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.readArtifactDescriptor(DefaultArtifactDescriptorReader.java:217)

        at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:461)

        at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:573)

        at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:573)

        at org.eclipse.aether.internal.impl.DefaultDependencyCollector.collectDependencies(DefaultDependencyCollector.java:261)

        ... 5 more

Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not transfer artifact commons-codec:commons-codec:pom:1.7 from/to central (http://repo1.maven.org/maven2/): connect timed out

        at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:459)

        at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts(DefaultArtifactResolver.java:262)

        at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifact(DefaultArtifactResolver.java:239)

        at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.loadPom(DefaultArtifactDescriptorReader.java:320)

        ... 10 more

Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact commons-codec:commons-codec:pom:1.7 from/to central (http://repo1.maven.org/maven2/): connect timed out

        at io.tesla.aether.connector.AetherRepositoryConnector$2.wrap(AetherRepositoryConnector.java:830)

        at io.tesla.aether.connector.AetherRepositoryConnector$2.wrap(AetherRepositoryConnector.java:824)

        at io.tesla.aether.connector.AetherRepositoryConnector$GetTask.flush(AetherRepositoryConnector.java:619)

        at io.tesla.aether.connector.AetherRepositoryConnector.get(AetherRepositoryConnector.java:238)

        at org.eclipse.aether.internal.impl.DefaultArtifactResolver.performDownloads(DefaultArtifactResolver.java:535)

        at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:436)

        ... 13 more

Caused by: java.net.SocketTimeoutException: connect timed out

        at java.net.PlainSocketImpl.socketConnect(Native Method)

        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)

        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)

        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)

        at java.net.Socket.connect(Socket.java:579)

        at com.squareup.okhttp.Connection.connect(Connection.java:100)

        at com.squareup.okhttp.internal.http.HttpEngine.connect(HttpEngine.java:287)

        at com.squareup.okhttp.internal.http.HttpEngine.sendSocketRequest(HttpEngine.java:248)

        at com.squareup.okhttp.internal.http.HttpEngine.sendRequest(HttpEngine.java:197)

        at com.squareup.okhttp.internal.http.HttpURLConnectionImpl.execute(HttpURLConnectionImpl.java:388)

        at com.squareup.okhttp.internal.http.HttpURLConnectionImpl.getResponse(HttpURLConnectionImpl.java:339)

        at com.squareup.okhttp.internal.http.HttpURLConnectionImpl.getResponseCode(HttpURLConnectionImpl.java:540)

        at io.tesla.aether.okhttp.OkHttpAetherClient$ResponseAdapter.getStatusCode(OkHttpAetherClient.java:196)

        at io.tesla.aether.connector.AetherRepositoryConnector$GetTask.resumableGet(AetherRepositoryConnector.java:549)

        at io.tesla.aether.connector.AetherRepositoryConnector$GetTask.run(AetherRepositoryConnector.java:391)

        at io.tesla.aether.connector.AetherRepositoryConnector.get(AetherRepositoryConnector.java:232)


在 2014年9月11日星期四UTC-7下午8时43分11秒,Fangjin Yang写道:

Xiaoming Zhang

unread,
Sep 12, 2014, 5:16:29 AM9/12/14
to druid-de...@googlegroups.com
We changed one RT node config to this:

druid.service=realtime
druid.port=8083


druid.extensions.coordinates=["io.druid.extensions:druid-kafka-eight:0.6.142","io.druid.extensions:druid-hdfs-storage:0.6.142"]
druid.realtime.chathandler.type=announce

# The realtime config file.
druid.realtime.specFile=config/realtime/kafka.soj.spec

# Change this config to db to hand off to the rest of the Druid cluster
druid.publish.type=db

# These configs are only required for real hand off
druid.db.connector.connectURI=jdbc\:mysql\://mymisc1.db.stratus.ebay.com\:3306/sojdruid
druid.db.connector.user=sojdruid
druid.db.connector.password=sojdruid

druid.processing.buffer.sizeBytes=100000000
druid.processing.numThreads=1

druid.monitoring.monitors=["io.druid.segment.realtime.RealtimeMetricsMonitor"]
druid.extensions.remoteRepositories=[]

druid.segmentCache.locations=[{"path": "/kafka1/druid/segmentCache", "maxSize": 1000000000000}]
druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://artemis-nn.vip.ebay.com:8020/tmp/DruidSegments

When i try to strat the node, it just hang at the hdfs extension and unable to start, here is the log:

2014-09-12 02:03:51,888 INFO  [main] io.druid.guice.PropertiesModule - Loading properties from runtime.properties
2014-09-12 02:03:51,920 INFO  [main] org.hibernate.validator.internal.util.Version - HV000001: Hibernate Validator 5.0.1.Final
2014-09-12 02:03:52,408 INFO  [main] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=[io.druid.extensions:druid-kafka-eight:0.6.142, io.druid.extensions:druid-hdfs-storage:0.6.142], localRepository='/home/cronus/.m2/repository', remoteRepositories=[]}]
2014-09-12 02:03:52,532 INFO  [main] io.druid.initialization.Initialization - Loading extension[io.druid.extensions:druid-kafka-eight:0.6.142] for class[io.druid.cli.CliCommandCreator]
2014-09-12 02:03:52,895 INFO  [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/io/druid/extensions/druid-kafka-eight/0.6.142/druid-kafka-eight-0.6.142.jar]
2014-09-12 02:03:52,896 INFO  [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/org/apache/kafka/kafka_2.9.2/0.8.0/kafka_2.9.2-0.8.0.jar]
2014-09-12 02:03:52,896 INFO  [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/org/scala-lang/scala-library/2.9.2/scala-library-2.9.2.jar]
2014-09-12 02:03:52,896 INFO  [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/net/sf/jopt-simple/jopt-simple/3.2/jopt-simple-3.2.jar]
2014-09-12 02:03:52,896 INFO  [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/org/slf4j/slf4j-simple/1.6.4/slf4j-simple-1.6.4.jar]
2014-09-12 02:03:52,896 INFO  [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/org/scala-lang/scala-compiler/2.9.2/scala-compiler-2.9.2.jar]
2014-09-12 02:03:52,896 INFO  [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/com/101tec/zkclient/0.3/zkclient-0.3.jar]
2014-09-12 02:03:52,896 INFO  [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.jar]
2014-09-12 02:03:52,896 INFO  [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar]
2014-09-12 02:03:52,896 INFO  [main] io.druid.initialization.Initialization - Added URL[file:/home/cronus/.m2/repository/com/yammer/metrics/metrics-annotation/2.2.0/metrics-annotation-2.2.0.jar]
2014-09-12 02:03:52,900 INFO  [main] io.druid.initialization.Initialization - Loading extension[io.druid.extensions:druid-hdfs-storage:0.6.142] for class[io.druid.cli.CliCommandCreator]

It takes so long and nothings happens


在 2014年9月11日星期四UTC-7下午8时43分11秒,Fangjin Yang写道:

Xiaoming Zhang

unread,
Sep 14, 2014, 10:54:15 PM9/14/14
to druid-de...@googlegroups.com
We uploaded the missed jars and removed things under /kafka1/druidrealtime/basepersist/, finally it seems worked.

But here is the exception found at historical node:

druid.log:2014-09-14 19:33:56,736 INFO  [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Node[/druid/loadQueue/phxdbx1190.phx.ebay.com:8081/pulsar_raw_2014-09-04T01:00:00.000-07:00_2014-09-04T02:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_4] was removed
druid.log:2014-09-14 19:33:56,742 INFO  [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - New request[LOAD: pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9] with node[/druid/loadQueue/phxdbx1190.phx.ebay.com:8081/pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9].
druid.log:2014-09-14 19:33:56,742 INFO  [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Loading segment pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9
druid.log:2014-09-14 19:33:56,742 INFO  [ZkCoordinator-0] io.druid.segment.loading.OmniSegmentLoader - Deleting directory[/kafka1/druid/indexCache/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/2014-09-04T00:00:00.000-07:00/9]
druid.log:2014-09-14 19:33:56,742 INFO  [ZkCoordinator-0] io.druid.server.coordination.SingleDataSegmentAnnouncer - Unannouncing segment[pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9] at path[/druid/servedSegments/phxdbx1190.phx.ebay.com:8081/pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9]
druid.log:2014-09-14 19:33:56,742 INFO  [ZkCoordinator-0] io.druid.curator.announcement.Announcer - unannouncing [/druid/servedSegments/phxdbx1190.phx.ebay.com:8081/pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9]
druid.log:2014-09-14 19:33:56,742 ERROR [ZkCoordinator-0] io.druid.curator.announcement.Announcer - Path[/druid/servedSegments/phxdbx1190.phx.ebay.com:8081/pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9] not announced, cannot unannounce.
druid.log:2014-09-14 19:33:56,742 INFO  [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Completely removing [pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9] in [30,000] millis
druid.log:2014-09-14 19:33:56,746 INFO  [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Completed request [LOAD: pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9]
druid.log:2014-09-14 19:33:56,746 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9], segment=DataSegment{size=406328782, shardSpec=LinearShardSpec{partitionNum=9}, metrics=[count], dimensions=[_cn, _con, _cty, _lat, _lon, _rgn, app, bti, bu, curprice, dd_bf, dd_d, dd_os, g, itm, itmtitle, js_ev_type, js_evt_kafka_produce_ts, mloc, p, referer, remoteip, t, timestamp, u], version='2014-09-04T00:00:00.000-07:00', loadSpec={type=local, path=/kafka2/druid/localStorage/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/2014-09-04T00:00:00.000-07:00/9/index.zip}, interval=2014-09-04T00:00:00.000-07:00/2014-09-04T01:00:00.000-07:00, dataSource='pulsar_raw', binaryVersion='9'}}
druid.log:2014-09-14 19:33:56,746 INFO  [ZkCoordinator-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-14T19:33:56.746-07:00","service":"historical","host":"phxdbx1190.phx.ebay.com:8081","severity":"component-failure","description":"Failed to load segment for dataSource","data":{"class":"io.druid.server.coordination.ZkCoordinator","exceptionType":"io.druid.segment.loading.SegmentLoadingException","exceptionMessage":"Exception loading segment[pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9]","exceptionStackTrace":"io.druid.segment.loading.SegmentLoadingException: Exception loading segment[pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9]\n\tat io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:136)\n\tat io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44)\n\tat io.druid.server.coordination.BaseZkCoordinator$1.childEvent(BaseZkCoordinator.java:113)\n\tat org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:509)\n\tat org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:503)\n\tat org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92)\n\tat com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)\n\tat org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:83)\n\tat org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:500)\n\tat org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35)\n\tat org.apache.curator.framework.recipes.cache.PathChildrenCache$10.run(PathChildrenCache.java:762)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:262)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:262)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:744)\nCaused by: io.druid.segment.loading.SegmentLoadingException: Asked to load path[/kafka2/druid/localStorage/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/2014-09-04T00:00:00.000-07:00/9/index.zip], but it doesn't exist.\n\tat io.druid.segment.loading.LocalDataSegmentPuller.getFile(LocalDataSegmentPuller.java:100)\n\tat io.druid.segment.loading.LocalDataSegmentPuller.getSegmentFiles(LocalDataSegmentPuller.java:41)\n\tat io.druid.segment.loading.OmniSegmentLoader.getSegmentFiles(OmniSegmentLoader.java:125)\n\tat io.druid.segment.loading.OmniSegmentLoader.getSegment(OmniSegmentLoader.java:93)\n\tat io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:145)\n\tat io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:132)\n\t... 17 more\n","segment":{"dataSource":"pulsar_raw","interval":"2014-09-04T00:00:00.000-07:00/2014-09-04T01:00:00.000-07:00","version":"2014-09-04T00:00:00.000-07:00","loadSpec":{"type":"local","path":"/kafka2/druid/localStorage/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/2014-09-04T00:00:00.000-07:00/9/index.zip"},"dimensions":"_cn,_con,_cty,_lat,_lon,_rgn,app,bti,bu,curprice,dd_bf,dd_d,dd_os,g,itm,itmtitle,js_ev_type,js_evt_kafka_produce_ts,mloc,p,referer,remoteip,t,timestamp,u","metrics":"count","shardSpec":{"type":"linear","partitionNum":9},"binaryVersion":9,"size":406328782,"identifier":"pulsar_raw_2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00_2014-09-04T00:00:00.000-07:00_9"}}}]



在 2014年9月11日星期四UTC-7下午8时43分11秒,Fangjin Yang写道:

Xiaoming Zhang

unread,
Sep 15, 2014, 5:07:01 AM9/15/14
to druid-de...@googlegroups.com
Sorry to bother you agagin, i have two questions:

1. Is there a way that may smoothly delete files which include ones in RT nodes and historical nodes?

2.  Is there a way that may smoothly change the index segment granularity?

Thanks a lot for your help!

在 2014年9月11日星期四UTC-7下午8时43分11秒,Fangjin Yang写道:

Xiaoming Zhang

unread,
Sep 15, 2014, 5:59:25 AM9/15/14
to druid-de...@googlegroups.com
We cleaned things under /kafka1/druidrealtime/basepersist/ and started all RT nodes.

While we still get following errors:

2014-09-15 01:50:00,559 ERROR [pulsar_event-2014-09-14T23:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[pulsar_event]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=No FileSystem for scheme: hdfs, interval=2014-09-14T23:00:00.000-07:00/2014-09-15T00:00:00.000-07:00}
2014-09-15 01:50:00,559 INFO  [pulsar_event-2014-09-14T23:00:00.000-07:00-persist-n-merge] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-09-15T01:50:00.559-07:00","service":"realtime","host":"phxdbx1182.phx.ebay.com:8083","severity":"component-failure","description":"Failed to persist merged index[pulsar_event]","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.IOException","exceptionMessage":"No FileSystem for scheme: hdfs","exceptionStackTrace":"java.io.IOException: No FileSystem for scheme: hdfs\n\tat org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2304)\n\tat org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2311)\n\tat org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)\n\tat org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)\n\tat org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)\n\tat org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)\n\tat org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)\n\tat io.druid.storage.hdfs.HdfsDataSegmentPusher.push(HdfsDataSegmentPusher.java:75)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:356)\n\tat io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:744)\n","interval":"2014-09-14T23:00:00.000-07:00/2014-09-15T00:00:00.000-07:00"}}]
2014-09-15 02:08:57,174 ERROR [pulsar_session-2014-09-14T23:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[pulsar_session]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=No FileSystem for scheme: hdfs, interval=2014-09-14T23:00:00.000-07:00/2014-09-15T00:00:00.000-07:00}


在 2014年9月11日星期四UTC-7下午8时43分11秒,Fangjin Yang写道:

Fangjin Yang

unread,
Sep 15, 2014, 2:28:44 PM9/15/14
to druid-de...@googlegroups.com
Hi Xiaoming, a few more configuration problems it seems like:

Looking at this log line:
 Asked to load path[/kafka2/druid/localStorage/pulsar_raw/2014-09-04T00:00:00.000-07:00_2014-09-04T01:00:00.000-07:00/2014-09-04T00:00:00.000-07:00/9/index.zip], but it doesn't exist.

It appears local storage is what is specified as the deep storage mechanism. Unless you are running all nodes on a single system, this won't work for handoff. Do you guys want to load segments to hdfs?

Thanks,
FJ

Fangjin Yang

unread,
Sep 15, 2014, 2:32:27 PM9/15/14
to druid-de...@googlegroups.com
Hi, see inline.


On Monday, September 15, 2014 2:07:01 AM UTC-7, Xiaoming Zhang wrote:
Sorry to bother you agagin, i have two questions:

No worries, we are happy to answer all questions. 

1. Is there a way that may smoothly delete files which include ones in RT nodes and historical nodes?

Real-time nodes should cleanly delete all files after a successful handoff. Historical nodes should cleanly delete all segment files once they've dropped a segment. You can use the concept of rules (http://druid.io/docs/latest/Rule-Configuration.html) to automate when data is dropped from a cluster. If you want to cleanly remove all segments, metadata, and segment files in deep storage, you can look at the kill task (http://druid.io/docs/latest/Tasks.html).
 
2.  Is there a way that may smoothly change the index segment granularity?

Yes. You can change the segment granularity at ingest time, and if you want to change granularities for already created segments, you can look at http://druid.io/docs/latest/Ingestion-FAQ.html about reingesting data with different schemas. 

Xiaoming Zhang

unread,
Sep 17, 2014, 3:28:20 AM9/17/14
to druid-de...@googlegroups.com
Thank you very much for your answer.

I mentioned that we have two clusters, though they perform the same way: send topic 1 and topic 2 to druid.

When RT node is ingesting data, we query the rows,  which  turns out that the number is not match with our cassandra system.

Recently we started our historical node and we query the data of historical nodes, we found that the number is exactly the same as casandra.

Is this a bug of Druid?

在 2014年9月15日星期一UTC-7上午11时32分27秒,Fangjin Yang写道:

Xiaoming Zhang

unread,
Sep 17, 2014, 6:13:59 AM9/17/14
to druid-de...@googlegroups.com
BTW:

Is there any way that can make Druid allow cross domain request?

Any config may make this done?

在 2014年9月15日星期一UTC-7上午11时32分27秒,Fangjin Yang写道:

Fangjin Yang

unread,
Sep 18, 2014, 12:56:52 AM9/18/14
to druid-de...@googlegroups.com
I do not believe this is a bug, but instead a misconfiguration with the real-time pipeline. Do you still see this problem after removing the "test" rejection policy?

Fangjin Yang

unread,
Sep 18, 2014, 12:58:24 AM9/18/14
to druid-de...@googlegroups.com
Hi Xiaoming, by cross domain request, do you mean deploying Druid in multiple data centers? Can you give me some more detail on the use case?

Xiaoming Zhang

unread,
Sep 18, 2014, 1:05:16 AM9/18/14
to druid-de...@googlegroups.com
Yes, we use rejection policy as servertime in our cluseter, this happens all the time.

We use javascript to send request to the duird broker node to query, as the broswer not allow doing this.

在 2014年9月17日星期三UTC-7下午9时58分24秒,Fangjin Yang写道:

Fangjin Yang

unread,
Sep 18, 2014, 9:48:37 PM9/18/14
to druid-de...@googlegroups.com
Hi Xiaoming, so when the results do not match up, do you mean the results do no not match up during real-time data ingestion? Can you provide some more details, I am not quite understanding the situation which causes data not to match up.

Xiaoming Zhang

unread,
Sep 18, 2014, 10:22:00 PM9/18/14
to druid-de...@googlegroups.com
We query the data which is in RT nodes now, we get a count number "a", after the RT nodes handoff the data to historical nodes, we count the data in the same time period, we get a count number "b", "a" and "b" is not the same.

And "a" is about the half of "b".

About the cross domain problem, we use javascript to send duird query to get result, which is not allowed by browser. Duird should add some header to allow this.

在 2014年9月18日星期四UTC-7下午6时48分37秒,Fangjin Yang写道:

Fangjin Yang

unread,
Sep 18, 2014, 11:35:28 PM9/18/14
to druid-de...@googlegroups.com
Hmmm, that is very strange, and something I haven't seen before. Some things that come to mind
1) Exceptions may be occurring on the real-time nodes, preventing correct data from returning
2) Exceptions or timeouts may be occurring on the brokers

In these 2 cases, logs will really help.

3) Misconfiguration confusing the broker. Using your coordinator console, make sure all shards of a segment appear before you make the query. Perhaps there is something wrong with segment announcement.

One thing you can try to do is query each rt node individually. All queryable Druid nodes share the same query API. Make sure each individual RT node actually returns results.

As for the cross domain problem, you can take a look at QueryServlet.java to make changes in how Druid parses, send and returns http request headers. All queries use this class.

Xiaoming Zhang

unread,
Sep 26, 2014, 5:11:53 AM9/26/14
to druid-de...@googlegroups.com
We query data from Druid and the query we use:
        var oneMinuteRaw = {
            "queryType": "timeseries",
            "dataSource": "pulsar_event",
            "granularity": "second",
            "aggregations": [
                { "type": "count", "name": "hits"}
            ],
            "intervals": ["PT5m/"+nowTime]
       }

This query will be executed each second and fresh our graph, we found the graph is not stable and it is twist up and down time to time.

I logged the data once this happened:
[index, value] pair.

0,6418,1,6521,2,6378,3,6461,4,6471,5,6400,6,6530,7,6372,8,6480,9,6384,10,6618,11,6480,12,6510,13,6338,14,6394,15,6678,16,6504,17,6429,18,6464,19,6521,20,6414,21,6451,22,6449,23,6475,24,6338,25,6291,26,6427,27,6416,28,6363,29,6196,30,6280,31,6107,32,6265,33,6376,34,6460,35,6356,36,6588,37,6619,38,6425,39,6427,40,6509,41,6496,42,6642,43,6610,44,6407,45,6395,46,6362,47,6328,48,6413,49,6377,50,6481,51,6378,52,6360,53,6451,54,6319,55,6368,56,6278,57,6399,58,6435,59,6427,60,6508,61,6278,62,6380,63,6472,64,6372,65,6395,66,6341,67,6431,68,6496,69,6528,70,6484,71,6575,72,6445,73,6516,74,6413,75,6377,76,6441,77,6563,78,6650,79,6417,80,6445,81,6390,82,6378,83,6455,84,6387,85,6391,86,6413,87,6544,88,6424,89,6581,90,6386,91,6418,92,6303,93,6371,94,6448,95,6203,96,6342,97,6450,98,6469,99,6603,100,6368,101,6413,102,6428,103,6820,104,6213,105,6523,106,6233,107,6471,108,6478,109,6305,110,6250,111,6507,112,6503,113,6349,114,6542,115,6340,116,6476,117,6404,118,6379,119,6239,120,6569,121,6402,122,6428,123,6363,124,6378,125,6251,126,6378,127,6565,128,6396,129,6473,130,6581,131,6300,132,6310,133,6339,134,6338,135,6272,136,6363,137,6425,138,6479,139,6360,140,6314,141,6194,142,6511,143,6541,144,6349,145,6333,146,6502,147,6408,148,6262,149,6388,150,6214,151,6112,152,6472,153,6346,154,6421,155,6369,156,6256,157,6268,158,6073,159,6322,160,6148,161,6359,162,6351,163,6453,164,6314,165,6270,166,6357,167,6232,168,6050,169,6421,170,6409,171,6443,172,6320,173,6363,174,6319,175,6262,176,6307,177,6410,178,6432,179,6190,180,6239,181,6259,182,6286,183,6355,184,6550,185,6330,186,6316,187,6242,188,6351,189,6497,190,6136,191,6273,192,6336,193,6429,194,6260,195,6221,196,6269,197,6307,198,6369,199,6396,200,6292,201,6412,202,6310,203,6365,204,6199,205,6405,206,5982,207,6158,208,6383,209,6225,210,6205,211,6205,212,6323,213,6182,214,6254,215,6343,216,6470,217,6390,218,6568,219,6660,220,6475,221,6496,222,6369,223,6735,224,6291,225,6492,226,6272,227,6158,228,6284,229,6293,230,6218,231,6345,232,6326,233,6281,234,6358,235,6184,236,6361,237,6232,238,6261,239,6178,240,6367,241,6270,242,6250,243,6521,244,6253,245,6166,246,6450,247,6346,248,6280,249,6445,250,6194,251,6244,252,6259,253,6343,254,6329,255,6398,256,6184,257,6220,258,6342,259,6290,260,6404,261,6355,262,6094,263,6408,264,6393,265,6212,266,6236,267,6237,268,6411,269,6174,270,6315,271,6203,272,6353,273,6223,274,6196,275,6377,276,6416,277,6411,278,6441,279,6251,280,6147,281,6154,282,6010,283,6484,284,6091,285,6203,286,6201,287,6410,288,6337,289,6410,290,6269,291,6148,292,6199,293,6237,294,6404,295,6276,296,6319,297,6420,298,6246,299,6195 trafficsource.js:56

0,5551,1,5588,2,5590,3,5547,4,5650,5,5569,6,5594,7,5543,8,5736,9,5616,10,5665,11,5463,12,5601,13,5770,14,5610,15,5514,16,5595,17,5677,18,5556,19,5617,20,5602,21,5591,22,5449,23,5471,24,5584,25,5617,26,5524,27,5395,28,5431,29,5323,30,5403,31,5506,32,5602,33,5518,34,5736,35,5747,36,5558,37,5568,38,5626,39,5618,40,5728,41,5753,42,5543,43,5525,44,5502,45,5490,46,5510,47,5501,48,5615,49,5499,50,5537,51,5600,52,5461,53,5524,54,5482,55,5577,56,5534,57,5554,58,5672,59,5459,60,5526,61,5575,62,5522,63,5566,64,5502,65,5563,66,5648,67,5684,68,5655,69,5710,70,5573,71,5633,72,5578,73,5529,74,5545,75,5687,76,5762,77,5560,78,5591,79,5537,80,5512,81,5595,82,5537,83,5558,84,5544,85,5668,86,5592,87,5721,88,5539,89,5554,90,5480,91,5523,92,5592,93,5351,94,5469,95,5601,96,5632,97,5704,98,5487,99,5596,100,5569,101,5948,102,5411,103,5644,104,5424,105,5644,106,5631,107,5485,108,5404,109,5671,110,5639,111,5509,112,5681,113,5506,114,5618,115,5558,116,5497,117,5367,118,5667,119,5615,120,5576,121,5513,122,5472,123,5420,124,5537,125,5690,126,5521,127,5615,128,5736,129,5467,130,5454,131,5532,132,5503,133,5429,134,5526,135,5576,136,5607,137,5570,138,5487,139,5403,140,5641,141,5691,142,5515,143,5510,144,5647,145,5573,146,5425,147,5582,148,5396,149,5303,150,5573,151,5483,152,5531,153,5506,154,5453,155,5436,156,5277,157,5489,158,5375,159,5576,160,5498,161,5561,162,5467,163,5452,164,5487,165,5407,166,5260,167,5574,168,5523,169,5560,170,5477,171,5552,172,5514,173,5430,174,5466,175,5604,176,5593,177,5359,178,5434,179,5440,180,5457,181,5504,182,5699,183,5473,184,5488,185,5424,186,5500,187,5634,188,5336,189,5477,190,5494,191,5612,192,5388,193,5414,194,5430,195,5480,196,5480,197,5528,198,5486,199,5541,200,5483,201,5474,202,5353,203,5533,204,5216,205,5357,206,5522,207,5395,208,5398,209,5373,210,5448,211,5366,212,5420,213,5490,214,5595,215,5551,216,5684,217,5761,218,5623,219,5603,220,5509,221,5826,222,5473,223,5622,224,5411,225,5353,226,5488,227,5470,228,5404,229,5515,230,5456,231,5426,232,5481,233,5386,234,5501,235,5424,236,5427,237,5380,238,5510,239,5440,240,5421,241,5649,242,5418,243,5300,244,5608,245,5488,246,5463,247,5544,248,5406,249,5402,250,5432,251,5482,252,5509,253,5530,254,5375,255,5395,256,5492,257,5443,258,5560,259,5547,260,5292,261,5564,262,5548,263,5410,264,5452,265,5408,266,5567,267,5352,268,5502,269,5389,270,5494,271,5390,272,5368,273,5540,274,5542,275,5559,276,5572,277,5432,278,5335,279,5371,280,5201,281,5652,282,5301,283,5352,284,5399,285,5578,286,5491,287,5569,288,5445,289,5329,290,5421,291,5412,292,5486,293,5477,294,5443,295,5547,296,5439,297,5391,298,5486,299,5486 trafficsource.js:56

0,6310,1,6258,2,6393,3,6285,4,6333,5,6263,6,6490,7,6361,8,6368,9,6128,10,6353,11,6472,12,6312,13,6223,14,6342,15,6375,16,6279,17,6315,18,6302,19,6265,20,6128,21,6155,22,6328,23,6321,24,6188,25,6110,26,6165,27,6053,28,6101,29,6175,30,6317,31,6180,32,6433,33,6434,34,6359,35,6329,36,6328,37,6318,38,6478,39,6541,40,6216,41,6213,42,6188,43,6234,44,6239,45,6200,46,6296,47,6178,48,6273,49,6303,50,6164,51,6251,52,6185,53,6302,54,6269,55,6268,56,6362,57,6141,58,6238,59,6305,60,6292,61,6268,62,6234,63,6279,64,6397,65,6445,66,6413,67,6486,68,6271,69,6340,70,6279,71,6206,72,6292,73,6410,74,6487,75,6313,76,6311,77,6228,78,6215,79,6311,80,6264,81,6260,82,6276,83,6377,84,6335,85,6476,86,6249,87,6272,88,6198,89,6272,90,6257,91,6051,92,6223,93,6293,94,6307,95,6433,96,6226,97,6313,98,6262,99,6706,100,6091,101,6371,102,6187,103,6331,104,6393,105,6263,106,6101,107,6419,108,6462,109,6221,110,6382,111,6238,112,6350,113,6283,114,6182,115,6046,116,6371,117,6370,118,6291,119,6227,120,6177,121,6161,122,6252,123,6407,124,6200,125,6344,126,6441,127,6138,128,6166,129,6265,130,6229,131,6098,132,6226,133,6284,134,6301,135,6269,136,6240,137,6109,138,6370,139,6447,140,6265,141,6209,142,6368,143,6267,144,6107,145,6295,146,6094,147,5974,148,6275,149,6146,150,6264,151,6225,152,6158,153,6127,154,5979,155,6179,156,6066,157,6309,158,6220,159,6303,160,6163,161,6115,162,6217,163,6111,164,5939,165,6317,166,6229,167,6249,168,6218,169,6262,170,6229,171,6126,172,6177,173,6313,174,6338,175,6057,176,6147,177,6138,178,6149,179,6198,180,6455,181,6192,182,6232,183,6140,184,6186,185,6363,186,6008,187,6170,188,6204,189,6331,190,6096,191,6075,192,6208,193,6118,194,6230,195,6258,196,6197,197,6245,198,6249,199,6185,200,6037,201,6230,202,5899,203,6117,204,6265,205,6049,206,6078,207,6044,208,6150,209,6038,210,6155,211,6194,212,6265,213,6230,214,6391,215,6494,216,6344,217,6299,218,6217,219,6533,220,6221,221,6383,222,6150,223,6051,224,6200,225,6189,226,6085,227,6228,228,6160,229,6163,230,6202,231,6128,232,6234,233,6099,234,6154,235,6099,236,6194,237,6107,238,6100,239,6366,240,6114,241,5982,242,6307,243,6172,244,6164,245,6245,246,6137,247,6063,248,6124,249,6119,250,6168,251,6223,252,6052,253,6075,254,6165,255,6161,256,6270,257,6221,258,5990,259,6216,260,6231,261,6109,262,6185,263,6131,264,6260,265,6046,266,6219,267,6086,268,6187,269,6077,270,6080,271,6255,272,6294,273,6287,274,6228,275,6119,276,5988,277,6031,278,5907,279,6360,280,5954,281,6006,282,6146,283,6256,284,6200,285,6269,286,6140,287,6015,288,6145,289,6053,290,6166,291,6217,292,6138,293,6249,294,6153,295,6086,296,6204,297,6217,298,6156,299,6175


The second result is quite different from the first and third, which shows a drop and will rise once the third data is loaded.

Why this happened and how to solve?

Thank you!

在 2014年9月18日星期四UTC-7下午8时35分28秒,Fangjin Yang写道:

Xiaoming Zhang

unread,
Sep 28, 2014, 2:48:10 AM9/28/14
to druid-de...@googlegroups.com

We have two kafka clusters that will give data to druid:
PHX cluster:
SLC cluster:

There is an attribute druid.zk.service.host at each runtimre-properties file, previously i only add PHX machines to druid.zk.service.host.

We query RT data we find that the data volume is only half of the same time of yesterday.

Do we need to add all machines to druid.zk.service.host of all nodes?

I tried this, and historical nodes can not give any result, once i delete slc machines,  historical nodes can give result.

Here is the RT config :
[
    {
        "schema": {
            "dataSource": "pulsar_event",
            "aggregators": [
                {
                    "type": "count",
                    "name": "count"
                }
            ],
            "indexGranularity": "second",
            "shardSpec": {
                "type": "linear",
                "partitionNum": 1
            }
        },
        "config": {
            "maxRowsInMemory": 500000,
            "intermediatePersistPeriod": "PT1m"
        },
        "firehose": {
            "type": "kafka-0.8",
            "consumerProps": {
                "zookeeper.connection.timeout.ms": "15000",
                "zookeeper.session.timeout.ms": "15000",
                "zookeeper.sync.time.ms": "5000",
                "group.id": "druid-group1",
                "fetch.message.max.bytes": "1048586",
                "auto.offset.reset": "largest",
                "auto.commit.enable": "false"
            },
            "feed": "Trkng.druid-sojEvent",
            "parser": {
                "timestampSpec": {
                    "column": "timestamp",
                    "format": "auto"
                },
                "data": {
                    "format": "json"
                },
                "dimensions": ["guid","uid","tenant","eventType","site","country","city","linespeed","continent","region","browserFamily","browserVersion","deviceFamily","deviceClass","osFamily","osVersion","page"]
            }
        },
        "plumber": {
            "type": "realtime",
            "windowPeriod": "PT50m",
            "segmentGranularity": "hour",
            "basePersistDirectory": "/kafka1/druidrealtime/basepersist1",
            "rejectionPolicyFactory": {
                "type": "serverTime"
            }
        }
    },
    {
        "schema": {
            "dataSource": "pulsar_session",
            "aggregators": [
                {
                    "type": "count",
                    "name": "count"
                }
            ],
            "indexGranularity": "second",
            "shardSpec": {
                "type": "linear",
                "partitionNum": 1
            }
        },
        "config": {
            "maxRowsInMemory": 500000,
            "intermediatePersistPeriod": "PT1m"
        },
        "firehose": {
            "type": "kafka-0.8",
            "consumerProps": {
                "zookeeper.connection.timeout.ms": "15000",
                "zookeeper.session.timeout.ms": "15000",
                "zookeeper.sync.time.ms": "5000",
                "group.id": "druid-group1",
                "fetch.message.max.bytes": "1048586",
                "auto.offset.reset": "largest",
                "auto.commit.enable": "false"
            },
            "feed": "Trkng.druid-sessionEvent",
            "parser": {
                "timestampSpec": {
                    "column": "timestamp",
                    "format": "auto"
                },
                "data": {
                    "format": "json"
                },
                "dimensions": ["uid","rv","entryPage","exitPage","trafficSource","tenant","site","city","region","country","continent","app","linespeed","browserFamily","browserVersion","deviceFamily","deviceClass","osFamily","osVersion","treatments"]
            }
        },
        "plumber": {
            "type": "realtime",
            "windowPeriod": "PT50m",
            "segmentGranularity": "hour",
            "basePersistDirectory": "/kafka1/druidrealtime/basepersist1",
            "rejectionPolicyFactory": {
                "type": "serverTime"
            }
        }
    },
        {
        "schema": {
            "dataSource": "pulsar_event",
            "aggregators": [
                {
                    "type": "count",
                    "name": "count"
                }
            ],
            "indexGranularity": "second",
            "shardSpec": {
                "type": "linear",
                "partitionNum": 2
            }
        },
        "config": {
            "maxRowsInMemory": 500000,
            "intermediatePersistPeriod": "PT1m"
        },
        "firehose": {
            "type": "kafka-0.8",
            "consumerProps": {
                "zookeeper.connection.timeout.ms": "15000",
                "zookeeper.session.timeout.ms": "15000",
                "zookeeper.sync.time.ms": "5000",
                "group.id": "druid-group1",
                "fetch.message.max.bytes": "1048586",
                "auto.offset.reset": "largest",
                "auto.commit.enable": "false"
            },
            "feed": "Trkng.druid-sojEvent",
            "parser": {
                "timestampSpec": {
                    "column": "timestamp",
                    "format": "auto"
                },
                "data": {
                    "format": "json"
                },
                "dimensions": ["guid","uid","tenant","eventType","site","country","city","linespeed","continent","region","browserFamily","browserVersion","deviceFamily","deviceClass","osFamily","osVersion","page"]
            }
        },
        "plumber": {
            "type": "realtime",
            "windowPeriod": "PT50m",
            "segmentGranularity": "hour",
            "basePersistDirectory": "/kafka1/druidrealtime/basepersist2",
            "rejectionPolicyFactory": {
                "type": "serverTime"
            }
        }
    },
    {
        "schema": {
            "dataSource": "pulsar_session",
            "aggregators": [
                {
                    "type": "count",
                    "name": "count"
                }
            ],
            "indexGranularity": "second",
            "shardSpec": {
                "type": "linear",
                "partitionNum": 2
            }
        },
        "config": {
            "maxRowsInMemory": 500000,
            "intermediatePersistPeriod": "PT1m"
        },
        "firehose": {
            "type": "kafka-0.8",
            "consumerProps": {
                "zookeeper.connection.timeout.ms": "15000",
                "zookeeper.session.timeout.ms": "15000",
                "zookeeper.sync.time.ms": "5000",
                "group.id": "druid-group1",
                "fetch.message.max.bytes": "1048586",
                "auto.offset.reset": "largest",
                "auto.commit.enable": "false"
            },
            "feed": "Trkng.druid-sessionEvent",
            "parser": {
                "timestampSpec": {
                    "column": "timestamp",
                    "format": "auto"
                },
                "data": {
                    "format": "json"
                },
                "dimensions": ["uid","rv","entryPage","exitPage","trafficSource","tenant","site","city","region","country","continent","app","linespeed","browserFamily","browserVersion","deviceFamily","deviceClass","osFamily","osVersion","treatments"]
            }
        },
        "plumber": {
            "type": "realtime",
            "windowPeriod": "PT50m",
            "segmentGranularity": "hour",
            "basePersistDirectory": "/kafka1/druidrealtime/basepersist2",
            "rejectionPolicyFactory": {
                "type": "serverTime"
            }
        }
    }
]


在 2014年9月18日星期四UTC-7下午8时35分28秒,Fangjin Yang写道:

Fangjin Yang

unread,
Sep 29, 2014, 11:59:18 PM9/29/14
to druid-de...@googlegroups.com
Inline.
Just to be clear, each of your Kafka clusters use a 9 node ZK setup? Any particular reason why you need so many? Aren't you affected by quorum decision times?

There is an attribute druid.zk.service.host at each runtimre-properties file, previously i only add PHX machines to druid.zk.service.host.

Druid uses ZK for distributed coordination. The druid.zk.service.host has nothing to do with Kafka. 

We query RT data we find that the data volume is only half of the same time of yesterday.

Have all events been ingested? Are you sure the bottleneck is Druid and not Kafka? We've seen similar questions a few times in the forum and Kafka is most often the bottleneck (https://groups.google.com/forum/#!topic/druid-development/ntAHm8HigMk). Do you have any other consumers in your cluster? 

Do we need to add all machines to druid.zk.service.host of all nodes?

I think there is a misunderstanding here about how Druid works. 

Xiaoming Zhang

unread,
Oct 9, 2014, 2:15:33 AM10/9/14
to druid-de...@googlegroups.com
Hi, Fangjin:

Thank you very much for you answer!

The reason why we use 9 zk nodes is not that strange, each time we are allocated 9 machines for a bash, so we use one bash machines as a cluster.It is has any bad affects on Druid?

You said this attribute is used for coordinators so i understand your words, i make a mistake, sorry.

Could you please check the email i sent here at 9/26, it is about query result unstable problem.

Hope to get your response soon.

Thanks
Zhang Xiaoming.



在 2014年9月30日星期二UTC+8上午11时59分18秒,Fangjin Yang写道:
...

Fangjin Yang

unread,
Oct 9, 2014, 1:16:22 PM10/9/14
to druid-de...@googlegroups.com
Hi Xiaoming, I've sent several personal emails to you already, have you had a chance to look at them?

I've also gotten several personal emails from other members of your team about this issue. If possible, I'd like to work out the issue and update this thread once things are resolved.



在 2014年9月30日星期二UTC+8上午11时59分18秒,Fangjin Yang写道:
Reply all
Reply to author
Forward
0 new messages