Realtime task Ingestion rate with Kafka. (Very slow)

Deepak Jain

unread,

Aug 19, 2014, 7:04:35 AM8/19/14

to druid-de...@googlegroups.com

Configuration

Druid .145

Kafka .8

Brokers 3

Case 1)

1 Realtime task.

Ingestion Rate

16006 per minute

266 per second

Case 2)

3 Realtime tasks.

Ingestion Rate

17301 per minute

288 per second

17548 per minute

292 per second

10122 per minute

168 per second

This is very slow. I was told of ingestion rate ranging from 10k to 60k events per node per second.

More: https://docs.google.com/spreadsheets/d/1sYtw0toDvXbaEJEp6fhLy9-ECNG2vO36e7NVJf40d54/edit#gid=0

I greped the tasks logs to get above data.

Lets discuss if required. Am i goofing up something ?

Regards,

Deepak

Deepak Jain

unread,

Aug 19, 2014, 7:09:23 AM8/19/14

to druid-de...@googlegroups.com

Data:

6 metrics and 49 dimensions.

Deepak Jain

unread,

Aug 19, 2014, 7:11:55 AM8/19/14

to druid-de...@googlegroups.com

I have written a pig script that reads records locally and invokes a function for every event. This function creates a producer once (across all events) and sends the event.

Nishant Bangarwa

unread,

Aug 19, 2014, 9:26:59 AM8/19/14

to druid-de...@googlegroups.com

In the attached sheet, I see lots of variation in the ingestion rate 61K per minute to around 5K per min,

I wonder if its some issue at the producer not being able to send enough events to druid or some GC issues in Realtime tasks ?

do you see any Full GCs on realtime tasks ? How much heap are you running with ?

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/8cf0c5f2-8dba-4ab6-b216-5ce420eb986d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Nishant

Software Engineer

|

METAMARKETS

m	+91-9729200044

nishant....@metamarkets.com

Deepak Jain

unread,

Aug 20, 2014, 11:25:22 AM8/20/14

to druid-de...@googlegroups.com

I reattempted the ingestion with more than 10 producers this time each reading file from local disk and with Asynchronous Kafka producer. The best i could see was 55k events per minute or 900 per second and thats way less than the suggested range of 10k to 60k per second. Also this rate is way less than we expect it (4k per second).

Realtime Task Machine: 24GB total RAM.

MM: java -server -Xmx1g -Xms1g -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/tmp -classpath /opt/druid/druid-services-0.6.138/lib/*:/opt/druid/druid-services-0.6.138/config/overlord:/apache/hadoop/conf:/apache/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/share/hadoop/common/lib/commons-collections-3.2.1.jar:/apache/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:/apache/hadoop/share/hadoop/common/lib/hadoop-auth-2.4.1-EBAY-2.jar:/apache/hadoop/share/hadoop/common/lib/hadoop-ebay-0.1-EBAY-2.jar:/apache/hadoop/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/commons-net-3.1.jar io.druid.cli.Main server middleManager > /tmp/druid-middle-manager.log &

Peon (Realtime task) :

druid.indexer.runner.javaOpts=-server -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

I had shared the task log that contained GC statements over IRC.

Please suggest.

-Deepak

Fangjin Yang

unread,

Aug 20, 2014, 1:50:01 PM8/20/14

to druid-de...@googlegroups.com

Hi Deepak, can you look at the logs of the task? These rates are much slower than what we and others have reported with using Druid. As Nishant mentioned, if you create a simple Kafka consumer, how fast are you able to ingest data? One other thing to be aware is that Druid will throttle ingestion if it has to constantly persist data. Make sure that this is also not occurring.

Deepak Jain

unread,

Aug 20, 2014, 9:23:26 PM8/20/14

to druid-de...@googlegroups.com

This the config

"fireDepartmentConfig": {

"maxRowsInMemory": 500000,

"intermediatePersistPeriod": "PT10m"

},

"windowPeriod": "PT30m",

"segmentGranularity": "hour",

"rejectionPolicyFactory": {

"type": "test"

}

1) I thought 10m intermediate persist was recommended.

2) I have seen Kafka producer finishing ingesting data, while consumers (druid realtime task) still reading from cluster.

3) I do not include dimension while ingesting, do you think that can slow down.

A sample record is as below

{"TSTAMP":"1407858586111","GUID":"17c305741450a620c8213821ffc1111","CAL_DT":"1900-01-01","EXPRMNT_ID":"9694","TRMT_ID":"22459","TRMT_VRSN_ID":"6","CUMM_START_DT":"2014-05-08","CUMM_END_DT":"2014-05-27","BROWSER_ID":"-999","CHNL_ID":"1","DVIC_ID":"1","FINAL_GEO_IND":"0","FINAL_BYR_CNTRY_ID":"-999","FINAL_TREATED_IND":"1","GLBL_VRTCL_CD":"100","SUCC_BID_OL_SC":"0.000000","SUCC_BID_CNT":"0.000000","GMB_USD":"0.000000","GMB_USD_OL_SC":"0.000000","FP_GMB_USD":"0.000000","FP_GMB_USD_OL_SC":"0.000000","AUCT_GMB_USD":"0.000000","AUCT_GMB_USD_OL_SC":"0.000000","TRANS_CNT":"0.000000","TRANS_OL_SC":"0.000000","FP_TRANS_CNT":"0.000000","FP_TRANS_OL_SC":"0.000000","AUCT_TRANS_CNT":"0.000000","AUCT_TRANS_OL_SC":"0.000000","ASQ_CNT":"0.000000","ASQ_OL_SC":"0.000000","BO_CNT":"0.000000","BO_OL_SC":"0.000000","BID_CNT":"0.000000","BID_OL_SC":"0.000000","BIN_CNT":"0.000000","BIN_OL_SC":"0.000000","BBOWA_CNT":"4.000000","BBOWA_OL_SC":"4.000000","UNIQUE_BBOWA_CNT":"4.000000","UNIQUE_BBOWA_OL_SC":"4.000000","WATCH_CNT":"4.000000","WATCH_OL_SC":"4.000000","FP_WATCH_OL_SC":"4.000000","AUCT_WATCH_OL_SC":"0.000000","FP_WATCH_CNT":"4.000000","AUCT_WATCH_CNT":"0.000000","BID_PRICE_USD":"0.00","OFFER_PRICE_USD":"0.00","QTY_BOUGHT_CNT":"0","QTY_BOUGHT_OL_SC":"0.000000","FP_QTY_BOUGHT_CNT":"0","FP_QTY_BOUGHT_OL_SC":"0.000000","AUCT_QTY_BOUGHT_CNT":"0","AUCT_QTY_BOUGHT_OL_SC":"0.000000","ADD2CART_CNT":"0","ADD2LIST_CNT":"0"}

I lost the node that did the ingestion and hence lost the log file. Re-ingest and send you,

Regards,

Deepak

Deepak Jain

unread,

Aug 20, 2014, 9:56:36 PM8/20/14

to druid-de...@googlegroups.com

Here is a producer perf test result.

[root@producer-1 kafka]# bin/kafka-producer-perf-test.sh --broker-list=broker-1-288777.phx-os1.stratus.dev.ebay.com:9092,broker-2-288781.phx-os1.stratus.dev.ebay.com:9092,broker-3-288783.phx-os1.stratus.dev.ebay.com:9092 --messages 10000000 --topic test --threads 10 --message-size 1000 --batch-size 200 --compression-codec 1

start.time, end.time, compression, message.size, batch.size, total.data.sent.in.MB, MB.sec, total.data.sent.in.nMsg, nMsg.sec

2014-08-21 01:49:55:901, 2014-08-21 01:53:02:900, 1, 1000, 200, 9536.74, 50.9989, 10000000, 53476.2218

[root@producer-1-288785 kafka]#

Here we see 53k messages per second.

Deepak Jain

unread,

Aug 20, 2014, 10:04:40 PM8/20/14

to druid-de...@googlegroups.com

Kafka Consumer Test

bin/kafka-consumer-perf-test.sh --messages 10000000 --topic test --threads 10 --message-size 1000 --batch-size 200 --compression-codec 1 --zookeeper zookeeper.phx-os1.stratus.com:2181

start.time, end.time, fetch.size, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec

2014-08-21 02:00:02:504, 2014-08-21 02:00:41:741, 1048576, 9536.7432, 278.5508, 10000000, 292081.6660

[root@producer-1-288785 kafka]#

Hence the test consumer is able to consume 292K messages per second. This is way large than 900 per second by Druid.

If possible can someone from IST work for 30 minutes and close this issue, that am facing. I am stuck here for quite a few days now.

Deepak Jain

unread,

Aug 20, 2014, 11:57:59 PM8/20/14

to druid-de...@googlegroups.com

After including several dimension in exclusion list, the max ingestion rate achived was 100k per minute. (Still very slow but 100% improvement over earlier attempt)

[dvasthimal@historical-5-293214 ~]$ cat /tmp/persistent/task/index_realtime_expt_real_1_2014-08-21T09:26:34.013Z/c008cb4f-60df-402c-9655-a118653b1935/log | grep "events/processed" | cut -d "," -f7 | cut -d ":" -f2

3

3796

61204

74400

94874

98526

106200

107400

87600

85000

106519

107281

76400

99000

85400

64124

60930

83046

60671

56274

51624

55948

69591

46000

57800

51285

48600

44776

42766

71793

80400

51873

73948

60396

41863

79924

51600

67597

78436

63576

90200

36862

45709

60247

72865

64000

87400

65267

40917

85364

96400

19775

0

Now i have less than 15 dimensions, earlier it was 49.

Deepak Jain

unread,

Aug 21, 2014, 1:12:03 AM8/21/14

to druid-de...@googlegroups.com

Task log: https://drive.google.com/file/d/0BzWgEoMg8APBekpvYkRfVlBEclE/edit?usp=sharing

Xavier Léauté

unread,

Aug 21, 2014, 2:04:21 AM8/21/14

to druid-de...@googlegroups.com

Deepak, I wonder if part of the problem is you sending too much unused information to Druid.

Currently, dimension exclusion happens after the event has been parsed, so it still has to parse all the data.

Try sending only the fields you actually are going to use, that should help.

You may also see a small boost by making all your dimensions lowercase, since Druid lowercases everything internally, but I haven't tested that.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/dc2f4aeb-ef87-4f21-82f7-ad7e51283868%40googlegroups.com.

Prashant Deva

unread,

Aug 21, 2014, 3:13:01 AM8/21/14

to druid-de...@googlegroups.com

>Druid lowercases everything internally

by this I assume you mean the dimension column names are lower cased right?

or is the data in the dimension columns also lower cased? that would mean Druid cannot handle data that differs by case...

Deepak Jain

unread,

Aug 21, 2014, 3:19:15 AM8/21/14

to druid-de...@googlegroups.com

Xavier,

The reason for excluding the dimensions was to see if Druid actually slows down if there are 49 dimensions and looks like it does. Is this a correct statement to make ?

My Data is JSON and all the names are in UPPER CASE. So you want me to make that into lower case and then send it to Kafka ?

Let me try these and come back to you.

Regards,

Deepak

ÐΞ€ρ@Ҝ (๏̯͡๏)

unread,

Aug 21, 2014, 4:07:21 AM8/21/14

to druid-de...@googlegroups.com

I would like to test above suggestions with serverTime as rejectionPolicy since test has been really buggy at times and not recommended at production environment.

--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/ntAHm8HigMk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/70859a9b-6d96-4f46-92f9-2087e695ae39%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Deepak

Nishant Bangarwa

unread,

Aug 21, 2014, 6:28:31 AM8/21/14

to druid-de...@googlegroups.com

Hi Prashant,

yeah, it means the dimension column names are lower cased, the data i.e values are still case sensitive.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/fab01359-104f-432a-be49-db750d00147b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Deepak Jain

unread,

Aug 21, 2014, 6:32:09 AM8/21/14

to druid-de...@googlegroups.com

Hello,

I need to achieve at-least 480k per minute or 8000 per second. Please suggest.

1) Attached is the task log file.

2) I tested the above suggestions with test rejectionPolicy as serverTime seem to be only rejecting even after making sure the TSTAMP is set to NOW and windowPeriod was P30M.

3) Max Ingestion Rates achieved was again around 100k:

67162

92566

94600

96538

103862

99800

72072

92800

78308

79538

60146

67106

69516

92310

100400

84351

80716

75968

75734

50151

69032

20712

4)

HEre is the task file:

{

task:"index_realtime_expt_real_1_2014-08-21T16:15:37.939Z",

payload:{

id:"index_realtime_expt_real_1_2014-08-21T16:15:37.939Z",

resource:{

availabilityGroup:"epgrp_1",

requiredCapacity:1

},

spec:{

dataSchema:{

dataSource:"expt_real",

parser:{

type:"string",

parseSpec:{

format:"json",

timestampSpec:{

column:"tstamp",

format:"auto"

},

dimensionsSpec:{

dimensions:[

],

dimensionExclusions:[

"gmb_usd",

"final_treated_ind",

"bid_cnt",

"fp_watch_ol_sc",

"unique_bbowa_ol_sc",

"glbl_vrtcl_cd",

"bid_price_usd",

"auct_trans_ol_sc",

"add2cart_cnt",

"watch_cnt",

"add2list_cnt",

"succ_bid_ol_sc",

"auct_qty_bought_cnt",

"bo_ol_sc",

"bid_ol_sc",

"offer_price_usd",

"qty_bought_cnt",

"bin_cnt",

"chnl_id",

"bbowa_ol_sc",

"auct_watch_cnt",

"succ_bid_cnt",

"bo_cnt",

"auct_trans_cnt",

"fp_trans_cnt",

"final_byr_cntry_id",

"fp_trans_ol_sc",

"unique_bbowa_cnt",

"fp_watch_cnt",

"auct_watch_ol_sc",

"final_geo_ind",

"watch_ol_sc",

"asq_ol_sc",

"dvic_id",

"asq_cnt",

"bin_ol_sc",

"fp_qty_bought_cnt",

"trans_cnt",

"bbowa_cnt",

"fp_gmb_usd",

"trans_ol_sc"

],

spatialDimensions:[

]

}

},

metricsSpec:[

{

type:"count",

name:"count"

},

{

type:"doubleSum",

name:"gmb_usd",

fieldName:"GMB_USD_OL_SC"

},

{

type:"doubleSum",

name:"gmb_fp",

fieldName:"FP_GMB_USD_OL_SC"

},

{

type:"doubleSum",

name:"gmb_act",

fieldName:"AUCT_GMB_USD_OL_SC"

},

{

type:"doubleSum",

name:"bi",

fieldName:"QTY_BOUGHT_OL_SC"

},

{

type:"doubleSum",

name:"bi_fp",

fieldName:"FP_QTY_BOUGHT_OL_SC"

},

{

type:"doubleSum",

name:"bi_act",

fieldName:"AUCT_QTY_BOUGHT_OL_SC"

}

],

granularitySpec:{

type:"uniform",

segmentGranularity:"HOUR",

queryGranularity:{

type:"duration",

duration:60000,

origin:"1970-01-01T00:00:00.000Z"

},

intervals:null

}

},

ioConfig:{

type:"realtime",

firehose:{

type:"kafka-0.8",

consumerProps:{

zookeeper.connection.timeout.ms:"15000",

zookeeper.session.timeout.ms:"15000",

auto.offset.reset:"largest",

group.id:"expt_real_cgid",

fetch.message.max.bytes:"1048586",

zookeeper.connect:"zookeeper-288779.phx-os1.stratus.dev.ebay.com:2181",

zookeeper.sync.time.ms:"5000",

auto.commit.enable:"false"

},

feed:"exptpoc",

parser:{

type:"string",

parseSpec:{

format:"json",

timestampSpec:{

column:"tstamp",

format:"auto"

},

dimensionsSpec:{

dimensions:[

],

dimensionExclusions:[

"gmb_usd",

"final_treated_ind",

"bid_cnt",

"fp_watch_ol_sc",

"unique_bbowa_ol_sc",

"glbl_vrtcl_cd",

"bid_price_usd",

"auct_trans_ol_sc",

"add2cart_cnt",

"watch_cnt",

"add2list_cnt",

"succ_bid_ol_sc",

"auct_qty_bought_cnt",

"bo_ol_sc",

"bid_ol_sc",

"offer_price_usd",

"qty_bought_cnt",

"bin_cnt",

"chnl_id",

"bbowa_ol_sc",

"auct_watch_cnt",

"succ_bid_cnt",

"bo_cnt",

"auct_trans_cnt",

"fp_trans_cnt",

"final_byr_cntry_id",

"fp_trans_ol_sc",

"unique_bbowa_cnt",

"fp_watch_cnt",

"auct_watch_ol_sc",

"final_geo_ind",

"watch_ol_sc",

"asq_ol_sc",

"dvic_id",

"asq_cnt",

"bin_ol_sc",

"fp_qty_bought_cnt",

"trans_cnt",

"bbowa_cnt",

"fp_gmb_usd",

"trans_ol_sc"

],

spatialDimensions:[

]

}

},

tuningConfig:{

type:"realtime",

maxRowsInMemory:500000,

intermediatePersistPeriod:"PT10M",

windowPeriod:"PT30M",

basePersistDirectory:"/tmp/1408637737947-0",

versioningPolicy:{

type:"intervalStart"

},

maxPendingPersists:0,

shardSpec:{

type:"linear",

partitionNum:1

},

rejectionPolicyFactory:{

type:"test"

}

},

groupId:"index_realtime_expt_real",

dataSource:"expt_real"

To unsubscribe from this group and all its topics, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/70859a9b-6d96-4f46-92f9-2087e695ae39%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Deepak

tasklog.zip

Deepak Jain

unread,

Aug 21, 2014, 9:51:38 AM8/21/14

to druid-de...@googlegroups.com

I am able to achieve around 450K per minute or 7500 per second.

Ingestion rate per minute (events/processed) for 5 million messages.

8/21/2014 13:15:54	118806
8/21/2014 13:16:54	422247
8/21/2014 13:17:54	421424
8/21/2014 13:18:54	405155
8/21/2014 13:19:54	416273
8/21/2014 13:20:54	390623
8/21/2014 13:21:54	410785
8/21/2014 13:22:54	396796
8/21/2014 13:23:54	414828
8/21/2014 13:24:54	393610
8/21/2014 13:25:54	388453
8/21/2014 13:26:54	409272
8/21/2014 13:27:54	390044
8/21/2014 13:28:54	21684

Producer was able to ingest 5M into Kafka at much faster rate.

Druid took ~13minutes to ingest 5 million events, Producer was able to push data into Kafka in ~3.5 minutes.

I have 49 dimensions and 8 aggregates, 3 broker kafka cluster wtih 3 partitions.

Regards,

Deepak

Deepak

Kafka Consumer Test

This the config

Data:

<td
...

Gian Merlino

unread,

Aug 21, 2014, 10:57:59 AM8/21/14

to druid-de...@googlegroups.com

The number one thing you can do to improve performance is index less data. That can be done by sending only the fields you actually need to index to druid (so it doesn't have to parse extraneous data), by providing the full set of "dimensions" (so it doesn't have to index extraneous data), and by using a coarser indexGranularity (same reason). If it is not possible to provide the full set of dimensions up-front, then providing as comprehensive a set of "dimensionExclusions" as you can will also help.

Once you have done everything you can there, you can look at performance bottlenecks in your hardware and configuration. The most common are heap memory, cpu, and disk throughput.

Heap memory- If you are getting a lot of GCs, you can allocate more heap or you can try lowering your maxRowsInMemory.

CPU- Realtime indexing is single-threaded. If you have lots of idle cores in your machine, because you only have one realtime shard, then you can add more shards to make use of more cores (either by adding new shards to your standalone realtime spec file, or submitting more tasks to the indexing service, depending on what you are using)

Disk- If your indexer is persisting very frequently, then you can try to write less frequently by using or larger heaps with more maxRowsInMemory. That may or may not help depending on the distribution of your data (it will help only if it gets you better rollup on incremental indexes). If that is not possible, you need faster disks.

Deepak

Kafka Consumer Test

2014-08-21 01:49:55:901, 2014-08-21 01:53:02:900, 1, 1000, 200, &
...

Xavier Léauté

unread,

Aug 21, 2014, 12:45:00 PM8/21/14

to druid-de...@googlegroups.com

Deepak, since you were able to get 30x improvement from where you were initially, do you mind sharing what changes you made to get that improvement? It doesn't seem like you reduced the number of dimensions / aggregates, so it must be something else?

--

You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/02ebae66-1eca-4d82-939e-89a49c5e1f0f%40googlegroups.com.

Prashant Deva

unread,

Aug 27, 2014, 12:47:00 PM8/27/14

to druid-de...@googlegroups.com

+1. i am curious about the 30x improvement too

Deepak Jain

unread,

Aug 27, 2014, 10:04:06 PM8/27/14

to druid-de...@googlegroups.com

Looks like I did not on group, did it on irc. My producer was dropping a log statement after every event and this was actually causing lower volumes of data to be sent to druid. I removed the log line and was able to see 7500 Evers per second across six nodes.

Reply all

Reply to author

Forward