Druid Segments dropping

Nitin Gautam

unread,

Apr 2, 2014, 4:27:29 AM4/2/14

to druid-de...@googlegroups.com

Hi

I am loading data in batches for each hour. Each time a segment gets loaded I verify the count in the MySQL table druid_segments. What I can see is that at each hour the count of number of segments with the used flag set to 0 increases by 1. I have no rules in the druid_rules table other than the default

mysql> select * from druid_rules;

+-----------------------------------+------------+--------------------------+-----------------------------------------------------------------------------------+

+-----------------------------------+------------+--------------------------+-----------------------------------------------------------------------------------+

| _default_2013-10-03T17:47:55.354Z | _default | 2013-10-03T17:47:55.354Z | [{"period":"P5000Y","replicants":2,"tier":"_default_tier","type":"loadByPeriod"}] |

+-----------------------------------+------------+--------------------------+-----------------------------------------------------------------------------------+

1 row in set (0.00 sec)

I am not sure what is causing the used flag to reset to 0. I did change the contents of the druid_rules table earlier but the dump above is the latest output of the rules table. I am running druid version 0.6.73.

Thanks

nitin

Nishant Bangarwa

unread,

Apr 2, 2014, 11:25:09 AM4/2/14

to druid-de...@googlegroups.com

Hi Nitin,

As per the default rule, druid will set 0 for a segment when it is overridden by another segment with newer version for the same interval.

Do you have segments created with overlapping intervals ?
Are you running the batch loading jobs with overlapping intervals or all have distinct intervals?
Can you share the batch loading spec file for more details ?

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/f112d322-2a17-47be-b605-a91b744cdbbb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Nishant

Software Engineer

|

METAMARKETS

m	+91-9729200044

nishant....@metamarkets.com

Nitin Gautam

unread,

Apr 3, 2014, 12:44:53 AM4/3/14

to druid-de...@googlegroups.com

Hi Nishant

My indexing procedure is as follows

Every hour a log file is generated on a server.
I download the log and generate a json file. Each row of the json is an entry in the log file. The first entry for each line in the json, is a timestamp of the format <<"timestamp":"2014-04-02T06:02:01Z">>. I add this entry while creating the json which represents the current time. So each json generated at the end of the hour will have different entries in the first column corresponding to the current hour.
Once this is done the json is uploaded to S3.
Next I submit an indexing task to the overlord node, the json is as below

{
"type" : "index",
"dataSource" : "dsn",
"granularitySpec" : {
"type" : "uniform",
"gran" : "hour",
"intervals" : [ "2013-01-01/2014-12-31" ]
},
"aggregators" : [{
"type" : "count",
"name" : "count"
}, {
"type" : "doubleSum",
"name" : "imp",
"fieldName" : "impressions"
}, {
"type" : "doubleSum",
"name" : "click",
"fieldName" : "clicks"
}, {
"type" : "doubleSum",
"name" : "conversion",
"fieldName" : "conversions"
}],
"firehose" : {
"type" : "static-s3",
"uris" : ["<<bucket>>"],
"parser" : {
"timestampSpec" : {
"column" : "timestamp"
},
"data" : {
"format" : "json",
"dimensions" : [<<fields>>,"impressions","clicks","conversions"]
}
}
}
}

Note that in the granularitySpec section the entry for intervals is always the same for all the json's. Is this a possible cause of the behavior. Should I change this to the hour interval for which the data is downloaded.

One more point is that I have these settings in the coordinator. Do you think that could be the reason why some segments are getting dropped.

druid.coordinator.merge.on=true

druid.coordinator.conversion.on=true

Regards

Nitin

Nishant Bangarwa

unread,

Apr 3, 2014, 5:41:25 AM4/3/14

to druid-de...@googlegroups.com

Hi Nitin,

In case your batch data file contains any entry from a different hour that can cause one segment with newer version to be created and previous segment getting invalidated.

e.g While indexing data for 4-5 pm, If any of the file entries contains timestamps between 3-4 pm, the batch ingestion will generate two segments one overriding existing 3-4pm segment

and another for current interval. I hope the log file generation mechanism already ensures that the data files to be loaded contains data for only one hour. But still to rule out any possibilities of any bugs in log generation, I would recommend you to specify the hourly intervals for your batch job, so that it never overrides the segments unless you want them to be generated again.

Another possibility is that if you segments sizes are too small they might be getting merged to as you have set this property druid.coordinator.merge.on=true

In this case you should be able to see logging for segments getting merged in coordinator logs, In this case there should not be any data loss and your queries should still return correct results.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/be141b02-1aca-4c33-b30f-a14f6bb1370a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward