Druid Segments dropping

531 views
Skip to first unread message

Nitin Gautam

unread,
Apr 2, 2014, 4:27:29 AM4/2/14
to druid-de...@googlegroups.com
Hi

I am loading data in batches for each hour. Each time a segment gets loaded I verify the count in the MySQL table druid_segments. What I can see is that at each hour the count of number of segments with the used flag set to 0 increases by 1. I have no rules in the druid_rules table other than the default

mysql> select * from druid_rules;
+-----------------------------------+------------+--------------------------+-----------------------------------------------------------------------------------+
| id                                | dataSource | version                  | payload                                                                           |
+-----------------------------------+------------+--------------------------+-----------------------------------------------------------------------------------+
| _default_2013-10-03T17:47:55.354Z | _default   | 2013-10-03T17:47:55.354Z | [{"period":"P5000Y","replicants":2,"tier":"_default_tier","type":"loadByPeriod"}] |
+-----------------------------------+------------+--------------------------+-----------------------------------------------------------------------------------+
1 row in set (0.00 sec)

I am not sure what is causing the used flag to reset to 0. I did change the contents of the druid_rules table earlier but the dump above is the latest output of the rules table. I am running druid version 0.6.73.

Thanks
nitin

Nishant Bangarwa

unread,
Apr 2, 2014, 11:25:09 AM4/2/14
to druid-de...@googlegroups.com
Hi Nitin,

As per the default rule, druid will set 0 for a segment when it is overridden by another segment with newer version for the same interval.
Do you have segments created with overlapping intervals ?
Are you running the batch loading jobs with overlapping intervals or all have distinct intervals?
Can you share the batch loading spec file for more details ?


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/f112d322-2a17-47be-b605-a91b744cdbbb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Nitin Gautam

unread,
Apr 3, 2014, 12:44:53 AM4/3/14
to druid-de...@googlegroups.com
Hi Nishant

My indexing procedure is as follows 
  1. Every hour a log file is generated on a server.
  2. I download the log and generate a json file. Each row of the json is an entry in the log file. The first entry for each line in the json, is a timestamp of the format <<"timestamp":"2014-04-02T06:02:01Z">>. I add this entry while creating the json which represents the current time. So each json generated at the end of the hour will have different entries in the first column corresponding to the current hour.
  3. Once this is done the json is uploaded to S3.
  4. Next I submit an indexing task to the overlord node, the json is as below
{
  "type" : "index",
  "dataSource" : "dsn",
  "granularitySpec" : {
    "type" : "uniform",
    "gran" : "hour",
    "intervals" : [ "2013-01-01/2014-12-31" ]
  },
  "aggregators" : [{
     "type" : "count",
     "name" : "count"
    }, {
     "type" : "doubleSum",
     "name" : "imp",
     "fieldName" : "impressions"
    }, {
     "type" : "doubleSum",
     "name" : "click",
     "fieldName" : "clicks"
    }, {
     "type" : "doubleSum",
     "name" : "conversion",
     "fieldName" : "conversions"
  }],
  "firehose" : {
    "type" : "static-s3",
    "uris" : ["<<bucket>>"],
    "parser" : {
      "timestampSpec" : {
        "column" : "timestamp"
      },
      "data" : {
        "format" : "json",
        "dimensions" : [<<fields>>,"impressions","clicks","conversions"]
      }
    }
  }
}

Note that in the granularitySpec section the entry for intervals is always the same for all the json's. Is this a possible cause of the behavior. Should I change this to the hour interval for which the data is downloaded.

One more point is that I have these settings in the coordinator. Do you think that could be the reason why some segments are getting dropped.
druid.coordinator.merge.on=true
druid.coordinator.conversion.on=true

Regards
Nitin 

Nishant Bangarwa

unread,
Apr 3, 2014, 5:41:25 AM4/3/14
to druid-de...@googlegroups.com
Hi Nitin,

In case your batch data file contains any entry from a different hour that can cause one segment with newer version to be created and previous segment getting invalidated.
e.g While indexing data for 4-5 pm, If any of the file entries contains timestamps between 3-4 pm, the batch ingestion will generate two segments one overriding existing 3-4pm segment
and another for current interval. I hope the log file generation mechanism already ensures that the data files to be loaded contains data for only one hour. But still to rule out any possibilities of any bugs in log generation, I would recommend you to specify the hourly intervals for your batch job, so that it never overrides the segments unless you want them to be generated again.

Another possibility is that if you segments sizes are too small they might be getting merged to as you have set this property druid.coordinator.merge.on=true
In this case you should be able to see logging for segments getting merged in coordinator logs, In this case there should not be any data loss and your queries should still return correct results.




For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages