Data not showing up at all after ingestion

1,035 views
Skip to first unread message

Amy Troschinetz

unread,
Oct 17, 2014, 10:24:24 AM10/17/14
to druid-de...@googlegroups.com
The index task log is 1.1MB, so instead of attaching it, I've uploaded it to my personal webspace: http://lexicalunit.com/shares/spilling.log

Some relevant excerpts:

[...]
2014-10-16 16:45:51,095 INFO [main] io.druid.indexing.worker.executor.ExecutorLifecycle - Running with task: {
  "type" : "index",
  "id" : "index_click_conversion_2014-10-16T16:45:43.002Z",
  "schema" : {
    "dataSchema" : {
      "dataSource" : "click_conversion",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "tsv",
          "timestampSpec" : {
            "column" : "timestamp",
            "format" : "yyyy-MM-dd HH:mm:ss"
          },
          "dimensionsSpec" : {
            "dimensions" : [ "browser", "city", "country", "coupon", "device", "channel", "region", "site", "property", "outclick_type" ],
            "dimensionExclusions" : [ ],
            "spatialDimensions" : [ ]
          },
          "delimiter" : "\t",
          "columns" : [ "browser", "city", "timestamp", "country", "coupon", "device", "channel", "region", "site", "property", "outclick_type", "commissions", "sales", "orders" ]
        }
      },
      "metricsSpec" : [ {
        "type" : "count",
        "name" : "count"
      }, {
        "type" : "doubleSum",
        "name" : "commissions",
        "fieldName" : "commissions"
      }, {
        "type" : "doubleSum",
        "name" : "sales",
        "fieldName" : "sales"
      }, {
        "type" : "doubleSum",
        "name" : "orders",
        "fieldName" : "orders"
      } ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : {
          "type" : "duration",
          "duration" : 1000,
          "origin" : "1970-01-01T00:00:00.000Z"
        },
        "intervals" : [ "2014-08-01T00:00:00.000Z/2014-08-15T00:00:00.000Z" ]
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "static-s3",
        "parser" : {
          "type" : "string",
          "parseSpec" : {
            "format" : "tsv",
            "timestampSpec" : {
              "column" : "timestamp",
              "format" : "yyyy-MM-dd HH:mm:ss"
            },
            "dimensionsSpec" : {
              "dimensions" : [ "browser", "city", "country", "coupon", "device", "channel", "region", "site", "property", "outclick_type" ],
              "dimensionExclusions" : [ ],
              "spatialDimensions" : [ ]
            },
            "delimiter" : "\t",
            "columns" : [ "browser", "city", "timestamp", "country", "coupon", "device", "channel", "region", "site", "property", "outclick_type", "commissions", "sales", "orders" ]
          }
        },
      }
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 0,
      "rowFlushBoundary" : 0
    }
  },
  "dataSource" : "click_conversion",
  "groupId" : "index_click_conversion_2014-10-16T16:45:43.002Z",
  "interval" : "2014-08-01T00:00:00.000Z/2014-08-15T00:00:00.000Z",
  "resource" : {
    "availabilityGroup" : "index_click_conversion_2014-10-16T16:45:43.002Z",
    "requiredCapacity" : 1
  }
}
[...]
2014-10-16 17:29:29,934 INFO [task-runner-0] io.druid.indexing.common.index.YeOldePlumberSchool - Spilling index[4] with rows[130037] to: /tmp/persistent/task/index_click_conversion_2014-10-16T16:45:43.002Z/work/click_conversion_2014-08-06T00:00:00.000Z_2014-08-07T00:00:00.000Z_2014-10-16T16:45:43.003Z_0/click_conversion_2014-08-06T00:00:00.000Z_2014-08-07T00:00:00.000Z_2014-10-16T16:45:43.003Z/spill4
2014-10-16 17:29:29,936 INFO [task-runner-0] io.druid.segment.IndexMerger - Starting persist for interval[2014-08-06T00:00:00.000Z/2014-08-07T00:00:00.000Z], rows[130,037]
2014-10-16 17:29:30,237 INFO [task-runner-0] io.druid.segment.IndexMerger - outDir[/tmp/persistent/task/index_click_conversion_2014-10-16T16:45:43.002Z/work/click_conversion_2014-08-06T00:00:00.000Z_2014-08-07T00:00:00.000Z_2014-10-16T16:45:43.003Z_0/click_conversion_2014-08-06T00:00:00.000Z_2014-08-07T00:00:00.000Z_2014-10-16T16:45:43.003Z/spill4/v8-tmp] completed index.drd in 1 millis.
2014-10-16 17:29:30,342 INFO [task-runner-0] io.druid.segment.IndexMerger - outDir[/tmp/persistent/task/index_click_conversion_2014-10-16T16:45:43.002Z/work/click_conversion_2014-08-06T00:00:00.000Z_2014-08-07T00:00:00.000Z_2014-10-16T16:45:43.003Z_0/click_conversion_2014-08-06T00:00:00.000Z_2014-08-07T00:00:00.000Z_2014-10-16T16:45:43.003Z/spill4/v8-tmp] completed dim conversions in 105 millis.
2014-10-16 17:29:31,715 INFO [task-runner-0] io.druid.segment.IndexMerger - outDir[/tmp/persistent/task/index_click_conversion_2014-10-16T16:45:43.002Z/work/click_conversion_2014-08-06T00:00:00.000Z_2014-08-07T00:00:00.000Z_2014-10-16T16:45:43.003Z_0/click_conversion_2014-08-06T00:00:00.000Z_2014-08-07T00:00:00.000Z_2014-10-16T16:45:43.003Z/spill4/v8-tmp] completed walk through of 130,037 rows in 1,372 millis.
[...]
2014-10-16 18:23:24,687 INFO [task-runner-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_click_conversion_2014-10-16T16:45:43.002Z",
  "status" : "SUCCESS",
  "duration" : 5853424
}
[...]

Given the above log, I would except to see data for the date range [2014-08-01, 2014-08-15). However, when I run a timeseries query I get no results:

$ cat series.json 
{
    "queryType": "timeseries",
    "dataSource": "click_conversion",
    "granularity": "day",
    "aggregations": [
        {
            "type": "longSum",
            "fieldName": "count",
            "name": "events"
        }
    ], 
    "intervals": ["2014-08-01/2014-08-15"]
}


curl --silent --show-error -d @series.json -H 'content-type: application/json' 'http://broker-ip:8080/druid/v2/' --data-urlencode 'pretty' | python -mjson.tool | pygmentize -l json -f terminal256

[]

real 0m0.090s
user 0m0.003s
sys 0m0.002s

I have no idea what's going on here. Any help?

Data Software Engineer




This e-mail, including attachments, contains confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. The reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.

Gian Merlino

unread,
Oct 17, 2014, 10:34:23 AM10/17/14
to druid-de...@googlegroups.com
It does look like the indexing went through properly, so possibly something is wrong with the coordinator/historical data loading dance. Are there any logs on the coordinator that look like it's trying to assign these segments to historical nodes (and succeeding)? Are there any logs on your historical nodes indicating they're trying to download these segments (and succeeding)?

Amy Troschinetz

unread,
Oct 17, 2014, 12:51:23 PM10/17/14
to druid-de...@googlegroups.com
Sure enough, there's some exceptions in the historical node logs:

2014-10-17 16:46:43,191 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[click_conversion_2014-08-31T00:00:00.000Z_2014-09-01T00:00:00.000Z_2014-10-16T16:59:39.454Z], segment=DataSegment{size=68530163, shardSpec=NoneShardSpec, metrics=[count, commissions, sales, orders], dimensions=[browser, channel, city, country, coupon, device, outclick_type, property, region, site], version='2014-10-16T16:59:39.454Z', loadSpec={type=s3_zip, bucket=s3-int-std-agg-deep-storage, key=click_conversion/2014-08-31T00:00:00.000Z_2014-09-01T00:00:00.000Z/2014-10-16T16:59:39.454Z/0/index.zip}, interval=2014-08-31T00:00:00.000Z/2014-09-01T00:00:00.000Z, dataSource='click_conversion', binaryVersion='9'}}

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[click_conversion_2014-08-31T00:00:00.000Z_2014-09-01T00:00:00.000Z_2014-10-16T16:59:39.454Z]

at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:129)

at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44)

at io.druid.server.coordination.BaseZkCoordinator$1.childEvent(BaseZkCoordinator.java:113)

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:494)

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:488)

at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92)

at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)

at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:83)

at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:485)

at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35)

at org.apache.curator.framework.recipes.cache.PathChildrenCache$11.run(PathChildrenCache.java:755)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: io.druid.segment.loading.SegmentLoadingException: Problem decompressing object[S3Object [key=click_conversion/2014-08-31T00:00:00.000Z_2014-09-01T00:00:00.000Z/2014-10-16T16:59:39.454Z/0/index.zip, bucket=s3-int-std-agg-deep-storage, lastModified=Thu Oct 16 19:07:53 UTC 2014, dataInputStream=org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream@42ff28c0, Metadata={ETag="369cff46932e170a9cce1aa6e8df3019", Date=Fri Oct 17 16:46:38 UTC 2014, Content-Length=34937426, id-2=K53EBUsDk+HZxt+c8OazPSM43ZJ9i/1elkoty7V+m+Doc3kOQYXfC757/Gg1C2HKIItcYfnUM+Y=, request-id=1FDA28445F60E5F9, Last-Modified=Thu Oct 16 19:07:53 UTC 2014, md5-hash=369cff46932e170a9cce1aa6e8df3019, Content-Type=application/zip}]]

at io.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:138)

at io.druid.segment.loading.OmniSegmentLoader.getSegmentFiles(OmniSegmentLoader.java:125)

at io.druid.segment.loading.OmniSegmentLoader.getSegment(OmniSegmentLoader.java:93)

at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:146)

at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:125)

... 17 more

Caused by: java.io.IOException: Problem decompressing object[S3Object [key=click_conversion/2014-08-31T00:00:00.000Z_2014-09-01T00:00:00.000Z/2014-10-16T16:59:39.454Z/0/index.zip, bucket=s3-int-std-agg-deep-storage, lastModified=Thu Oct 16 19:07:53 UTC 2014, dataInputStream=org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream@42ff28c0, Metadata={ETag="369cff46932e170a9cce1aa6e8df3019", Date=Fri Oct 17 16:46:38 UTC 2014, Content-Length=34937426, id-2=K53EBUsDk+HZxt+c8OazPSM43ZJ9i/1elkoty7V+m+Doc3kOQYXfC757/Gg1C2HKIItcYfnUM+Y=, request-id=1FDA28445F60E5F9, Last-Modified=Thu Oct 16 19:07:53 UTC 2014, md5-hash=369cff46932e170a9cce1aa6e8df3019, Content-Type=application/zip}]]

at io.druid.storage.s3.S3DataSegmentPuller$1.call(S3DataSegmentPuller.java:114)

at io.druid.storage.s3.S3DataSegmentPuller$1.call(S3DataSegmentPuller.java:88)

at com.metamx.common.RetryUtils.retry(RetryUtils.java:22)

at io.druid.storage.s3.S3Utils.retryS3Operation(S3Utils.java:79)

at io.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:86)

... 21 more

Caused by: java.io.IOException: No space left on device

at java.io.FileOutputStream.writeBytes(Native Method)

at java.io.FileOutputStream.write(FileOutputStream.java:345)

at com.google.common.io.ByteStreams.copy(ByteStreams.java:211)

at io.druid.utils.CompressionUtils.unzip(CompressionUtils.java:104)

at io.druid.storage.s3.S3DataSegmentPuller$1.call(S3DataSegmentPuller.java:103)

... 25 more

2014-10-17 16:46:43,192 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - /druid/loadQueue/10.101.187.175:8080/click_conversion_2014-08-31T00:00:00.000Z_2014-09-01T00:00:00.000Z_2014-10-16T16:59:39.454Z was removed

It looks like the root filesystem is full due to the indexCache in /tmp. I checked the config for the historical nodes and sure enough, the specified max size for the segmentCache is set too high and needs to be lowered. Could this be what's causing these exceptions?

On Friday, October 17, 2014 9:34:23 AM UTC-5, Gian Merlino wrote:
It does look like the indexing went through properly, so possibly something is wrong with the coordinator/historical data loading dance. Are there any logs on the coordinator that look like it's trying to assign these segments to historical nodes (and succeeding)? Are there any logs on your historical nodes indicating they're trying to download these segments (and succeeding)?

Nishant Bangarwa

unread,
Oct 17, 2014, 1:51:58 PM10/17/14
to druid-de...@googlegroups.com
Hi Amy, 

yeah this is related to disk being full, which means the historical is not able to load new segments. 
lowering the max size on historical should fix this, you may also need to provision more historical nodes in order to hold your data.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/5c66ab9e-2654-4791-aaa4-86e55b86ede6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Fangjin Yang

unread,
Oct 17, 2014, 1:53:12 PM10/17/14
to druid-de...@googlegroups.com
I think Nishant means increasing maxSize on the historical.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
Nishant
Software Engineer|METAMARKETS
+91-9729200044

Fangjin Yang

unread,
Oct 17, 2014, 2:00:41 PM10/17/14
to druid-de...@googlegroups.com
Okay I'm dumb :P, the no space left on device means reducing the maxSize.

Amy Troschinetz

unread,
Oct 17, 2014, 2:19:18 PM10/17/14
to druid-de...@googlegroups.com
I reduced the maxSize but now I get this error on my historical nodes:

2014-10-17 18:12:53,531 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[click_conversion_2014-04-06T00:00:00.000Z_2014-04-07T00:00:00.000Z_2014-10-15T16:40:58.252Z], segment=DataSegment{size=61869449, shardSpec=NoneShardSpec, metrics=[count, commissions, sales, orders], dimensions=[browser, channel, city, country, coupon, device, outclick_type, property, region, site], version='2014-10-15T16:40:58.252Z', loadSpec={type=s3_zip, bucket=s3-int-std-agg-deep-storage, key=click_conversion/2014-04-06T00:00:00.000Z_2014-04-07T00:00:00.000Z/2014-10-15T16:40:58.252Z/0/index.zip}, interval=2014-04-06T00:00:00.000Z/2014-04-07T00:00:00.000Z, dataSource='click_conversion', binaryVersion='9'}}
io.druid.segment.loading.SegmentLoadingException: Exception loading segment[click_conversion_2014-04-06T00:00:00.000Z_2014-04-07T00:00:00.000Z_2014-10-15T16:40:58.252Z]
        at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:129)
        at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44)
        at io.druid.server.coordination.BaseZkCoordinator$1.childEvent(BaseZkCoordinator.java:113)
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:494)
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:488)
        at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92)
        at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
        at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:83)
        at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:485)
        at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35)
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$11.run(PathChildrenCache.java:755)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.metamx.common.ISE: Segment[click_conversion_2014-04-06T00:00:00.000Z_2014-04-07T00:00:00.000Z_2014-10-15T16:40:58.252Z:61,869,449] too large for storage[/tmp/druid/indexCache:11,488,719].
        at io.druid.segment.loading.OmniSegmentLoader.getSegmentFiles(OmniSegmentLoader.java:114)
        at io.druid.segment.loading.OmniSegmentLoader.getSegment(OmniSegmentLoader.java:93)
        at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:146)
        at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:125)
        ... 17 more
2014-10-17 18:12:53,531 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - /druid/loadQueue/10.33.174.9:8080/click_conversion_2014-04-06T00:00:00.000Z_2014-04-07T00:00:00.000Z_2014-10-15T16:40:58.252Z was removed

I thought druid stored data in memory, not on the filesystem. I'm using machines with 30 GB of main memory, but only 6GB of disk space on /. Since we're talking about the segment cache here (keyword being cache) I figured that old data would be flushed from disk as new data flowed in, and that the segment cache didn't need to be that big.

You're saying druid stores the entire dataset both in memory and on disk at the same time and that my segment cache needs to be at least as big (bigger?) than the size of the machine's main memory?

Confused :(

You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/h2oLzqU9RwU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/68067114-f99b-4a75-94ff-f2eed1d9e117%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Data Software Engineer


Fangjin Yang

unread,
Oct 19, 2014, 4:55:59 PM10/19/14
to druid-de...@googlegroups.com
Hi Amy, please see inline.
Every historical node must first download a segment locally before it can run queries over it. Druid memory maps segments by default. This error is saying you've told Druid to download segments to /tmp/indexCache, but the maxSize is too small to accommodate the segment.
To set the maximum size your node can node, please set these two configs (example sizes provided):
druid.server.maxSize=1550000000000
druid.segmentCache.locations=[{"path": "/mnt/persistent/zk_druid", "maxSize": 1550000000000}]

To understand how to configure historical nodes for your hardware, I'd recommend reading:


I thought druid stored data in memory, not on the filesystem. I'm using machines with 30 GB of main memory, but only 6GB of disk space on /. Since we're talking about the segment cache here (keyword being cache) I figured that old data would be flushed from disk as new data flowed in, and that the segment cache didn't need to be that big.

We rely on the OS to page columns and segments in and out of memory. Data that is queried often is held in memory and data that is not queried is paged out as more queries come in. 

You're saying druid stores the entire dataset both in memory and on disk at the same time and that my segment cache needs to be at least as big (bigger?) than the size of the machine's main memory?

Confused :(

If you provide your hardware, we can make suggestions about how to configure things for your particular hardware. We are also happy to do a follow up call to better discuss Druid configuration.
 
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsub...@googlegroups.com.

To post to this group, send email to druid-development@googlegroups.com.

-- 
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/h2oLzqU9RwU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-development+unsub...@googlegroups.com.

To post to this group, send email to druid-development@googlegroups.com.

Data Software Engineer


Amy Troschinetz

unread,
Oct 20, 2014, 1:41:29 PM10/20/14
to druid-de...@googlegroups.com
On Oct 19, 2014, at 3:55 PM, Fangjin Yang <fangj...@gmail.com> wrote:

Every historical node must first download a segment locally before it can run queries over it. Druid memory maps segments by default. This error is saying you've told Druid to download segments to /tmp/indexCache, but the maxSize is too small to accommodate the segment.
To set the maximum size your node can node, please set these two configs (example sizes provided):
druid.server.maxSize=1550000000000
druid.segmentCache.locations=[{"path": "/mnt/persistent/zk_druid", "maxSize": 1550000000000}]

I generate my Druid node configuration in a bash shell script that calculates maxSize (among other things) based on inspections of the machine's hardware. For the druid.segmentCache.locations size I use the following:

  # Index cache is 50% of available space at index cache path
  INDEX_CACHE_PATH="/tmp/druid/indexCache"
  mkdir -p "$INDEX_CACHE_PATH"
  INDEX_CACHE_SIZE="$(echo "(($(df $INDEX_CACHE_PATH -k --output=size | tail -n 1 | awk '{print $1}') * 1024 * .50) + 0.5) / 1" | bc)"

Which for my machine, comes out to be 4160450560 bytes, or approximately 4.2 GB. Here's the output from df for reference:

$ df -h /tmp/druid/indexCache
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.8G  6.3G  1.2G  85% /

As you can see, I've configured Druid to use an appropriate amount of space, avoiding filling up the the drive all the way and running into out of memory errors.

As for the druid.server.maxSize, I use the following:

  # Max size is 90% of memory
  SERVER_MAX_SIZE="$(echo "(($(grep MemTotal /proc/meminfo | awk '{print $2}') * 1024 * .90) + 0.5) / 1" | bc)"

That comes out to 28407667507 bytes, or approximately 28 GB. Inspecting the /proc filesystem for reference:

$ grep MemTotal /proc/meminfo
MemTotal:       30824292 kB

Which shows that my machine has approximately 30 GB of main memory.

So just to summarize, this is my current configuration for maxSize for my historical nodes:

# 90% of main memory, 28 GB (out of 30 GB total)
druid.server.maxSize=28407667507

# 50% of disk space at the given path, 4.2 GB (out of 7.8 GB total)
druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize": 4160450560}]

If I set my segmentCache's maxSize any higher I get out of memory errors when the /tmp filesystem gets completely full. I'm not anywhere near running out of main memory based on the output of top:

top - 14:19:52 up 2 days, 20:39,  1 user,  load average: 0.00, 0.01, 0.05
Tasks: 123 total,   1 running, 122 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  30824292 total,  7657208 used, 23167084 free,   155812 buffers
KiB Swap:        0 total,        0 used,        0 free.  5988252 cached Mem

So I should be able to accommodate another 23 GB of data in the main memory on my historical node.

To understand how to configure historical nodes for your hardware, I'd recommend reading:

Quoting from this link, I'm trying to understand exactly what this means:

"Historical nodes use off-heap memory to store intermediate results, and by default, all segments are memory mapped before they can be queried."

Does this mean that I need to have as much disk space for my segmentCache as I have for the druid server maxSize? That seems to be what you're implying here, unless I'm mistaken.

We rely on the OS to page columns and segments in and out of memory. Data that is queried often is held in memory and data that is not queried is paged out as more queries come in.

That doesn't seem to be happening here in this case. I have a lot more memory available to hold data that's not getting used because the disk space seems to be the real limiting factor. If you always have the same amount of disk space as main memory, you would never be able to get into a situation where the OS would need to page data out of memory. So I can't figure out how the scenario you've mentioned would ever actually happen.

If you provide your hardware, we can make suggestions about how to configure things for your particular hardware. We are also happy to do a follow up call to better discuss Druid configuration.

Hopefully the above details provide enough information to resolve this issue. In the meantime I'll talk to my boss about setting up a call, we might have some more people that would want to sit on that call :)

Data Software Engineer


Amy Troschinetz

unread,
Oct 20, 2014, 3:43:19 PM10/20/14
to druid-de...@googlegroups.com
On Oct 19, 2014, at 3:55 PM, Fangjin Yang <fangj...@gmail.com> wrote:

Every historical node must first download a segment locally before it can run queries over it. Druid memory maps segments by default. This error is saying you've told Druid to download segments to /tmp/indexCache, but the maxSize is too small to accommodate the segment.
To set the maximum size your node can node, please set these two configs (example sizes provided):
druid.server.maxSize=1550000000000
druid.segmentCache.locations=[{"path": "/mnt/persistent/zk_druid", "maxSize": 1550000000000}]

To understand how to configure historical nodes for your hardware, I'd recommend reading:

Quoting from this link, I'm trying to understand exactly what this means:

"Historical nodes use off-heap memory to store intermediate results, and by default, all segments are memory mapped before they can be queried."

Does this mean that I need to have as much disk space for my segmentCache as I have for the druid server maxSize? That seems to be what you're implying here, unless I'm mistaken.

We rely on the OS to page columns and segments in and out of memory. Data that is queried often is held in memory and data that is not queried is paged out as more queries come in.

That doesn't seem to be happening here in this case. I have a lot more memory available to hold data that's not getting used because the disk space seems to be the real limiting factor. If you always have the same amount of disk space as main memory, you would never be able to get into a situation where the OS would need to page data out of memory. So I can't figure out how the scenario you've mentioned would ever actually happen.

If you provide your hardware, we can make suggestions about how to configure things for your particular hardware. We are also happy to do a follow up call to better discuss Druid configuration.

Hopefully the above details provide enough information to resolve this issue. In the meantime I'll talk to my boss about setting up a call, we might have some more people that would want to sit on that call :)

Data Software Engineer


Fangjin Yang

unread,
Oct 20, 2014, 4:07:09 PM10/20/14
to druid-de...@googlegroups.com
Hi, see inline.


On Monday, October 20, 2014 12:43:19 PM UTC-7, Amy Troschinetz wrote:
On Oct 19, 2014, at 3:55 PM, Fangjin Yang <fangj...@gmail.com> wrote:

Every historical node must first download a segment locally before it can run queries over it. Druid memory maps segments by default. This error is saying you've told Druid to download segments to /tmp/indexCache, but the maxSize is too small to accommodate the segment.
To set the maximum size your node can node, please set these two configs (example sizes provided):
druid.server.maxSize=1550000000000
druid.segmentCache.locations=[{"path": "/mnt/persistent/zk_druid", "maxSize": 1550000000000}]

I generate my Druid node configuration in a bash shell script that calculates maxSize (among other things) based on inspections of the machine's hardware. For the druid.segmentCache.locations size I use the following:

  # Index cache is 50% of available space at index cache path
  INDEX_CACHE_PATH="/tmp/druid/indexCache"
  mkdir -p "$INDEX_CACHE_PATH"
  INDEX_CACHE_SIZE="$(echo "(($(df $INDEX_CACHE_PATH -k --output=size | tail -n 1 | awk '{print $1}') * 1024 * .50) + 0.5) / 1" | bc)"

Which for my machine, comes out to be 4160450560 bytes, or approximately 4.2 GB. Here's the output from df for reference:

$ df -h /tmp/druid/indexCache
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.8G  6.3G  1.2G  85% /

As you can see, I've configured Druid to use an appropriate amount of space, avoiding filling up the the drive all the way and running into out of memory errors.

As for the druid.server.maxSize, I use the following:

  # Max size is 90% of memory
  SERVER_MAX_SIZE="$(echo "(($(grep MemTotal /proc/meminfo | awk '{print $2}') * 1024 * .90) + 0.5) / 1" | bc)"

That comes out to 28407667507 bytes, or approximately 28 GB. Inspecting the /proc filesystem for reference:

Because of https://github.com/metamx/druid/issues/798, druid.server.maxSize and druid.segmentCache.locations are effectively the same config and should the same maxSize value. 

$ grep MemTotal /proc/meminfo
MemTotal:       30824292 kB

Which shows that my machine has approximately 30 GB of main memory.

So just to summarize, this is my current configuration for maxSize for my historical nodes:

# 90% of main memory, 28 GB (out of 30 GB total)
druid.server.maxSize=28407667507

# 50% of disk space at the given path, 4.2 GB (out of 7.8 GB total)
druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize": 4160450560}]

If I set my segmentCache's maxSize any higher I get out of memory errors when the /tmp filesystem gets completely full. I'm not anywhere near running out of main memory based on the output of top:

top - 14:19:52 up 2 days, 20:39,  1 user,  load average: 0.00, 0.01, 0.05
Tasks: 123 total,   1 running, 122 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  30824292 total,  7657208 used, 23167084 free,   155812 buffers
KiB Swap:        0 total,        0 used,        0 free.  5988252 cached Mem

So I should be able to accommodate another 23 GB of data in the main memory on my historical node.


What are you setting your heap and direct memory size to? 
To understand how to configure historical nodes for your hardware, I'd recommend reading:

Quoting from this link, I'm trying to understand exactly what this means:

"Historical nodes use off-heap memory to store intermediate results, and by default, all segments are memory mapped before they can be queried."

Does this mean that I need to have as much disk space for my segmentCache as I have for the druid server maxSize? That seems to be what you're implying here, unless I'm mistaken.

Yup. 

We rely on the OS to page columns and segments in and out of memory. Data that is queried often is held in memory and data that is not queried is paged out as more queries come in.

That doesn't seem to be happening here in this case. I have a lot more memory available to hold data that's not getting used because the disk space seems to be the real limiting factor. If you always have the same amount of disk space as main memory, you would never be able to get into a situation where the OS would need to page data out of memory. So I can't figure out how the scenario you've mentioned would ever actually happen.

If you provide your hardware, we can make suggestions about how to configure things for your particular hardware. We are also happy to do a follow up call to better discuss Druid configuration.

Hopefully the above details provide enough information to resolve this issue. In the meantime I'll talk to my boss about setting up a call, we might have some more people that would want to sit on that call :)

Ah cool, I sent some private messages before seeing this :) 
Reply all
Reply to author
Forward
0 new messages