{ "error" : "Maximum number of rows [500000] reached" }

Amey Jain

unread,

Feb 9, 2016, 7:44:34 PM2/9/16

to Druid User

I am have pretty small dataset loaded in my druid.
I am querying it and I am constantly getting this error.
Any idea how to limit the output rows like limit n. I tried threshold:5 but does not help, it gives all the results it finds(if they are less than 500000.

query:

{
    "queryType": "groupBy",
    "dataSource": "avpingFull",
    "granularity": {"type": "duration", "duration": "86400000", "origin":"2015-01-01T23:55:00Z"},
    "dimensions": ["file_sha2"],
   "aggregations": [
        {"type": "count", "name": "rows"}
    ],
    "OrderByColumnSpec":{
        "dimension":"rows",
        "direction":"descending"
    },
    "intervals": ["2015-01-01T10:00:00/2016-01-01T11:01:00"]
}

max time I have in my cluster it 2015-01-01T23.59.59
Please advice.

Amey Jain

unread,

Feb 9, 2016, 8:17:22 PM2/9/16

to Druid User

Looks like it relates to the maxRowsinMemory in realtlime config file. Eventhough I tried giving a limitspec with a value 5. So now there is no way I can query my whole data at once. I have only one day of data.

{
    "queryType": "groupBy",
    "dataSource": "avpingFull",


    "granularity":"all",


    "dimensions": ["file_sha2"],
   "aggregations": [
        {"type": "count", "name": "rows"}
    ],


    "limitSpec":{
        "type"    : "default",
        "limit"   : 5,
        "columns" : [{


                "dimension":"rows",
                "direction":"descending"
        }]
    },


    "intervals": ["2015-01-01T00:10:00/2015-01-01T00:15:00"]
}

Amey Jain

unread,

Feb 9, 2016, 8:30:06 PM2/9/16

to Druid User

Another thing I noticed using plyql I am able to query full dataset for the same query

plyql -h hostname:8082 -a select -q "SELECT file_sha2 ,count() FROM avpingFull where __time>'2015-01-01T00:00:00' AND __time < '2015-01-01T23:59:59' group by file_sha2 limit 5 "
This give me the result as I need.

On Tuesday, February 9, 2016 at 4:44:34 PM UTC-8, Amey Jain wrote:

Gian Merlino

unread,

Feb 9, 2016, 10:54:11 PM2/9/16

to druid...@googlegroups.com

Hey Amey,

You're running into a limit on the size of the pre-limited resultset for groupBy queries (by default this is 500000). plyql is likely working because your query could actually be run as a topN (it's a single dimension ordered by a metric), and so plyql is probably doing that. You can see in verbose mode the query it's issuing (--verbose).

If topNs work for you, you should use them, as they are much more efficient than groupBys. They are approximate though and may have some 'noise' (especially near the bottom of the results) due to potentially not including 100% of the data points.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/ccc867f9-32f2-4db0-a301-0daabda3acdd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Amey Jain

unread,

Feb 11, 2016, 2:34:13 PM2/11/16

to Druid User

Ok i found the default properties in the documentation. Thanks for the help. Can I restart the realtime node without loosing the data. because the time range which realtime node is serving is stored in its memory(i guess) because there is no segment data in the basepersist directoy, it is empty. I am not 100 percent sure what happens if I restart the realtime node with changed configurations.

Fangjin Yang

unread,

Feb 17, 2016, 3:52:53 PM2/17/16

to Druid User

Should be fine.

Amey Jain

unread,

Feb 17, 2016, 7:45:46 PM2/17/16

to Druid User

I did restart the realtime node and I lost the last two segments when I restarted. The basepersist folder is empty.
now the max time I get
[
{
    "max___time": {
      "type": "TIME",
      "value": "2015-01-01T23:49:59.000Z"
    }
}
]
which should be 23:59:59. Why would this happen.
Another problem is I updated the groupby setting to following in the historical nodes but still I get the same error of 50000 max rows.
druid.query.groupBy.maxIntermediateRows=60000000
druid.query.groupBy.maxResults=24000000

Fangjin Yang

unread,

Feb 19, 2016, 6:57:08 PM2/19/16

to Druid User

Amey, I'm guessing your setup is misconfigured. How often are you persisting?

David Lim

unread,

Feb 19, 2016, 8:11:42 PM2/19/16

to Druid User

Addressing "Another problem is I updated the groupby setting to following in the historical nodes but still I get the same error of 50000 max rows.":

Try setting these properties on the broker nodes as well.

Amey Jain

unread,

Feb 22, 2016, 10:09:42 PM2/22/16

to Druid User

Thanks changing the broker node solved the problem.

Amey Jain

unread,

Feb 22, 2016, 10:48:04 PM2/22/16

to Druid User

intermediatePersist time is 10mins(default). From what I understand is realtime node keeps buffering the data unless the segment reaches the granularity and a data of later time stamp is received. It will still be in the memory till the window period. After that realtime node will write the data to local disk and commit to kafka that message is received. If something goes wrong with realtime node before it persists to local disk, it will consume the non commited data from
kafka. in my case it didn't happened. I checked the kafka topic and all data from cluster is gone now. As the rention period in kafka cluster is over the data is deleted from kafka as well. How should I handle this case?

Can we do something like realtime node write all the data it has because once our data is finished, realtime node will still hold the last chunk in its memory which might be lost in future? As in my case the last 10 minutes data should be written to the disk and loaded in historical nodes rather than being served by realtime node?
Here is my realtime spec file

[
  {
    "dataSchema" : {
      "dataSource" : "ping",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "tsv",
          "columns" : ["some dimensions"],

          "delimiter":"\t",
          "timestampSpec" : {
            "column" : "server_ts",
            "format" : "yyyy-MM-dd HH:mm:ss"
          },
          "dimensionsSpec" : {
            "dimensions": ["some dimensions"],

            "dimensionExclusions" : [],
            "spatialDimensions" : []
          }
        }
      },
      "metricsSpec" : [{
        "type" : "count",
        "name" : "count"
      }],
      "aggregations": [{
         "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" }],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "FIVE_MINUTE",      "queryGranularity" : "NONE"
      }
    },
    "ioConfig" : {
      "type" : "realtime",
      "firehose": {
        "type": "kafka-0.8",
        "consumerProps": {
          "zookeeper.connect": "hostname:2181",
          "zookeeper.connection.timeout.ms" : "15000",
          "zookeeper.session.timeout.ms" : "15000",
          "zookeeper.sync.time.ms" : "5000",
          "group.id": "avping-full",
          "fetch.message.max.bytes" : "1048586",
          "auto.offset.reset": "largest",
          "auto.commit.enable": "false"
        },
        "feed": "pingFull"
      },
      "plumber": {
        "type": "realtime"
      }
    },
    "tuningConfig": {
      "type" : "realtime",
      "maxRowsInMemory": 250000000,
      "intermediatePersistPeriod": "PT10m",
      "windowPeriod": "PT5m",
      "basePersistDirectory": "/mnt/druid/realtime/basePersist",
      "rejectionPolicy": {
        "type": "messageTime"
      }
    }
  }
]

I have used messageTime rejection policy to show the realtime ingestion functionality to my team. I am not sure this can cause problems. Other than that everything is pretty standard. Please let me know as this will be big question from my time about data loss.

Reply all

Reply to author

Forward