Getting wrong records count when using python driver with query contains "IN".

826 views
Skip to first unread message

Wy Chang

unread,
Nov 25, 2014, 3:35:03 AM11/25/14
to python-dr...@lists.datastax.com
Hi

Have any one met the issue below:
when the query contains "IN" in criteria, the python driver will sometimes return the wrong record count.

Details:
1. The total count of records which meet some specific criteria is 1000.
2. Query is: "select XXX from XXX where col in (XXX, XXX, XXX) limit 100". (meet above specific criteria)
3. Set fetch_size to 800.

The problem1 is:  Even the limit size (100) is smaller than fetch_size (1000), the returned result is still a PagedResult (which I think should be a list).
The problem2 is:  The total size of returned is bigger than the limit size, which is 100.
BTW, when I removed "IN" part from criteria, and everything works fine.
And the version I am using is 2.1.2

My simple code is:

# Initialize cluster
cluster = Cluster(['mynode'])
session = cluster.connect('mycolumnfamily')


# Fetch all the records and get the count
cql = SimpleStatement("select * from table where pkcol in ('val1', '
val2', 'val3', 'val4', 'val5')", fetch_size=None)
print "Total count is: %d" % len(session.execute(cql))

# Fetch 20 records and set fetch_size to 40
cql = SimpleStatement("select * from table where pkcol in ('val1', '
val2', 'val3', 'val4', 'val5') limit 20", fetch_size=40)
result = session.execute(cql)

# print out result type and count of records
print
"Type of result is: %s" % type(result)
result = list(result)
print "Count returned: %d (should only return 20)" % len(result)

The output is:
Total count is: 54
Type of result is: <class 'cassandra.cluster.PagedResult'>
Count returned: 32 (should only return 20)

Please correct me if I am wrong at some place.


 

Adam Holmberg

unread,
Nov 25, 2014, 12:49:16 PM11/25/14
to python-dr...@lists.datastax.com

The problem1 is:  Even the limit size (100) is smaller than fetch_size (1000), the returned result is still a PagedResult (which I think should be a list).
 
The type of result returned is based on a paging state received from Cassandra. There is nothing to be done driver-side. There might be some optimization that could be made on the server for this type of query, to avoid unnecessary paging. Alternatively you might choose to split your multiget up into distinct selects by partition, and run them concurrently using cassandra.concurrent.

The problem2 is:  The total size of returned is bigger than the limit size, which is 100.
 
This looks like a Cassandra issue to me. There is some strange behavior when both limit and fetch_size are specified together. 

My observations:
fetch_size <= limit < items: limit is disregarded and all items are returned
limit < fetch_size < items: some arbitrary number of items returned, < items
limit < items <= fetch_size: limit is respected

This might be good to submit to the Cassandra JIRA if you're willing.

In the mean time, an easy workaround would be to avoid using LIMIT with multiget (WHERE..IN clause). The items in the set naturally limit the results returned. I'm not sure what you're trying to accomplish with the additional LIMIT.

Adam

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Wy Chang

unread,
Nov 25, 2014, 9:51:47 PM11/25/14
to python-dr...@lists.datastax.com
Thanks Adam

Actually what I am trying to accomplish is: Even after records is limited by items in set, there still could be a big amount of records left. That's why I use limit trying to reduce the result count.

Thanks,
Wenyu
Reply all
Reply to author
Forward
0 new messages