get_indexed_slices count causing timeout

136 views
Skip to first unread message

Stephen Jones

unread,
Mar 4, 2014, 4:30:59 PM3/4/14
to pycassa...@googlegroups.com
Hey There -

I'm new to the cassandra database and really data basing in general. I've been able to get information on and off of my DB, however I'm coming across an issue that perhaps someone might be able to shine some light on. When I try and query my column family (example below) I get a timed out failure. However, I put the query to a smaller "count" size and I don't get any failures. I'm sure this has something to do with the way I've got things set up, but below is an example of the issue I'm seeing. I believe that all of the columns I'm searching through are indexed keys, but frankly I'm not 100% sure.  Any information pertaining to how the count contributes to the time out of the connection is greatly appreciated. I understand that count is obviously the number of items to return, but how do I get all of the items, not just a couple. Would multiple queries be a possibility? If so, how would I make sure that I've got the "next set" of keys returned? Thanks in advance for the help and information. Cheers.

Column Family Name = FileStats with 50 static columns

{column_name: cliname, validation_class: UTF8Type, index_type: KEYS},
{column_name: proj, validation_class: UTF8Type, index_type: KEYS},
{column_name: user, validation_class: UTF8Type, index_type: KEYS},


%pool = ConnectionPool(keyspace='file_stats',server_list=['cassaws002','cassaws003','cassaws004','cassaws005'],pool_timeout=5.0,max_retries=3)
%cf = ColumnFamily(pool,'FileStats')
%def SearchFrameData(column_family=None,count=10,**kwargs):
    if column_family:
        expr_list=[]
        for key, value in kwargs.items():
            exp = create_index_expression(key, value)
            expr_list.append(exp)
        clause = create_index_clause(expr_list, count=count)
        data = {}
        for key, col in column_family.get_indexed_slices(clause):
            data[key] = col
        return data

matched=SearchFrameData(column_family=cf,cliname='lon',proj='zz0325',user='sjones')
############ Returns correction without Time out, count = 10 #####################

This returns 10 matched items. However I have over 100-2k+ items that should match. When I set my "count" to something higher than 40, I get a timed out. What is the problem here? How do I let cassandra know that this is a large query, and to not time out?

############ Error with count at 100 ###############

%matched=SearchFrameData(column_family=cf,cliname='lon',proj='zz0325',user='sjones',count=100)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/site-packages/pycassa/columnfamily.py", line 712, in get_indexed_slices
    key_slices = self.pool.execute('get_indexed_slices', cp, clause, sp, cl)
  File "/usr/lib/python2.6/site-packages/pycassa/pool.py", line 554, in execute
    return getattr(conn, f)(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/pycassa/pool.py", line 150, in new_f
    return new_f(self, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/pycassa/pool.py", line 150, in new_f
    return new_f(self, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/pycassa/pool.py", line 150, in new_f
    return new_f(self, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/pycassa/pool.py", line 150, in new_f
    return new_f(self, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/pycassa/pool.py", line 150, in new_f
    return new_f(self, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/pycassa/pool.py", line 145, in new_f
    (self._retry_count, exc.__class__.__name__, exc))
pycassa.pool.MaximumRetryException: Retried 6 times. Last failure was timeout: timed out




Tyler Hobbs

unread,
Mar 5, 2014, 11:46:42 AM3/5/14
to pycassa...@googlegroups.com
Hi Stephen,

There are a few issues to cover here.  I'll cover the easiest parts first.

The "timed out" error indicates a client-side socket timeout.  By default, pycassa uses a timeout of 0.5 seconds.  You can adjust this with the timeout parameter to ConnectionPool (perhaps you might be confusing pool_timeout with this).

For get_indexed_slice queries, pycassa transparently breaks up large queries into smaller chunks.  By default, it fetches 1024 rows at a time.  This is also configurable with the buffer_size parameter to get_indexed_slices().  The timeout will apply to each of these subqueriers.  Depending on how large the rows are and how quickly Cassandra can respond, you may want to use a lower buffer_size.

Last, you probably don't want to depend on get_indexed_slices() calls for operations that you will be performing on a regular basis if you want them to be efficient.  With Cassandra, you typically want to use a separate column family for each read pattern you will use and denormalize your data, writing a copy of each piece of data into each of the column families.  This means you're writing more, but writes are cheap; reads are what you should optimize for.  I recommend you check out the slides and videos here: http://wiki.apache.org/cassandra/DataModel.  Data modeling is one of the most difficult aspects of Cassandra for newcomers, but probably the most important.

I hope that helps!


--
You received this message because you are subscribed to the Google Groups "pycassa-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pycassa-discu...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Tyler Hobbs
DataStax

Stephen Jones

unread,
May 5, 2014, 9:54:27 PM5/5/14
to pycassa...@googlegroups.com
Thanks for your insight Tyler. Now that I've gone through and organized the data accordingly I'm still having issues with getting an expected return for indexed_slices. I'm trying to get a return of row keys that have columns that match a given string between 3 indexed columns. The return of rows could be from 0 to 10000+.... rows. I was able to solve my initial timeout issue with setting the buffer size lower. When I do a search that should return one key, I get a timeout error again. However, when I do a search where the number of return rows is 200+ I don't get a time out. What am I missing? I'm confused in that, less data is needing to be returned - yet I get the time out error, but when more information is returned - no timeout error. To me this doesn't make piratical sense. Thanks for your help! Cheers.

Tyler Hobbs

unread,
May 6, 2014, 12:17:08 PM5/6/14
to pycassa...@googlegroups.com

On Mon, May 5, 2014 at 8:54 PM, Stephen Jones <sbjo...@gmail.com> wrote:
Thanks for your insight Tyler. Now that I've gone through and organized the data accordingly I'm still having issues with getting an expected return for indexed_slices. I'm trying to get a return of row keys that have columns that match a given string between 3 indexed columns. The return of rows could be from 0 to 10000+.... rows. I was able to solve my initial timeout issue with setting the buffer size lower. When I do a search that should return one key, I get a timeout error again. However, when I do a search where the number of return rows is 200+ I don't get a time out. What am I missing? I'm confused in that, less data is needing to be returned - yet I get the time out error, but when more information is returned - no timeout error. To me this doesn't make piratical sense. Thanks for your help! Cheers.

This is definitely a bit of counter-intuitive behavior by Cassandra.  The reason that fetching one matching row is slow is that in 1.2 and 2.0, each token range is queried sequentially to look for matches.  If the single matching row is in the last token range, this will take a while.  In 2.1, we've improved this by parallelizing the token range queries: https://issues.apache.org/jira/browse/CASSANDRA-1337.  However, even with that improvement, single matches are the worst case for secondary indexes.  If you're going to make this query frequently, this is where a proper data model becomes important.

I will note that if you only expect a single matching row, setting a count of 1 in your IndexClause will cut the query time in half, on average.


--
Tyler Hobbs
DataStax
Reply all
Reply to author
Forward
0 new messages