pycassa connection errors

1,399 views
Skip to first unread message

Ernst D Schoen-René

unread,
Jul 24, 2011, 11:58:58 PM7/24/11
to pycassa...@googlegroups.com
I'm a newbie to cassandra, but here are some questions I haven't found
the answers to yet.

We're running a single node. I know that's wrong, but I don't have a
choice right now. Shortly we'll move to a 16 node cluster.

in the mean time:

I get lots of these in my log:

2011-07-25 03:45:34,791 Connection 7767968 (localhost:9160) in
ConnectionPool (id = 375856016) failed: Could not connect to localhost:9160
2011-07-25 03:45:34,792 Connection 7767968 (localhost:9160) in
ConnectionPool (id = 375856016) failed: Could not connect to localhost:9160
2011-07-25 03:45:34,803 Connection 7767968 (localhost:9160) in
ConnectionPool (id = 375856016) failed: Could not connect to localhost:9160

1: is there a way to tell pycassa to NOT assert these to the log at log
level INFO?

2: This message is sort of confusing. It seems like sometimes the
message comes in, but the connection is eventually successful, since
writes happen. What does this message actually mean?

3: Is there a way to reduce the number of connection failures I get?
I'm running a multiprocess script, and sometimes all connections are
fine, sometimes many of them fail, sometimes all of them fail. I can't
relate it to anything else going on. Nothing else runs on the machine,
it doesn't relate to compactions, it doesn't show up in the cassandra log.

Thanks! Any help would be appreciated!

Tyler Hobbs

unread,
Jul 25, 2011, 11:06:07 AM7/25/11
to pycassa...@googlegroups.com
2011/7/24 Ernst D Schoen-René <er...@peoplebrowsr.com>

I get lots of these in my log:

2011-07-25 03:45:34,791 Connection 7767968 (localhost:9160) in
ConnectionPool (id = 375856016) failed: Could not connect to localhost:9160
2011-07-25 03:45:34,792 Connection 7767968 (localhost:9160) in
ConnectionPool (id = 375856016) failed: Could not connect to localhost:9160
2011-07-25 03:45:34,803 Connection 7767968 (localhost:9160) in
ConnectionPool (id = 375856016) failed: Could not connect to localhost:9160

1: is there a way to tell pycassa to NOT assert these to the log at log
level INFO?

No, you can increase your log level to WARN, but there's not a clean way to do this. (Of course, everything is possible in Python.)  This log message if for issues that can indicative of an underlying problem, but each issue on its own is recoverable; I think this makes either INFO or WARN appropriate for the log message.  In your case, it's actually helped you discover a problem.   When things are running smoothly, you'll rarely see these logs.

2: This message is sort of confusing.  It seems like sometimes the
message comes in, but the connection is eventually successful, since
writes happen.  What does this message actually mean?

It means that a single connection in the pool failed when it was used for an operation.  Assuming your pool setting allow for retries, the operation was retried with a different connection and eventually succeeded.

3: Is there a way to reduce the number of connection failures I get?
I'm running a multiprocess script, and sometimes all connections are
fine, sometimes many of them fail, sometimes all of them fail.  I can't
relate it to anything else going on.  Nothing else runs on the machine,
it doesn't relate to compactions, it doesn't show up in the cassandra log.

Neither ConnectionPool or ColumnFamily are multiprocess-safe.  My guess is that you're sharing these between processes and the failures are due to race conditions.

--
Tyler Hobbs
Software Engineer, DataStax
Maintainer of the pycassa Cassandra Python client library

Ernst D Schoen-René

unread,
Jul 25, 2011, 12:28:19 PM7/25/11
to pycassa...@googlegroups.com, Tyler Hobbs
Thanks so much for the very helpful answer.

Regarding #3 - I'm not sharing the connection pool between worker processes.  Each worker process spawns its own pool and uses it, while events are piped to it from the parent via a queue.

Is there any way to diagnose the cause of these failures?  Should I up cassandra's log level?

Ernst D Schoen-René

unread,
Jul 25, 2011, 12:50:47 PM7/25/11
to Tyler Hobbs, pycassa...@googlegroups.com
I already raised the ulimit on cassandra significantly.  The open files for cassandra is beneath that limit.
The cassandra logs show nothing relating to connections.  That's part of the mystery.

On 7/25/11 9:42 AM, Tyler Hobbs wrote:

2011/7/25 Ernst D Schoen-René <er...@peoplebrowsr.com>

Regarding #3 - I'm not sharing the connection pool between worker processes.  Each worker process spawns its own pool and uses it, while events are piped to it from the parent via a queue.

Is there any way to diagnose the cause of these failures?  Should I up cassandra's log level?

I suspect that you might be hitting an open file limit on the Cassandra node.  What's the output of 'cat /proc/<pid>/limits' where <pid> is the Cassandra process?

If that's not the problem, then increasing the Cassandra log may help to uncover the issue.
 

Ernst D Schoen-René

unread,
Jul 25, 2011, 1:39:44 PM7/25/11
to Tyler Hobbs, pycassa...@googlegroups.com
[cassy@cassarch1 conf]$ ulimit -Hn
200000
[cassy@cassarch1 conf]$ ulimit -Sn
100000

lsof -u cassy | wc -l
95798

So still under the ulimit.

Connections failures happen all over the place.  Even with this single-threaded example below.  NOTHING else is running on this machine, and nothing else is trying to connect to cassandra.  Sometimes they happen at the beginning, sometimes they happen in the middle, no rhyme or reason.

import pycassa
CASSANDRA_NODE = 'localhost:9160'
connection_pool = pycassa.pool.ConnectionPool(keyspace='creds', server_list=[CASSANDRA_NODE], max_retries=5, pool_size=16)
print connection_pool

that produces this:

Traceback (most recent call last):
  File "testconnect.py", line 5, in <module>
    connection_pool = pycassa.pool.ConnectionPool(keyspace='creds', server_list=[CASSANDRA_NODE], max_retries=5, pool_size=16)
  File "/usr/local/lib/python2.7/site-packages/pycassa-1.1.0-py2.7.egg/pycassa/pool.py", line 617, in __init__
    self._q.put(self._create_connection(), False)
  File "/usr/local/lib/python2.7/site-packages/pycassa-1.1.0-py2.7.egg/pycassa/pool.py", line 123, in _create_connection
    raise AllServersUnavailable('An attempt was made to connect to each of the servers '
pycassa.pool.AllServersUnavailable: An attempt was made to connect to each of the servers twice, but none of the attempts succeeded.

however, the second run got me this:

<pycassa.pool.ConnectionPool object at 0xac53750>




On 7/25/11 10:34 AM, Tyler Hobbs wrote:
2011/7/25 Ernst D Schoen-René <er...@peoplebrowsr.com>
I already raised the ulimit on cassandra significantly.  The open files for cassandra is beneath that limit.

The cassandra logs show nothing relating to connections.  That's part of the mystery.


Can you verify the open file limit through the method I described?  It's common to misconfigure this limit because of several oddities with the limits module.

Are the connection failures happening quickly after startup (of the client), or do they take a while to appear?  Can you describe any patterns with how the log messages show up?
 

Ernst D Schoen-René

unread,
Jul 27, 2011, 2:56:47 PM7/27/11
to Tyler Hobbs, pycassa...@googlegroups.com
Thanks, I appreciate the help.

In debug mode, all pycassa gives me is a bunch of this:

Connection 283668688 (cass3:9160) was checked out from ConnectionPool (id = 279661520)
Connection 283659216 (cass2:9160) was checked out from ConnectionPool (id = 279661520)
Connection 283650704 (cass3:9160) was checked out from ConnectionPool (id = 279661520)
Connection 280698832 (cass3:9160) was checked out from ConnectionPool (id = 279661520)
Connection 283659856 (cass5:9160) was checked out from ConnectionPool (id = 279661520)
Connection 283622544 (cass7:9160) was checked in to ConnectionPool (id = 279661520)
Connection 283659600 (cass5:9160) was checked out from ConnectionPool (id = 279661520)
Connection 283622992 (cass1:9160) was checked out from ConnectionPool (id = 279661520)
Connection 279661456 (cass7:9160) was checked in to ConnectionPool (id = 279661520)
Connection 283651152 (cass1:9160) was checked in to ConnectionPool (id = 279661520)
Connection 283991376 (cass7:9160) was checked in to ConnectionPool (id = 279661520)
Connection 283659216 (cass2:9160) was checked in to ConnectionPool (id = 279661520)
Connection 283668688 (cass3:9160) was checked in to ConnectionPool (id = 279661520)
Connection 280698832 (cass3:9160) was checked in to ConnectionPool (id = 279661520)


meanwhile, in my logs, I'm still getting a ton of

2011-07-27 18:54:18,549 Connection 140691664 (cass6:9160) in ConnectionPool (id = 136469392) failed: timed out
2011-07-27 18:54:18,564 Connection 140678928 (cass6:9160) in ConnectionPool (id = 136469392) failed: timed out
2011-07-27 18:54:18,570 Connection 140701904 (cass6:9160) in ConnectionPool (id = 136469392) failed: timed out
2011-07-27 18:54:18,651 Connection 140701968 (cass6:9160) in ConnectionPool (id = 136469392) failed: timed out
2011-07-27 18:54:18,715 Connection 140678672 (cass6:9160) in ConnectionPool (id = 136469392) failed: timed out
2011-07-27 18:54:26,306 Connection 139947664 (cass8:9160) in ConnectionPool (id = 136469392) failed: timed out
2011-07-27 18:54:26,322 Connection 140655440 (cass8:9160) in ConnectionPool (id = 136469392) failed: timed out
2011-07-27 18:54:26,332 Connection 140691600 (cass8:9160) in ConnectionPool (id = 136469392) failed: timed out

I am now running on an 8-node cluster, not on a single machine any more.



On 7/27/11 10:56 AM, Tyler Hobbs wrote:
Sorry I haven't been able to reply more quickly -- I unexpectedly had to fly out to OSCON :)

Did you make any progress on this?

If not, could you increase the pycassa log level to DEBUG and paste the logs here when the problem shows up?  You can see an example of changing the log level here: http://pycassa.github.com/pycassa/api/pycassa/logging/pycassa_logger.html

Ernst D Schoen-René

unread,
Jul 28, 2011, 6:40:58 PM7/28/11
to Tyler Hobbs, pycassa...@googlegroups.com
is the timeout set in the connection?  I've got it set to default, which I read as being 30 seconds.  30 seconds seems like an awfully long time to have so many timeouts on all 8 of my cassandra nodes when this is the only app connecting to it.  I'm using connectionPool:

        connection_pool = pycassa.pool.ConnectionPool(keyspace = 'creds', server_list = CASSANDRA_NODE, max_retries = 5, pool_size = 1, prefill = False)

is there anything there that would cause it to timeout so much?

Thanks!

On 7/28/11 3:21 PM, Tyler Hobbs wrote:
"timed out" indicates that you're hitting the client-side (pycassa) timeout.  Depending on the application, it may be desirable to retry the query on a different node if it hasn't completed by this timeout, but it's worth looking at that timeout to make sure it matches your application needs.

In any case, this is definitely a different failure case than the previous log messages.

2011/7/27 Ernst D Schoen-René <er...@peoplebrowsr.com>

Tyler Hobbs

unread,
Jul 28, 2011, 7:20:31 PM7/28/11
to Ernst D Schoen-René, pycassa...@googlegroups.com
Oops, meant to reply to the list on that last one.

Batch writes could certainly take longer than half a second in some cases.

2011/7/28 Ernst D Schoen-René <er...@peoplebrowsr.com>
I'm doing batch writes.  Maybe that's the problem


On 7/28/11 4:13 PM, Tyler Hobbs wrote:
The "timeout" kwarg in the ConnectionPool constructor controls the timeout; the default is 0.5 seconds.  You're probably thinking of "pool_timeout", which controls how long a thread will wait when trying to get a connection from the connection pool (this only matters in multi-threaded environments).

Are you doing really large reads or writes that might take longer than 0.5 seconds to complete?  Another possibility is that one particular node might be undergoing a large GC or compaction which could slow down the operation.

2011/7/28 Ernst D Schoen-René <er...@peoplebrowsr.com>

Ernst D Schoen-René

unread,
Jul 28, 2011, 7:24:43 PM7/28/11
to Tyler Hobbs, pycassa...@googlegroups.com
that would make sense.  Thanks!

Ernst Schoen-Rene

unread,
Aug 19, 2011, 2:00:02 PM8/19/11
to pycassa-discuss
I'm back to having this same problem. My connections are being run in
a multi-process environment, but each connectionpool is created in the
child process and owned by the child process.
This code was chugging along fine for a few weeks, and now it's
started failing and throwing tons of timeout errors:

2011-08-19 17:39:09,599 Connection 439493328 (cass6:9160) in
ConnectionPool (id = 433491024) failed: timed out
2011-08-19 17:39:09,600 Connection 439493456 (cass6:9160) in
ConnectionPool (id = 433491024) failed: timed out
2011-08-19 17:39:09,603 Connection 439481424 (cass6:9160) in
ConnectionPool (id = 433491024) failed: timed out
2011-08-19 17:39:09,610 Connection 439481360 (cass6:9160) in
ConnectionPool (id = 433491024) failed: timed out
2011-08-19 17:39:09,616 Connection 439493392 (cass6:9160) in
ConnectionPool (id = 433491024) failed: timed out
2011-08-19 17:39:09,621 Connection 439485392 (cass6:9160) in
ConnectionPool (id = 433491024) failed: timed out
2011-08-19 17:39:09,641 Connection 439481744 (cass6:9160) in
ConnectionPool (id = 433491024) failed: timed out
2011-08-19 17:39:09,646 Connection 439493264 (cass6:9160) in
ConnectionPool (id = 433491024) failed: timed out
2011-08-19 17:39:09,683 Connection 439493136 (cass6:9160) in
ConnectionPool (id = 433491024) failed: timed out


The cluster is fine, no real load. I have 16 nodes, and I can connect
to all of them from the app machine via cassandra-cli.

Here is the connection statement

connection_pool = pycassa.pool.ConnectionPool(keyspace =
'creds', server_list = CASSANDRA_NODE, max_retries = 5, pool_size =
16, prefill = False, timeout=10)

where CASSANDRA_NODE is

CASSANDRA_NODE = ['cass1:9160',
'cass2:9160',
'cass3:9160',
'cass4:9160',
'cass5:9160',
'cass6:9160',
'cass7:9160',
'cass8:9160',
'cass9:9160',
'cass10:9160',
'cass11:9160',
'cass12:9160',
'cass13:9160',
'cass14:9160',
'cass15:9160',
'cass16:9160',
]

I see no connection complaints in the individual nodes' cassandra
log. Turning log level on my app up to DEBUG doesn't show any extra
information. All I get are endless connection failures, although the
nodes seem to be fine and connecting via cli is fine.

It would be really nice if I could get some sort of information about
why these connections are timing out. Any ideas?

On Jul 28, 4:24 pm, Ernst D Schoen-René <er...@peoplebrowsr.com>
wrote:
> that would make sense.  Thanks!
>
> On 7/28/11 4:20 PM, Tyler Hobbs wrote:
>
>
>
>
>
>
>
> > Oops, meant to reply to the list on that last one.
>
> > Batch writes could certainly take longer than half a second in some cases.
>
> > 2011/7/28 Ernst D Schoen-Ren� <er...@peoplebrowsr.com
> > <mailto:er...@peoplebrowsr.com>>
>
> >     I'm doing batch writes.  Maybe that's the problem
>
> >     On 7/28/11 4:13 PM, Tyler Hobbs wrote:
> >>     The "timeout" kwarg in the ConnectionPool constructor controls
> >>     the timeout; the default is 0.5 seconds.  You're probably
> >>     thinking of "pool_timeout", which controls how long a thread will
> >>     wait when trying to get a connection from the connection pool
> >>     (this only matters in multi-threaded environments).
>
> >>     Are you doing really large reads or writes that might take longer
> >>     than 0.5 seconds to complete?  Another possibility is that one
> >>     particular node might be undergoing a large GC or compaction
> >>     which could slow down the operation.
>
> >>     2011/7/28 Ernst D Schoen-Ren� <er...@peoplebrowsr.com
> >>     <mailto:er...@peoplebrowsr.com>>
>
> >>         is the timeout set in the connection?  I've got it set to
> >>         default, which I read as being 30 seconds.  30 seconds seems
> >>         like an awfully long time to have so many timeouts on all 8
> >>         of my cassandra nodes when this is the only app connecting to
> >>         it.  I'm using connectionPool:
>
> >>                 connection_pool =
> >>         pycassa.pool.ConnectionPool(keyspace = 'creds', server_list =
> >>         CASSANDRA_NODE, max_retries = 5, pool_size = 1, prefill = False)
>
> >>         is there anything there that would cause it to timeout so much?
>
> >>         Thanks!
>
> >>         On 7/28/11 3:21 PM, Tyler Hobbs wrote:
> >>>         "timed out" indicates that you're hitting the client-side
> >>>         (pycassa) timeout.  Depending on the application, it may be
> >>>         desirable to retry the query on a different node if it
> >>>         hasn't completed by this timeout, but it's worth looking at
> >>>         that timeout to make sure it matches your application needs.
>
> >>>         In any case, this is definitely a different failure case
> >>>         than the previous log messages.
>
> >>>         2011/7/27 Ernst D Schoen-Ren� <er...@peoplebrowsr.com
> >>>         <mailto:er...@peoplebrowsr.com>>
> >>>>            http://pycassa.github.com/pycassa/api/pycassa/logging/pycassa_logger....
>
> >>>>             2011/7/25 Ernst D Schoen-Ren� <er...@peoplebrowsr.com
> >>>>             <mailto:er...@peoplebrowsr.com>>
> >>>>>                 2011/7/25 Ernst D Schoen-Ren�
> >>>>>                 <er...@peoplebrowsr.com
> >>>>>                 <mailto:er...@peoplebrowsr.com>>
>
> >>>>>                     I already raised the ulimit on cassandra
> >>>>>                     significantly.  The open files for cassandra
> >>>>>                     is beneath that limit.
> >>>>>                     The cassandra logs show nothing relating to
> >>>>>                     connections.  That's part of the mystery.
>
> >>>>>                 Can you verify the open file limit through the
> >>>>>                 method I described?  It's common to misconfigure
> >>>>>                 this limit because of several oddities with the
> >>>>>                 limits module.
>
> >>>>>                 Are the connection failures happening quickly
> >>>>>                 after startup (of the client), or do they take a
> >>>>>                 while to appear?  Can you describe any patterns
> >>>>>                 with how the log messages show up?
>
> >>>>>                     On 7/25/11 9:42 AM, Tyler Hobbs wrote:
>
> >>>>>>                     2011/7/25 Ernst D Schoen-Ren�
> >>>>>>                     <er...@peoplebrowsr.com
>
> ...
>
> read more »

Marty Weiner

unread,
Aug 19, 2011, 2:34:08 PM8/19/11
to pycassa...@googlegroups.com
Are your processes distinct processes or are they forked from an
original single process?

pycassa is *not* multi-process-safe. If one process forks itself and
shares the same cass connection, you'll have problems.

> On Jul 28, 4:24 pm, Ernst D Schoen-Ren�<er...@peoplebrowsr.com>


> wrote:
>> that would make sense. Thanks!
>>
>> On 7/28/11 4:20 PM, Tyler Hobbs wrote:
>>
>>
>>
>>
>>
>>
>>
>>> Oops, meant to reply to the list on that last one.
>>> Batch writes could certainly take longer than half a second in some cases.

>>> 2011/7/28 Ernst D Schoen-Ren�<er...@peoplebrowsr.com


>>> <mailto:er...@peoplebrowsr.com>>
>>> I'm doing batch writes. Maybe that's the problem
>>> On 7/28/11 4:13 PM, Tyler Hobbs wrote:
>>>> The "timeout" kwarg in the ConnectionPool constructor controls
>>>> the timeout; the default is 0.5 seconds. You're probably
>>>> thinking of "pool_timeout", which controls how long a thread will
>>>> wait when trying to get a connection from the connection pool
>>>> (this only matters in multi-threaded environments).
>>>> Are you doing really large reads or writes that might take longer
>>>> than 0.5 seconds to complete? Another possibility is that one
>>>> particular node might be undergoing a large GC or compaction
>>>> which could slow down the operation.

>>>> 2011/7/28 Ernst D Schoen-Ren�<er...@peoplebrowsr.com


>>>> <mailto:er...@peoplebrowsr.com>>
>>>> is the timeout set in the connection? I've got it set to
>>>> default, which I read as being 30 seconds. 30 seconds seems
>>>> like an awfully long time to have so many timeouts on all 8
>>>> of my cassandra nodes when this is the only app connecting to
>>>> it. I'm using connectionPool:
>>>> connection_pool =
>>>> pycassa.pool.ConnectionPool(keyspace = 'creds', server_list =
>>>> CASSANDRA_NODE, max_retries = 5, pool_size = 1, prefill = False)
>>>> is there anything there that would cause it to timeout so much?
>>>> Thanks!
>>>> On 7/28/11 3:21 PM, Tyler Hobbs wrote:
>>>>> "timed out" indicates that you're hitting the client-side
>>>>> (pycassa) timeout. Depending on the application, it may be
>>>>> desirable to retry the query on a different node if it
>>>>> hasn't completed by this timeout, but it's worth looking at
>>>>> that timeout to make sure it matches your application needs.
>>>>> In any case, this is definitely a different failure case
>>>>> than the previous log messages.

>>>>> 2011/7/27 Ernst D Schoen-Ren�<er...@peoplebrowsr.com

>>>>>> 2011/7/25 Ernst D Schoen-Ren�<er...@peoplebrowsr.com

>>>>>>> 2011/7/25 Ernst D Schoen-Ren�


>>>>>>> <er...@peoplebrowsr.com
>>>>>>> <mailto:er...@peoplebrowsr.com>>
>>>>>>> I already raised the ulimit on cassandra
>>>>>>> significantly. The open files for cassandra
>>>>>>> is beneath that limit.
>>>>>>> The cassandra logs show nothing relating to
>>>>>>> connections. That's part of the mystery.
>>>>>>> Can you verify the open file limit through the
>>>>>>> method I described? It's common to misconfigure
>>>>>>> this limit because of several oddities with the
>>>>>>> limits module.
>>>>>>> Are the connection failures happening quickly
>>>>>>> after startup (of the client), or do they take a
>>>>>>> while to appear? Can you describe any patterns
>>>>>>> with how the log messages show up?
>>>>>>> On 7/25/11 9:42 AM, Tyler Hobbs wrote:

>>>>>>>> 2011/7/25 Ernst D Schoen-Ren�
>>>>>>>> <er...@peoplebrowsr.com
>> ...
>>
>> read more �

Ernst D Schoen-René

unread,
Aug 19, 2011, 2:38:01 PM8/19/11
to pycassa...@googlegroups.com, Marty Weiner
I'm no longer sharing connections across forks. Each child is forked,
THEN it creates the connection.
I believe I've found the issue, which is that my cassandra cluster is
really bogged down. However, what I'd really like to see is some more
verbose output from pycassa about why it's timing out. If I turn on
debug, I still only get the one error message.
It would also be nice to see some actual connection failure errors on
the cassandra node itself, but that's cassandra's problem, not pycassa.

Marty Weiner

unread,
Aug 19, 2011, 4:47:57 PM8/19/11
to Ernst D Schoen-René, pycassa...@googlegroups.com
Ok great! Do you know why the cluster is bogged down? Have you checked
iostats -xtc 5 for read/write IO?

If reads seem to be heavy ended, make sure you're doing a repair/compact
regularly.

Ernst D Schoen-René

unread,
Aug 19, 2011, 5:28:47 PM8/19/11
to Marty Weiner, pycassa...@googlegroups.com
The cluster is bogged down because we have 800G / node on 16 nodes, and
it's getting slow.
I'd just like to see some more info about what is happening when a
timeout is thrown. I guess the responsibility for that would be more on
the cassandra end and less on the pycassa end.
Reply all
Reply to author
Forward
0 new messages