3 node cluster, seeing very heavy loads on one of the nodes???

stantonk

unread,

Apr 9, 2012, 1:45:06 PM4/9/12

to pycassa...@googlegroups.com

I have a 3 node cluster setup with RandomPlacement strategy. We run a weekly job that produces heavy load for several hours while reading data for all our users and producing summary data. When that happens, I'm seeing a CPU load of ~10 on one of our nodes, and only loads of 2-3 on the other two.

We are using PyCassa as our client to read all of the data to generate the reports, and I'm wondering if there is some way PyCassa could be the culprit? I looked at the source and saw the random.shuffle() call, so it doesn't seem like that would be the issue.

Provided PyCassa (or the way in which we're using it) is not the culprit, that leaves me with a couple other ideas:

1) The data isn't evenly distributed across the nodes?

Running "nodetool ring" produces the following output:

Address         DC          Rack        Status State   Load            Owns    Token                                       

                                                                               113427455640312821154458202477256070485     

10.0.1.2    datacenter1 rack1       Up     Normal  25.77 GB        33.33%  0                                           

10.0.1.3    datacenter1 rack1       Up     Normal  24.76 GB        33.33%  56713727820156410577229101238628035242      

10.0.1.4   datacenter1 rack1       Up     Normal  23.74 GB        33.33%  113427455640312821154458202477256070485

So, at a high level at least, this doesn't seem to be the issue. The data is written using a row-key of user_id, and the data sets being pulled back for each user_id should be about the same size.

2) The seed node in a cluster naturally sees more load?

Though I'm not entirely sure why. But the node that is experiencing a load of 10 is the seed node.

Can anyone shed some light on this? Is it a concern?

Thanks,

Kevin

Tyler Hobbs

unread,

Apr 9, 2012, 9:25:11 PM4/9/12

to pycassa...@googlegroups.com

What parameters are you using when creating the connection pool?

--
Tyler Hobbs
DataStax

stantonk

unread,

Apr 10, 2012, 12:18:24 PM4/10/12

to pycassa...@googlegroups.com

Sorry, I mis-typed in my original post, we're using RandomPartitioner (not RandomPlacement ;-)), we're using SimpleStrategy for our placement strategy, and a RF of 3 (and 3 nodes total).

Here's how we're creating the connection pool. We created a wrapper class around PyCassa's ColumnFamily class to handle creation of the connection pool and seamlessly manage cases where we are running unittests locally and might not have connectivity to a Cassandra cluster.

class InternalColumnFamily(ColumnFamily):

def __init__(self, **kwargs):

if 'test' in sys.argv:

# lame attempt at not requiring a local cassandra instance when running tests

pool = pycassa.ConnectionPool('our_test_ks',

server_list=('127.0.0.1:9160',),

timeout=0,

credentials=settings.TEST_CASSANDRA_CREDENTIALS,

prefill=False)

else:

pool = pycassa.ConnectionPool('our_ks',

server_list=('10.0.1.2:9160', '10.0.1.3:9160', '10.0.1.4:9160'),

timeout=5,

credentials=settings.CASSANDRA_CREDENTIALS)

super(InternalColumnFamily, self).__init__(pool, self.column_family, **kwargs)

So the way we use this is we subclass from InternalColumnFamily for each of our column families we use, i.e.:

class StuffColumnFamily(InternalColumnFamily):

column_family = 'stuff'

def __init__(self, **kwargs):

super(StuffColumnFamily, self).__init__(**kwargs)

self.default_validation_class = JsonType() # custom json serializer/deserializer

self.column_name_class = DateType()

We instantiate our subclassed ColumnFamily objects as model objects in a web API, so a pool is created for each API request...

Hopefully we're just doing something obvious... :-P

On Monday, April 9, 2012 8:25:11 PM UTC-5, Tyler Hobbs wrote:

What parameters are you using when creating the connection pool?

Tyler Hobbs

unread,

Apr 10, 2012, 11:52:21 PM4/10/12

to pycassa...@googlegroups.com

If you're creating a new pool per request, set prefill=False for the ConnectionPool. If you're doing quite a few operations with each pool, use a pool_size of 3; otherwise, just 1 will probably do it.

Besides just the load on the nodes, what does the cpu utilization and disk utilization look like? Running 'iostat -x 5' is a decent way to watch these; a full monitoring system like OpsCenter is better if you can use it. The load alone isn't always a reliable indicator of how busy a node really is.

Hopefully we're just doing something obvious... :-P

Nothing else obvious here to me. Given that your RF equals the number of nodes, the only thing that would normally make a difference in the load is the number of requests that it's acting as a coordinator for.

On Monday, April 9, 2012 8:25:11 PM UTC-5, Tyler Hobbs wrote:

What parameters are you using when creating the connection pool?

On Mon, Apr 9, 2012 at 12:45 PM, stantonk <> wrote:

I have a 3 node cluster setup with RandomPlacement strategy. We run a weekly job that produces heavy load for several hours while reading data for all our users and producing summary data. When that happens, I'm seeing a CPU load of ~10 on one of our nodes, and only loads of 2-3 on the other two.

We are using PyCassa as our client to read all of the data to generate the reports, and I'm wondering if there is some way PyCassa could be the culprit? I looked at the source and saw the random.shuffle() call, so it doesn't seem like that would be the issue.

Provided PyCassa (or the way in which we're using it) is not the culprit, that leaves me with a couple other ideas:

1) The data isn't evenly distributed across the nodes?

Running "nodetool ring" produces the following output:

Address DC Rack Status State Load Owns Token

113427455640312821154458202477256070485
10.0.1.2 datacenter1 rack1 Up Normal 25.77 GB 33.33% 0
10.0.1.3 datacenter1 rack1 Up Normal 24.76 GB 33.33% 56713727820156410577229101238628035242

10.0.1.4 datacenter1 rack1 Up Normal 23.74 GB 33.33% 113427455640312821154458202477256070485

So, at a high level at least, this doesn't seem to be the issue. The data is written using a row-key of user_id, and the data sets being pulled back for each user_id should be about the same size.

2) The seed node in a cluster naturally sees more load?

Though I'm not entirely sure why. But the node that is experiencing a load of 10 is the seed node.

Can anyone shed some light on this? Is it a concern?

Thanks,
Kevin

--
Tyler Hobbs
DataStax

--
Tyler Hobbs
DataStax

stantonk

unread,

Apr 25, 2012, 2:23:07 PM4/25/12

to pycassa...@googlegroups.com

Turns out it was likely Astyanax causing the load imbalances amongst the nodes. Our code there was not using the latency-aware stuff. Since adding that, we've seen loads normalize across our nodes.

Thanks for your help!

Reply all

Reply to author

Forward