python-driver seems rather slow when connecting to cluster (~500ms)

SuperSilveron

unread,

Mar 11, 2014, 9:50:03 AM3/11/14

to python-dr...@lists.datastax.com

Hi,

Connecting to a cluster seem rather slow. On localhost it takes 100ms and on my production cluster it takes ~500ms.

Average ping to the hosts are around 8 ms. Is this normal behavior? Or is there something I can do about it?

It sounds that bot both 100ms and 500ms is too much.

Cassandra version: 2.0.4

python-driver version: 1.0.2

Test script. (I have manually replaced the actual ip addresses with either ipadres1 or ipadres2)

cluster = Cluster(

["ipadres1", "ipadres2"],

connection_class=cassandra.io.libevreactor.LibevConnection

)

db = cluster.connect(keyspace="service")

cluster.shutdown()

Result:

Host ipadres1 is now marked up

Host ipadres2 is now marked up

[control connection] Opening new connection to ipadres2

Sending initial options message for new connection (41171984) to ipadres2

Starting libev event loop

Received options response on new connection (41171984) from ipadres2

Got ReadyMessage on new connection (41171984) from ipadres2

[control connection] Established new connection <LibevConnection(41171984) ipadres2:9042>, registering watchers and refreshing schema and topology

[control connection] Refreshing node list and token map

[control connection] Fetched ring info, rebuilding metadata

[control connection] Waiting for schema agreement

[control connection] Schemas match

[control connection] Fetched schema, rebuilding metadata

Control connection created

Initializing new connection pool for host ipadres1

Initializing new connection pool for host ipadres2

Sending initial options message for new connection (41185936) to ipadres2

Sending initial options message for new connection (40306320) to ipadres1

Received options response on new connection (41185936) from ipadres2

Got ReadyMessage on new connection (41185936) from ipadres2

Received options response on new connection (40306320) from ipadres1

Got ReadyMessage on new connection (40306320) from ipadres1

Sending initial options message for new connection (40421520) to ipadres2

Received options response on new connection (40421520) from ipadres2

Got ReadyMessage on new connection (40421520) from ipadres2

Finished initializing new connection pool for host ipadres2

Added pool for host ipadres2 to session

Sending initial options message for new connection (41185680) to ipadres1

Received options response on new connection (41185680) from ipadres1

Got ReadyMessage on new connection (41185680) from ipadres1

Finished initializing new connection pool for host ipadres1

Added pool for host ipadres1 to session

Shutting down Cluster Scheduler

Closing connection (41171984) to ipadres2

Closing connection (40306320) to ipadres1

Not executing scheduled task due to Scheduler shutdown

Closing connection (41185680) to ipadres1

Closing connection (41185936) to ipadres2

Closing connection (40421520) to ipadres2

All Connections currently closed, event loop ended

0.407905101776 seconds

Tyler Hobbs

unread,

Mar 11, 2014, 2:08:08 PM3/11/14

to python-dr...@lists.datastax.com

There are a few points here:

Clusters and Sessions are meant to be long-lived and reused, so in general the driver expects this to be a one-time cost.
The size of your cluster and the number of keyspaces and tables that you have defined will have a large effect on how long connect() takes. On startup, the driver processes the entire schema and the ring topology. Parts of this could be optimized by the driver, and a lot of it could be done asynchronously, but that hasn't been done yet.
If you want to minimize startup time, before you call connect(), do cluster.set_core_connections_per_host(cassandra.policies.HostDistance.LOCAL, 1).

Hopefully that helps! As I mentioned, the startup process will likely get some optimizations in coming releases.

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

--
Tyler Hobbs
DataStax

SuperSilveron .

unread,

Mar 12, 2014, 8:56:03 AM3/12/14

to python-dr...@lists.datastax.com

First off all, many thanks for (fast) response :)

Setting the core connections per host significantly increased the connection time to ~200ms.

I will investigate on setting up a connection pool for the driver so the connection will be persistent. This isn't implemented within the current python-driver right? Do you have any tips? I was think about using SqlAlchemy and writing a custom connection pool on it.

Thanks in advice!

Ron

Tyler Hobbs

unread,

Mar 12, 2014, 12:22:48 PM3/12/14

to python-dr...@lists.datastax.com

On Wed, Mar 12, 2014 at 7:56 AM, SuperSilveron . <ronvan...@gmail.com> wrote:

Setting the core connections per host significantly increased the connection time to ~200ms.

Good to hear. I also thought of one other thing you can tweak that should help if you have more than two nodes. When the Session is created, it opens connections connections to nodes in parallel, but by default it uses a threadpool of size 2, so only two hosts are being connected to at once. You can increase the size of this threadpool through the 'executor_threads' kwarg for the Cluster constructor. (e.g. cluster = Cluster(..., executor_threads=4)).

Maybe I'll start a FAQ with this as the first entry :)

I will investigate on setting up a connection pool for the driver so the connection will be persistent. This isn't implemented within the current python-driver right? Do you have any tips? I was think about using SqlAlchemy and writing a custom connection pool on it.

The driver itself is already maintaining multiple connection pools. I don't think a SqlAlchemy connection pool is what you want.

I would have to know more about your application stack to suggest anything. For example, if you were running a webserver (say, with Django), I would suggest using gunicorn with the pre-fork worker model. This would give you long-lived workers where the Cluster and Sessions can be created once and then reused many times.

--
Tyler Hobbs
DataStax

tommaso barbugli

unread,

Mar 12, 2014, 12:41:16 PM3/12/14

to python-dr...@lists.datastax.com

Hi Tyler,

Perhaps I did not understand your suggestion about pre-fork; I thought that it was not safe to safe instances of Cluster and Session across different processes.

Cheers,
Tommaso

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Tyler Hobbs

unread,

Mar 12, 2014, 1:20:31 PM3/12/14

to python-dr...@lists.datastax.com

On Wed, Mar 12, 2014 at 11:41 AM, tommaso barbugli <tbar...@gmail.com> wrote:

Perhaps I did not understand your suggestion about pre-fork; I thought that it was not safe to safe instances of Cluster and Session across different processes.

You're correct, it's not safe to share Clusters or Sessions across processes. You would need to create those objects post-fork (and then reuse them).

--
Tyler Hobbs
DataStax

SuperSilveron .

unread,

Mar 13, 2014, 11:06:03 AM3/13/14

to python-dr...@lists.datastax.com

Currently we are using mod wsgi as webserver. Flask as framework, and python-driver to connect to our cassandra cluster. Making one connection for per worker seems as a good approach.

We are prepared to switch to Gunicorn if necessary/easier. But I'm a little confused because Gunicorn stated it's using 'pre-fork worker model' but it's not safe and we need 'post-fork'?

So without boundary of software choice, what do you think is the best setup in this use case. Or did you suggest to use Gunicorn with:

http://docs.gunicorn.org/en/develop/configure.html#post-fork

Thanks in advice,

Ron

Tyler Hobbs

unread,

Mar 18, 2014, 12:40:23 PM3/18/14

to python-dr...@lists.datastax.com

On Thu, Mar 13, 2014 at 10:06 AM, SuperSilveron . <ronvan...@gmail.com> wrote:

We are prepared to switch to Gunicorn if necessary/easier. But I'm a little confused because Gunicorn stated it's using 'pre-fork worker model' but it's not safe and we need 'post-fork'?

So without boundary of software choice, what do you think is the best setup in this use case. Or did you suggest to use Gunicorn with:
http://docs.gunicorn.org/en/develop/configure.html#post-fork

Yes, I would use the pre-fork worker model and setup the Cluster and Session in the post_fork() hook. Definitely a little confusing.

--
Tyler Hobbs
DataStax

SuperSilveron .

unread,

Mar 19, 2014, 4:37:45 AM3/19/14

to python-dr...@lists.datastax.com

Thank you, this was very helpful

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Reply all

Reply to author

Forward