python-driver seems rather slow when connecting to cluster (~500ms)

855 views
Skip to first unread message

SuperSilveron

unread,
Mar 11, 2014, 9:50:03 AM3/11/14
to python-dr...@lists.datastax.com
Hi,

Connecting to a cluster seem rather slow. On localhost it takes 100ms and on my production cluster it takes ~500ms.
Average ping to the hosts are around 8 ms. Is this normal behavior? Or is there something I can do about it?
It sounds that bot both 100ms and 500ms is too much.

Cassandra version: 2.0.4
python-driver version: 1.0.2

Test script. (I have manually replaced the actual ip addresses with either ipadres1 or ipadres2)

cluster = Cluster(
    ["ipadres1", "ipadres2"],
    connection_class=cassandra.io.libevreactor.LibevConnection
 )
 db = cluster.connect(keyspace="service")
 cluster.shutdown()

Result:

Host ipadres1 is now marked up
Host ipadres2 is now marked up
[control connection] Opening new connection to ipadres2
Sending initial options message for new connection (41171984) to ipadres2
Starting libev event loop
Received options response on new connection (41171984) from ipadres2
Got ReadyMessage on new connection (41171984) from ipadres2
[control connection] Established new connection <LibevConnection(41171984) ipadres2:9042>, registering watchers and refreshing schema and topology
[control connection] Refreshing node list and token map
[control connection] Fetched ring info, rebuilding metadata
[control connection] Waiting for schema agreement
[control connection] Schemas match
[control connection] Fetched schema, rebuilding metadata
Control connection created
Initializing new connection pool for host ipadres1
Initializing new connection pool for host ipadres2
Sending initial options message for new connection (41185936) to ipadres2
Sending initial options message for new connection (40306320) to ipadres1
Received options response on new connection (41185936) from ipadres2
Got ReadyMessage on new connection (41185936) from ipadres2
Received options response on new connection (40306320) from ipadres1
Got ReadyMessage on new connection (40306320) from ipadres1
Sending initial options message for new connection (40421520) to ipadres2
Received options response on new connection (40421520) from ipadres2
Got ReadyMessage on new connection (40421520) from ipadres2
Finished initializing new connection pool for host ipadres2
Added pool for host ipadres2 to session
Sending initial options message for new connection (41185680) to ipadres1
Received options response on new connection (41185680) from ipadres1
Got ReadyMessage on new connection (41185680) from ipadres1
Finished initializing new connection pool for host ipadres1
Added pool for host ipadres1 to session
Shutting down Cluster Scheduler
Closing connection (41171984) to ipadres2
Closing connection (40306320) to ipadres1
Not executing scheduled task due to Scheduler shutdown
Closing connection (41185680) to ipadres1
Closing connection (41185936) to ipadres2
Closing connection (40421520) to ipadres2
All Connections currently closed, event loop ended
0.407905101776 seconds

Tyler Hobbs

unread,
Mar 11, 2014, 2:08:08 PM3/11/14
to python-dr...@lists.datastax.com
There are a few points here:
  • Clusters and Sessions are meant to be long-lived and reused, so in general the driver expects this to be a one-time cost.
  • The size of your cluster and the number of keyspaces and tables that you have defined will have a large effect on how long connect() takes.  On startup, the driver processes the entire schema and the ring topology. Parts of this could be optimized by the driver, and a lot of it could be done asynchronously, but that hasn't been done yet.
  • If you want to minimize startup time, before you call connect(), do cluster.set_core_connections_per_host(cassandra.policies.HostDistance.LOCAL, 1).

Hopefully that helps!  As I mentioned, the startup process will likely get some optimizations in coming releases.



To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.



--
Tyler Hobbs
DataStax

SuperSilveron .

unread,
Mar 12, 2014, 8:56:03 AM3/12/14
to python-dr...@lists.datastax.com
First off all, many thanks for (fast) response :)
Setting the core connections per host significantly increased the connection time to ~200ms.

I will investigate on setting up a connection pool for the driver so the connection will be persistent. This isn't implemented within the current python-driver right? Do you have any tips? I was think about using SqlAlchemy and writing a custom connection pool on it.

Thanks in advice!

Ron

Tyler Hobbs

unread,
Mar 12, 2014, 12:22:48 PM3/12/14
to python-dr...@lists.datastax.com
On Wed, Mar 12, 2014 at 7:56 AM, SuperSilveron . <ronvan...@gmail.com> wrote:
Setting the core connections per host significantly increased the connection time to ~200ms.

Good to hear.  I also thought of one other thing you can tweak that should help if you have more than two nodes.  When the Session is created, it opens connections connections to nodes in parallel, but by default it uses a threadpool of size 2, so only two hosts are being connected to at once.  You can increase the size of this threadpool through the 'executor_threads' kwarg for the Cluster constructor.  (e.g. cluster = Cluster(..., executor_threads=4)).

Maybe I'll start a FAQ with this as the first entry :)
 

I will investigate on setting up a connection pool for the driver so the connection will be persistent. This isn't implemented within the current python-driver right? Do you have any tips? I was think about using SqlAlchemy and writing a custom connection pool on it.

The driver itself is already maintaining multiple connection pools.  I don't think a SqlAlchemy connection pool is what you want.

I would have to know more about your application stack to suggest anything.  For example, if you were running a webserver (say, with Django), I would suggest using gunicorn with the pre-fork worker model.  This would give you long-lived workers where the Cluster and Sessions can be created once and then reused many times.


--
Tyler Hobbs
DataStax

tommaso barbugli

unread,
Mar 12, 2014, 12:41:16 PM3/12/14
to python-dr...@lists.datastax.com
Hi Tyler,
Perhaps I did not understand your suggestion about pre-fork; I thought that it was not safe to safe instances of Cluster and Session across different processes.

Cheers,
Tommaso


To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Tyler Hobbs

unread,
Mar 12, 2014, 1:20:31 PM3/12/14
to python-dr...@lists.datastax.com

On Wed, Mar 12, 2014 at 11:41 AM, tommaso barbugli <tbar...@gmail.com> wrote:
Perhaps I did not understand your suggestion about pre-fork; I thought that it was not safe to safe instances of Cluster and Session across different processes.

You're correct, it's not safe to share Clusters or Sessions across processes.  You would need to create those objects post-fork (and then reuse them).


--
Tyler Hobbs
DataStax

SuperSilveron .

unread,
Mar 13, 2014, 11:06:03 AM3/13/14
to python-dr...@lists.datastax.com
Currently we are using mod wsgi as webserver. Flask as framework, and python-driver to connect to our cassandra cluster. Making one connection for per worker seems as a good approach. 

We are prepared to switch to Gunicorn if necessary/easier. But I'm a little confused because Gunicorn stated it's using 'pre-fork worker model' but it's not safe and we need 'post-fork'?

So without boundary of software choice, what do you think is the best setup in this use case. Or did you suggest to use Gunicorn with: 

Thanks in advice,

Ron

Tyler Hobbs

unread,
Mar 18, 2014, 12:40:23 PM3/18/14
to python-dr...@lists.datastax.com

On Thu, Mar 13, 2014 at 10:06 AM, SuperSilveron . <ronvan...@gmail.com> wrote:

We are prepared to switch to Gunicorn if necessary/easier. But I'm a little confused because Gunicorn stated it's using 'pre-fork worker model' but it's not safe and we need 'post-fork'?

So without boundary of software choice, what do you think is the best setup in this use case. Or did you suggest to use Gunicorn with: 

Yes, I would use the pre-fork worker model and setup the Cluster and Session in the post_fork() hook.  Definitely a little confusing.


--
Tyler Hobbs
DataStax

SuperSilveron .

unread,
Mar 19, 2014, 4:37:45 AM3/19/14
to python-dr...@lists.datastax.com
Thank you, this was very helpful


To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Reply all
Reply to author
Forward
0 new messages