Design for high concurrency

Dean

unread,

Apr 1, 2011, 1:02:59 PM4/1/11

to Gremlin-users

Hi all,

We're looking at a Tinkerpop-based stack to build our object graph
database on. It looks very cool. Our application will require high
concurrency - i.e. we need to support many clients accessing the
database at the same time. Since our workload will be read-heavy, I'm
less worried about locking for the moment than I am about ensuring the
the architecture can scale reasonably to handle concurrent requests.

Assuming for the sake of discussion that we're building atop the
OrientDB platform, and referencing the Blueprints API for OrientDB:
http://code.google.com/p/orient/wiki/GraphDatabaseTinkerpop#Work_with_vertexes_and_edges
. We will have one server and multiple worker process clients (with
multiple connections per client) communicating via TCP/IP.
Performance is very important to us, so while REST via ReXster looks
like a great client API, it seems undesirable from a performance
perspective. I understand that Blueprints does not support connection
pooling, which I interpret to mean that it only supports a single
handle to the underlying datastore at a time.

What is the preferred way to support high performance concurrent
requests from remote clients?

Thanks,

Dean

Luca Garulli

unread,

Apr 1, 2011, 1:21:27 PM4/1/11

to gremlin-users, Dean

On 1 April 2011 19:02, Dean <deanw...@gmail.com> wrote:

Hi all,

We're looking at a Tinkerpop-based stack to build our object graph
database on. It looks very cool. Our application will require high
concurrency - i.e. we need to support many clients accessing the
database at the same time. Since our workload will be read-heavy, I'm
less worried about locking for the moment than I am about ensuring the
the architecture can scale reasonably to handle concurrent requests.

Assuming for the sake of discussion that we're building atop the
OrientDB platform, and referencing the Blueprints API for OrientDB:
http://code.google.com/p/orient/wiki/GraphDatabaseTinkerpop#Work_with_vertexes_and_edges

Hi dean,

good choice! :-)

. We will have one server and multiple worker process clients (with
multiple connections per client) communicating via TCP/IP.
Performance is very important to us, so while REST via ReXster looks
like a great client API, it seems undesirable from a performance
perspective. I understand that Blueprints does not support connection
pooling, which I interpret to mean that it only supports a single
handle to the underlying datastore at a time.

What is the preferred way to support high performance concurrent
requests from remote clients?

Blueprints doesn't cover this, so it's in charge to the implementation. OrientDB supports remote connections just use the URL in this way: "remote:localhost/demo" that means remote protocol, localhost as server and demo as database.

Just avoid using AUTOMATIC transaction but set it always as MANUAL until this problem is fixed.

Thanks,

Dean

Lvc@

Dean

unread,

Apr 1, 2011, 1:33:20 PM4/1/11

to Gremlin-users

Hi Luca,

Thanks for your fast response!

That approach was certainly one that I had identified, but seemed to
have the disadvantage in that we're now reliant on a feature exposed
by the underlying datastore. In other words, we lose a lot of the
benefit in Blueprints being datastore-agnostic.

This bit probably belongs in the OrientDB grouplist, but I'll ask it
here given that it relates to my original question: In the OrientDB
remote connection model, it appears that OrientDB supports concurrent
connections. Can you suggest how many connections we should run in
our pool? For example, we've noticed that SQL databases typically do
best with a few dozen handles, while some of the other NoSQL platforms
can handle several thousand without issue.

Cheers,

Dean

Luca Garulli

unread,

Apr 1, 2011, 1:46:45 PM4/1/11

to gremlin-users, Dean

On 1 April 2011 19:33, Dean <deanw...@gmail.com> wrote:

Hi Luca,

Thanks for your fast response!

That approach was certainly one that I had identified, but seemed to
have the disadvantage in that we're now reliant on a feature exposed
by the underlying datastore. In other words, we lose a lot of the
benefit in Blueprints being datastore-agnostic.

You're always using Blueprints API, so no one line of code will be OrientDB specific.

This bit probably belongs in the OrientDB grouplist, but I'll ask it
here given that it relates to my original question: In the OrientDB
remote connection model, it appears that OrientDB supports concurrent
connections. Can you suggest how many connections we should run in
our pool? For example, we've noticed that SQL databases typically do
best with a few dozen handles, while some of the other NoSQL platforms
can handle several thousand without issue.

OrientDB, under the hood, shares a single socket among all the OrientGraph instances. It would be nice support a configurable pool of socket connections but today it isn't implemented.

Cheers,

Dean

Lvc@

Marko Rodriguez

unread,

Apr 1, 2011, 1:52:26 PM4/1/11

to gremli...@googlegroups.com

Hi Dean,

We're looking at a Tinkerpop-based stack to build our object graph
database on. It looks very cool. Our application will require high
concurrency - i.e. we need to support many clients accessing the
database at the same time. Since our workload will be read-heavy, I'm
less worried about locking for the moment than I am about ensuring the
the architecture can scale reasonably to handle concurrent requests.

That question should be oriented towards a particular database implementation.

OrientDB platform, and referencing the Blueprints API for OrientDB:
http://code.google.com/p/orient/wiki/GraphDatabaseTinkerpop#Work_with_vertexes_and_edges
. We will have one server and multiple worker process clients (with
multiple connections per client) communicating via TCP/IP.
Performance is very important to us, so while REST via ReXster looks
like a great client API, it seems undesirable from a performance
perspective. I understand that Blueprints does not support connection
pooling, which I interpret to mean that it only supports a single
handle to the underlying datastore at a time.

I'm not competent with the inner workings of OrientDB. However, in Neo4j, for example, there is no notion of connection pools. As long as you have a reference to the DB, you can query it and it doesn't matter if there are other threads doing the same thing. As such, Neo4j through Rexster is concurrent as Rexster has non-blocking read.

What is the preferred way to support high performance concurrent
requests from remote clients?

With respect to a non-REST API. I know in OrientDB you can make a "remote:/" connection and in Neo4j you can can pass in a RemoteGraphDatabase and, I believe, its RMI-based.

Finally, where do you plan to do your computations? -- on the database machine or on the client side. If you do it on the database machine, then using Rexster is not a bad idea because you simply extend Rexster with Traversals written in Java [ https://github.com/tinkerpop/rexster/wiki/Traversals ]. Thus, the computations stays close to the data and the results of the computation are returned over REST. If you plan to do it client side, then you are going to be pulling lots of data over the wire and thats not very efficient. However, if you just need to read explicit data from the database (no computations), then yes, you can get better performance using "remote:/" or RemoteGraphDatabase. E.g.

new OrientGraph("remote:/host/database")

new Neo4jGraph(new RemoteGraphDatabaseService(??I forget the API??));

Hope that helps,

Marko.

http://markorodriguez.com

Dean

unread,

Apr 5, 2011, 10:39:04 AM4/5/11

to Gremlin-users

Hi Marko,

Thanks for your response. I'll dig into the technical details of
connection concurrency on each of the OrientDB and neo4j mailing
lists.

You bring up an excellent question regarding locality of data for
computation. Our stack is primarily NoSQL of some type or another, so
we've become used to doing the bulk of the heavy lifting client side.
The ability to do traversals server-side with Rexster is certainly
interesting, though it may exacerbate the scaling challenges already
seemingly inherent in most graph databases.

It feels like its time to do some benchmarking...

Cheers,

Dean

Reply all

Reply to author

Forward