Pooling multiple EmbeddedGraphDatabase objects in an ObjectPool

54 views
Skip to first unread message

Badri Janakiraman

unread,
May 18, 2012, 7:45:28 PM5/18/12
to Neo4j
Hello

I believe a similar sort of question was asked in this thread about
multi-tenancy, https://groups.google.com/group/neo4j/browse_thread/thread/fc03f9fe24e7cf0f/a425e67e1c5410db?lnk=gst&q=tenancy#
- however, I couldn't find anything on that thread about this specific
approach.

We are trying to offer a cloud based service, where each user gets
assigned a dataset - and the application loads data into it, in the
background, and serves various analytics and reports off of it in the
foreground. We were wondering if it would be possible to implement
this by having a single java web-application and a pool of
EmbeddedGraphDatabase instances stored in an object pool. That way,
each request could checkout the database instance it wanted to, run
its reports and check it back in.

Does this approach sound feasible? Is there any reason one should be
worried about having a pool of databases and working with different
ones in each thread running through the application?

Specifically, the things we are worried about are- are there global
pieces of data that are not scoped to within a database instance? Are
there any thread-locals or in-memory caching structures that won't get
cleared out? Are these a small, finite and well-known number that can
be managed- or are we treading sharky waters here?

Is this usage pattern something someone else here has attempted before?

Jim Webber

unread,
May 19, 2012, 7:51:37 AM5/19/12
to ne...@googlegroups.com
Hi Badri,

The blocker for this approach is probably supplying configuration to the database instances. That is, each JVM gets some properties for things like cache size/type and I worry whether it's practical/possible to detangle the properties to be targeted at each internal instance - not to mention that's not a configuration we perf test on.

When folks want multiple databases, it's generally in multiple JVMs (and perhaps using neo4j server). If you definitely want multitenancy within a single instance of your product, then you'll likely be better off implementing it within your app.

Jim

PS - I'd also remind you that if you're embedding Neo4j for resale within products (OEM), then none of the open source licenses apply.

Badri Janakiraman

unread,
May 21, 2012, 1:15:28 PM5/21/12
to Neo4j
Thanks a bunch for the advice, Jim. Just so I understand, your
concerns would be twofold

- whether it is practical to supply db configuration information per
object
- whether something like this would perform

and your recommendation is multiple JVMs or building multi-tenancy
into the application domain.

About the licensing issues, this is not something we will be bundling
and shipping out with our products. This is a new service that we are
building, which will be a SaaS offering, pricing and details not
decided yet. We will certainly not be redistributing Neo4J in part or
in total.

- b

Jim Webber

unread,
May 21, 2012, 5:55:43 PM5/21/12
to ne...@googlegroups.com
Hey Badri,


> - whether it is practical to supply db configuration information per
> object

Per instance of GraphDatbaseService.

> - whether something like this would perform

Yeah, because of the above. I dunno what the implications are, but the kernel team folks would.

> and your recommendation is multiple JVMs or building multi-tenancy
> into the application domain.

Yes, unless the kernel team can contradict that advice.

> About the licensing issues, this is not something we will be bundling
> and shipping out with our products. This is a new service that we are
> building, which will be a SaaS offering, pricing and details not
> decided yet. We will certainly not be redistributing Neo4J in part or
> in total.

Just keeping you nice-and-legal mate. Pricing etc not the issue since OSS is not about price but provenance.

Jim

Michael Hunger

unread,
May 21, 2012, 6:23:35 PM5/21/12
to ne...@googlegroups.com
What is the size of the dataset that is loaded into these graph databases?

And how free are the customers to query and analyze their data? Will that happen through an API / UI that you provide or can they drop down to the raw database level?
If that is the case you might fare better by implementing a simple multi-tenancy on top of neo4j as Jim outlined.

In general if you know estimates of the data-set-sizes you can pass the memory mapped-io config to each database, it gets trickier with cache sizes as they try to consume enough heap to work efficiently.

We do that in a simple form in the Neo4j console but that is for simple, throw-away databases anyway (just storing in-memory Neo4j-Test-Database instances in session state)

Michael

Badri Janakiraman

unread,
May 21, 2012, 10:47:52 PM5/21/12
to Neo4j
Michael

The dataset that we expect to load into each instance of
GraphDatabaseService would probably range from around 50MB to 500MB.
This is the current "on-disk" storge size we see when we run a du -sh.
Number of nodes are currently around 400,000- and we expect them to
top out around 2M max.

To provide you some context of how we arrived here:

We are currently building an application that might be best described
as "analytics-as-a-service" - where data from user systems gets
slurped up into a logical bundle that we refer to as a "dataset". The
application will generate analytics off of that dataset. We will be
developing the core analytics and presentation. However, a user might
have multiple datasets for each of which we provide analytics.

We were hoping to load each logical dataset into a separate physical
neo4j database. Each request-response cycle would only work with one
database, which would get setup for it as a ThreadLocal. Different
threads/requests running through the application will each work with
different databases, however. This is what brought us to the idea of
storing the multiple GraphDatabaseService objects, each initialized by
pointing to a different physical database on disk, into an ObjectPool.
Each request could check out one database object to work with for the
scope of that request alone. This simplifies our domain model
dramatically, as every traversal is guaranteed to be scoped to the
correct dataset without a problem. It also dramatically simplifies the
background job that acquires the data from the customers and stores
the generated objects for each dataset into their respective
databases.

Initially, customers will not be able to drop down to the data level.
They will primarily interact with the app through the UI to perform a
few select operations. We will however, possibly provide the ability
for people to write plugins and extensions- however, we still do not
see people dropping down to raw cypher any time soon.

We could certainly set the memory-mapped-io config to each instance
that we create to place in the pool. Shouldn't be a problem. We do not
foresee very much affecting us by way of performance- and we will be
testing that quite extensively.

We were more worried if there could be a case where different web-
threads that were intended to work with different GraphDatabaseService
instances might get their paths crossed and result in bad data - or
subtle bugs. Primarily thread-local state and implicit-one-db-per-JVM-
caches are the sort of thing we were worried about.

On the plus side, the application is intended to be almost exclusively
readonly- so the database is going to only have readers and no
transactions.

Jim

Much appreciate your help in clarifying the licensing terms.

Thank you all for your help and advice.
- b

On May 21, 3:23 pm, Michael Hunger <michael.hun...@neotechnology.com>
wrote:
> What is the size of the dataset that is loaded into these graph databases?
>
> And how free are the customers to query and analyze their data? Will that happen through an API / UI that you provide or can they drop down to the raw database level?
> If that is the case you might fare better by implementing a simple multi-tenancy on top of neo4j as Jim outlined.
>
> In general if you know estimates of the data-set-sizes you can pass the memory mapped-io config to each database, it gets trickier with cache sizes as they try to consume enough heap to work efficiently.
>
> We do that in a simple form in the Neo4j console but that is for simple, throw-away databases anyway (just storing in-memory Neo4j-Test-Database instances in session state)
>
> Michael
>
> Am 19.05.2012 um 01:45 schrieb Badri Janakiraman:
>
>
>
>
>
>
>
> > Hello
>
> > I believe a similar sort of question was asked in this thread about
> > multi-tenancy,https://groups.google.com/group/neo4j/browse_thread/thread/fc03f9fe24...
Reply all
Reply to author
Forward
0 new messages