Michael
The dataset that we expect to load into each instance of
GraphDatabaseService would probably range from around 50MB to 500MB.
This is the current "on-disk" storge size we see when we run a du -sh.
Number of nodes are currently around 400,000- and we expect them to
top out around 2M max.
To provide you some context of how we arrived here:
We are currently building an application that might be best described
as "analytics-as-a-service" - where data from user systems gets
slurped up into a logical bundle that we refer to as a "dataset". The
application will generate analytics off of that dataset. We will be
developing the core analytics and presentation. However, a user might
have multiple datasets for each of which we provide analytics.
We were hoping to load each logical dataset into a separate physical
neo4j database. Each request-response cycle would only work with one
database, which would get setup for it as a ThreadLocal. Different
threads/requests running through the application will each work with
different databases, however. This is what brought us to the idea of
storing the multiple GraphDatabaseService objects, each initialized by
pointing to a different physical database on disk, into an ObjectPool.
Each request could check out one database object to work with for the
scope of that request alone. This simplifies our domain model
dramatically, as every traversal is guaranteed to be scoped to the
correct dataset without a problem. It also dramatically simplifies the
background job that acquires the data from the customers and stores
the generated objects for each dataset into their respective
databases.
Initially, customers will not be able to drop down to the data level.
They will primarily interact with the app through the UI to perform a
few select operations. We will however, possibly provide the ability
for people to write plugins and extensions- however, we still do not
see people dropping down to raw cypher any time soon.
We could certainly set the memory-mapped-io config to each instance
that we create to place in the pool. Shouldn't be a problem. We do not
foresee very much affecting us by way of performance- and we will be
testing that quite extensively.
We were more worried if there could be a case where different web-
threads that were intended to work with different GraphDatabaseService
instances might get their paths crossed and result in bad data - or
subtle bugs. Primarily thread-local state and implicit-one-db-per-JVM-
caches are the sort of thing we were worried about.
On the plus side, the application is intended to be almost exclusively
readonly- so the database is going to only have readers and no
transactions.
Jim
Much appreciate your help in clarifying the licensing terms.
Thank you all for your help and advice.
- b
On May 21, 3:23 pm, Michael Hunger <
michael.hun...@neotechnology.com>
wrote:
> What is the size of the dataset that is loaded into these graph databases?
>
> And how free are the customers to query and analyze their data? Will that happen through an API / UI that you provide or can they drop down to the raw database level?
> If that is the case you might fare better by implementing a simple multi-tenancy on top of neo4j as Jim outlined.
>
> In general if you know estimates of the data-set-sizes you can pass the memory mapped-io config to each database, it gets trickier with cache sizes as they try to consume enough heap to work efficiently.
>
> We do that in a simple form in the Neo4j console but that is for simple, throw-away databases anyway (just storing in-memory Neo4j-Test-Database instances in session state)
>
> Michael
>
> Am 19.05.2012 um 01:45 schrieb Badri Janakiraman:
>
>
>
>
>
>
>
> > Hello
>
> > I believe a similar sort of question was asked in this thread about
> > multi-tenancy,
https://groups.google.com/group/neo4j/browse_thread/thread/fc03f9fe24...