Hi all,
we are very happy to announce the first release of Neo4j 1.9.M01.
Highlights in terms of new functionality is a totally new High Availability
cluster communication framework, based on Paxos, and getting rid of the
hard-to-configure Zookeeper Coordinator subsystem. Testing, feedback and
comments are VERY welcome!
In this release we would like to thank Wes Freeman who has been
contributing a lot of great features to Cypher, console.neo4j.org and to
the discussions on this list. You rock Wes.
> Hi all,
> we are very happy to announce the first release of Neo4j 1.9.M01.
> Highlights in terms of new functionality is a totally new High > Availability cluster communication framework, based on Paxos, and > getting rid of the hard-to-configure Zookeeper Coordinator subsystem. > Testing, feedback and comments are VERY welcome!
> In this release we would like to thank Wes Freeman who has been > contributing a lot of great features to Cypher, console.neo4j.org > <http://console.neo4j.org> and to the discussions on this list. You > rock Wes.
On Fri, Oct 26, 2012 at 10:58 PM, Axel Morgner <a...@morgner.de> wrote:
> Cool Peter!
> Watched your HA screencast, the new HA architecture sounds really good!
> Thanks!
> Am 26.10.2012 22:39, schrieb Peter Neubauer:
> Hi all,
> we are very happy to announce the first release of Neo4j 1.9.M01.
> Highlights in terms of new functionality is a totally new High
> Availability cluster communication framework, based on Paxos, and getting
> rid of the hard-to-configure Zookeeper Coordinator subsystem. Testing,
> feedback and comments are VERY welcome!
> In this release we would like to thank Wes Freeman who has been
> contributing a lot of great features to Cypher, console.neo4j.org and to
> the discussions on this list. You rock Wes.
This new setup for HA is awesome.
Just a couple of questions. You mention something at the end of the
screencast related to node ids. You meant that node ids don't change
across the instances? So, start n=node(2) return n will alwaus return
the same node?
And the second one, let's imagine a very intense operations like
creation of thousands nodes, cloning or importing a graph, how long
does is take to replicate to the other instances?
Javier,
cool you like the new setup and screencast - actually it is fun to do these!
Regarding your questions - node IDs do not change across instances. Yes,
start n = node(2) will always return the same node on all instances.
For replication, there is actually two protocols I think - if there are no
transactions from a slave to merge (e.g. a new cluster member is joining)
then the whole store is copied upon first connect, making this a comparably
fast operation. After that, TX are propagated using a TX protocol. So
bringing new instances online should not take much time.
> This new setup for HA is awesome.
> Just a couple of questions. You mention something at the end of the
> screencast related to node ids. You meant that node ids don't change
> across the instances? So, start n=node(2) return n will alwaus return
> the same node?
> And the second one, let's imagine a very intense operations like
> creation of thousands nodes, cloning or importing a graph, how long
> does is take to replicate to the other instances?
My #1 request would be to lift the 32 billion relationship/node and 64 billion property limit, or to implement distributed graphs. That is quickly becoming a very restrictive limitation. We're going to have to create our own sharding scheme as a workaround for now (and as a result, we've had to do a lot of "non-graphy" things since we can't maintain relationships across shards very easily).
On Friday, October 26, 2012 4:39:45 PM UTC-4, Peter Neubauer wrote:
> Hi all, > we are very happy to announce the first release of Neo4j 1.9.M01.
> Highlights in terms of new functionality is a totally new High > Availability cluster communication framework, based on Paxos, and getting > rid of the hard-to-configure Zookeeper Coordinator subsystem. Testing, > feedback and comments are VERY welcome!
> In this release we would like to thank Wes Freeman who has been > contributing a lot of great features to Cypher, console.neo4j.org and to > the discussions on this list. You rock Wes.
Slighly off topic, I have been thinking about sharding lately, since I want to introduce that in one of the next versions of my software. One of the strategies I am condsidering now has the following properties:
Each node belongs to a shard.
Relationships between nodes belonging to the same shard are treated the same as relationships are treated now.
Creating a relationships between nodes belonging to different shards is treated differently. Suppose we want to create a relationship from node1 (in shard1) to node2 (in shard 2). First we do a lookup for a node in shard2 that represents node1, if not we create that node. Then we do the same for node2 in shard1. Then we create two relationships, one between node1 and the representative of node2 in shard1, and a relationship between node2 and the representative of node1 in shard2.
Representative nodes contain the uuid of the original node and have a relationship to a representative node of the shard of the original node, so it can transparently be looked up.
Taking these steps guarantees that shards are effectively disonnected from one another and can thus be distributed over different databases.
When a shard is moved from one database to another, all nodes representing that shard in all other shards need to be updated, unless we devise some central repository for shards.
On Saturday, October 27, 2012 4:29:41 PM UTC+2, RickBullotta wrote: > My #1 request would be to lift the 32 billion relationship/node and 64 > billion property limit, or to implement distributed graphs. That is > quickly becoming a very restrictive limitation. We're going to have to > create our own sharding scheme as a workaround for now (and as a result, > we've had to do a lot of "non-graphy" things since we can't maintain > relationships across shards very easily).
> On Friday, October 26, 2012 4:39:45 PM UTC-4, Peter Neubauer wrote:
>> Hi all, >> we are very happy to announce the first release of Neo4j 1.9.M01.
>> Highlights in terms of new functionality is a totally new High >> Availability cluster communication framework, based on Paxos, and getting >> rid of the hard-to-configure Zookeeper Coordinator subsystem. Testing, >> feedback and comments are VERY welcome!
>> In this release we would like to thank Wes Freeman who has been >> contributing a lot of great features to Cypher, console.neo4j.org and to >> the discussions on this list. You rock Wes.
I suppose that a couple of the challenges would involve:
- Creating/managing node UUIDs (this would/could consume a lot of properties and a lot of cache memory, since the Long node id is not a reliable UUID) - Looking up UUIDs to resolve them to a node, since Lucene doesn't seem to like very large indices and potentially every node would be in that index - The number of extra nodes/relationships required to maintain connections between shards could be substantial depending on the specific graph's complexity
We're trying to keep fairly clear isolation between our shards so that we don't keep any significant "relationships" across nodes in different shards. In our model, most subgraphs are really discrete collections and it makes it (somewhat) easier for us to move them around between databases and servers.
On Sunday, October 28, 2012 8:49:04 AM UTC-4, Niels Hoogeveen wrote:
> Hi Rick,
> Slighly off topic, I have been thinking about sharding lately, since I > want to introduce that in one of the next versions of my software. One of > the strategies I am condsidering now has the following properties:
> Each node belongs to a shard.
> Relationships between nodes belonging to the same shard are treated the > same as relationships are treated now.
> Creating a relationships between nodes belonging to different shards is > treated differently. Suppose we want to create a relationship from node1 > (in shard1) to node2 (in shard 2). First we do a lookup for a node in > shard2 that represents node1, if not we create that node. Then we do the > same for node2 in shard1. Then we create two relationships, one between > node1 and the representative of node2 in shard1, and a relationship between > node2 and the representative of node1 in shard2.
> Representative nodes contain the uuid of the original node and have a > relationship to a representative node of the shard of the original node, so > it can transparently be looked up.
> Taking these steps guarantees that shards are effectively disonnected from > one another and can thus be distributed over different databases.
> When a shard is moved from one database to another, all nodes representing > that shard in all other shards need to be updated, unless we devise some > central repository for shards.
> Any thoughts?
> On Saturday, October 27, 2012 4:29:41 PM UTC+2, RickBullotta wrote:
>> My #1 request would be to lift the 32 billion relationship/node and 64 >> billion property limit, or to implement distributed graphs. That is >> quickly becoming a very restrictive limitation. We're going to have to >> create our own sharding scheme as a workaround for now (and as a result, >> we've had to do a lot of "non-graphy" things since we can't maintain >> relationships across shards very easily).
>> On Friday, October 26, 2012 4:39:45 PM UTC-4, Peter Neubauer wrote:
>>> Hi all, >>> we are very happy to announce the first release of Neo4j 1.9.M01.
>>> Highlights in terms of new functionality is a totally new High >>> Availability cluster communication framework, based on Paxos, and getting >>> rid of the hard-to-configure Zookeeper Coordinator subsystem. Testing, >>> feedback and comments are VERY welcome!
>>> In this release we would like to thank Wes Freeman who has been >>> contributing a lot of great features to Cypher, console.neo4j.org and >>> to the discussions on this list. You rock Wes.
On Sun, Oct 28, 2012 at 6:36 PM, RickBullotta <rick.bullo...@gmail.com>wrote:
> I suppose that a couple of the challenges would involve:
> - Creating/managing node UUIDs (this would/could consume a lot of
> properties and a lot of cache memory, since the Long node id is not a
> reliable UUID)
uuid is just 2 longs, so it double memory consumption ... hmm ... not much
on one side and a lot for another. Maybe some switch to run db in two
different modes? or anything.
> - Looking up UUIDs to resolve them to a node, since Lucene doesn't seem to
> like very large indices and potentially every node would be in that index
> - The number of extra nodes/relationships required to maintain connections
> between shards could be substantial depending on the specific graph's
> complexity
it simpler if think in discovery service alya jxta. that mean no
requirement to remember where it stored, but know where to ask (several
places or all).
> We're trying to keep fairly clear isolation between our shards so that we
> don't keep any significant "relationships" across nodes in different
> shards. In our model, most subgraphs are really discrete collections and
> it makes it (somewhat) easier for us to move them around between databases
> and servers.
I'm agree that 32 billion too small figure. If my site have 1M accounts
only 32k nodes left for objects per account, not much. Have only one db
much better that several in many reasons.
> On Sun, Oct 28, 2012 at 6:36 PM, RickBullotta <rick.bullo...@gmail.com> wrote:
> I suppose that a couple of the challenges would involve:
> - Creating/managing node UUIDs (this would/could consume a lot of properties and a lot of cache memory, since the Long node id is not a reliable UUID)
> uuid is just 2 longs, so it double memory consumption ... hmm ... not much on one side and a lot for another. Maybe some switch to run db in two different modes? or anything.
> - Looking up UUIDs to resolve them to a node, since Lucene doesn't seem to like very large indices and potentially every node would be in that index
> - The number of extra nodes/relationships required to maintain connections between shards could be substantial depending on the specific graph's complexity
> it simpler if think in discovery service alya jxta. that mean no requirement to remember where it stored, but know where to ask (several places or all).
> We're trying to keep fairly clear isolation between our shards so that we don't keep any significant "relationships" across nodes in different shards. In our model, most subgraphs are really discrete collections and it makes it (somewhat) easier for us to move them around between databases and servers.
> I'm agree that 32 billion too small figure. If my site have 1M accounts only 32k nodes left for objects per account, not much. Have only one db much better that several in many reasons.
When addressing the store size, would it be an option to include an id-offset for nodes and relationships; a parameter that can be set upon database creation. This would allow for cheap storage of sharding information. The id's now are longs, so theoretically 64 bits can be used to address nodes in the database. However a database can not contain more than 2^64 / record size number of nodes. This leaves room for having database ids. If the record size is somewhere in the order of 32 byte, this would mean we don't need 8 bits of the 64 bit address space, leaving room for at least 256 unique database ids.
Any node or relationship with an id different not in the range of the current database can be identified and the corresponding database id can be determined for free.
On Sunday, October 28, 2012 11:21:59 PM UTC+1, Michael Hunger wrote: > The store-size issue is planned to be addressed in 1.10 in spring 2013.
> Michael
> Am 28.10.2012 um 20:13 schrieb Dmitriy Shabanov:
> On Sun, Oct 28, 2012 at 6:36 PM, RickBullotta <rick.b...@gmail.com<javascript:> > > wrote:
>> I suppose that a couple of the challenges would involve:
>> - Creating/managing node UUIDs (this would/could consume a lot of >> properties and a lot of cache memory, since the Long node id is not a >> reliable UUID)
> uuid is just 2 longs, so it double memory consumption ... hmm ... not much > on one side and a lot for another. Maybe some switch to run db in two > different modes? or anything.
>> - Looking up UUIDs to resolve them to a node, since Lucene doesn't seem >> to like very large indices and potentially every node would be in that index >> - The number of extra nodes/relationships required to maintain >> connections between shards could be substantial depending on the specific >> graph's complexity
> it simpler if think in discovery service alya jxta. that mean no > requirement to remember where it stored, but know where to ask (several > places or all).
>> We're trying to keep fairly clear isolation between our shards so that we >> don't keep any significant "relationships" across nodes in different >> shards. In our model, most subgraphs are really discrete collections and >> it makes it (somewhat) easier for us to move them around between databases >> and servers.
> I'm agree that 32 billion too small figure. If my site have 1M accounts > only 32k nodes left for objects per account, not much. Have only one db > much better that several in many reasons.
Well, have it 128 bit allow to share same id for same node other any db
(globally unique id). It much better than workaround with database id as
part of node id. Global address space is dream of dreams -)
On Mon, Oct 29, 2012 at 4:16 AM, Niels Hoogeveen <nielshoog...@gmail.com>wrote:
> When addressing the store size, would it be an option to include an
> id-offset for nodes and relationships; a parameter that can be set upon
> database creation. This would allow for cheap storage of sharding
> information. The id's now are longs, so theoretically 64 bits can be used
> to address nodes in the database. However a database can not contain more
> than 2^64 / record size number of nodes. This leaves room for having
> database ids. If the record size is somewhere in the order of 32 byte, this
> would mean we don't need 8 bits of the 64 bit address space, leaving room
> for at least 256 unique database ids.
> Any node or relationship with an id different not in the range of the
> current database can be identified and the corresponding database id can be
> determined for free.
> Niels
> On Sunday, October 28, 2012 11:21:59 PM UTC+1, Michael Hunger wrote:
>> The store-size issue is planned to be addressed in 1.10 in spring 2013.
>> Michael
>> Am 28.10.2012 um 20:13 schrieb Dmitriy Shabanov:
>> On Sun, Oct 28, 2012 at 6:36 PM, RickBullotta <rick.b...@gmail.com>wrote:
>>> I suppose that a couple of the challenges would involve:
>>> - Creating/managing node UUIDs (this would/could consume a lot of
>>> properties and a lot of cache memory, since the Long node id is not a
>>> reliable UUID)
>> uuid is just 2 longs, so it double memory consumption ... hmm ... not
>> much on one side and a lot for another. Maybe some switch to run db in two
>> different modes? or anything.
>>> - Looking up UUIDs to resolve them to a node, since Lucene doesn't seem
>>> to like very large indices and potentially every node would be in that index
>>> - The number of extra nodes/relationships required to maintain
>>> connections between shards could be substantial depending on the specific
>>> graph's complexity
>> it simpler if think in discovery service alya jxta. that mean no
>> requirement to remember where it stored, but know where to ask (several
>> places or all).
>>> We're trying to keep fairly clear isolation between our shards so that
>>> we don't keep any significant "relationships" across nodes in different
>>> shards. In our model, most subgraphs are really discrete collections and
>>> it makes it (somewhat) easier for us to move them around between databases
>>> and servers.
>> I'm agree that 32 billion too small figure. If my site have 1M accounts
>> only 32k nodes left for objects per account, not much. Have only one db
>> much better that several in many reasons.
> Well, have it 128 bit allow to share same id for same node other any > db (globally unique id). It much better than workaround with database > id as part of node id. Global address space is dream of dreams -)
> On Mon, Oct 29, 2012 at 4:16 AM, Niels Hoogeveen > <nielshoog...@gmail.com <mailto:nielshoog...@gmail.com>> wrote:
> When addressing the store size, would it be an option to include
> an id-offset for nodes and relationships; a parameter that can be
> set upon database creation. This would allow for cheap storage of
> sharding information. The id's now are longs, so theoretically 64
> bits can be used to address nodes in the database. However a
> database can not contain more than 2^64 / record size number of
> nodes. This leaves room for having database ids. If the record
> size is somewhere in the order of 32 byte, this would mean we
> don't need 8 bits of the 64 bit address space, leaving room for at
> least 256 unique database ids.
> Any node or relationship with an id different not in the range of
> the current database can be identified and the corresponding
> database id can be determined for free.
> Niels
> On Sunday, October 28, 2012 11:21:59 PM UTC+1, Michael Hunger wrote:
> The store-size issue is planned to be addressed in 1.10 in
> spring 2013.
> Michael
> Am 28.10.2012 um 20:13 schrieb Dmitriy Shabanov:
>> On Sun, Oct 28, 2012 at 6:36 PM, RickBullotta
>> <rick.b...@gmail.com> wrote:
>> I suppose that a couple of the challenges would involve:
>> - Creating/managing node UUIDs (this would/could consume
>> a lot of properties and a lot of cache memory, since the
>> Long node id is not a reliable UUID)
>> uuid is just 2 longs, so it double memory consumption ... hmm
>> ... not much on one side and a lot for another. Maybe some
>> switch to run db in two different modes? or anything.
>> - Looking up UUIDs to resolve them to a node, since
>> Lucene doesn't seem to like very large indices and
>> potentially every node would be in that index
>> - The number of extra nodes/relationships required to
>> maintain connections between shards could be substantial
>> depending on the specific graph's complexity
>> it simpler if think in discovery service alya jxta. that mean
>> no requirement to remember where it stored, but know where to
>> ask (several places or all).
>> We're trying to keep fairly clear isolation between our
>> shards so that we don't keep any significant
>> "relationships" across nodes in different shards. In our
>> model, most subgraphs are really discrete collections and
>> it makes it (somewhat) easier for us to move them around
>> between databases and servers.
>> I'm agree that 32 billion too small figure. If my site have
>> 1M accounts only 32k nodes left for objects per account, not
>> much. Have only one db much better that several in many reasons.
On Mon, Oct 29, 2012 at 2:30 PM, Axel Morgner <a...@morgner.de> wrote:
> +1 for UUIDs as optional/additional node id
> Am 29.10.2012 10:23, schrieb Dmitriy Shabanov:
> Well, have it 128 bit allow to share same id for same node other any db
> (globally unique id). It much better than workaround with database id as
> part of node id. Global address space is dream of dreams -)
> On Mon, Oct 29, 2012 at 4:16 AM, Niels Hoogeveen <nielshoog...@gmail.com>wrote:
>> When addressing the store size, would it be an option to include an
>> id-offset for nodes and relationships; a parameter that can be set upon
>> database creation. This would allow for cheap storage of sharding
>> information. The id's now are longs, so theoretically 64 bits can be used
>> to address nodes in the database. However a database can not contain more
>> than 2^64 / record size number of nodes. This leaves room for having
>> database ids. If the record size is somewhere in the order of 32 byte, this
>> would mean we don't need 8 bits of the 64 bit address space, leaving room
>> for at least 256 unique database ids.
>> Any node or relationship with an id different not in the range of the
>> current database can be identified and the corresponding database id can be
>> determined for free.
> I mean UUIDs for nodes & relationships, not just nodes -)
> On Mon, Oct 29, 2012 at 2:30 PM, Axel Morgner <a...@morgner.de > <mailto:a...@morgner.de>> wrote:
> +1 for UUIDs as optional/additional node id
> Am 29.10.2012 10:23, schrieb Dmitriy Shabanov:
>> Well, have it 128 bit allow to share same id for same node other
>> any db (globally unique id). It much better than workaround with
>> database id as part of node id. Global address space is dream of
>> dreams -)
>> On Mon, Oct 29, 2012 at 4:16 AM, Niels Hoogeveen
>> <nielshoog...@gmail.com <mailto:nielshoog...@gmail.com>> wrote:
>> When addressing the store size, would it be an option to
>> include an id-offset for nodes and relationships; a parameter
>> that can be set upon database creation. This would allow for
>> cheap storage of sharding information. The id's now are
>> longs, so theoretically 64 bits can be used to address nodes
>> in the database. However a database can not contain more than
>> 2^64 / record size number of nodes. This leaves room for
>> having database ids. If the record size is somewhere in the
>> order of 32 byte, this would mean we don't need 8 bits of the
>> 64 bit address space, leaving room for at least 256 unique
>> database ids.
>> Any node or relationship with an id different not in the
>> range of the current database can be identified and the
>> corresponding database id can be determined for free.
Realize that the node ID is a sequential #, and this is essential to
preserve since (I assume) it provides extremely fast random retrieval from
a fixed offset. Therefore an UUID would be an additional memory item.
Adding 16 bytes (two longs) + overhead (let's just estimate 24 bytes) on a
system with a billion nodes or so quickly adds up! 24GB of storage/RAM.
Regarding discovery versus indices, it really doesn't matter - you'll still
need a monstrously huge index to do the lookup, won't you?
Regarding current size limitations, the one that we find more restrictive
is the # of properties. We hit that limit long before we hit the
node/relationship limit.
On Sun, Oct 28, 2012 at 3:13 PM, Dmitriy Shabanov <shaban...@gmail.com>wrote:
> On Sun, Oct 28, 2012 at 6:36 PM, RickBullotta <rick.bullo...@gmail.com>wrote:
>> I suppose that a couple of the challenges would involve:
>> - Creating/managing node UUIDs (this would/could consume a lot of
>> properties and a lot of cache memory, since the Long node id is not a
>> reliable UUID)
> uuid is just 2 longs, so it double memory consumption ... hmm ... not much
> on one side and a lot for another. Maybe some switch to run db in two
> different modes? or anything.
>> - Looking up UUIDs to resolve them to a node, since Lucene doesn't seem
>> to like very large indices and potentially every node would be in that index
>> - The number of extra nodes/relationships required to maintain
>> connections between shards could be substantial depending on the specific
>> graph's complexity
> it simpler if think in discovery service alya jxta. that mean no
> requirement to remember where it stored, but know where to ask (several
> places or all).
>> We're trying to keep fairly clear isolation between our shards so that we
>> don't keep any significant "relationships" across nodes in different
>> shards. In our model, most subgraphs are really discrete collections and
>> it makes it (somewhat) easier for us to move them around between databases
>> and servers.
> I'm agree that 32 billion too small figure. If my site have 1M accounts
> only 32k nodes left for objects per account, not much. Have only one db
> much better that several in many reasons.
A GUID is of course nice to have, but can easily be added as a property. What GUID's miss is structural information.
The current node-id and relationship-id contain information where to find the corresponding record. Record length * id = position in the file.
As I stated in a previous message, not the entire address space can be used to locate node and relationship records, so the remaining space could in principle be used for other purposes, like a store id.
This would give a nice structural key, making it possible to locate a node or relationship within a particular store.
GUID's are too opaque for this purpose, requiring an index to link a GUID to a particular node or relationship in a particular store. Such an index can easily become very big and would not only require a lot of storage, but also increase lookup time.
The proposal for a structural key based on store-id and node-id/relationship-id, adds no overhead. It does however place a limit on the number of databases one installation can serve.
8 bit store-id + 56 bit node-id/relationship-id: 256 stores with approximately 10^17 nodes/relationships 12 bit store-id + 52 bit node-id/relationship-id: 4096 stores with approximately 10^16 nodes/relationships 16 bit store-id + 48 bit node-id/relationship-id: 65,536 stores with approximately 10^14 nodes/relationships
On Monday, October 29, 2012 10:23:47 AM UTC+1, Dmitriy Shabanov wrote: > Well, have it 128 bit allow to share same id for same node other any db > (globally unique id). It much better than workaround with database id as > part of node id. Global address space is dream of dreams -)
> On Mon, Oct 29, 2012 at 4:16 AM, Niels Hoogeveen <nielsh...@gmail.com<javascript:> > > wrote:
>> When addressing the store size, would it be an option to include an >> id-offset for nodes and relationships; a parameter that can be set upon >> database creation. This would allow for cheap storage of sharding >> information. The id's now are longs, so theoretically 64 bits can be used >> to address nodes in the database. However a database can not contain more >> than 2^64 / record size number of nodes. This leaves room for having >> database ids. If the record size is somewhere in the order of 32 byte, this >> would mean we don't need 8 bits of the 64 bit address space, leaving room >> for at least 256 unique database ids.
>> Any node or relationship with an id different not in the range of the >> current database can be identified and the corresponding database id can be >> determined for free.
>> Niels
>> On Sunday, October 28, 2012 11:21:59 PM UTC+1, Michael Hunger wrote:
>>> The store-size issue is planned to be addressed in 1.10 in spring 2013.
>>> Michael
>>> Am 28.10.2012 um 20:13 schrieb Dmitriy Shabanov:
>>> On Sun, Oct 28, 2012 at 6:36 PM, RickBullotta <rick.b...@gmail.com>wrote:
>>>> I suppose that a couple of the challenges would involve:
>>>> - Creating/managing node UUIDs (this would/could consume a lot of >>>> properties and a lot of cache memory, since the Long node id is not a >>>> reliable UUID)
>>> uuid is just 2 longs, so it double memory consumption ... hmm ... not >>> much on one side and a lot for another. Maybe some switch to run db in two >>> different modes? or anything.
>>>> - Looking up UUIDs to resolve them to a node, since Lucene doesn't seem >>>> to like very large indices and potentially every node would be in that index >>>> - The number of extra nodes/relationships required to maintain >>>> connections between shards could be substantial depending on the specific >>>> graph's complexity
>>> it simpler if think in discovery service alya jxta. that mean no >>> requirement to remember where it stored, but know where to ask (several >>> places or all).
>>>> We're trying to keep fairly clear isolation between our shards so that >>>> we don't keep any significant "relationships" across nodes in different >>>> shards. In our model, most subgraphs are really discrete collections and >>>> it makes it (somewhat) easier for us to move them around between databases >>>> and servers.
>>> I'm agree that 32 billion too small figure. If my site have 1M accounts >>> only 32k nodes left for objects per account, not much. Have only one db >>> much better that several in many reasons.
The other consideration in these discussions is the portability of the IDs
- backup/archive/transfer of nodes or subgraphs between graphs should be
supported somehow (which may require making the ability to reuse IDs a
configurable option), as well as determining how to assign/manage the store
IDs.
On Mon, Oct 29, 2012 at 11:07 AM, Niels Hoogeveen <nielshoog...@gmail.com>wrote:
> A GUID is of course nice to have, but can easily be added as a property.
> What GUID's miss is structural information.
> The current node-id and relationship-id contain information where to find
> the corresponding record. Record length * id = position in the file.
> As I stated in a previous message, not the entire address space can be
> used to locate node and relationship records, so the remaining space could
> in principle be used for other purposes, like a store id.
> This would give a nice structural key, making it possible to locate a node
> or relationship within a particular store.
> GUID's are too opaque for this purpose, requiring an index to link a GUID
> to a particular node or relationship in a particular store. Such an index
> can easily become very big and would not only require a lot of storage, but
> also increase lookup time.
> The proposal for a structural key based on store-id and
> node-id/relationship-id, adds no overhead. It does however place a limit on
> the number of databases one installation can serve.
> 8 bit store-id + 56 bit node-id/relationship-id: 256 stores
> with approximately 10^17 nodes/relationships
> 12 bit store-id + 52 bit node-id/relationship-id: 4096 stores
> with approximately 10^16 nodes/relationships
> 16 bit store-id + 48 bit node-id/relationship-id: 65,536 stores with
> approximately 10^14 nodes/relationships
> Niels
> On Monday, October 29, 2012 10:23:47 AM UTC+1, Dmitriy Shabanov wrote:
>> Well, have it 128 bit allow to share same id for same node other any db
>> (globally unique id). It much better than workaround with database id as
>> part of node id. Global address space is dream of dreams -)
>> On Mon, Oct 29, 2012 at 4:16 AM, Niels Hoogeveen <nielsh...@gmail.com>wrote:
>>> When addressing the store size, would it be an option to include an
>>> id-offset for nodes and relationships; a parameter that can be set upon
>>> database creation. This would allow for cheap storage of sharding
>>> information. The id's now are longs, so theoretically 64 bits can be used
>>> to address nodes in the database. However a database can not contain more
>>> than 2^64 / record size number of nodes. This leaves room for having
>>> database ids. If the record size is somewhere in the order of 32 byte, this
>>> would mean we don't need 8 bits of the 64 bit address space, leaving room
>>> for at least 256 unique database ids.
>>> Any node or relationship with an id different not in the range of the
>>> current database can be identified and the corresponding database id can be
>>> determined for free.
>>> Niels
>>> On Sunday, October 28, 2012 11:21:59 PM UTC+1, Michael Hunger wrote:
>>>> The store-size issue is planned to be addressed in 1.10 in spring 2013.
>>>> Michael
>>>> Am 28.10.2012 um 20:13 schrieb Dmitriy Shabanov:
>>>> On Sun, Oct 28, 2012 at 6:36 PM, RickBullotta <rick.b...@gmail.com>wrote:
>>>>> I suppose that a couple of the challenges would involve:
>>>>> - Creating/managing node UUIDs (this would/could consume a lot of
>>>>> properties and a lot of cache memory, since the Long node id is not a
>>>>> reliable UUID)
>>>> uuid is just 2 longs, so it double memory consumption ... hmm ... not
>>>> much on one side and a lot for another. Maybe some switch to run db in two
>>>> different modes? or anything.
>>>>> - Looking up UUIDs to resolve them to a node, since Lucene doesn't
>>>>> seem to like very large indices and potentially every node would be in that
>>>>> index
>>>>> - The number of extra nodes/relationships required to maintain
>>>>> connections between shards could be substantial depending on the specific
>>>>> graph's complexity
>>>> it simpler if think in discovery service alya jxta. that mean no
>>>> requirement to remember where it stored, but know where to ask (several
>>>> places or all).
>>>>> We're trying to keep fairly clear isolation between our shards so that
>>>>> we don't keep any significant "relationships" across nodes in different
>>>>> shards. In our model, most subgraphs are really discrete collections and
>>>>> it makes it (somewhat) easier for us to move them around between databases
>>>>> and servers.
>>>> I'm agree that 32 billion too small figure. If my site have 1M accounts
>>>> only 32k nodes left for objects per account, not much. Have only one db
>>>> much better that several in many reasons.
Well, properties come in because of node/relationship limits (my guess). It
possible to move properties into node/relationship area. We can look on the
problem from different points:
- memory size (storage size) ... if you have small db you will have small
requirements. If you db grow you have to provide more memory anyway.
- "structure" design ... that always question of finding way to fit into
limits
- "ideas" design ... that most interesting point, because it related to
way we are thinking. Very often we need to find workaround for our systems
to support growth (in most cases because of decisions at "structure" design
stage). But things become very simple as soon as we start think in term of
global addressing space. Of course, I can continue this subject (and can if
anyone interesting). For now hope that points understandable.
On Mon, Oct 29, 2012 at 7:30 PM, Rick Bullotta <rick.bullo...@gmail.com>wrote:
> Realize that the node ID is a sequential #, and this is essential to
> preserve since (I assume) it provides extremely fast random retrieval from
> a fixed offset. Therefore an UUID would be an additional memory item.
> Adding 16 bytes (two longs) + overhead (let's just estimate 24 bytes) on a
> system with a billion nodes or so quickly adds up! 24GB of storage/RAM.
> Regarding discovery versus indices, it really doesn't matter - you'll
> still need a monstrously huge index to do the lookup, won't you?
> Regarding current size limitations, the one that we find more restrictive
> is the # of properties. We hit that limit long before we hit the
> node/relationship limit.
Niels, you write right things from storage infrastructure point of view,
BUT from point of systems design this gives nothing. I don't want to say
that you are wrong. Just wanna say that I (as minimum) have to support UUID
to node/repationship Id mapping anyway.
Maybe, it have to stay this way. And one solution for that two different
level problems don't exist at all.
To be clear here they are:
- look up at physical storage
- look up at global addressing space
On Mon, Oct 29, 2012 at 8:07 PM, Niels Hoogeveen <nielshoog...@gmail.com>wrote:
> A GUID is of course nice to have, but can easily be added as a property.
> What GUID's miss is structural information.
> The current node-id and relationship-id contain information where to find
> the corresponding record. Record length * id = position in the file.
> As I stated in a previous message, not the entire address space can be
> used to locate node and relationship records, so the remaining space could
> in principle be used for other purposes, like a store id.
> This would give a nice structural key, making it possible to locate a node
> or relationship within a particular store.
> GUID's are too opaque for this purpose, requiring an index to link a GUID
> to a particular node or relationship in a particular store. Such an index
> can easily become very big and would not only require a lot of storage, but
> also increase lookup time.
> The proposal for a structural key based on store-id and
> node-id/relationship-id, adds no overhead. It does however place a limit on
> the number of databases one installation can serve.
> 8 bit store-id + 56 bit node-id/relationship-id: 256 stores
> with approximately 10^17 nodes/relationships
> 12 bit store-id + 52 bit node-id/relationship-id: 4096 stores
> with approximately 10^16 nodes/relationships
> 16 bit store-id + 48 bit node-id/relationship-id: 65,536 stores with
> approximately 10^14 nodes/relationships