Merging two neo4j databases

bsge...@gmail.com

unread,

Jul 10, 2013, 3:40:02 PM7/10/13

to ne...@googlegroups.com

Hi,

I am working on an application that collects data from different sources and stores it in a neo4j graph. I would like to speed up the data collection, so am thinking of parallelizing it. For my purposes, though, I really can run multiple instances of program entirely independently, which would save the overhead and the time investment of parallelization. If I were to do this, is there any way to merge the resulting graph databases into one large graph? If so, is there anything I should know before I try? It's surprisingly difficult to find an answer to this question online.

Thanks in advance,

B Gelley

Michael Hunger

unread,

Jul 10, 2013, 8:24:38 PM7/10/13

to ne...@googlegroups.com

Do you run into insert speed issues with a single neo4j instance?

In principle yes as long as you can identify the nodes that are duplicate in many graphs and merge them sensibly (e.g. find them via index lookup while doing the insertion.

You would just iterate over the graph-database (or in parallel) but make sure to create these connection nodes with an index lookup and UniqueNodeFactory so that they are only created once even in the multithreaded case.

http://docs.neo4j.org/chunked/stable/transactions-unique-nodes.html#transactions-get-or-create

HTH

Michael

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Abhishek Gupta

unread,

Dec 13, 2013, 1:32:14 AM12/13/13

to ne...@googlegroups.com

Hey,

Did you find ways to do this? I am encountering the same problem.

Abhishek

Michael Hunger

unread,

Dec 15, 2013, 6:34:06 PM12/15/13

to ne...@googlegroups.com

What do you actually want to do?

And do you need a live database or read-only-snapshots of a concurrently running import?

I had an idea some time ago of driving the batch-inserter API to import a large amount of data concurrently.

Whenever there is a request for a snapshot, the batch-inserter is correctly shut down, a copy of the database is taken and the batch-inserter restarted to continue the import. Incoming messages are fed to a message queue / event processing system, so the short shutdown-time shouldn't make a difference.