Re: [Neo4j] Import data from Hadoop to Neo4j

Kris Geusebroek

unread,

Sep 20, 2012, 3:10:38 PM9/20/12

to ne...@googlegroups.com, KevinZhang

Hi Kevin

I'm working on a series of map reduce jobs that create the filesystem structure for neo4j from a single nodes file and a separate edges file.
It's still under construction, but i'm getting somewhere ;-)

If you want to know more detailes on the progress i'm making feel free to send me a mail.

Cheers Kris

On Sep 20, 2012, at 6:57 PM, KevinZhang <kevinz...@gmail.com> wrote:

> Hi,
>
> I'm new to Neo4j.
>
> I'm trying to figure out how to import Hadoop data to Nep4j.
>
> We have some text files on HDFS, which contains 1 billion records.
>
> The problem is how to implement an initial data load to Neo4j.
>
> Any idea?
>
> Kevin
>
>
>
>
>
> --
>
>

Marko Rodriguez

unread,

Sep 21, 2012, 3:47:25 AM9/21/12

to ne...@googlegroups.com

Hello,

Aurelius just released Faunus 0.1-alpha.

http://thinkaurelius.github.com/faunus/

Faunus is a Hadoop-based graph computing framework where computations are expressed using the Gremlin graph traversal language. Along with other graph data sources, it supports Neo4j. This might be of interest to your plight.

Good luck,

Marko.

http://markorodriguez.com

--

Friso van Vollenhoven

unread,

Sep 21, 2012, 4:13:46 AM9/21/12

to ne...@googlegroups.com

While we are waiting for Kris' work to be ready (no pressure, Kris), we use the batch importer API to get data from Hadoop into Neo4j.

I'll describe our process in brief. First, we turn the the raw data into a graph structure in Hadoop. When working with records, you have to figure out what becomes a node and what becomes a relationship. We turn our raw data into a nodes files and relationships file on HDFS. Depending on what data we either use handwritten jobs in Java/Cascading, but sometimes we can also manage things with Hive (which gives slower jobs, but less dev time).

Our file formats are trivial. The nodes file is tab separated and each line starts with an ID field (unique ID for that node). In the relationships file each line starts with two ID's for start and end node. One potential issue here is that your records do not have (unique) IDs or have don't have compact IDs (e.g. very long strings, which are inconvenient in our importer code). We solve this by assigning a monotonically increasing ID to each entity in the nodes file. There is a distributed approach for assigning row numbers described here: http://waredingen.nl/monotonically-increasing-row-ids-with-mapredu

When we have a nodes and relationships file, we use the batch importer API to create the database. So, the final step happens on a single machine. Our importer code does the following:

- Insert each node into the graph

- While inserting the nodes, build up a mapping between Neo4j-ID and Node ID (from the original file)

- While inserting nodes, create index entries for the stuff we want indexed

- Insert each edge into the graph; the edges file contains the Node IDs as they are on HDFS, but we can lookup the Neo4j node IDs through the mapping mentioned earlier

- Possibly create index entries for edges (we don't currently use those).

Because we need the in-memory mapping between Neo4j IDs and Node IDs from HDFS, we like our Node IDs to be compact. Also, because in our case we make sure they are monotonically increasing IDs, we can solve the thing with an array of longs instead of a map, which is way more memory efficient (as long as you don't have more than Ineteger.MAX_VALUE nodes, this works).

One more important thing (for us) is that the batch importer code reads the nodes and relationships file directly from HDFS, so it comes over the wire. This leaves the filesystem buffers/caches on the machine that does the import untouched, leaving more free memory for memory mapping and less paging pressure. Also this way, the reading side of the importer does not compete with the writing side for IOPS, which is nice. Ideally, you'll want your full graph to fit in memory on the importing machine.

I did a (short) talk about our process at the Berlin Buzzwords conference this year. You can watch it here: http://vimeo.com/44023458, slides are here: http://www.slideshare.net/fvanvollenhoven/network-analysis-with-hadoop-and-neo4j and code is here: https://github.com/friso/graphs (but not very well organized).

Like Kris said, we are working on a MapReduce based approach that creates the Neo4j file structure in a distributed manner. Then creating a Neo4j db from the result would just be a matter of concatenating the splits and place them on a Neo4j server. It's currently not top-prio,though. If we have a working version, it will be on github.

Depending on what you are trying to achieve, you could also do graph analysis in Hadoop (like Marko points out). However, for interactive applications, Neo4j (a database) is probably the way to go.

Hope this helps. Don't be hesitant with follow up questions. Cheers,

Friso

--

Friso van Vollenhoven

unread,

Mar 20, 2013, 10:57:30 AM3/20/13

to Neo4j

Hi Todd,

Could you give me a bit more context? I am not entirely sure what you mean with keeping IDs in sync and in which particular situation (creating the DB files using MR or doing the normal batch insert).

I am happy to help when I understand the question.

Friso

On Wed, Mar 20, 2013 at 3:32 PM, <todd.m...@powerthinkingmedia.com> wrote:

Friso,
I am also looking to do something similar. I watched your video and found your slides on slide-share as well as Kris's blog posts. It was a big help in optimizing my Neo4j batch loader. Thanks so much for sharing!

I'm trying to understand why you need to create the ID within the HDFS stored files and keep it in sync with Neo4j. Can you help me understand that?

Have you had any luck on the bottoms-up concat approach you mentioned below? Seems like a much more efficient way to go versus using the batch loader.

Thanks again for sharing.
-=Todd

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Todd Michaud

unread,

Mar 20, 2013, 11:17:15 AM3/20/13

to ne...@googlegroups.com

You mention, and show on Kris's blog post, creating a unique row ID for each node and relationship in your text files. And then you refer below to...

"While inserting the nodes, build up a mapping between Neo4j-ID and Node ID (from the original file)"

I'm just wondering what the purpose behind that mapping/index creation is. I am dumping the node and edges into files as you describe and using it to batch load into Neo4j. I'm not taking the step to create a monotonically increasing id for each row or keeping a map back to what was in the file. I'm just wondering why you take that step.

Thanks,

Todd

--
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/imAj6BAa_Gk/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--

Todd L. Michaud
Founder & CEO
Power Thinking Media
www.powerthinkingmedia.com
@todd_michaud | LinkedIn

404.946.8786

Friso van Vollenhoven

unread,

Mar 20, 2013, 11:30:48 AM3/20/13

to Neo4j

Ah, I see. The problem is this: first you want to create all the nodes. Then you will create all the relationships. The nodes and relationships are in two different files. When you create a relationship, you will need to know (at that point) what the node IDs are that you are connecting with that relationship. There are two ways to do this:

1) When creating the nodes, you also insert them into a index (based on the node IDs that you use in the relationship files). Then when creating the relationships, you lookup the node IDs in that index.

2) When creating the nodes, you fetch the ID that neo chose to assign to that node and keep it in memory in a map against the IDs that you use in your files. When creating the relationships, you just lookup the node IDs in that in-memory map. This is the approach we use.

The row number generation (the monotonically increasing IDs) is just a trick we use in the case that our original data has no natural IDs (business keys) of itself. We also need this trick when creating the graph in Hadoop, but that's an entirely different (and more involved) story.

Hope that helps,

Friso

Todd Michaud

unread,

Mar 20, 2013, 11:42:12 AM3/20/13

to ne...@googlegroups.com

Ah, that makes sense. Thanks for the explanation.

I'd love to know more about what you guys are doing with Hadoop and the graph. I'd love to be able to leverage the power of MapReduce on graph jobs. I've been stuck on figuring out meta-data/property management. I just posted something this morning on StackOverflow: http://stackoverflow.com/questions/15527334/has-anyone-used-neo4j-node-ids-as-foreign-keys-to-other-databases-for-large-prop

It sounds like we are both tackling a lot of the same things. Are you mostly focused on consulting, or you working on product development as well?

-=Todd

Lobna Tonn

unread,

Jan 7, 2015, 4:19:13 AM1/7/15

to ne...@googlegroups.com

Hi,
I'm really a beginner in hadoop and neo4j technology

i did load my data in HDFS and i was wondering how can i turn this raw data into nodes and relationship files ?
I have some knowledge in Talend so it will be easier if i can use this tool.

any help ?

david fauth

unread,

Jan 7, 2015, 12:58:40 PM1/7/15

to ne...@googlegroups.com

Lobna,

There are a few ways of doing this. The easiest way is to create a set of CSV files for the nodes and relationships. You can use Talend to do this. I will often write Pig jobs to output the data that I want.

In Neo4j 2.2, the new Neo4j-import capability makes loading large data sets easy. Take a look at http://www.intelliwareness.org/?p=583 for an idea on how to structure your CSV files for import.

Dave

Reply all

Reply to author

Forward