While we are waiting for Kris' work to be ready (no pressure, Kris), we use the batch importer API to get data from Hadoop into Neo4j.
I'll describe our process in brief. First, we turn the the raw data into a graph structure in Hadoop. When working with records, you have to figure out what becomes a node and what becomes a relationship. We turn our raw data into a nodes files and relationships file on HDFS. Depending on what data we either use handwritten jobs in Java/Cascading, but sometimes we can also manage things with Hive (which gives slower jobs, but less dev time).
Our file formats are trivial. The nodes file is tab separated and each line starts with an ID field (unique ID for that node). In the relationships file each line starts with two ID's for start and end node. One potential issue here is that your records do not have (unique) IDs or have don't have compact IDs (e.g. very long strings, which are inconvenient in our importer code). We solve this by assigning a monotonically increasing ID to each entity in the nodes file. There is a distributed approach for assigning row numbers described here:
http://waredingen.nl/monotonically-increasing-row-ids-with-mapredu
When we have a nodes and relationships file, we use the batch importer API to create the database. So, the final step happens on a single machine. Our importer code does the following:
- Insert each node into the graph
- While inserting the nodes, build up a mapping between Neo4j-ID and Node ID (from the original file)
- While inserting nodes, create index entries for the stuff we want indexed
- Insert each edge into the graph; the edges file contains the Node IDs as they are on HDFS, but we can lookup the Neo4j node IDs through the mapping mentioned earlier
- Possibly create index entries for edges (we don't currently use those).
Because we need the in-memory mapping between Neo4j IDs and Node IDs from HDFS, we like our Node IDs to be compact. Also, because in our case we make sure they are monotonically increasing IDs, we can solve the thing with an array of longs instead of a map, which is way more memory efficient (as long as you don't have more than Ineteger.MAX_VALUE nodes, this works).
One more important thing (for us) is that the batch importer code reads the nodes and relationships file directly from HDFS, so it comes over the wire. This leaves the filesystem buffers/caches on the machine that does the import untouched, leaving more free memory for memory mapping and less paging pressure. Also this way, the reading side of the importer does not compete with the writing side for IOPS, which is nice. Ideally, you'll want your full graph to fit in memory on the importing machine.
Like Kris said, we are working on a MapReduce based approach that creates the Neo4j file structure in a distributed manner. Then creating a Neo4j db from the result would just be a matter of concatenating the splits and place them on a Neo4j server. It's currently not top-prio,though. If we have a working version, it will be on github.
Depending on what you are trying to achieve, you could also do graph analysis in Hadoop (like Marko points out). However, for interactive applications, Neo4j (a database) is probably the way to go.
Hope this helps. Don't be hesitant with follow up questions. Cheers,
Friso