Load very large CSV into Neo4j

1,199 views
Skip to first unread message

mohsen

unread,
Dec 11, 2014, 1:24:38 AM12/11/14
to ne...@googlegroups.com

I want to load a set of large rdf triple files into Neo4j. I have already written a map-reduce code to read all input n-triples and output two CSV files: nodes.csv (7GB - 90 million rows) and relationships.csv (15GB - 120 million rows). I tried batch-import command from Neo4j v2.2.0-M01, but it crashes after loading around 30M rows of nodes. I have 16GB of RAM in my machine so I set wrapper.java.initmemory=4096 and wrapper.java.maxmemory=13000. So, I decided to split nodes.csv and relationships.csv into smaller parts and run batch-import for each part. However, I don't know how to merge the databases created from multiple imports. I appreciate any suggestion on how to load large CSV files into Neo4j.

[Link to this question in StackOverflow: http://stackoverflow.com/questions/27416262/load-very-large-csv-into-neo4j]

Andrii Stesin

unread,
Dec 11, 2014, 8:44:36 AM12/11/14
to ne...@googlegroups.com
I'd suggest you take a look at last 5-7 posts in this recent thread. You don't basically need any "batch import" command - I'd suggest you to use just a plain LOAD CSV functionality from Cypher, and you will just fill your database step by step.

WBR,
Andrii

Chris Vest

unread,
Dec 11, 2014, 9:38:15 AM12/11/14
to ne...@googlegroups.com
What does it say when you run the import command with --stacktrace ?

--
Chris Vest
System Engineer, Neo Technology
[ skype: mr.chrisvest, twitter: chvest ]


On 11 Dec 2014, at 07:24, mohsen <mh.tah...@gmail.com> wrote:

I want to load a set of large rdf triple files into Neo4j. I have already written a map-reduce code to read all input n-triples and output two CSV files: nodes.csv (7GB - 90 million rows) and relationships.csv (15GB - 120 million rows). I tried batch-import command from Neo4j v2.2.0-M01, but it crashes after loading around 30M rows of nodes. I have 16GB of RAM in my machine so I set wrapper.java.initmemory=4096 and wrapper.java.maxmemory=13000. So, I decided to split nodes.csv and relationships.csv into smaller parts and run batch-import for each part. However, I don't know how to merge the databases created from multiple imports. I appreciate any suggestion on how to load large CSV files into Neo4j.

[Link to this question in StackOverflow: http://stackoverflow.com/questions/27416262/load-very-large-csv-into-neo4j]


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mohsen

unread,
Dec 11, 2014, 9:30:03 PM12/11/14
to ne...@googlegroups.com
I forgot to run it with stacktrace, someone suggested that I use groovy and batch-inserter, and I am trying that right now. If I could not load the data using this approach, I will run batch-import again with --stacktrace to find the error message.

mohsen

unread,
Dec 11, 2014, 9:34:31 PM12/11/14
to ne...@googlegroups.com
I guess the core code for both batch-import and Load CSV is the same, why do you think running it from Cypher (rather than through batch-import) helps? I am trying groovy and batch-inserter now, will post how it goes.

Michael Hunger

unread,
Dec 12, 2014, 1:02:05 AM12/12/14
to ne...@googlegroups.com
The groovy one should work fine too. I wanted to augment the post with one that has @CompileStatic so that it's faster. 

I'd be also interested in the --stacktraces output of the batch-import tool of Neo4j 2.2, perhaps you can let it run over night or in the background.

Cheers, Michael

--

mohsen

unread,
Dec 12, 2014, 4:08:19 AM12/12/14
to ne...@googlegroups.com
I could not load the data using Groovy too. I increased groovy heap size to 10G before running the script (using JAVA_OPTS). My machine has 16G of RAM. It halts when it loads 41M rows from nodes.csv:


log: 
....
41200000 rows 38431 ms
41300000 rows 50988 ms
41400000 rows 63747 ms
41500000 rows 112758 ms 
41600000 rows 326497 ms

After logging 41,600,000 rows, nothing happened. I waited 2 hours there was not any progress. The process was still taking CPU but there was NOT any free memory at that time. I guess that's the reason for that. I have attached my groovy script where you can find the memory configurations. I guess something goes wrong with memory since it stopped when all my system's memory was used.

I then switched back to batch-import tool with stacktrace. I think the error I got last time was due to small heap size because I did not get that error this time (after allocating 10GB heap). Anyway, I have exactly 86983375 nodes and it could load the nodes this time, but I got another error:  

 Nodes
[INPUT-------------|ENCODER-----------------------------------------|WRITER] 86M
Calculate dense nodes
Import error: InputRelationship:
   properties: []
   startNode: file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
   endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
   type: http://purl.org/ontology/echonest/beatVariance specified start node that hasn't been imported
java.lang.RuntimeException: InputRelationship:
   properties: []
   startNode: file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
   endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
   type: http://purl.org/ontology/echonest/beatVariance specified start node that hasn't been imported
at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.stillExecuting(StageExecution.java:54)
at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.anyStillExecuting(PollingExecutionMonitor.java:71)
at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.finishAwareSleep(PollingExecutionMonitor.java:94)
at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.monitor(PollingExecutionMonitor.java:62)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.executeStages(ParallelBatchImporter.java:221)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(ParallelBatchImporter.java:139)
at org.neo4j.tooling.ImportTool.main(ImportTool.java:212)
Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: InputRelationship:
   properties: []
   startNode: file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
   endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
   type: http://purl.org/ontology/echonest/beatVariance specified start node that hasn't been imported
at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.ensureNodeFound(CalculateDenseNodesStep.java:95)
at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(CalculateDenseNodesStep.java:61)
at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(CalculateDenseNodesStep.java:38)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.run(ExecutorServiceStep.java:81)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:99)


It seems that it cannot find the start and end node of a relationships. However, both nodes exist in nodes.csv (I did a grep to be sure). So, I don't know what goes wrong. Do you have any idea? Can it be related to the id of the start node "file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal"?

csv2neo4j.groovy

Michael Hunger

unread,
Dec 12, 2014, 4:40:56 AM12/12/14
to ne...@googlegroups.com
It would have been good if you had taken a thread dump from the groovy script.

but if you look at the memory:

off heap = 2+2+1+1 => 6
heap = 10
leaves nothing for OS

probably the heap gc's as well.

So you have to reduce the mmio mapping size

Was the output still with nodes or already rels?

Perhaps also replace DynamicRelationshipType.withName(line.Type) with an enum

you can also extend trace to output number of nodes and rels

Would you be able to share your csv files?

Michael

mohsen

unread,
Dec 12, 2014, 5:26:17 AM12/12/14
to ne...@googlegroups.com
Thanks Michael for following my problem. In groovy script, the output was still with nodes. It is not feasible to use enum for relationshipTypes, types are URIs of ontology predicates coming from CSV file, and there are many of them. However, I think the problem is that this script requires more than 10GB heap, because it needs to store the nodes in memory (map) to use them later for creating relationships. So, I guess even reducing mmio mapping size won't solve the problem, will try it though tomorrow.

Regarding the batch-import command, do you have any idea why I am getting that error? 

mohsen

unread,
Dec 12, 2014, 5:27:56 AM12/12/14
to ne...@googlegroups.com
I don't have any problem with sharing my CSV files, but I don't know how and where can I share these large files.

Michael Hunger

unread,
Dec 12, 2014, 5:41:36 AM12/12/14
to ne...@googlegroups.com
>our id's are UUIDs or ? so 36 chars * 90M -> 72 bytes and Neo-Id's are longs w/ 8 bytes. so 80 bytes per entry.
Should allocate about 6G heap.

Btw. importing RDF 1:1 into Neo4j is no good idea in the first place.

You should model a clean property graph model and import INTO that model.

The the batch-import, it's a bug that has been fixed after the milestone, I try to get you a newer version to try.

Cheers, Michael


mohsen

unread,
Dec 12, 2014, 6:07:51 AM12/12/14
to ne...@googlegroups.com
I appreciate if you get me the newer version, I am already using 2.2.0-M01. 

I want to run some graph queries over my rdf. First, I loaded my data into Virtuoso triple store (took 2-3 hours), but could not get results for my SPARQL queries in a reasonable time. That is the reason I decided to load my data into Neo4j to be able to run my queries.

I am only importing RDF to Neo4j only for a specific research problem. I need to extract some patterns from the rdf data and I have to write queries that require some sort of graph traversal. I don't want to do reasoning over my rdf data. The graph structure looks simple: nodes only have Label (Uri or Literal) and Value, and relationships don't have any property. 

Michael Hunger

unread,
Dec 12, 2014, 6:13:15 AM12/12/14
to ne...@googlegroups.com
Right, that's the problem with an RDF model why only uses relationships to represent properties, you won't get the performance that you would get with a real property-graph model.

I share the version separately.

Cheers, Michael

mohsen

unread,
Dec 12, 2014, 3:26:51 PM12/12/14
to ne...@googlegroups.com
Thanks for sharing the new version. Here are my memory info before running batch-import: 

Mem:  18404972k total,   549848k used, 17855124k free,    12524k buffers
Swap:  4063224k total,        0k used,  4063224k free,   211284k cached

I assigned 11G for heap:  export JAVA_OPTS="$JAVA_OPTS -Xmx11G"
I ran the batch-import at 11:13am, now it is 12:20pm and it seems that it is stuck. Here is the log: 

Nodes
[INPUT-------------------|NODE-------------------------------------------------|PROP|WRITER: W:] 86M
Done in 15m 21s 150ms
Calculate dense nodes
[INPUT---------|PREPARE(2)====================================================================|]   0


And this is my memory info right now:

top - 12:22:43 up  1:34,  3 users,  load average: 0.00, 0.00, 0.00
Tasks: 134 total,   1 running, 133 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  0.5%sy,  0.0%ni, 99.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  18404972k total, 18244612k used,   160360k free,     6132k buffers
Swap:  4063224k total,        0k used,  4063224k free, 14089236k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                   
 4496 root      20   0 7598m 3.4g  15m S  3.3 19.4  20:35.88 java        

It's been more than 40 minutes that it is stuck in Calculate Dense Nodes. Should I wait for that? or I need to kill the process?

 

mohsen

unread,
Dec 12, 2014, 5:33:14 PM12/12/14
to ne...@googlegroups.com
Michael, I sent you a separate email with credentials to access the csv files. Thanks.

mohsen

unread,
Dec 15, 2014, 11:53:34 PM12/15/14
to ne...@googlegroups.com
With help of Michael, I could finally load the data using batch-import command in Neo4j 2.2.0-M02. It took totally 56 minutes. The issue preventing Neo4j from loading the CSV files was having \" in some values, which was interpreted as a quotation character to be included in the field value and this was messing up everything from this point forward. 

mohsen

unread,
Dec 15, 2014, 11:54:24 PM12/15/14
to ne...@googlegroups.com
Here are statistics of loading data:

Nodes
[INPUT----------|NODE--------------------------------------------------------------|PROP|WRITER] 87M
Done in 15m 11s 91ms
Calculate dense nodes
[INPUT--------------|PREPARE(2)=============================================================|CA]114M
Done in 18m 18s 880ms
Relationships
[INPUT--------------|PREPARE(2)=========================================================|REL]114M14M
Done in 18m 46s 226ms
Node first rel
[LINKER----------------------------------------------------------------------------------------] 84M
Done in 1m 1s 629ms
Relationship back link
[LINKER----------------------------------------------------------------------------------------]113M
Done in 2m 9s 3ms
Node counts
[NODE COUNTS-----------------------------------------------------------------------------------] 75M
Done in 12s 906ms
Relationship counts
[RELATIONSHIP COUNTS---------------------------------------------------------------------------]113M
Done in 38s 374ms

IMPORT DONE in 56m 25s 25ms
Reply all
Reply to author
Forward
0 new messages