I want to load a set of large rdf triple files into Neo4j. I have already written a map-reduce code to read all input n-triples and output two CSV files: nodes.csv (7GB - 90 million rows) and relationships.csv (15GB - 120 million rows). I tried batch-import command from Neo4j v2.2.0-M01, but it crashes after loading around 30M rows of nodes. I have 16GB of RAM in my machine so I set wrapper.java.initmemory=4096 and wrapper.java.maxmemory=13000. So, I decided to split nodes.csv and relationships.csv into smaller parts and run batch-import for each part. However, I don't know how to merge the databases created from multiple imports. I appreciate any suggestion on how to load large CSV files into Neo4j.
[Link to this question in StackOverflow: http://stackoverflow.com/questions/27416262/load-very-large-csv-into-neo4j]
On 11 Dec 2014, at 07:24, mohsen <mh.tah...@gmail.com> wrote:
I want to load a set of large rdf triple files into Neo4j. I have already written a map-reduce code to read all input n-triples and output two CSV files: nodes.csv (7GB - 90 million rows) and relationships.csv (15GB - 120 million rows). I tried batch-import command from Neo4j v2.2.0-M01, but it crashes after loading around 30M rows of nodes. I have 16GB of RAM in my machine so I set wrapper.java.initmemory=4096 and wrapper.java.maxmemory=13000. So, I decided to split nodes.csv and relationships.csv into smaller parts and run batch-import for each part. However, I don't know how to merge the databases created from multiple imports. I appreciate any suggestion on how to load large CSV files into Neo4j.
[Link to this question in StackOverflow: http://stackoverflow.com/questions/27416262/load-very-large-csv-into-neo4j]
--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Nodes
[INPUT-------------|ENCODER-----------------------------------------|WRITER] 86M
Calculate dense nodes
Import error: InputRelationship:
properties: []
startNode: file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
type: http://purl.org/ontology/echonest/beatVariance specified start node that hasn't been imported
java.lang.RuntimeException: InputRelationship:
properties: []
startNode: file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
type: http://purl.org/ontology/echonest/beatVariance specified start node that hasn't been imported
at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.stillExecuting(StageExecution.java:54)
at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.anyStillExecuting(PollingExecutionMonitor.java:71)
at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.finishAwareSleep(PollingExecutionMonitor.java:94)
at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMonitor.monitor(PollingExecutionMonitor.java:62)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.executeStages(ParallelBatchImporter.java:221)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(ParallelBatchImporter.java:139)
at org.neo4j.tooling.ImportTool.main(ImportTool.java:212)
Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: InputRelationship:
properties: []
startNode: file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal
endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
type: http://purl.org/ontology/echonest/beatVariance specified start node that hasn't been imported
at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.ensureNodeFound(CalculateDenseNodesStep.java:95)
at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(CalculateDenseNodesStep.java:61)
at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.process(CalculateDenseNodesStep.java:38)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.run(ExecutorServiceStep.java:81)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:99)
It seems that it cannot find the start and end node of a relationships. However, both nodes exist in nodes.csv (I did a grep to be sure). So, I don't know what goes wrong. Do you have any idea? Can it be related to the id of the start node "file:///Users/mohsen/Desktop/Music%20RDF/echonest/analyze-example.rdf#signal"?
Mem: 18404972k total, 549848k used, 17855124k free, 12524k buffers
Swap: 4063224k total, 0k used, 4063224k free, 211284k cached
Nodes
[INPUT-------------------|NODE-------------------------------------------------|PROP|WRITER: W:] 86M
Done in 15m 21s 150ms
Calculate dense nodes
[INPUT---------|PREPARE(2)====================================================================|] 0
And this is my memory info right now:
top - 12:22:43 up 1:34, 3 users, load average: 0.00, 0.00, 0.00
Tasks: 134 total, 1 running, 133 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 0.5%sy, 0.0%ni, 99.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 18404972k total, 18244612k used, 160360k free, 6132k buffers
Swap: 4063224k total, 0k used, 4063224k free, 14089236k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4496 root 20 0 7598m 3.4g 15m S 3.3 19.4 20:35.88 java