issues with the Superfast Batch Importer

94 views
Skip to first unread message

Rich Morin

unread,
May 31, 2014, 6:33:40 PM5/31/14
to ne...@googlegroups.com
I've been trying out the new "Superfast Batch Importer":


It looks very promising, but I'm having a few problems.  Help?

Background

My real files will be enormous (eg, 100M rels), so I'm using moderate-sized test files:

  $ wc -l tmp/[nr]*
  128299  tmp/nodes.csv
    92661  tmp/rels.csv
  220960  total

Here is my batch.properties file, set up for a 32GB, 8-core Mac Pro, running OSX 10.7.5:

cache_type=none
use_memory_mapped_buffers=true
# 14 bytes per node
neostore.nodestore.db.mapped_memory=2G
# 33 bytes per relationship
neostore.relationshipstore.db.mapped_memory=20G
# 38 bytes per property
neostore.propertystore.db.mapped_memory=1G
# 60 bytes per long-string block
neostore.propertystore.db.strings.mapped_memory=1G
neostore.propertystore.db.index.keys.mapped_memory=50M
neostore.propertystore.db.index.mapped_memory=50M
# set up indexing
batch_import.node_index.Xhas_airport_code=exact
batch_import.node_index.XhasArea=exact
batch_import.node_index.Xhas_family_name=exact
batch_import.node_index.Xhas_GeoNames_Class_ID=exact
batch_import.node_index.Xhas_GeoNames_Entity_ID=exact
batch_import.node_index.Xhas_given_name=exact
batch_import.node_index.Xhas_gloss=exact
batch_import.node_index.Xhas_ISBN=exact
batch_import.node_index.Xhas_IMDB=exact
batch_import.node_index.Xhas_language_code=exact
batch_import.node_index.Xhas_motto=exact
batch_import.node_index.Xhas_official_language=exact
batch_import.node_index.Xhas_Synset_ID=exact
batch_import.node_index.Xhas_top-level_domain=exact
batch_import.node_index.Xhas_three-letter_language_code=exact
batch_import.node_index.Xis_preferred_meaning_of=exact
batch_import.node_index.Xlabel=exact
batch_import.node_index.Xns_name=exact
batch_import.node_index.Xpreferred_label=exact


Behavior

My Terminal output looks rather messy; perhaps some output buffering tweaks or newlines are needed:

$ time import.sh test.db -nodes ../nodes.csv -rels ../rels.csv
Neo4j Data Importer
Importer -db-directory <graph.db> -nodes <nodes.csv> -rels <rels.csv> -debug <debug config>

Using Existing Configuration File
[Current time:2014-05-31 15:11:32.258][Compile Time:Importer $ batch-import-2.1.0 $ 31/05/2014 04:12:24]
Node Import: [5] Property[292048] Node[128298] Relationship[0] Label[0] Disk[13 mb, 0 mb/sec] FreeMem[3173 mb]
[2014-05-31 15:11:41.02] Node file [nodes.csv] imported in 8 secs - [Property[292048] Node[128298] Relationship[0] Label[0]]
[2014-05-31 15:11:41.02]Node Import complete in 8 secs - [Property[292048] Node[128298] Relationship[0] Label[0]]java.lang.NumberFormatException: For input string: "owl_Thing"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:441)
at java.lang.Long.parseLong(Long.java:483)
at org.neo4j.batchimport.importer.structs.AbstractDataBuffer.getLong(AbstractDataBuffer.java:159)
at org.neo4j.unsafe.batchinsert.BatchInserterImplNew.accumulateNodeCount(BatchInserterImplNew.java:743)
at org.neo4j.batchimport.importer.stages.NodeStatsAccumulatorStage$2.execute(NodeStatsAccumulatorStage.java:24)
at org.neo4j.batchimport.importer.stages.ImportWorker.processData(ImportWorker.java:144)
at org.neo4j.batchimport.importer.stages.ImportWorker.run(ImportWorker.java:196)
Invoke stage method failed:ImportNode_Stage1:[Error in accumulateNodeCount - For input string: "owl_Thing"]:1
org.neo4j.kernel.api.Exceptions.BatchImportException: [Error in accumulateNodeCount - For input string: "owl_Thing"]
at org.neo4j.batchimport.importer.stages.ImportWorker.processData(ImportWorker.java:152)
at org.neo4j.batchimport.importer.stages.ImportWorker.run(ImportWorker.java:196)
Import worker:ImportNode_Stage1:[Error in accumulateNodeCount - For input string: "owl_Thing"]
Uncaught exception: java.lang.RuntimeException: [Error in accumulateNodeCount - For input string: "owl_Thing"]
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:321)
at org.neo4j.batchimport.importer.structs.DataBufferBlockingQ.putBuffer(DataBufferBlockingQ.java:243)
at org.neo4j.batchimport.importer.stages.ImportWorker.writeData(ImportWorker.java:160)
at org.neo4j.batchimport.importer.stages.ImportWorker.run(ImportWorker.java:198)
java.lang.InterruptedException: sleep interruptedImport worker:ImportNode_Stage0:null

at java.lang.Thread.sleep(Native Method)Uncaught exception: java.lang.RuntimeException

at org.neo4j.batchimport.importer.stages.ImportWorker.readData(ImportWorker.java:131)
at org.neo4j.batchimport.importer.stages.ImportWorker.run(ImportWorker.java:194)
Import worker:ImportNode_Stage4:sleep interrupted
java.lang.InterruptedException: sleep interruptedUncaught exception: java.lang.RuntimeException: sleep interrupted
...

Aside from the fact that the SBI doesn't like "owl_Thing", I have no clue about what the problem is.


Also, the Relationship Prescan counter (827, below) is only changing once a second:

  [827] Property[292048] Node[128298] Relationship[0] Label[0] Disk[13 mb, 0 mb/sec] FreeMem[3088 mb]

If this is counting relationships, this load is gonna take a loooong time...

Michael Hunger

unread,
May 31, 2014, 6:39:24 PM5/31/14
to ne...@googlegroups.com
What do your files look like, can you show the first 10 lines of each?


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Hunger

unread,
May 31, 2014, 8:52:09 PM5/31/14
to ne...@googlegroups.com
Rich, 

thanks so much for looking into it.
as far as I understood, this new implementation does not yet support external id's (i.e. those index-lookups).

I'm working in parallel on another version that integrates with new kernel-level APIs that provide a similar functionality but will also support external ids.

So right now for that new importer you have to provide node-id's externally.

Cheers,

Michael


On Sun, Jun 1, 2014 at 12:33 AM, Rich Morin <r...@cfcl.com> wrote:

--

Rich Morin

unread,
Jun 1, 2014, 2:26:01 AM6/1/14
to ne...@googlegroups.com
On Saturday, May 31, 2014 5:52:09 PM UTC-7, Michael Hunger wrote:
So right now for that new importer you have to provide node-id's externally

Yow!  OK, I can retrofit my latest changes onto an earlier version which generates IDs.  That said, can you give any idea of when you expect the indexing version to be available?

-r
 
Reply all
Reply to author
Forward
0 new messages