LOAD CSV on bulk, performance

81 views
Skip to first unread message

gg4u

unread,
Aug 12, 2014, 12:13:39 PM8/12/14
to ne...@googlegroups.com
Hello,

here i am trying to upload a massive network:
4M nodes, 100M correlations.

having problems of memory and perfomance, I'd like to know if I am doing it OK:

1.
Before loading the correlations, I wanted to load the nodes.

2. Set up neo4-wrapper and neo4j.properties as written in 

with JVM heap set at 4096Mb

with this setting, bulk on 4M nodes failed.

3. Raised memory min-heap and max-heap to 6144Mb
Run a test with 100K nodes.

I got:
Nodes created: 98991
Properties set: 197982
Labels added: 98991
3438685 ms

Almost an hour for uploading 100K nodes with two properties?
I thought it should be much faster.

Am I doing smtg wrong?
this is the importer code I used:

CREATE CONSTRAINT ON (n:MYNODES) ASSERT n.id IS UNIQUE;
CREATE INDEX ON : n:MYNODES(name);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///blablabla.csv' AS line  FIELDTERMINATOR '\t' 
WITH line, toInt(line.topicId) as id, line.name as name LIMIT 100000
MERGE (n:MYNODES { id: id, name: name });


Rik Van Bruggen

unread,
Aug 12, 2014, 12:46:00 PM8/12/14
to ne...@googlegroups.com
I think you should use the batch importer for this size of a graph. You will be done in minutes, not hours.


Rik

gg4u

unread,
Aug 12, 2014, 1:44:04 PM8/12/14
to ne...@googlegroups.com
Hi Rik!

...in minutes?

I'd like to understand how I could get closer to that result, though I will try also that library.

that's kind of strange for me, cause both using the LOAD CSV functionality from shell, both doing a transaction each time, it looks like I run into a memory heap problem.

Why the batch import from shell should be so slower than the batch-import script?

Also, I see the importer is flexible enough, but my custom file (adjacnecy list to avoid redundancy) is more than 1GB; if I expand it and make a csv full of redundancy of node-rel-neighbor1, node-rel-neighbor2, it will be much much bigger and i am worried if it could be handled.

A question:
i read node-id start from 0.

Are they temporary id or mandatory?
E.g. what if I would like to upload another subgraph in the same db with the batch importer (clearly without overriding the nodes) ?

gg4u

unread,
Aug 12, 2014, 6:12:41 PM8/12/14
to ne...@googlegroups.com
Hello,

I am following the advice of Rik,
it is real promising!

I have still issues when using my own custom csv with the batch importer:
Exception in thread "main" org.neo4j.graphdb.NotFoundException: id=39
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.getNodeRecord(BatchInserterImpl.java:1215)
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.createRelationship(BatchInserterImpl.java:777)
at org.neo4j.batchimport.Importer.importRelationships(Importer.java:154)
at org.neo4j.batchimport.Importer.doImport(Importer.java:232)
at org.neo4j.batchimport.Importer.main(Importer.java:83)


I think i am closer to this mega-import! (Hope really so :P)
Could you please help in figuring out what  may the problem be?

My hypothesis

1. I thought it is because it cannot find a node, while it is written as start/end of a relationships.

So I checked my nodes.csv and rel.csv, make trivial files with two nodes and one relationships, but still got the error.

2. on the batch importer documention, it is written that 
  • have to know max # of rels per node, properties per node and relationship
where and how should this be specified? in nodes.csv or rels.csv?
Does the number of relationships be specified in the column 'rels' of nodes.csv as in the test.db example?
But it is not written in the documention example on git. I'm confused!

3. The documention paragraph about schema index is not clear to me: does it means I can use the files node.csv and rels.csv used for the test.db, and modify the header and batch.properties file according to my own custom structure?

What does counter:int property refer to?

Here what i've done!

1. headers in nodes.csv and rels.csv
Nodes.csv headers:
id:int mynamelabel:label name:string:mynodeindex

Rels.csv headers
id:int id:int type proximity counter:int

2. indexes
I want to use my own indexes:
node.id  (specified as int) are unique but not in progressive order.
Is it an issue?
E.g. my nodes' list is like:
node.id property
25 mark
39 julie

What is the difference between an exact index and a fulltext index?

3. My batch.importer

dump_configuration=false
cache_type
=none
use_memory_mapped_buffers
=true
neostore
.propertystore.db.index.keys.mapped_memory=5M
neostore
.propertystore.db.index.mapped_memory=5M
# 14 bytes per node
neostore
.nodestore.db.mapped_memory=200M
# 33 bytes per relationships
neostore
.relationshipstore.db.mapped_memory=4G
# 38 bytes per property
neostore
.propertystore.db.mapped_memory=200M
neostore
.propertystore.db.strings.mapped_memory=500M
batch_array_separator
=,
#batch_import.csv.quotes=true
#batch_import.csv.delim=,
batch_import
.keep_db=true
#
batch_import
.node_index.mynodeindex=exact
batch_import
.node_index.id=exact
batch_import
.node_index.node_auto_index=exact

P.s. 

Once the db is loaded in neo, are constraint on properties and indexes already present?
I was trying to match a node on test.db, but could not find it with simple query
MATCH (a {label:254782})-[r]-b Return r Limit 25
and the query takes a very long time to compute, making me suspect if the indexes were properly created.


Really thank you for your help!

Rik Van Bruggen

unread,
Aug 13, 2014, 5:09:11 AM8/13/14
to ne...@googlegroups.com
Batch import is completely different from load csv:
  • load csv is a transactional import on a running server
  • batch-import is a non-transactional, all-or-nothing import into the neo4j store files. The server is not running at that time. You can then use the store files to run the server - after the import.
Hope that makes sense.

Rik


--
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/EVdq1qUaFQY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Rik Van Bruggen
skype: rvanbruggen
Join us at GraphConnect 2014 San Francisco! graphconnect.com
As a friend of Neo4j, use discount code *KOMPIS* for $100 off registration

gg4u

unread,
Aug 13, 2014, 6:27:32 AM8/13/14
to ne...@googlegroups.com
Hi Rik,

yes, totally make sense.
Yesterday I dived into the batch importer, I was able to import a test.db as written in the git, using a generator, but having some issues with my real db - I wrote them in the message above, i couldn't solve them yet.
Reply all
Reply to author
Forward
0 new messages