Re: large cypher statements

482 views
Skip to first unread message

Michael Hunger

unread,
Nov 19, 2014, 8:04:44 AM11/19/14
to "José F. Morales Ph.D.", ne...@googlegroups.com
José,

Let's continue the discussion on the google group

With larger I meant amount of data, not size of statements

As I also point out in various places we recommend creating only small subgraphs with a single statement separated by srmicolons.
Eg up to 100 nodes and rels

Gigantic statements just let the parser explode

I recommending splitting them up into statements creating subgraphs
Or create nodes and later match them by label & property to connect them
Make sure to have appropriate indexes / constraints

You should also surround blocks if statements with begin and commit commands

Von meinem iPhone gesendet

Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. <jm3...@columbia.edu>:

Hey Michael and Kenny

Thanks you guys a bunch for the help.

Let me give you a little background.  I am charged to make a prototype of a tool (“LabCards”) that we hope to use in the hospital and beyond at some point .  In preparation for making the main prototype, I made two prior Neo4j databases that worked exactly as I wanted them to.  The first database was built with NIH data and had 183 nodes and around 7500 relationships.  The second database was the Pre-prototype and it had 1080 nodes and around 2000 relationships.  I created these in the form of cypher statements and either pasted them in the Neo4j browser or used the neo4j shell and loaded them as text files. Before doing that I checked the cypher code with Sublime Text 2 that highlights the code. Both databases loaded fine in both methods and did what I wanted them to do.  

As you might imagine, the prototype is an expansion of the mini-prototype.  It has almost the same data model and I built it as a series of cypher statements as well.  My first version of the prototype had ~60k nodes and 160k relationships.  

I should say that a feature of this model is that all the source and target nodes have relationships that point to each other.  No node points to itself as far as I know. This file was 41 Mb of cypher code that I tried to load via the neo4j shell.  

In fact, I was following your advise on loading big data files... “Use the Neo4j-Shell for larger Imports”  (http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/).   This first time out, Java maxed out its memory allocated at 4Gb 2x and did not complete loading in 24 hours.  I killed it. 

I then contacted Kenny, and he generously gave me some advice regarding the properties file (below) and again the same deal (4 Gb Memory 2x) with Java and no success in about 24 hours. I killed that one too.

Given my loading problems, I have subsequently eliminated a bunch relationships (100k) so that the file is now 21 Mb. Alot of these were duplicates that I didn’t pick up before and am trying it again.  So far 15 min into it, similar situation.  The difference is that Java is using 1.7 and 0.5 GB of memory

Here is the cypher for a typical node…

CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate (ketone body)',SYNONYM:'',Sample:'SERUM, URINE',MEDCODE:10010,CUI:'NA’})

Here is the cypher for a typical relationship...

CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME',Ui_Rl:'T157',RESULT:'',Type:'',Semantic_Distance_Score:'NA',Path_Length:'NA',Path_Steps:'NA'}]->(CLT_TARGET_3617),

I will let you know how this one turns out.  I hope this is helpful.

Many, many thanks fellas!!!

Jose

On Nov 18, 2014, at 8:33 PM, Michael Hunger <michael...@neotechnology.com> wrote:

Hi José,

can you provide perhaps more detail about your dataset (e.g. sample of the csv, size, etc. perhaps an output of csvstat (of csvkit) would be helpful), your cypher queries to load it

Have you seen my other blog post, which explains two big caveats that people run into when trying this? jexp.de/blog/2014/10/load-cvs-with-success/

Cheers, Michael

On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <k...@socialmoon.com> wrote:
Hey Jose,

There is definitely an answer. Let me put you in touch with the data import master: Michael Hunger.

Michael, I think the answers here will be pretty straight forward for you. You met Jose at GraphConnect NY last year, so I'll spare any introductions. The memory map configurations I provided need to be calculated and customized for the data import volume.

Thanks,

Kenny

Sent from my iPhone

On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. <jm3...@columbia.edu> wrote:

Kenny,  

In 3 hours it’ll be trying to load for 24 hours so this is not working.  I’m catching shit from my crew too, so I got to fix this like soon.

I haven’t done this before, but can I break up the data and load it in pieces?

Jose

On Nov 17, 2014, at 3:35 PM, Kenny Bastani <k...@socialmoon.com> wrote:

Hey Jose,

Try turning off the object cache. Add this line to the neo4j.properties configuration file:

cache_type=none

Then retry your import. Also, enable memory mapped files by adding these lines to the neo4j.properties file:

neostore.nodestore.db.mapped_memory=2048M
neostore.relationshipstore.db.mapped_memory=4096M
neostore.propertystore.db.mapped_memory=200M
neostore.propertystore.db.strings.mapped_memory=500M
neostore.propertystore.db.arrays.mapped_memory=500M

Thanks,

Kenny


From: José F. Morales Ph.D. <jm3...@columbia.edu>
Sent: Monday, November 17, 2014 12:32 PM
To: Kenny Bastani
Subject: latest
 
Hey Kenny,

Here’s the deal. As I think I said, I loaded the 41 Mb file of cypher code via the neo4j shell. Before I tried the LabCards file, I tried the movies file and a UMLS database I made (8k relationships).  They worked fine. 

The LabCards file is taking a LONG time to load since I started at about 9:30 - 10 PM last night and its 3PM now.  

I’ve wondered if its hung up and the activity monitor’s memory usage is constant at two rows of Java at 4GB w/ the kernel at 1 GB.  The CPU panel changes alot so it looks like its doing its thing. 

So is this how are things to be expected?  Do you think the loading is gonna take a day or two?  

Jose


|//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
José F. Morales Ph.D.
Instructor
Cell Biology and Pathology
Columbia University Medical Center

|//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
José F. Morales Ph.D.
Instructor
Cell Biology and Pathology
Columbia University Medical Center



|//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
José F. Morales Ph.D.
Instructor
Cell Biology and Pathology
Columbia University Medical Center

Andrii Stesin

unread,
Nov 19, 2014, 8:35:42 AM11/19/14
to ne...@googlegroups.com, jm3...@columbia.edu
Hi,

maybe I'm missing smth but the very idea of importing database to Neo4j via pack of Cypher "CREATE" statements seems to be a counter-productive idea for me.

I'd suggest to dump the base into 2 .csv files:
  1. first one describes nodes (and enumerates them via some my_node_id integer attribute), columns: my_node_id,label,node_prop_01,node_prop_ZZ
  2. second which describes relations, columns: source_my_node_id, dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN
Than import first file via LOAD CSV, than the second one in the same way. Don't forget to add your needed indexes and/or constraints before starting import! This will save you much time and effort.

Mixing creation of nodes and relationships in a single LOAD CSV statement turned out to be a risky idea, at least as of early 2.1.* releases. Also I'd recomment to import no more than 10000-30000 lines in a single LOAD CSV statement - depends on your hardware and especially RAM setup. Just split your BIG .csv into parts with appropriate number of lines - that's what I did i.e. with my experimental 9 million lines .csv

WBR,
Andrii

Andrii Stesin

unread,
Nov 19, 2014, 8:41:21 AM11/19/14
to ne...@googlegroups.com, jm3...@columbia.edu
Also it would be interesting to have a look at the target (graph) data model.

José F. Morales

unread,
Nov 19, 2014, 2:36:50 PM11/19/14
to ne...@googlegroups.com, jm3...@columbia.edu
OK Fellas,

A you might imagine the last effort I made didn't work either even though I cut down the relationships a lot.  Same maxing out Java at 4 GB and not doing anything for 12 +  hours.

OK, so here is my understanding of the Approaches and my likely course of action.  Some aspects you guys cite I’m not familiar with…particularly indexes and constraints.Never used them before.  I’m going to look for examples that can give me an idea of how to do them.

Approaches:

1. Approach 1:
a. “Creating only small subgraphs with a single statement separated by semicolons.  Eg up to 100 nodes and rels”   
b. surround blocks of statements begin and commit commands
c. I am assuming that this approach involves cypher statements uploaded via the neo4j shell

I am assuming that the format you are referring to is similar to what was used in the movies db.  There a few nodes were created, then the relationships that used them were created and so on.  Since I used the “movies” DB as my model, I did not use the “begin” and “commit” commands in my previous code. They seemed to work find and I didn’t know I needed them.  I will look up how to use them.  However, this means making sure the nodes and relationships are in the proper order.  That’ll take a little work.

2. Approach 2
a. “…create nodes and later match them by label & property to connect them”
b. surround blocks of statements begin and commit commands
c. I am assuming that this approach involves cypher statements uploaded via the neo4j shell

I am not sure exactly what you mean here in terms of “…match them by label & property to connect them”

3. CSV approach 
a. “Dump the base into 2 .csv files:”
b. CSV1:  “Describe nodes (enumerate them via some my_node_id integer attribute),  columns: my_node_id,label,node_prop_01,node_prop_ZZ”
c. CSV2:  “Describe relations, columns: source_my_node_id, dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN”
d. Indexes constraints: before starting import  —> have appropriate indexes / constraints
e. via LOAD CSV, import CSV1, then CSV2. 
f. Import no more than 10,000-30,000 lines in a single LOAD CSV statement 

This seems to be a very well elaborated method and the easiest for me to do.  I have files such that I can create these without too much problem.  I figure I’ll split the nodes into three files 20k rows each.  I can do the same with the Rels.  I have not used indexes or constraints yet in the db’s that I already created and as I said above, I’ll have to see how to use them.

I am assuming column headers that fit with my data are consistent with what you explained below (Like, I can put my own meaningful text into Label 1 -10 and node_prop_01 - 05).... 
my_node_id,    label1,       label2,       label3,   label4,            label5,         label6,             label7,          label8,   label9,            label10,           node_prop_01,    node_prop_02,  node_prop_03,  node_prop_04,       node_prop_ZZ”

Thanks again Fellas!!

Jose


Michael Hunger

unread,
Nov 19, 2014, 4:34:35 PM11/19/14
to ne...@googlegroups.com, José F. Morales Ph.D.
Make sure to have a look at my blog posts:

1. elaborating the individual cypher commands: 

if you create nodes and then later match them to connect:

// created indexes
create index on :Movie(title);
create index on :Person(name);

// or alternatively unique constraints

create constraint on m:Movie assert m.title is unique;
create constraint on p:Person assert p.name is unique;

begin
create (:Movie {title:"The Matrix", ...});
create (:Person {name:"Keanu Reeves", ...});
....

// match + create rel
match (m:Movie {title:"The Matrix"}), (p:Person {name:"Keanu Reeves"}) create (p)-[:ACTED_IN {role:"Neo"}]->(m);
...

commit

2. load csv


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrii Stesin

unread,
Nov 20, 2014, 6:22:34 AM11/20/14
to ne...@googlegroups.com, jm3...@columbia.edu
Before you start.

1. On nodes and their labels. First of all, I strongly suggest you to separate your nodes into different .csv files by label. So you won't have a column `label` in your .csv but rather set of files:

nodes_LabelA.csv
...
nodes_LabelZ.csv

whatever your labels are. (Consider label to be kinda of synonym for `class` in object-oriented programming or `table` in RDBMS). That's due the fact that labels in Cypher are somewhat specific entities and you probably won't be allowed to make them parameterized into variables inside your LOAD CSV statement.

2. Then consider one additional "technological" label, let's name it `:Skewer` because it will "penetrate" all your nodes of every different label (class) like a kebab skewer.

Before you start (or at least before you start importing relationships) do

CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS UNIQUE;

3. When doing LOAD CSV with nodes, make sure each node will get 2 (two) labels, one of them is `:Skewer`. This will create index on `my_node_id` attribute (makes relationships creation some orders of magnitude faster) and you'll be sure you don't have occasional duplicate nodes, as a bonus.

4. Now when you are done with nodes and start doing LOAD CSV for relationships, you may give the MATCH statement, which looks up your pair of nodes, a hint for fast lookup, like

LOAD CSV ...from somewhere... AS csvline
MATCH
(source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:Skewer {my_node_id: ToInt(csvline[1]})
CREATE
(source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., rel_prop_NN: csvline[ZZ]}]->(dest_node);

Adding `:Skewer` label in MATCH will tell Cypher to (implicitly) use your index on my_node_id which was created when you created your constraint. Or you may try to explicitly give it a hint to use the index, with USING INDEX... clause after MATCH before CREATE. Btw some earlier versions of Neo4j refused to use index in LOAD CSV for some reason, I hope this problem is gone with 2.1.5.

5. While importing, be careful to explicitly specify type conversions for each property which is not a string. I have seen numerous occasions when people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and Cypher silently stored their (supposed) numerics as strings. It's Ok, dude, you say it :) This led to confusion afterwards when say numerical comparisons doesn't MATCH and so on (though it's easy to correct with a single Cypher command, but anyway).

WBR,
Andrii

Andrii Stesin

unread,
Nov 20, 2014, 7:59:17 AM11/20/14
to ne...@googlegroups.com, jm3...@columbia.edu
Just a remark. If you won't create an index on :Skewer label and make double-labeled nodes while importing, you'll need to create a separate index on my_node_id property instead, for each LabelA ... LabelZ - really not a problem, but somewhat boring I think. Without indexes, LOAD CSV for relationships is painfully slow with thousands of nodes (no surprise, yeah?)

Anyway, after you complete your data import, you can REMOVE an extra label from your (my_node:LabelN:Skewer), REMOVE extra (unneeded anymore) properties my_node.my_node_id (just for economy storage space and corresponding RAM) and DROP the 'technological' constraint if you wish.

The only (harmless) consequence of those actions will be an extra label Skewer hanging in your database forever. I didn't find any way to completely eliminate an unneeded (or once mistyped on CREATE) label from the database. Maybe I'll ask Neo4j team to look at this 'feature' someday. Or maybe not... who cares, really.

WBR,
Andrii

José F. Morales

unread,
Nov 20, 2014, 8:20:03 AM11/20/14
to ne...@googlegroups.com, jm3...@columbia.edu
OK, Gents,

I will implement your very gracious suggestions and let you know what happens.  

Thanks a load,

Jose

On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger wrote:

Andrii Stesin

unread,
Nov 20, 2014, 10:38:57 AM11/20/14
to ne...@googlegroups.com, jm3...@columbia.edu
BTW if you need your nodes to have multiple labels each, I'd suggest you to prepare a simple file (not even .csv, just 2 columns) beforehands:

$ cat node_labels.txt
1 :LabelA:LabelN:LabelV
2 :LabelD:LabelV:LabelX:LabelZ
...
9999 :LabelF:LabelJ:LabelN:LabelX:LabelZ
$

Then import all nodes with a single label Skewer as suggested above and import relationships too.

As soon as all your nodes are already inside Neo4j database, you can write a simple bash script which reads your node_labels.txt in a loop line by line, splits each line inside 2 shell variables: my_node_id and labels and just issues a shell command for each line, like this (untested):

neo4j-shell -c "MATCH (n:Skewer \{my_node_id: $my_node_id \}) SET n $labels ;"

this will do the job in some finite (not too long) time.

WBR,
Andrii

José F. Morales

unread,
Nov 20, 2014, 10:42:50 AM11/20/14
to ne...@googlegroups.com, jm3...@columbia.edu
Thanks Andrii, 
I will let you know how it goes.  I am trying to clean up the data some more to get the size down right now.  Probably will help I figure.  
J

Andrii Stesin

unread,
Nov 20, 2014, 11:02:35 AM11/20/14
to ne...@googlegroups.com, jm3...@columbia.edu
I've had some painful experiences with massive data imports, so I hope my accumulated bumps and bruises will help someone else, too :)

Good luck :)
Andrii

José F. Morales

unread,
Nov 20, 2014, 11:03:58 AM11/20/14
to ne...@googlegroups.com, jm3...@columbia.edu
That's awesome of you Andrii!!  Thanks again!  

Jose

Michael Hunger

unread,
Nov 20, 2014, 11:36:32 AM11/20/14
to ne...@googlegroups.com, José F. Morales Ph.D.
Hi Andrii,

did you blog about your experiences? The collected information might be helpful to many?

Did you ever look into my shell import tools?

Cheers Michael

--

Andrii Stesin

unread,
Nov 20, 2014, 12:05:08 PM11/20/14
to ne...@googlegroups.com, jm3...@columbia.edu
Hi Michael,

1) no I didn't - was too busy running over the rakes and pitfalls of data imports, had no time for blogging that :) but you see, I'm ready to contribute as soon as someone asks for help in some sample practical situation.

2) no I didn't - I decided to rely solely on LOAD CSV functionality (with some preparatory Perl & bash wooden legs), from the day LOAD CSV appeared. I better prefer to learn 1 single tool (even if complex "professional" one) perfectly and use it at regular basis, than collect a toolbox of episodically useful heterogeneous tools.

Again, as soon as someone will bring up his sample situation, which I'll consider worth the effort, I'd probably blog something on it (btw I don't have a blog yet :) like that thing with mobile phones - I'm familiar with telco domain from one of my past lives :) so I was able to say something valuable. 

BTW I think it's difficult to add anything really significant to your series of blog posts on LOAD CSV topic. You already wrote all the knowledge one will need to do the job.

For now the problem size and performance with RDF representation (described by Niclas Hoyer) worries me at the first place. I suspect some data modelling issue there, will try to devise smth on it.

WBR,
Andrii

Andrii Stesin

unread,
Nov 20, 2014, 12:07:14 PM11/20/14
to ne...@googlegroups.com, jm3...@columbia.edu
Maybe I'll assemble some FAQ-blog on Neo4j data import and other tricky topics someday. Where do you suggest me to go to get a blog?

Andrii Stesin

unread,
Nov 27, 2014, 10:38:07 AM11/27/14
to ne...@googlegroups.com, jm3...@columbia.edu
Dear José, note that 2.1.6 release just hit the streets, did you try it?

José F. Morales

unread,
Nov 27, 2014, 10:48:41 AM11/27/14
to ne...@googlegroups.com, jm3...@columbia.edu
Hey Andrii  !

Not yet, still trying  to get the data with your and Michael's tips.  

Anything major in this release?  Something that helps with loading data perhaps?

Happy turkeycide day!

Jose

Andrii Stesin

unread,
Nov 27, 2014, 11:22:56 AM11/27/14
to ne...@googlegroups.com, jm3...@columbia.edu
From release notes I didn't get an impression of some drastic changes happened with regard to import, but who knows - maybe there are some non-trivial cross-influences between bugfixes?

Thank you :)
Andrii

José F. Morales

unread,
Nov 28, 2014, 6:12:52 PM11/28/14
to ne...@googlegroups.com, jm3...@columbia.edu
Hey Andrii,

I've been thinking alot about your recommendations.   I have some questions, some of which show how ignorant I am.  Apologies for basics if necessary.


On Thursday, November 20, 2014 6:22:34 AM UTC-5, Andrii Stesin wrote:
Before you start.

1. On nodes and their labels. First of all, I strongly suggest you to separate your nodes into different .csv files by label. So you won't have a column `label` in your .csv but rather set of files:

nodes_LabelA.csv
...
nodes_LabelZ.csv

whatever your labels are. (Consider label to be kinda of synonym for `class` in object-oriented programming or `table` in RDBMS). That's due the fact that labels in Cypher are somewhat specific entities and you probably won't be allowed to make them parameterized into variables inside your LOAD CSV statement.


OK, so you have modified your original idea of putting the db into two files 1 nodes , 1 relationships.  Now here you say, put all the nodes into 1 file/ label.   The way I have worked with it, I created 1 file for a class of nodes I'll call CLT_SOURCE and another file for a class of nodes called CLT_TARGET.  Then I have a file for the relationships. Perhaps foolishly I originally would create 1 file that would combine all of this info and try to paste it in the browser or in the shell.  Neither worked even though with smaller amount of data it did.

You are recommending that with the nodes, I take two steps...
1) Combine my CLT_SOURCE and CLT_TARGET nodes, 
2) then I split that file into files that correspond to the node: my_node_id,  1 label, and then properties P1...Pn.  Since I have 10 Labels/node, I should have 10 files named..... Nodes_LabelA... Nodes_LabelJ.  Thus...

File:  CLT_Nodes-LabelA     columns:  my_node_id, label A, property P1..., property P4
...
File:  CLT_Nodes-LabelJ     columns:  my_node_id, label B, property P1..., property P4


Q1: What are the rules about what can be used for my_node_id?  I have usually seen them as a letter integer combination. Is that the convention?   Sometimes I've seen a letter being used with a specific class of nodes  a1..a100 for one class and b1..b100 for another.  I learned the hard way that you have to give each node a unique ID.  I used CLT_1...CLT_n for my CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. It worked with the smaller db I made.  Anything wrong using the convention n1...n100?
 
 
2. Then consider one additional "technological" label, let's name it `:Skewer` because it will "penetrate" all your nodes of every different label (class) like a kebab skewer.

Before you start (or at least before you start importing relationships) do

CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS UNIQUE;


Q2:  Should I do scenario 1 or 2?

Scenario 1:  add two labels to each file?  One from my original nodes and one as "Skewer"

File 1:  CLT_Nodes-LabelA     columns:  my_node_id, label A, Skewer, property P1..., property P4
...
File 2:  CLT_Nodes-LabelJ     columns:  my_node_id, label J, Skewer, property P1..., property P4
 
OR 

Scenario 2:  Include an eleventh file thus....

File 11:  CLT_Nodes-LabelK     columns:  my_node_id, Skewer, property P1..., property P4 

From below, I think you mean Scenario 1.

Q3: “Skewer” is just an integer right?  It corresponds in a way to my_node_id 

3. When doing LOAD CSV with nodes, make sure each node will get 2 (two) labels, one of them is `:Skewer`. This will create index on `my_node_id` attribute (makes relationships creation some orders of magnitude faster) and you'll be sure you don't have occasional duplicate nodes, as a bonus.


Here is some sort of cypher….

 

//Creating the nodes

 

USING PERIODIC COMMIT
1000

LOAD CSV WITH HEADERS FROM
“…/././…. CLT_NODES_LabelA.csv" AS csvline

MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1})

ON CREATE SET  

n.Property2 = csvline.Property2,  

n.Property3 = csvline.Property3,  

n.Property4 = csvline.Property4;


….

LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline

 

MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1})

ON CREATE SET  

n.Property2 = csvline.Property2,  

n.Property3 = csvline.Property3,  

n.Property4 = csvline.Property4;


 

Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J combine the various labels and their respective values with their corresponding nodes? 

Q5: Since I think of my data in terms of the two classes of nodes in my Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after loading the nodes, how then I get two classes of nodes? 

Q6: Is there a step missing that explains how the code below got to have a “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE and CLT_TARGET nodes?

 


 
4. Now when you are done with nodes and start doing LOAD CSV for relationships, you may give the MATCH statement, which looks up your pair of nodes, a hint for fast lookup, like

LOAD CSV ...from somewhere... AS csvline
MATCH
(source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:Skewer {my_node_id: ToInt(csvline[1]})
CREATE
(source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., rel_prop_NN: csvline[ZZ]}]->(dest_node);


Q6: This LOAD CSV  command (line 1) looks into the separate REL.csv file you mentioned first right?  

Q7: csvline is some sort of temp file that is a series of lines of the cvs file? 

Q8: Do you imply in line 2 that the REL.csv file has headers that include  source_node, dest_node ?

Q9: While I see how Skewer is a label,  how is my_node_id a  property (line 2) ? 

Q10: How does my_node_id relate to either ToInt(csvline[0]} or ToInt(csvline[1]}  (line 2) ?

Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  

Does csvline[0] refer to a column in REL.csv as do csvline[2] and csvline[ZZ] (line 3) ?

 
Adding `:Skewer` label in MATCH will tell Cypher to (implicitly) use your index on my_node_id which was created when you created your constraint. Or you may try to explicitly give it a hint to use the index, with USING INDEX... clause after MATCH before CREATE. Btw some earlier versions of Neo4j refused to use index in LOAD CSV for some reason, I hope this problem is gone with 2.1.5.

OK
 
5. While importing, be careful to explicitly specify type conversions for each property which is not a string. I have seen numerous occasions when people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and Cypher silently stored their (supposed) numerics as strings. It's Ok, dude, you say it :) This led to confusion afterwards when say numerical comparisons doesn't MATCH and so on (though it's easy to correct with a single Cypher command, but anyway).

Think I did that re. type conversion.  Only applies to properties for my data.
  
Sorry for so many questions.  I am really interested in figuring this out!

Thanks loads,  
Jose

Michael Hunger

unread,
Nov 28, 2014, 6:16:49 PM11/28/14
to ne...@googlegroups.com
José 

if you watch Nicole's webinar many things will become clear. https://vimeo.com/112447027
You don't have to overcomplicate things.

The Skewer(id) thing is not really needed if each of your entities has a label and a primary key of some sorts.
It is just an optimization to not have to think about separate entities.

Cheers, Michael

José F. Morales

unread,
Nov 28, 2014, 7:18:56 PM11/28/14
to ne...@googlegroups.com
Hey Michael,

I'll check it out.   Trouble is knowing what over-complicating is.  Thanks for the heads up!

I am trying to figure out inductively how to use LOAD CSV from various examples.  Thanks for another one.  

Its killing me that its taking so long.  

Jose

Michael Hunger

unread,
Nov 28, 2014, 7:50:30 PM11/28/14
to ne...@googlegroups.com
What takes so long? The loading? Or figuring it out?

Michael

FANC2

unread,
Nov 28, 2014, 7:53:26 PM11/28/14
to ne...@googlegroups.com
Both. Using what I did before the loading either never finished or failed.  I’m trying to not follow that example with the figuring it out!!  :)

You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

José F. Morales Ph.D.



Michael Hunger

unread,
Nov 28, 2014, 8:03:34 PM11/28/14
to ne...@googlegroups.com
If you look at the video it is pretty obvious,
she outlines all the major steps and pitfalls.

Except for one, create nodes and rels separately if you need more than one MERGE

GOOD merge|match|create node merge|match|create node create erel
GOOD match node match node MERGE rel

BAD match|create node merge node merge rel
BAD match node set node.prop


FANC2

unread,
Nov 28, 2014, 8:05:13 PM11/28/14
to ne...@googlegroups.com
Thanks Dude!  I’ll let you know!

Andrii Stesin

unread,
Nov 29, 2014, 6:35:33 AM11/29/14
to ne...@googlegroups.com, jm3...@columbia.edu
Hi Jose,

On Saturday, November 29, 2014 1:12:52 AM UTC+2, José F. Morales wrote:

1. On nodes and their labels. First of all, I strongly suggest you to separate your nodes into different .csv files by label. So you won't have a column `label` in your .csv but rather set of files:

nodes_LabelA.csv
...
nodes_LabelZ.csv

whatever your labels are. (Consider label to be kinda of synonym for `class` in object-oriented programming or `table` in RDBMS). That's due the fact that labels in Cypher are somewhat specific entities and you probably won't be allowed to make them parameterized into variables inside your LOAD CSV statement.


OK, so you have modified your original idea of putting the db into two files 1 nodes , 1 relationships.  Now here you say, put all the nodes into 1 file/ label.   The way I have worked with it, I created 1 file for a class of nodes I'll call CLT_SOURCE and another file for a class of nodes called CLT_TARGET.

Ok, but how many valid distinct combinations of your 10 node labels may exist? I was speaking about a simple case where you have some limited number of possible node labels (or their combinations), say less than 10.

 You are recommending that with the nodes, I take two steps...
1) Combine my CLT_SOURCE and CLT_TARGET nodes, 

Not nessesary "combine" but just give each node a unique (temporary) my_node_id see my "10+M tree" example below.
 
2) then I split that file into files that correspond to the node: my_node_id,  1 label, and then properties P1...Pn.  Since I have 10 Labels/node, I should have 10 files named..... Nodes_LabelA... Nodes_LabelJ.  Thus...

You may have as much labels per node you wish, but it is all about how many valid distinct combinations of labels you have. (One single label is a combination itself, obviously).

If you have some limited quantity of valid label combination it's one story. But if we are talking about order of 10! possible valid combinations, the story is somewhat more interesting :) Which setup is yours?
 
File:  CLT_Nodes-LabelA     columns:  my_node_id, label A, property P1..., property P4
...
File:  CLT_Nodes-LabelJ     columns:  my_node_id, label B, property P1..., property P4


Q1: What are the rules about what can be used for my_node_id?  I have usually seen them as a letter integer combination. Is that the convention?   Sometimes I've seen a letter being used with a specific class of nodes  a1..a100 for one class and b1..b100 for another.  I learned the hard way that you have to give each node a unique ID.  I used CLT_1...CLT_n for my CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. It worked with the smaller db I made.  Anything wrong using the convention n1...n100?

I'm not aware of any conventions here, the only thing I know for sure is that schema index works much(!) faster on plain integers than on Unicode strings. That's the only difference which I consider significant. So my personal preference is to have my_node_id to be a unique integer. Once when importing a 10+ millions nodes into a tree with variable height [1..7] where each level of nodes was in a separate file (because of level's own unique label and unique set of properties) I just selected a schema for numbering them like

:Skewer:Level1 my_node_id = 10000000 + file1.csv line number
:Skewer:Level2 my_node_id = 20000000 + file2.csv line number
...
:Skewer:Level7 my_node_id = 70000000 + file7.csv line number

so relationship file (all relationships were of a same single type) has become a simple 2 column .csv like this with 10+ millions of lines

10000017,20000362
10000017,20000547
10000017,40083215
10000018,30000397
...

After successful importing of 7 node files (and have nodes ready in db and indexed on their unique my_node_id under the label :Skewer) I split relationships.csv into 1000+ files with 10000 lines each and wrote a dumb shell script which loaded them with `neo4j-shell -c`  file by file doing `sleep 60` between files (to give neo4j a minute to complete each batch transaction) than started it Friday evening and got my tree ready on Monday morning :)

If you prefer alphanumerics for my_node_id it's completely up to you :) Anyway, after successful import you may prefer to remove those temporary ids completely from the database, just to conserve space where properties are stored.
 
2. Then consider one additional "technological" label, let's name it `:Skewer` because it will "penetrate" all your nodes of every different label (class) like a kebab skewer.

Before you start (or at least before you start importing relationships) do

CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS UNIQUE;


Q2:  Should I do scenario 1 or 2?

Scenario 1:  add two labels to each file?  One from my original nodes and one as "Skewer"

File 1:  CLT_Nodes-LabelA     columns:  my_node_id, label A, Skewer, property P1..., property P4
...
File 2:  CLT_Nodes-LabelJ     columns:  my_node_id, label J, Skewer, property P1..., property P4
 
OR 

Scenario 2:  Include an eleventh file thus....

File 11:  CLT_Nodes-LabelK     columns:  my_node_id, Skewer, property P1..., property P4 

From below, I think you mean Scenario 1.

Yes and you don't need to add a column for :Skewer label into a file, the LOAD CSV statement should assign it.
 
Q3: “Skewer” is just an integer right?  It corresponds in a way to my_node_id 

No, it's a label! so in Cypher your node (suppose it has 2 labels :LabelA and :LabelJ ) is described like

MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2: 'something else', p3: 'etc.'})

 Here is some sort of cypher….

 

//Creating the nodes

 

USING PERIODIC COMMIT
1000

LOAD CSV WITH HEADERS FROM
“…/././…. CLT_NODES_LabelA.csv" AS csvline

MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1})

ON CREATE SET  

n.Property2 = csvline.Property2,  

n.Property3 = csvline.Property3,  

n.Property4 = csvline.Property4;


….

LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline

 

MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1})

ON CREATE SET  

n.Property2 = csvline.Property2,  

n.Property3 = csvline.Property3,  

n.Property4 = csvline.Property4;


 

Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J combine the various labels and their respective values with their corresponding nodes? 


Label is not a variable, it does not have a value. It's just a label, consider "tag".
Also my_node_id IS a variable so it does have a value.

Looking at your 2 code snippets - in case you hope that the first one will create a node with LabelA and the second one will assign LabelJ to a node which was created earlier, you are wrong. But... if you remove labels from MERGE, it will work but look here with attention:

LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelA.csv"
AS csvline
MERGE
(new_node_A:Skewer: {my_node_id: csvline.node_unique_number, property1: csvline.property1})
// only my_node_id and property1 values will be taken into account! no labels, no other properties are taken care of
// AFAIR we do not need `ON CREATE SET` here, do you really care is it a new node or it was created earlier?  
SET
new_node_A
: LabelA,
new_node_A
.Property2 = csvline.Property2,  
new_node_A
.Property3 = csvline.Property3,  
new_node_A
.Property4 = csvline.Property4;


LOAD CSV WITH HEADERS FROM
"…/././…. CLT_NODES_LabelJ.csv"
AS csvline
MERGE
(new_node_J:Skewer: {my_node_id: csvline.node_unique_number, property1: csvline.property1})
// only my_node_id and property1 values will be taken into account! no labels, no other properties are taken care of
// AFAIR we do not need `ON CREATE SET` here, do you really care is it a new node or it was created earlier?  
SET
new_node_J
: LabelJ,
new_node_J
.Property2 = csvline.Property2,  
new_node_J
.Property3 = csvline.Property3,  
new_node_J
.Property4 = csvline.Property4;


What you get if doing things this way:

  1. When doing LabelA .csv you will create whatever uniquely numbered nodes were not already in the database, fill their properties (or maybe overwrite them?) and label the node (be it new or existing one) with LabelA - no matter what other labels did node (possibly) have,
  2. When doing LabelJ .csv you again will create whatever uniquely numbered nodes were not already in the database, again either fill or overwrite propertiers, and again label the node (be it new or existing one) with LabelJ - no matter what other labels did node (possibly) have,
  3. so if you created some node with first file and labeled it LabelA, if the same unique my_node_id occur both in first and second files, your node will get 2 labels LabelA and LabelJ.
 

Q5: Since I think of my data in terms of the two classes of nodes in my Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after loading the nodes, how then I get two classes of nodes?


Make them 2 labels: CLTSource and CLTTarget respectively.
 

Q6: Is there a step missing that explains how the code below got to have a “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE and CLT_TARGET nodes?


// suppose we coded relationships as 2 my_node_id's of nodes
LOAD CSV FROM
"...somewhere..." AS csvline
MATCH
(s:CLTSource:Skewer {my_node_id: TOINT(csvline[0)})
USING INDEX s
:Skewer(my_node_id)
WITH s
MATCH
(t:CLTTarget:Skewer {my_node_id: TOINT(csvline[1)})
USING INDEX t
:Skewer(my_node_id)
MERGE
(s)-[r:MY_RELATIONSHIP_TYPE]->(t)
SET
r
.prop1 = 'smth';



4. Now when you are done with nodes and start doing LOAD CSV for relationships, you may give the MATCH statement, which looks up your pair of nodes, a hint for fast lookup, like

LOAD CSV ...from somewhere... AS csvline
MATCH
(source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:Skewer {my_node_id: ToInt(csvline[1]})
CREATE
(source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., rel_prop_NN: csvline[ZZ]}]->(dest_node);


Q6: This LOAD CSV  command (line 1) looks into the separate REL.csv file you mentioned first right?  


Yep
 

Q7: csvline is some sort of temp file that is a series of lines of the cvs file? 


This is a variable - collection which is filled by column values of .csv line by line. You can use it either as an array referring fields by their index (my preferred way) - or, if you use `WITH HEADERS` mode, you can use it as a keyed map. See http://neo4j.com/docs/2.1.6/cypherdoc-importing-csv-files-with-cypher.html

Q8: Do you imply in line 2 that the REL.csv file has headers that include  source_node, dest_node ?


No I don't use headers so I refer to csvline fields by their index ("collection mode")
 

Q9: While I see how Skewer is a label,  how is my_node_id a  property (line 2) ? 


Because it IS a property of a node, and you build constraint & index on this exact property inside the scope of a label :Skewer
 

Q10: How does my_node_id relate to either ToInt(csvline[0]} or ToInt(csvline[1]}  (line 2) ?


For .csv with relationships, csvline[0] is a value of my_node_id property of the source node, csvline[1] is a value of my_node_id property of the target node, and TOINT() type conversion is used because my personal preference is to use integers for ids.
 

Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  

Does csvline[0] refer to a column in REL.csv as do csvline[2] and csvline[ZZ] (line 3) ?


I think you can combine import of multiple .CSV files in a single LOAD CSV statement but I didn't ever try this mode.

WBR,
Andrii

Michael Hunger

unread,
Nov 29, 2014, 7:45:23 AM11/29/14
to ne...@googlegroups.com
Just one thing

Never use more than one label and one property in merge otherwise it wont use indexes

And use ... On create set ...

Von meinem iPhone gesendet
--

Andrii Stesin

unread,
Nov 29, 2014, 9:37:13 AM11/29/14
to ne...@googlegroups.com
On Saturday, November 29, 2014 2:45:23 PM UTC+2, Michael Hunger wrote:
Never use more than one label and one property in merge otherwise it wont use indexes

Huh! Yes! I discovered this someday empirically but forgot already :(

The diagnostic was smth like "can not use index in this context" and it I can't recall this to be mentioned in documentation.

After some probes we found that only a single label will work and that's from where did :Skewer approach arrive...

Thanks for the reminder!
WBR,
Andrii

Andrii Stesin

unread,
Nov 29, 2014, 2:22:33 PM11/29/14
to ne...@googlegroups.com
BTW José, can you describe your target data model which you are importing data into? I took a brief look at LabCards and got an impression that I can't completely understand what are you going to achieve.

What are the entities you are dealing with?
What classifications do you apply to them? (What are labels and what are properties?)
How do they relate to one another?

Is the dataset you use available to public? I'd like to look at it and maybe rethink the whole thing from scratch and develop some sample case on graph data modelling from it.

Thanks in advance!
WBR,
Andrii

Andrii Stesin

unread,
Nov 29, 2014, 2:25:20 PM11/29/14
to ne...@googlegroups.com
What is 'CLT'? Is it an abbreviation of "CLinical Test'? (Sorry, English is not my native language and I'm way too far from medical topics at all).

José F. Morales

unread,
Nov 30, 2014, 5:33:58 PM11/30/14
to ne...@googlegroups.com, jm3...@columbia.edu


On Saturday, November 29, 2014 6:35:33 AM UTC-5, Andrii Stesin wrote:
Hi Jose,

On Saturday, November 29, 2014 1:12:52 AM UTC+2, José F. Morales wrote:

1. On nodes and their labels. First of all, I strongly suggest you to separate your nodes into different .csv files by label. So you won't have a column `label` in your .csv but rather set of files:

nodes_LabelA.csv
...
nodes_LabelZ.csv

whatever your labels are. (Consider label to be kinda of synonym for `class` in object-oriented programming or `table` in RDBMS). That's due the fact that labels in Cypher are somewhat specific entities and you probably won't be allowed to make them parameterized into variables inside your LOAD CSV statement.


OK, so you have modified your original idea of putting the db into two files 1 nodes , 1 relationships.  Now here you say, put all the nodes into 1 file/ label.   The way I have worked with it, I created 1 file for a class of nodes I'll call CLT_SOURCE and another file for a class of nodes called CLT_TARGET.

Ok, but how many valid distinct combinations of your 10 node labels may exist?

JFM: 264
 
I was speaking about a simple case where you have some limited number of possible node labels (or their combinations), say less than 10.

JFM: Lot more than that.
 

 You are recommending that with the nodes, I take two steps...
1) Combine my CLT_SOURCE and CLT_TARGET nodes, 

Not nessesary "combine" but just give each node a unique (temporary) my_node_id see my "10+M tree" example below.
 
2) then I split that file into files that correspond to the node: my_node_id,  1 label, and then properties P1...Pn.  Since I have 10 Labels/node, I should have 10 files named..... Nodes_LabelA... Nodes_LabelJ.  Thus...

You may have as much labels per node you wish, but it is all about how many valid distinct combinations of labels you have. (One single label is a combination itself, obviously).

If you have some limited quantity of valid label combination it's one story. But if we are talking about order of 10! possible valid combinations, the story is somewhat more interesting :) Which setup is yours?

JFM:  Like I said, there are 264 unique combinations in all my nodes. Some are redundant, full spelling of a term/phrase and an abbreviation.  Some are a code for a term/phrase.  Some were created in anticipation of others values I would create later.  I am trying to anticipate queries I'll make later.
 
 
File:  CLT_Nodes-LabelA     columns:  my_node_id, label A, property P1..., property P4
...
File:  CLT_Nodes-LabelJ     columns:  my_node_id, label B, property P1..., property P4


Q1: What are the rules about what can be used for my_node_id?  I have usually seen them as a letter integer combination. Is that the convention?   Sometimes I've seen a letter being used with a specific class of nodes  a1..a100 for one class and b1..b100 for another.  I learned the hard way that you have to give each node a unique ID.  I used CLT_1...CLT_n for my CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. It worked with the smaller db I made.  Anything wrong using the convention n1...n100?

I'm not aware of any conventions here, the only thing I know for sure is that schema index works much(!) faster on plain integers than on Unicode strings. That's the only difference which I consider significant. So my personal preference is to have my_node_id to be a unique integer. Once when importing a 10+ millions nodes into a tree with variable height [1..7] where each level of nodes was in a separate file (because of level's own unique label and unique set of properties) I just selected a schema for numbering them like


JFM: Makes sense for speed. I guess it depends upon the size of one's data.
 
:Skewer:Level1 my_node_id = 10000000 + file1.csv line number
:Skewer:Level2 my_node_id = 20000000 + file2.csv line number
...
:Skewer:Level7 my_node_id = 70000000 + file7.csv line number

so relationship file (all relationships were of a same single type) has become a simple 2 column .csv like this with 10+ millions of lines

10000017,20000362
10000017,20000547
10000017,40083215
10000018,30000397
...

After successful importing of 7 node files (and have nodes ready in db and indexed on their unique my_node_id under the label :Skewer) I split relationships.csv into 1000+ files with 10000 lines each and wrote a dumb shell script which loaded them with `neo4j-shell -c`  file by file doing `sleep 60` between files (to give neo4j a minute to complete each batch transaction) than started it Friday evening and got my tree ready on Monday morning :)

If you prefer alphanumerics for my_node_id it's completely up to you :) Anyway, after successful import you may prefer to remove those temporary ids completely from the database, just to conserve space where properties are stored.
 

JFM: OK.  Sounds good.

 
2. Then consider one additional "technological" label, let's name it `:Skewer` because it will "penetrate" all your nodes of every different label (class) like a kebab skewer.

Before you start (or at least before you start importing relationships) do

CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id IS UNIQUE;


Q2:  Should I do scenario 1 or 2?

Scenario 1:  add two labels to each file?  One from my original nodes and one as "Skewer"

File 1:  CLT_Nodes-LabelA     columns:  my_node_id, label A, Skewer, property P1..., property P4
...
File 2:  CLT_Nodes-LabelJ     columns:  my_node_id, label J, Skewer, property P1..., property P4
 
OR 

Scenario 2:  Include an eleventh file thus....

File 11:  CLT_Nodes-LabelK     columns:  my_node_id, Skewer, property P1..., property P4 

From below, I think you mean Scenario 1.

Yes and you don't need to add a column for :Skewer label into a file, the LOAD CSV statement should assign it.
 

JFM: OK.  Sounds good.  
 
Q3: “Skewer” is just an integer right?  It corresponds in a way to my_node_id 

No, it's a label! so in Cypher your node (suppose it has 2 labels :LabelA and :LabelJ ) is described like

MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2: 'something else', p3: 'etc.'})


JFM: Got that!

JFM: ok basic question...  MATCH (n:  <---What is "n"? Does it just indicate that its a node of a particular class?  What letter it is is arbitrary right?  Is there a name for what "n" is? For a while there, I thought it was my_node_ID.  
 
 Here is some sort of cypher….

 

//Creating the nodes

 

USING PERIODIC COMMIT
1000

LOAD CSV WITH HEADERS FROM
“…/././…. CLT_NODES_LabelA.csv" AS csvline

MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1})

ON CREATE SET  

n.Property2 = csvline.Property2,  

n.Property3 = csvline.Property3,  

n.Property4 = csvline.Property4;


….

LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline

 

MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1})

ON CREATE SET  

n.Property2 = csvline.Property2,  

n.Property3 = csvline.Property3,  

n.Property4 = csvline.Property4;


 

Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J combine the various labels and their respective values with their corresponding nodes? 


Label is not a variable, it does not have a value. It's just a label, consider "tag".
Also my_node_id IS a variable so it does have a value.


JFM: OK, I am not understanding this.  I understood a "Label" as a general category for a node.  This was as opposed to a "Property" that was specific to a particular node.  As I understood it, a "Label" has different values.  So that Label could be "Category" and there could be two categories, for example...  CLT_SOURCE and CLT_TARGET .    I thought that makes it like a variable.  If not, the label is all the same on a given set of nodes and what's the point in that?
 
JFM: OK, I get that my_node_id is a variable.  
 
Looking at your 2 code snippets - in case you hope that the first one will create a node with LabelA and the second one will assign LabelJ to a node which was created earlier, you are wrong.
 
But... if you remove labels from MERGE, it will work but look here with attention:

LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelA.csv" AS csvline
MERGE
(new_node_A:Skewer: {my_node_id: csvline.node_unique_number, property1: csvline.property1})
// only my_node_id and property1 values will be taken into account! no labels, no other properties are taken care of
// AFAIR we do not need `ON CREATE SET` here, do you really care is it a new node or it was created earlier?  
SET
new_node_A
: LabelA,
new_node_A
.Property2 = csvline.Property2,  
new_node_A
.Property3 = csvline.Property3,  
new_node_A
.Property4 = csvline.Property4;

LOAD CSV WITH HEADERS FROM
"…/././…. CLT_NODES_LabelJ.csv" AS csvline
MERGE
(new_node_J:Skewer: {my_node_id: csvline.node_unique_number, property1: csvline.property1})
// only my_node_id and property1 values will be taken into account! no labels, no other properties are taken care of
// AFAIR we do not need `ON CREATE SET` here, do you really care is it a new node or it was created earlier?  
SET
new_node_J
: LabelJ,
new_node_J
.Property2 = csvline.Property2,  
new_node_J
.Property3 = csvline.Property3,  
new_node_J
.Property4 = csvline.Property4;


What you get if doing things this way:

  1. When doing LabelA .csv you will create whatever uniquely numbered nodes were not already in the database, fill their properties (or maybe overwrite them?) and label the node (be it new or existing one) with LabelA - no matter what other labels did node (possibly) have,
 JFM: OK.  I get it.
  1. When doing LabelJ .csv you again will create whatever uniquely numbered nodes were not already in the database, again either fill or overwrite propertiers, and again label the node (be it new or existing one) with LabelJ - no matter what other labels did node (possibly) have,
 JFM: OK.  I get it.
  1. so if you created some node with first file and labeled it LabelA, if the same unique my_node_id occur both in first and second files, your node will get 2 labels LabelA and LabelJ.
JFM: That's wha tI want!! 
 

Q5: Since I think of my data in terms of the two classes of nodes in my Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after loading the nodes, how then I get two classes of nodes?


Make them 2 labels: CLTSource and CLTTarget respectively.
 

JFM: OK.  Regarding the labels...my csv file has a column called DESC that has two values CLT_SOURCE and CLT_TARGET.  You are saying that my Source cvs should have a CLT_SOURCE column and my target csv should have a CLT_TARGET column?  My csv files should NOT a configuration as I described?

JFM: Since my csv file has its A thru J columns  A (2) values, B (1), C (4) D (83), E (83), F (11) G (11) H (83) J (83), K (2), I should have ALOT of csv files instead of just two for nodes!

 

Q6: Is there a step missing that explains how the code below got to have a “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE and CLT_TARGET nodes?


// suppose we coded relationships as 2 my_node_id's of nodes
LOAD CSV FROM
"...somewhere..." AS csvline
MATCH
(s:CLTSource:Skewer {my_node_id: TOINT(csvline[0)})
USING INDEX s
:Skewer(my_node_id)
WITH s
MATCH
(t:CLTTarget:Skewer {my_node_id: TOINT(csvline[1)})
USING INDEX t
:Skewer(my_node_id)
MERGE
(s)-[r:MY_RELATIONSHIP_TYPE]->(t)
SET
r
.prop1 = 'smth';



JFM: What I am not getting from this is there is one csv file that has the CLTSOURCE and CLTTARGET labels in it. That contradicts what I said above because that would make only 1 csv file.  I assume this there is one LOAD CSV statement and the my_node_ID:TOINT(csvline(0)})  and  my_node_ID:TOINT(csvline(1)}) refer presumably to two lines in that file.
 

4. Now when you are done with nodes and start doing LOAD CSV for relationships, you may give the MATCH statement, which looks up your pair of nodes, a hint for fast lookup, like

LOAD CSV ...from somewhere... AS csvline
MATCH
(source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:Skewer {my_node_id: ToInt(csvline[1]})
CREATE
(source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., rel_prop_NN: csvline[ZZ]}]->(dest_node);


Q6: This LOAD CSV  command (line 1) looks into the separate REL.csv file you mentioned first right?  


Yep
 

Q7: csvline is some sort of temp file that is a series of lines of the cvs file? 


This is a variable - collection which is filled by column values of .csv line by line. You can use it either as an array referring fields by their index (my preferred way) - or, if you use `WITH HEADERS` mode, you can use it as a keyed map. See http://neo4j.com/docs/2.1.6/cypherdoc-importing-csv-files-with-cypher.html

Q8: Do you imply in line 2 that the REL.csv file has headers that include  source_node, dest_node ?


No I don't use headers so I refer to csvline fields by their index ("collection mode")
 

Q9: While I see how Skewer is a label,  how is my_node_id a  property (line 2) ? 


Because it IS a property of a node, and you build constraint & index on this exact property inside the scope of a label :Skewer
 

JFM: OK.
 

Q10: How does my_node_id relate to either ToInt(csvline[0]} or ToInt(csvline[1]}  (line 2) ?


For .csv with relationships, csvline[0] is a value of my_node_id property of the source node, csvline[1] is a value of my_node_id property of the target node, and TOINT() type conversion is used because my personal preference is to use integers for ids.
 

Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  

Does csvline[0] refer to a column in REL.csv as do csvline[2] and csvline[ZZ] (line 3) ?



JFM: OK, I think I get it.
 
I think you can combine import of multiple .CSV files in a single LOAD CSV statement but I didn't ever try this mode.

WBR,
Andrii
 

JFM: Thanks!

Andrii Stesin

unread,
Dec 1, 2014, 5:09:36 PM12/1/14
to ne...@googlegroups.com, jm3...@columbia.edu
Hi José,


On Monday, December 1, 2014 12:33:58 AM UTC+2, José F. Morales wrote:
Ok, but how many valid distinct combinations of your 10 node labels may exist?

JFM: 264

This makes me think that maybe your target data model needs some refactoring. What are the entities (classes), and what can be better considered as attributes? Again, I'm not familiar with LabCard, so in case you give some explanations and a sample dataset which is publicly available, I'd take a close look at it.
 
JFM:  Like I said, there are 264 unique combinations in all my nodes. Some are redundant, full spelling of a term/phrase and an abbreviation.  Some are a code for a term/phrase.  Some were created in anticipation of others values I would create later.  I am trying to anticipate queries I'll make later.

Once again, I foresee a data modelling issue here.
 
JFM: Makes sense for speed. I guess it depends upon the size of one's data.

Sure it does :)
 
Q3: “Skewer” is just an integer right?  It corresponds in a way to my_node_id 

No, it's a label! so in Cypher your node (suppose it has 2 labels :LabelA and :LabelJ ) is described like

MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2: 'something else', p3: 'etc.'})


JFM: Got that!

JFM: ok basic question...  MATCH (n:  <---What is "n"? Does it just indicate that its a node of a particular class?  What letter it is is arbitrary right?  Is there a name for what "n" is? For a while there, I thought it was my_node_ID.  

n is just a name of the variable. Cypher, like any other programming language, has a notion of "variable" which has it's name and which cat take different values; here I've choose n just occasionally for the variable name.
 

Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J combine the various labels and their respective values with their corresponding nodes? 


Label is not a variable, it does not have a value. It's just a label, consider "tag".
Also my_node_id IS a variable so it does have a value.

JFM: OK, I am not understanding this.  I understood a "Label" as a general category for a node.

That's Ok, or maybe even better is to imagine a tag. Node may have multiple tags (labels), they can be added and/or removed.
 
This was as opposed to a "Property" that was specific to a particular node.  As I understood it, a "Label" has different values.

Label is just a label. It doesn't have any value itself, it just marks (tags) some (sub)set of your nodes and allows you to distinguish between them. Labels may overlap. Consider automotive domain, and let's take a look for data model for it.

Brand seems to better be modelled as a label. Say `Opel`, `Volvo` or `Peugeout`.
Kind of vehicle is definitely(???) a label. Say `Truck`, `SUV`, `Car`.
How to model some deeper things, depends on what you are going to achieve.
Is body color a label or property? Which approach is better: either

MATCH (vhcl:Truck:Volvo {body_color: 'red', VIN: 'VE18727673826812634X65' })

or

MATCH (vhcl:Opel:Yellow:SUV {VIN: 'VE18727673826812634X65'})

? I'm not sure, it depends on the goal, as for me I'd prefer color to be a property of some exact single car (once you can decide to paint your yellow car in white or some other color, after all)

But VIN is definitely a property of one exact single car.

Is car license plate a label or property? Definitely none of either, because you can sell your car and new owner will get another license plate for it, so I'd model this as

MATCH (vhcl:Car:Ford {body_color: 'pink', VIN: 'FGT87356873HU8745'})-[:HAS_LICENSE_PLATE]->(lp:LicensePlate {state: 'AL', str: 'WH4TWR'})


but as you see `LicensePlate` obviously should not be ever mixed with either `Car` or `Truck`, so they are different labels which do not intersect.

So that Label could be "Category" and there could be two categories, for example...  CLT_SOURCE and CLT_TARGET .    I thought that makes it like a variable.  If not, the label is all the same on a given set of nodes and what's the point in that?
 
JFM: OK, I get that my_node_id is a variable.  

Agh, exactly.
 
  1. When doing LabelA .csv you will create whatever uniquely numbered nodes were not already in the database, fill their properties (or maybe overwrite them?) and label the node (be it new or existing one) with LabelA - no matter what other labels did node (possibly) have,
 JFM: OK.  I get it.
  1. When doing LabelJ .csv you again will create whatever uniquely numbered nodes were not already in the database, again either fill or overwrite propertiers, and again label the node (be it new or existing one) with LabelJ - no matter what other labels did node (possibly) have,
 JFM: OK.  I get it.
  1. so if you created some node with first file and labeled it LabelA, if the same unique my_node_id occur both in first and second files, your node will get 2 labels LabelA and LabelJ.
JFM: That's wha tI want!! 

Huh, Ok so far :)
 

Q5: Since I think of my data in terms of the two classes of nodes in my Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after loading the nodes, how then I get two classes of nodes?


Make them 2 labels: CLTSource and CLTTarget respectively.

JFM: OK.  Regarding the labels...my csv file has a column called DESC that has two values CLT_SOURCE and CLT_TARGET.  You are saying that my Source cvs should have a CLT_SOURCE column and my target csv should have a CLT_TARGET column?  My csv files should NOT a configuration as I described?

What does CLT really mean in the real life? I failed to parse :( sorry for that. Once again, in case you describe the LabCard domain and provide me with a dataset, I'd be able to make you some better ideas (this also may become a good tutorial sample case for future Neo4j users).
 
JFM: Since my csv file has its A thru J columns  A (2) values, B (1), C (4) D (83), E (83), F (11) G (11) H (83) J (83), K (2), I should have ALOT of csv files instead of just two for nodes!

Again, I strongly suspect a data modelling issue here.
  

JFM: What I am not getting from this is there is one csv file that has the CLTSOURCE and CLTTARGET labels in it. That contradicts what I said above because that would make only 1 csv file.  I assume this there is one LOAD CSV statement and the my_node_ID:TOINT(csvline(0)})  and  my_node_ID:TOINT(csvline(1)}) refer presumably to two lines in that file.


As soon as you have both src and target nodes already inside the database, you need a .csv file which describes only relationships in terms of 1st column contains src nodes ids, 2d column contains dst nodes ids and thus 1 row of .csv describes 1 single relationship per (linked) pair of nodes.

For .csv with relationships, csvline[0] is a value of my_node_id property of the source node, csvline[1] is a value of my_node_id property of the target node, and TOINT() type conversion is used because my personal preference is to use integers for ids.
 

Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  

Does csvline[0] refer to a column in REL.csv as do csvline[2] and csvline[ZZ] (line 3) ?



JFM: OK, I think I get it.
 
I think you can combine import of multiple .CSV files in a single LOAD CSV statement but I didn't ever try this mode.

WBR,
Andrii
 

JFM: Thanks!

:)

WBR,
Andrii

José F. Morales

unread,
Dec 4, 2014, 6:09:16 PM12/4/14
to ne...@googlegroups.com, jm3...@columbia.edu
Andrii and Michael,

Sorry for the delay in response. I was a little under the weather.   ANYHOW, it looks like I figured out how to do the data loading! I was trying several approaches and the one using Michael's shell tools seems to have worked! There were info from Andrii that proved important as well! (my_node_ID as integer).  The loading of the 18k NODES was in seconds. When I tested the RELS with a tiny data set it worked perfectly.  I am cleaning up the 52k RELS file after the first attempt failed because of a missing "  '  ".  

My only issue is that the RELs loading is slow....

commit after 1000 row(s)  0. 1%: nodes = 0 rels = 1000 properties = 7000 time 7059450 ms total 7059450 ms


Now I thought that if I created an index (below), it would be faster. Apparently not.  

neo4j-sh (?)$ auto-index LC_ID

Enabling auto-indexing of Node properties: [LC_ID]


Do I have this wrong?  Should it have been CREATE INDEX ON :LC_ID?

Jose

Michael Hunger

unread,
Dec 4, 2014, 6:27:53 PM12/4/14
to ne...@googlegroups.com, José F. Morales Ph.D.
Perhaps you should show the statement too? Not just the log output? :)

use this: CREATE INDEX ON :{Label}(LC_ID); <- replace with your label(s)

--

José F. Morales

unread,
Dec 4, 2014, 6:54:35 PM12/4/14
to ne...@googlegroups.com, jm3...@columbia.edu
OK Fellas,  

What do you think of this?

Did this first...

auto-index LC_ID

Then this...

import-cypher -d , -i SAMPLE/Tz/Total_RELS_2.csv -b 1000  MATCH (LEFT_NODE {LC_ID:{LEFT_NODE}}), (RIGHT_NODE {LC_ID:{RIGHT_NODE}}) CREATE LEFT_NODE-[:#{REL} {PHYLUM:#{PHYLUM},CAT:#{CAT},UI_RL:#{UI_RL},RESULT:#{RESULT},INT_TYPE:#{INT_TYPE},DEG:toINT(#{DEG}),SDS_TD:toFloat(#{SDS_TD}),Path_L_TD:toINT(#{Path_L_TD}),Path_S_TD:#{Path_S_TD}}]->RIGHT_NODE return *


Michael Hunger

unread,
Dec 4, 2014, 7:00:16 PM12/4/14
to ne...@googlegroups.com
No, not at all
auto-index is for legacy indexes
do the create index that I said

in your MATCH you _must_ provide the label then.

MATCH (LEFT_NODE:LABEL1 {LC_ID:{LEFT_NODE}}), (RIGHT_NODE:LABEL2 {LC_ID:{RIGHT_NODE}})
..

You should also _never_ use #{} expressions for values, only for labels and rel-types.
Only use Cypher parameters: {CAT}.

I also saw that you have a ton of relationship-properties. do you think you need them all?
Perhaps there is also a Node / Entity actually hiding in your relationships?

Michael

José F. Morales

unread,
Dec 4, 2014, 8:28:32 PM12/4/14
to ne...@googlegroups.com
Some questions...


On Thursday, December 4, 2014 7:00:16 PM UTC-5, Michael Hunger wrote:
No, not at all
auto-index is for legacy indexes
do the create index that I said

Got it.
 
in your MATCH you _must_ provide the label then.

MATCH (LEFT_NODE:LABEL1 {LC_ID:{LEFT_NODE}}), (RIGHT_NODE:LABEL2 {LC_ID:{RIGHT_NODE}})
..

I have a node label whose header in the csv is called DESC with two values that for brevity are ... s and  t .  
Do you mean I should write...  MATCH (LEFT_NODE:s {LC_ID:{LEFT_NODE}}), (RIGHT_NODE:t {LC_ID:{RIGHT_NODE}}) OR MATCH (LEFT_NODE:DESC {LC_ID:{LEFT_NODE}}), (RIGHT_NODE:DESC {LC_ID:{RIGHT_NODE}})?
 

You should also _never_ use #{} expressions for values, only for labels and rel-types.
Only use Cypher parameters: {CAT}.

Yes, got it. It worked.
 
I also saw that you have a ton of relationship-properties. do you think you need them all?

I could live without 3 of the 9.  5 are essential. 1 is maybe.  Could be useful.
 
Perhaps there is also a Node / Entity actually hiding in your relationships?

I am quite sure that my data model can be improved.  But I wanted to have a really simple one to start. Time is a factor now.   Those 5 are very much qualities of the relationship.  One of those properties applies only to one two types of relationships.  

José F. Morales

unread,
Dec 4, 2014, 8:34:35 PM12/4/14
to ne...@googlegroups.com
in your MATCH you _must_ provide the label then.

MATCH (LEFT_NODE:LABEL1 {LC_ID:{LEFT_NODE}}), (RIGHT_NODE:LABEL2 {LC_ID:{RIGHT_NODE}})
..

I have a node label whose header in the csv is called DESC with two values that for brevity are ... s and  t .  
Do you mean I should write...  MATCH (LEFT_NODE:s {LC_ID:{LEFT_NODE}}), (RIGHT_NODE:t {LC_ID:{RIGHT_NODE}}) OR MATCH (LEFT_NODE:DESC {LC_ID:{LEFT_NODE}}), (RIGHT_NODE:DESC {LC_ID:{RIGHT_NODE}})?

I suspect its the first option. 

Jose 

Michael Hunger

unread,
Dec 5, 2014, 2:44:40 AM12/5/14
to ne...@googlegroups.com
Not sure what you mean by: I have a node label whose header in the csv is called DESC with two values that for brevity are ... s and  t .  

If you have a CSV and you provide a label for both left node and right node then probably something like this:

MATCH (LEFT_NODE:#{LEFT_LABEL} {LC_ID:{LEFT_NODE_ID}}), (RIGHT_NODE:#{RIGHT_LABEL} {LC_ID:{RIGHT_NODE}})

If your label is fixed, eg. ":Document" then you can use it directly.

Michael


--

José F. Morales

unread,
Dec 5, 2014, 11:56:55 AM12/5/14
to ne...@googlegroups.com
Hey M,

Just to make sure... i have my Node csv which has columns for labels.  One label is DESC.  

The DESC column has two values...  "s" and "t".  

The Node csv file has 18k nodes, 146 of which have the "s" label and the rest have the "t" label.  Since I created relationships between "s" and "t" nodes in both directions, half the time the "s" node is "FROM" node and the other half of the time, the "s" node is the "TO" node.  The same arrangement applies to the "t" node.  

Consequently, am I right in assuming the statement should be as follows...MATCH (LEFT_NODE:#{DESC} {LC_ID:{LEFT_NODE_ID}}), (RIGHT_NODE:#{DESC} {LC_ID:{RIGHT_NODE}})?

Jose

Andrii Stesin

unread,
Dec 8, 2014, 7:06:03 AM12/8/14
to ne...@googlegroups.com
BTW I just finished an import operation, first step (only nodes, relationships are ongoing), so just to understand performance ability:

neo4j-sh (?)$ USING PERIODIC COMMIT 1000
> LOAD CSV FROM "file:///home/stesin/Documents/step1/3_generic/50_main.csv" AS row
...14 lines of query...
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1849312
Properties set: 13127979
Labels added: 3698624
768413 ms
neo4j
-sh (?)$

The machine is intel i5 desktop, Ubuntu 14.04 x64, with 16 Gb total RAM (of which 2/3 is used by circa 50 of Chrome tabs, virtualbox with m$ windows, 6 files in SublimeText, 6 terminals, ...) Neo4j is with out-of-thebox RAM config settings.

A hint: about 200k of those nodes are to be labeled with multiple labels, some 1000+ combinations. For initial import, I stored them into collections inside the nodes, and later I'll launch a perl script which will take those collections and convert them into labels.

WBR,
Andrii

Andrii Stesin

unread,
Dec 8, 2014, 12:46:04 PM12/8/14
to ne...@googlegroups.com, Michael Hunger
And here the problem arrives... The same input .csv file described above, nodes imported, now establishing relationships. Second pass, the same file. The query is pretty simple:

//
// I *do* know that Neo4j is very sensitive with batch import of relationships, so 50-100 is more than enough
//
USING PERIODIC COMMIT
50

LOAD CSV FROM
"file:///home/stesin/Documents/step1/3_generic/50_main.csv"
AS row
MATCH
(src:LeftLabel  { p1: row[0] }),
(dst:RightLabel { p2: row[2], coll1: SPLIT(row[3], ':'), future_labels_coll: SPLIT(row[4], ':') })
USING INDEX src
:LeftLabel(p1)
USING INDEX dst
:RightLabel(p2)
MERGE
(src)-[r:IS_RELATED_TO { coll3: SPLIT(row[1], ':') }]->(dst)
ON CREATE SET
...just SETting a bunch of properties r.some_payload here...
;

Profiler ensured me, that both src and dst nodes will be accessed via fast looking up in corresponding indices. Ok so far.

I freed all my 16Gb RAM almost completely, configured 4Gb for neo4j JVM heap, increased MMIO buffers x2,5, made sure that there isn't any auto-indexing.
Than I opened http://localhost:7474/webadmin/# and fed the script to neo4j-shell

First time the import attempt stuck when webadmin reported some 380000+ relationships in the db. No diagnostics, complete silence in logs, but neo4j server just stopped responding both to webadmin and to new shell connections. After I restarted the server (it was shut down brutally after being unable to close gracefully) webadmin told me that there are about 16k+ relationships in the db.

Tried once more with the same result, it just went south much earlier, at about 190k+ relationships.

So, this is the very same behavior we observed with early 2.1 releases! So what I did - just an old proven trick:

bash$ split -a 3 -l 10000 50_main.csv 50_main.

Got a bunch of 10000-liner files named from 50_main.aaa to 50_main.ajo - a total of 249 files,

than wrote a bash script:

#!/bin/bash

for file_name in `echo 50_main.a*`
do

echo
-n "${file_name} ... "

qry
="USING PERIODIC COMMIT 50"
qry
="$qry LOAD CSV FROM \"file:///home/stesin/Documents/step1/3_generic/${file_name}\" AS row"
qry
="$qry MATCH"
...
# ... line by line I built the whole query...

neo4j
-shell -c "$qry" || exit 1

echo
-n "sleeping... "

sleep
5

echo
"done."

done

This worked: each 10000-liner file took between 1600 and 2600 ms (say ~8 complete seconds per file including sleep 5) and the whole import completed Ok in some half an hour.

I think that this may be some kind of TCP/IP (HTTP) session timeout issue -- when some operation (transaction) takes too long, one side closes his end of the session while the other side (server?) continues to think that the session is still up.

Otherwise why the very same data, cut into 10000-line pieces, imports Ok - but the complete file does not?

WBR,
Andrii

Andrii Stesin

unread,
Dec 11, 2014, 4:17:30 AM12/11/14
to ne...@googlegroups.com, michael...@neotechnology.com
Thanks Michael for his kind explanation,

it turned out that my import of relationships hit an "eager loading problem" which is completely described in blog post by Mark Needham and in posts by Michael himself referenced from there.

Import in 10000 lines chunks hits it but as soon as it does not hit JVM heap size limit, it works Ok. So, the heavy life of datastore manager is good again :)

WBR,
Andrii


Reply all
Reply to author
Forward
0 new messages