Missing data MATCH vs START

99 views
Skip to first unread message

JDS

unread,
Jan 1, 2014, 1:26:14 PM1/1/14
to ne...@googlegroups.com
Maybe I'm wrong but I thought that all 3 of the top queries would return data based on the data returned by 1st and 4th query so I'm a little confused. Server is 2.0.0 enterprise stable.

neo4j-sh (?)$ START n=node(*) WHERE HAS (n.code) AND n.code = 'IN05_lvl1' RETURN n.code;
+-------------+
| n.code      |
+-------------+
| "IN05_lvl1" |
+-------------+
1 row
102405 ms
neo4j-sh (?)$ MATCH (ndt:NLRS_DATA_TYPE) WHERE ndt.code = 'IN05_lvl1' RETURN ndt.code;        
+----------+
| ndt.code |
+----------+
+----------+
0 row
31 ms
neo4j-sh (?)$ MATCH (ndt:NLRS_DATA_TYPE { code : 'IN05_lvl1' }) RETURN ndt.code;              
+----------+
| ndt.code |
+----------+
+----------+
0 row
20 ms
neo4j-sh (?)$ MATCH (ndt:NRLS_DATA_TYPE) RETURN ndt.code;                                     
+-------------------+
| ndt.code          |
+-------------------+
| "RP07"            |
| "IN07"            |
| "Age_at_Incident" |
| "ST01_LVL1"       |
| "PD09"            |
| "PD05_lvl1"       |
| "IN05_lvl1"       |
| "IN03_lvl1"       |
| "IN07_01MMYY"     |
| "PD11"            |
| "IN02_A_01"       |
| "IN01"            |
| "PD02"            |
+-------------------+
13 rows
113 ms
neo4j-sh (?)$ MATCH (ndt:NLRS_DATA_TYPE { code : "IN05_lvl1" }) RETURN ndt.code;


Michael Hunger

unread,
Jan 1, 2014, 1:32:16 PM1/1/14
to ne...@googlegroups.com
Typo:

In query #4 you use "NRLS_DATA_TYPE" in the previous ones you use "NLRS_DATA_TYPE"

N_RL_S vs. N_LR_S

HTH

Michael

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

JDS

unread,
Jan 1, 2014, 1:33:47 PM1/1/14
to ne...@googlegroups.com
Ugh *shame*

Thanks Mike

Michael Hunger

unread,
Jan 1, 2014, 1:36:04 PM1/1/14
to ne...@googlegroups.com
No worries, it's still early in the New Year :)

But you definitely want to write a blog post about what you're doing with Neo4j? Right?

Happy New Year

Michael

JDS

unread,
Jan 1, 2014, 1:43:41 PM1/1/14
to ne...@googlegroups.com
Actually I would, I've been working with REST::Neo4p and the writer of the module has been more than helpful and I've found decent results doing my inserts now using perl. I'm just in the process of migrating my CREATE UNIQUE queries to MERGE to test the performance benefits. Once that's done I'll start looking at the application layer (as I'm still loading data, there's a lot of it, over 2 million incident nodes and I'm barely half way through loading the spreadsheets). Eventually I hope to present my results as "normalizing and graphing medical patient data with Neo4j".

Happy New Year

~ icenine

JDS

unread,
Jan 1, 2014, 1:48:56 PM1/1/14
to ne...@googlegroups.com
BTW, I love the simplicity of something like this:

neo4j-sh (?)$ MATCH (ndt:NRLS_DATA_TYPE { code : 'IN05_lvl1' })
> MERGE (ic:INCIDENT_CATEGORY { category_level_01 : 'FOOBAR' })-[r:HAS_NRLS_DATA_TYPE]->(ndt)
> RETURN ic, r;
+-----------------------------------------------------------------------------+
| ic                                        | r                               |
+-----------------------------------------------------------------------------+
| Node[2121668]{category_level_01:"FOOBAR"} | :HAS_NRLS_DATA_TYPE[16880045]{} |
+-----------------------------------------------------------------------------+
1 row
Nodes created: 1
Relationships created: 1
Properties set: 1


Michael Hunger

unread,
Jan 1, 2014, 5:33:33 PM1/1/14
to ne...@googlegroups.com
Great !

Looks good.

I think if you use parameters and Neo4p's cypher support passing in perl-hashes for parameters and the using transactional endpoint with your import data it shouldn't take too long to import your 2 million data points.

#1 parameters
#2 transactional endpoint
#3 sensible batch-size (e.g. 20k per commit)
#4 usually when just creating data you don't have to return anything.

Cheers

Michael

icenine

unread,
Jan 1, 2014, 6:14:00 PM1/1/14
to ne...@googlegroups.com
Hi Michael,

I've posted the newest code here: http://pastebin.com/7PikkZRP

I've switched all of my CREATE UNIQUE statements to MERGE. I'm still convinced though that the inserts are under-performing. It's taking about 5 minutes for all of the statements in that code to execute 500 times. Overtime this gap grows to around 8 even up to 12 minutes. My biggest bottleneck seems to be the incident nodes and their relationships:

neo4j-sh (?)$ START n=node(*) MATCH (n)-[r]-() RETURN DISTINCT labels(n), type(r), count(*) ORDER BY labels(n)[0], type(r);
+-----------------------------------------------------------------------+
| labels(n)               | type(r)                          | count(*) |
+-----------------------------------------------------------------------+
| ["DEGREE_OF_HARM"]      | "HAS_INCIDENT_DEGREE_OF_HARM"    | 2120424  |
| ["DEGREE_OF_HARM"]      | "HAS_NRLS_DATA_TYPE"             | 7        |
| ["INCIDENT"]            | "HAS_INCIDENT_CATEGORY"          | 2120457  |
| ["INCIDENT"]            | "HAS_INCIDENT_DEGREE_OF_HARM"    | 2120424  |
| ["INCIDENT"]            | "HAS_INCIDENT_PATIENT"           | 2120432  |
| ["INCIDENT"]            | "HAS_INCIDENT_REPORTER"          | 2120442  |
| ["INCIDENT"]            | "HAS_INCIDENT_SPECIALITY"        | 2120450  |
| ["INCIDENT"]            | "HAS_NRLS_DATA_TYPE"             | 2120486  |
| ["INCIDENT"]            | "IS_NHS_TRUST_INCIDENT"          | 2120483  |
| ["INCIDENT"]            | "IS_NHS_TRUST_LOCATION_INCIDENT" | 2114664  |
| ["INCIDENT_CATEGORY"]   | "HAS_INCIDENT_CATEGORY"          | 2120457  |
| ["INCIDENT_CATEGORY"]   | "HAS_NRLS_DATA_TYPE"             | 16       |
| ["INCIDENT_REPORTER"]   | "HAS_INCIDENT_REPORTER"          | 2120442  |
| ["INCIDENT_REPORTER"]   | "HAS_NRLS_DATA_TYPE"             | 12       |
| ["INCIDENT_SPECIALITY"] | "HAS_INCIDENT_SPECIALITY"        | 2120450  |
| ["INCIDENT_SPECIALITY"] | "HAS_NRLS_DATA_TYPE"             | 17       |
| ["NHS_TRUST"]           | "HAS_NHS_TRUST_LOCATION"         | 480      |
| ["NHS_TRUST"]           | "HAS_NRLS_DATA_TYPE"             | 63       |
| ["NHS_TRUST"]           | "IS_NHS_TRUST_INCIDENT"          | 2120483  |
| ["NHS_TRUST_LOCATION"]  | "HAS_NHS_TRUST_LOCATION"         | 480      |
| ["NHS_TRUST_LOCATION"]  | "IS_NHS_TRUST_LOCATION_INCIDENT" | 2114664  |
| ["NRLS_DATA_TYPE"]      | "HAS_NRLS_DATA_TYPE"             | 2123426  |
| ["PATIENT"]             | "HAS_INCIDENT_PATIENT"           | 2120432  |
| ["PATIENT"]             | "HAS_NRLS_DATA_TYPE"             | 2825     |
+-----------------------------------------------------------------------+
24 rows
247418 ms

MERGE seems to be slightly more consistent in performance than CREATE UNIQUE though not that much faster.

I've tried the following to tune the instance (note I have 8G's of RAM on the VM and there's nothing else using it besides Neo4j and my extract process which never takes up much more than 100M of RAM now that I've tuned it with MAJ's suggestions):

cache_type=hpc
node_cache_array_fraction=6
relationship_cache_array_fraction=7
#node_cache_size=1024
relationship_cache_size=2G

I haven't bothered tuning node_cache_size itself since it's my relationship store that seems to be the biggest access point, accessing the node count takes 14 seconds but accessing a relationship count takes around 2 - 3 minutes.

Current heap usage after a restart and while the script is running after processing 1000 rows is ~ 500M.

Current neostore sizes are:

[root@miyu graph.db]# ls -l neostore* | awk '{printf("%10s %s\n", $5, $9)}'
        63 neostore
         9 neostore.id
        55 neostore.labeltokenstore.db
       456 neostore.labeltokenstore.db.names
  29850422 neostore.nodestore.db
         9 neostore.nodestore.db.id
        68 neostore.nodestore.db.labels
 177676780 neostore.propertystore.db
       128 neostore.propertystore.db.arrays
       162 neostore.propertystore.db.index
       722 neostore.propertystore.db.index.keys
 679805312 neostore.propertystore.db.strings
 560433951 neostore.relationshipstore.db
        45 neostore.relationshiptypestore.db
       380 neostore.relationshiptypestore.db.names
      1600 neostore.schemastore.db

Current cached mappings settings are:

neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=756M
neostore.propertystore.db.mapped_memory=300M
neostore.propertystore.db.strings.mapped_memory=756M
neostore.propertystore.db.arrays.mapped_memory=50M

Current initial heap settings are:

# Initial Java Heap Size (in MB)
wrapper.java.initmemory=2048

# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=5632

Current schema:

neo4j-sh (?)$ schema
Welcome to the Neo4j Shell! Enter 'help' for a list of commands
[Reconnected to server]
Indexes
  ON :DEGREE_OF_HARM(degree_of_harm)           ONLINE (for uniqueness constraint) 
  ON :INCIDENT(incident_description)           ONLINE                             
  ON :INCIDENT(incident_timestamp)             ONLINE                             
  ON :INCIDENT(incident_id)                    ONLINE (for uniqueness constraint) 
  ON :INCIDENT_CATEGORY(category_level_01)     ONLINE (for uniqueness constraint) 
  ON :INCIDENT_REPORTER(reporter_level_01)     ONLINE (for uniqueness constraint) 
  ON :INCIDENT_SPECIALITY(speciality_level_01) ONLINE (for uniqueness constraint) 
  ON :NHS_TRUST(name)                          ONLINE (for uniqueness constraint) 
  ON :NHS_TRUST_LOCATION(location_level_01)    ONLINE (for uniqueness constraint) 
  ON :NRLS_DATA_TYPE(code)                     ONLINE (for uniqueness constraint) 
  ON :PATIENT(patient_age)                     ONLINE                             
  ON :PATIENT(patient_sex)                     ONLINE                             
  ON :PATIENT(patient_ethnicity)               ONLINE                             

Constraints
  ON (nrls_data_type:NRLS_DATA_TYPE) ASSERT nrls_data_type.code IS UNIQUE
  ON (nhs_trust:NHS_TRUST) ASSERT nhs_trust.name IS UNIQUE
  ON (degree_of_harm:DEGREE_OF_HARM) ASSERT degree_of_harm.degree_of_harm IS UNIQUE
  ON (incident:INCIDENT) ASSERT incident.incident_id IS UNIQUE
  ON (nhs_trust_location:NHS_TRUST_LOCATION) ASSERT nhs_trust_location.location_level_01 IS UNIQUE
  ON (incident_reporter:INCIDENT_REPORTER) ASSERT incident_reporter.reporter_level_01 IS UNIQUE
  ON (incident_category:INCIDENT_CATEGORY) ASSERT incident_category.category_level_01 IS UNIQUE
  ON (incident_speciality:INCIDENT_SPECIALITY) ASSERT incident_speciality.speciality_level_01 IS UNIQUE

I'm going to keep trying to tweak but since I can't use property index hints with my MERGE statements (which I think would help with the incident relationships) I'm just loading anyway so I can get this done as I've been at it for a while.

If you have any further suggestions (or anyone else does) I'd be glad to try them out.

~ icenine

Michael Hunger

unread,
Jan 1, 2014, 6:56:43 PM1/1/14
to ne...@googlegroups.com, maj.for...@gmail.com, Mark Needham
CC'in the Marks here.

I don't know which endpoint Neo4p uses by default.

As far as I can see you also run just one query per http-request / tx ?
Usually you want to batch a score of them in a single tx (e.g. 20k elements).


To run your code we probably need the xls files as well :)

Also your query below shouldn't run so long, even if you have some 2 million entries in your db.

Any chance to share your db with me?
Message has been deleted

Mark Jensen

unread,
Jan 1, 2014, 7:41:35 PM1/1/14
to Michael Hunger, ne...@googlegroups.com, Mark Needham
Ice-9, you can use the transaction support in the latest Neo4p (>0.2220) :
REST::Neo4p->begin_work;
...queries and stuff
if ($something_bad_happened) {
 REST::Neo4p->rollback;
}
else {
  REST::Neo4p->commit;
}
and see if it improves perf.
(The ../cypher endpoint is used by default, and ../transaction within a txn marked as above)


On Wed, Jan 1, 2014 at 7:40 PM, Mark Jensen <maj.for...@gmail.com> wrote:
Ice-9, you can use the transaction support in the latest Neo4p (>0.2220) :
REST::Neo4p->begin_work;
...queries and stuff
if ($something_bad_happened) {

REST::Neo4p->commit;

(The ../cypher endpoint is used by default, and ../transaction within a txn marked as above)

icenine

unread,
Jan 1, 2014, 7:49:15 PM1/1/14
to ne...@googlegroups.com, Michael Hunger, Mark Needham
Hey MAJ,

Should I wrap a commit around the entire end to end process for the row or around each individual statement? Btw, I'm putting together an strace for read,write,open,close calls for a end to end transaction (minus the creation of nhs trust/nhs trust location as that's usually done once, but the main bulk don't hit these). 

Michael,

In regards to a db I could give you a mocked db for 2 - 4 trusts with sanitized data but I'll need to put that together tomorrow. The spreadsheets are a bit trickier but I could give the same backing ones I use for the db to you but I'll just have to clean them up. The existing full db I'm using is ~ 13G's and behind a VPN.

Thanks guys.

~ icenine

Mark Jensen

unread,
Jan 1, 2014, 10:53:16 PM1/1/14
to ne...@googlegroups.com, Michael Hunger, Mark Needham
Ice-9 - I would wrap each loop iteration in a transaction to start with; my guess is you'll see some improvement right away. You wouldn't get any benefit by wrapping each individual query in a txn. cheers MAJ

Karol Brejna

unread,
Jan 2, 2014, 4:52:42 AM1/2/14
to ne...@googlegroups.com
Ice,

It seems you are doing something very interesting. What system are you building? Is it possible for you to shed some light on it?

Regards,
Karol



W dniu czwartek, 2 stycznia 2014 00:14:00 UTC+1 użytkownik icenine napisał:
Hi Michael,

I've posted the newest code here: http://pastebin.com/7PikkZRP

I've switched all of my CREATE UNIQUE statements to MERGE. I'm still convinced though that the inserts are under-performing. It's taking about 5 minutes for all of the statements in that code to execute 500 times. Overtime this gap grows to around 8 even up to 12 minutes. My biggest bottleneck seems to be the incident nodes and their relationships:

neo4j-sh (?)$ START n=node(*) MATCH (n)-[r]-() RETURN DISTINCT labels(n), type(r), count(*) ORDER BY labels(n)[0], type(r);
+-----------------------------------------------------------------------+
| labels(n)               | type(r)                          | count(*) |
+-----------------------------------------------------------------------+
| ["DEGREE_OF_HARM"]      | "HAS_INCIDENT_DEGREE_OF_HARM"    | 2120424  |
| ["DEGREE_OF_HARM"]      | "HAS_NRLS_DATA_TYPE"             | 7        |

[...] 
Message has been deleted

icenine

unread,
Jan 2, 2014, 6:02:15 AM1/2/14
to ne...@googlegroups.com
Hi Karol,

It has to do with health care policy in the UK and is academic in nature. That's about as much as I can say at the moment.

~ icenine

icenine

unread,
Jan 2, 2014, 6:18:28 AM1/2/14
to ne...@googlegroups.com, Michael Hunger, Mark Needham
Hello MAJ, Michael,

I've wrapped the loop in a transaction now, and it has cut my time by 1/3rd but I have revise my original time quotes because that was with the CREATE UNIQUE's in place instead of the MERGE. With the MERGE the time for 500 rows was ~ 15 minutes, now it's at 5 minutes. I've done a trace of a transaction which is attached with masking for certain fields. This is only for reads and writes (produced by strace -Tttt -e read,write -s 4096 -fp $pid) but it gives a good idea where the heavy hitters are. I've produced the following timetable out of the trace (and divided the sections in the attached trace file). The times are estimates as there would be calls before and after the reads and writes.

Time table (in seconds with microseconds):

Total time in transaction: 1.804445

Per query time (including read callback):

Incident merge: 0.018740
Incident HAS_NRLS_DATA_TYPE: 0.852701
Incident IS_NHS_TRUST_INCIDENT: 0.023515
Incident IS_NHS_TRUST_LOCATION_INCIDENT: 0.754672
Incident Category merge + HAS_NRLS_DATA_TYPE: 0.014473
Incident HAS_INCIDENT_CATEGORY: 0.012777
Incident Speciality merge + HAS_NRLS_DATA_TYPE: 0.012035
Incident HAS_INCIDENT_SPECIALITY: 0.010719
Incident Reporter merge + HAS_NRLS_DATA_TYPE: 0.011301
Incident HAS_INCIDENT_REPORTER: 0.010741
Incident Patient merge + HAS_NRLS_DATA_TYPE: 0.021814
Incident HAS_INCIDENT_PATIENT: 0.011810
Incident DOH merge + HAS_NRLS_DATA_TYPE: 0.011132
Incident HAS_DEGREE_OF_HARM: 0.010414

So it seems based on this my two heaviest queries are:

@ 0.852701 - (i:INCIDENT)-(r:HAS_NRLS_DATA_TYPE)->(ndt:NRLS_DATA_TYPE)

MATCH (i:INCIDENT { incident_id : { incident_id }}),
(ndt:NRLS_DATA_TYPE { code : { code }})
MERGE (i)-[hndt:HAS_NRLS_DATA_TYPE]->(ndt)
RETURN hndt

@ 0.754672 - (i:INCIDENT)-[r:IS_NHS_TRUST_LOCATION_INCIDENT]->(ntl:NHS_TRUST_LOCATION)

MATCH (ntl:NHS_TRUST_LOCATION { location_level_01 : { location_level_01 }}),
(i:INCIDENT { incident_id : { incident_id }})
MERGE (i)-[r:IS_NHS_TRUST_LOCATION_INCIDENT]->(ntl)
RETURN r

These two relationships make up the bulk of unique relationship counts:

neo4j-sh (?)$ MATCH (i:INCIDENT)-[r:HAS_NRLS_DATA_TYPE]->(ndt:NRLS_DATA_TYPE) USING SCAN ndt:NRLS_DATA_TYPE RETURN type(r), count(i.incident_id);
+---------------------------------------------+
| type(r)              | count(i.incident_id) |
+---------------------------------------------+
| "HAS_NRLS_DATA_TYPE" | 2129522              |
+---------------------------------------------+
1 row
62598 ms
neo4j-sh (?)$ MATCH (i:INCIDENT)-[r:IS_NHS_TRUST_LOCATION_INCIDENT]->(ntl:NHS_TRUST_LOCATION) USING SCAN ntl:NHS_TRUST_LOCATION RETURN type(r), count(i.incident_id);
+---------------------------------------------------------+
| type(r)                          | count(i.incident_id) |
+---------------------------------------------------------+
| "IS_NHS_TRUST_LOCATION_INCIDENT" | 2123700              |
+---------------------------------------------------------+
1 row
14950 ms
neo4j-sh (?)$ 

So I'm thinking I need to optimize these two queries and things should get pretty streamline. Combining the incident to NRLS data type relationship with the initial incident create doesn't seem like it'd be faster, but surely that query should be taking advantage of the fact that incident_id is unique and should run a lot faster than it is. For the IS_NHS_TRUST_LOCATION_INCIDENT relationship is there a way to force a scan on NHS_TRUST_LOCATION just like I would if I were to query those relationships. Again this seems to me to be a problem with taking full advantage of the uniqueness of incident_id.

Recommendations welcome.

Thanks.

~ icenine
transaction_trace.txt

Michael Hunger

unread,
Jan 2, 2014, 6:22:33 AM1/2/14
to icenine, ne...@googlegroups.com, Mark Needham
Hi,

if you followed the mailing list, you've seen that there seems to be an perf-issue with read my own writes during a tx, i.e. reading the in-transactional state.

Just for comparison can you change your MERGE statements to CREATE (instead of create unique). Main difference is that create doesn't read stuff from the graph but just dumbly creates relationships.

How many relationships do you have on average on your different node types? You had the queries with the total counds, can you also produce one with min, max, avg ? There are also some percentile functions in cypher (see http://docs.neo4j.org/refcard/2.0/)


Would you also be able to attach a profiler to the server (jvisualvm comes with the JDK) and run the "sample" CPU profiling and send the snapshot data / hotspot-method screenshots?

See the screenshot below, in the resulting screen just hit the "CPU" button before and "Snapshot" after your import.


~ icenine
<transaction_trace.txt>

icenine

unread,
Jan 2, 2014, 9:20:06 AM1/2/14
to ne...@googlegroups.com, icenine, Mark Needham
Hi Michael,

From the threads I read it seemed MERGE had better performance but I can switch back to unique. However I've ran into a new problem I noticed while trying to load a new file for your profiler:

neo4j-sh (?)$ begin
Transaction started
neo4j-sh (?)$ MATCH (ndt:NRLS_DATA_TYPE { code : 'RP07'})
> MERGE (n:NHS_TRUST { name : 'masked_nhs_trust' })-[r:HAS_NRLS_DATA_TYPE]->(ndt)
> RETURN n.name;
+--------+
| n.name |
+--------+
| "xxx"  |
+--------+
1 row
Nodes created: 1
Relationships created: 1
Properties set: 1
Labels added: 1
48 ms
neo4j-sh (?)$ MERGE (ntl:NHS_TRUST_LOCATION { location_level_01 : 'General / acute hospital' })
> RETURN ntl.location_level_01
> ;
+----------------------------+
| ntl.location_level_01      |
+----------------------------+
| "General / acute hospital" |
+----------------------------+
1 row
27 ms
neo4j-sh (?)$ MATCH (nt:NHS_TRUST { name : 'masked_nhs_trust'}), 
> (ntl:NHS_TRUST_LOCATION { location_level_01 : 'General / acute hospital'})
> MERGE (nt)-[hntl:HAS_NHS_TRUST_LOCATION]->(ntl) RETURN hntl
> ;
+-------------------------------------+
| hntl                                |
+-------------------------------------+
| :HAS_NHS_TRUST_LOCATION[17036432]{} |
+-------------------------------------+
1 row
Relationships created: 1
1211 ms
neo4j-sh (?)$ MERGE (i:INCIDENT { incident_id : 'masked_incident_id' })
> ON CREATE SET i.incident_timestamp = '2005-03-28T00:00:00',
> i.incident_description = 'masked_incident_description'
> RETURN i.incident_id;
+---------------+
| i.incident_id |
+---------------+
| "xxxxxx"      |
+---------------+
1 row
Nodes created: 1
Properties set: 3
Labels added: 1
40 ms
neo4j-sh (?)$ MATCH (i:INCIDENT { incident_id : 'masked_incident_id'}), 
> (ndt:NRLS_DATA_TYPE { code : 'IN07'}) 
> MERGE (i)-[hndt:HAS_NRLS_DATA_TYPE]->(ndt)
> RETURN hndt;
+---------------------------------+
| hndt                            |
+---------------------------------+
| :HAS_NRLS_DATA_TYPE[17036433]{} |
+---------------------------------+
1 row
Relationships created: 1
3782 ms
neo4j-sh (?)$ MATCH (n:NHS_TRUST { name : 'masked_nhs_trust'}), 
> (i:INCIDENT { incident_id : 'masked_incident_description'})
> MERGE (i)-[r:IS_NHS_TRUST_INCIDENT]->(n) 
> RETURN r;
+------------------------------------+
| r                                  |
+------------------------------------+
| :IS_NHS_TRUST_INCIDENT[17036434]{} |
+------------------------------------+
1 row
Relationships created: 1
81 ms
neo4j-sh (?)$ MATCH (ntl:NHS_TRUST_LOCATION { location_level_01 : 'General / acute hospital'}), 
> (i:INCIDENT { incident_id : 'masked_incident_id'})
> MERGE (i)-[r:IS_NHS_TRUST_LOCATION_INCIDENT]->(ntl) 
> RETURN r;
+---------------------------------------------+
| r                                           |
+---------------------------------------------+
| :IS_NHS_TRUST_LOCATION_INCIDENT[17036435]{} |
+---------------------------------------------+
1 row
Relationships created: 1
1973 ms
neo4j-sh (?)$ MATCH (ndt:NRLS_DATA_TYPE { code : 'IN05_lvl1'})
> MERGE (ic:INCIDENT_CATEGORY { category_level_01 : 'Implementation of care and ongoing monitoring / review' 
> })-[r:HAS_NRLS_DATA_TYPE]->(ndt)
> RETURN ic.category_level_01;
+----------------------------------------------------------+
| ic.category_level_01                                     |
+----------------------------------------------------------+
| "Implementation of care and ongoing monitoring / review" |
+----------------------------------------------------------+
1 row
54 ms
neo4j-sh (?)$ MATCH (i:INCIDENT { incident_id : 'masked_incident_id'})
> MERGE (i)-[r:HAS_INCIDENT_CATEGORY]->
> (ic:INCIDENT_CATEGORY { category_level_01 : 'Implementation of care and ongoing monitoring / review'})
> RETURN r;
CypherExecutionException: Node 386 already exists with label INCIDENT_CATEGORY and property "category_level_01"=[Implementation of care and ongoing monitoring / review]

Shouldn't MERGE be fine with the fact that INCIDENT_CATEGORY already exists, is it trying to recreate it again? I can't specify it as:

MATCH (i:INCIDENT { incident_id : { incident_id }}),
(ic:INCIDENT_CATEGORY { category_level_01 : { category_level_01 }})
MERGE (i)-[r:HAS_INCIDENT_CATEGORY]->(ic)
RETURN r

Because if it doesn't exist then it won't create it. However, I can accomplish the same thing with CREATE UNIQUE where it creates both the node and the relationship.

Something I'm doing wrong here?

Btw MAJ,

This wasn't caught as an exception within the try statement that I'm using within the transaction, should I be checking for _tx_errors every time I execute a query? In this case the script just died with no output when it hit this exception.

~ icenine

icenine

unread,
Jan 2, 2014, 2:52:25 PM1/2/14
to ne...@googlegroups.com, icenine, Mark Needham
Actually sorry, I'm ignoring the obvious here, the incident category would have already been created in the previous statement so in all reality I could use:

neo4j-sh (?)$ MATCH (i:INCIDENT { incident_id : 'masked_incident_id'}),
> (ic:INCIDENT_CATEGORY { category_level_01 : 'Implementation of care and ongoing monitoring / review'})
> MERGE (i)-[r:HAS_INCIDENT_CATEGORY]->(ic)
> RETURN r;
+------------------------------------+
| r                                  |
+------------------------------------+
| :HAS_INCIDENT_CATEGORY[17036442]{} |
+------------------------------------+
1 row
Relationships created: 1
411 ms

but I would still think the exception thrown below is a bug in MERGE?

~ icenine

Michael Hunger

unread,
Jan 2, 2014, 5:37:26 PM1/2/14
to ne...@googlegroups.com, icenine, Mark Needham
It's how merge works.

It either matches and returns what it has found or creates all missing.

so if you have two nodes that you know you have there upfront do

Match (a:A {foo:"foo"}),(b:B {bar:"bar"})
MERGE (a)-[:XX]->(b)

if you want to create the nodes uniquely (i.e. you don't know if they are there)

MERGE (a:A {foo:"foo"})
MERGE (b:B {bar:"bar"})
MERGE (a)-[:XX]->(b)

if you want to attach a "subgraph" to an existing node where the uniqueness is within the subgraph only.

Match (a:A {foo:"foo"})
MERGE (a)-[:XX]->(b:B {bar:"bar"})



Am 02.01.2014 um 20:52 schrieb icenine <symo...@gmail.com>:

Actually sorry, I'm ignoring the obvious here, the incident category would have already been created in the previous statement so in all reality I could use:

neo4j-sh (?)$ MATCH (i:INCIDENT { incident_id : 'masked_incident_id'}),> (ic:INCIDENT_CATEGORY { category_level_01 : 'Implementation of care and ongoing monitoring / review'})> MERGE (i)-[r:HAS_INCIDENT_CATEGORY]->(ic)> RETURN r;
+------------------------------------+
| r                                  |
+------------------------------------+
| :HAS_INCIDENT_CATEGORY[17036442]{} |
+------------------------------------+
1 rowRelationships created: 1411 ms

Michael Hunger

unread,
Jan 2, 2014, 5:40:15 PM1/2/14
to icenine, ne...@googlegroups.com, Mark Needham
How many nodes of each label do you have?

MATCH (n) return labels(n), count(*) 

your counts always included the relationships so I cannot infer the # of nodes.

It would be cool to have a model of your graph with the numbers annotated on the elements (# nodes with this labels, #rels between these labeled nodes).

How long does it take if you leave off the MERGE in these statements, i.e. just do the matches and return ??

MATCH (i:INCIDENT { incident_id : { incident_id }}),
(ndt:NRLS_DATA_TYPE { code : { code }})
MERGE (i)-[hndt:HAS_NRLS_DATA_TYPE]->(ndt)
RETURN hndt

->

MATCH (i:INCIDENT { incident_id : { incident_id }}),
(ndt:NRLS_DATA_TYPE { code : { code }})
RETURN i,ndt
Cheers

Michael

<transaction_trace.txt>

icenine

unread,
Jan 2, 2014, 5:51:17 PM1/2/14
to ne...@googlegroups.com, icenine, Mark Needham
Hi Michael,

Thanks for the explanation but I'm still a little confused as to why it wouldn't find the category node in this query:

MATCH (i:INCIDENT { incident_id : '_incident_id_'})
MERGE (i)-[r:HAS_INCIDENT_CATEGORY]->
(ic:INCIDENT_CATEGORY { category_level_01 : 'Implementation of care and ongoing monitoring / review'})
RETURN r

In this case both i and ic are both completely unique nodes and i.incident_id and ic.category_level_01 are both unique properties, creating a new node for ic.category_level_01 would have violated the unique constraint on the property. 

Just FYI I've reverted to using CREATE UNIQUE since I couldn't get any realistic performance out of MERGE, it seems that MERGE works better when the nodes and relationships already exist as it smartly skips over them, CREATE UNIQUE seems to be a bit slower. I'm collecting the stats you asked for earlier loading an entire 12,000+ row new file (none of the nodes/relationships existed) with cpu profiling on with this latest code (I'm also using a transaction end point). I'll post it soon (it's currently at 10,000 rows loaded). I'll also post the latest code with the latest queries and the relationship averages you asked for (wanted to wait till this was done so the profiling is purely on the load process).

Thanks for the help.

~ icenine

icenine

unread,
Jan 2, 2014, 6:10:17 PM1/2/14
to ne...@googlegroups.com, icenine, Mark Needham
Hi Michael,

Gathering data, but just fyi, a sample of node counts were here:

icenine

unread,
Jan 2, 2014, 6:49:19 PM1/2/14
to ne...@googlegroups.com, icenine, Mark Needham
Ok, this is going to be a long post:

First off, current code here: http://pastebin.com/FaYRDhx7
As stated it's using CREATE UNIQUE rather than MERGE now.

Second, file loaded contained 12207 rows and loaded as follows:

Connecting to Neo4j : http://localhost:7474 Thu Jan  2 20:43:00 2014
Done connecting to Neo4j : http://localhost:7474 Thu Jan  2 20:43:00 2014
Creating REST::Neo4p::Query objects Thu Jan  2 20:43:00 2014
Done creating REST::Neo4p::Query objects Thu Jan  2 20:43:00 2014
Beginning processing loop Thu Jan  2 20:43:00 2014
Processed 100 rows. Thu Jan  2 20:43:58 2014
Processed 200 rows. Thu Jan  2 20:44:56 2014
Processed 300 rows. Thu Jan  2 20:45:55 2014
Processed 400 rows. Thu Jan  2 20:46:53 2014
... truncated ...
Processed 11700 rows. Thu Jan  2 23:10:58 2014
Processed 11800 rows. Thu Jan  2 23:12:15 2014
Processed 11900 rows. Thu Jan  2 23:13:29 2014
Processed 12000 rows. Thu Jan  2 23:14:49 2014
Processed 12100 rows. Thu Jan  2 23:16:13 2014
Processed 12200 rows. Thu Jan  2 23:17:33 2014
Completed processing loop Thu Jan  2 23:17:39 2014

Current node count (null node is default installed node 0):

neo4j-sh (?)$ START n=node(*) RETURN labels(n)[0] AS node_labels, count(*) AS node_count ORDER BY node_labels;
+------------------------------------+
| node_labels           | node_count |
+------------------------------------+
| "DEGREE_OF_HARM"      | 7          |
| "INCIDENT"            | 2143501    |
| "INCIDENT_CATEGORY"   | 16         |
| "INCIDENT_REPORTER"   | 12         |
| "INCIDENT_SPECIALITY" | 17         |
| "NHS_TRUST"           | 67         |
| "NHS_TRUST_LOCATION"  | 12         |
| "NRLS_DATA_TYPE"      | 13         |
| "PATIENT"             | 2825       |
| <null>                | 1          |
+------------------------------------+
10 rows
25754 ms

Current relationship count per node label (you can probably infer structure from this but if not I can email you a pdf with the simple model with Mark N also has a copy of), note that inconsistencies of incident relationship counts are due to partially loaded files:

neo4j-sh (?)$ START n=node(*) MATCH (n)-[r]-() RETURN DISTINCT labels(n)[0] AS node_labels, type(r) AS relationship, count(*) AS rel_per_node_count ORDER BY node_labels;
+-------------------------------------------------------------------------------+
| node_labels           | relationship                     | rel_per_node_count |
+-------------------------------------------------------------------------------+
| "DEGREE_OF_HARM"      | "HAS_INCIDENT_DEGREE_OF_HARM"    | 2143422            |
| "DEGREE_OF_HARM"      | "HAS_NRLS_DATA_TYPE"             | 7                  |
| "INCIDENT"            | "HAS_INCIDENT_DEGREE_OF_HARM"    | 2143422            |
| "INCIDENT"            | "HAS_NRLS_DATA_TYPE"             | 2143490            |
| "INCIDENT"            | "HAS_INCIDENT_SPECIALITY"        | 2143448            |
| "INCIDENT"            | "IS_NHS_TRUST_LOCATION_INCIDENT" | 2137662            |
| "INCIDENT"            | "HAS_INCIDENT_CATEGORY"          | 2143455            |
| "INCIDENT"            | "IS_NHS_TRUST_INCIDENT"          | 2143481            |
| "INCIDENT"            | "HAS_INCIDENT_REPORTER"          | 2143440            |
| "INCIDENT"            | "HAS_INCIDENT_PATIENT"           | 2143430            |
| "INCIDENT_CATEGORY"   | "HAS_NRLS_DATA_TYPE"             | 16                 |
| "INCIDENT_CATEGORY"   | "HAS_INCIDENT_CATEGORY"          | 2143455            |
| "INCIDENT_REPORTER"   | "HAS_NRLS_DATA_TYPE"             | 12                 |
| "INCIDENT_REPORTER"   | "HAS_INCIDENT_REPORTER"          | 2143440            |
| "INCIDENT_SPECIALITY" | "HAS_NRLS_DATA_TYPE"             | 17                 |
| "INCIDENT_SPECIALITY" | "HAS_INCIDENT_SPECIALITY"        | 2143448            |
| "NHS_TRUST"           | "HAS_NHS_TRUST_LOCATION"         | 498                |
| "NHS_TRUST"           | "IS_NHS_TRUST_INCIDENT"          | 2143481            |
| "NHS_TRUST"           | "HAS_NRLS_DATA_TYPE"             | 67                 |
| "NHS_TRUST_LOCATION"  | "HAS_NHS_TRUST_LOCATION"         | 498                |
| "NHS_TRUST_LOCATION"  | "IS_NHS_TRUST_LOCATION_INCIDENT" | 2137662            |
| "NRLS_DATA_TYPE"      | "HAS_NRLS_DATA_TYPE"             | 2146434            |
| "PATIENT"             | "HAS_INCIDENT_PATIENT"           | 2143430            |
| "PATIENT"             | "HAS_NRLS_DATA_TYPE"             | 2825               |
+-------------------------------------------------------------------------------+
24 rows
366826 ms

And finally the MERGE times you asked for:

neo4j-sh (?)$ MATCH (i:INCIDENT { incident_id : 'masked_incident_id' }),
> (ndt:NRLS_DATA_TYPE { code : 'IN07' })
> MERGE (i)-[hndt:HAS_NRLS_DATA_TYPE]->(ndt)
> RETURN hndt;
+---------------------------------+
| hndt                            |
+---------------------------------+
| :HAS_NRLS_DATA_TYPE[17050540]{} |
+---------------------------------+
1 row
2320 ms

neo4j-sh (?)$ MATCH (i:INCIDENT { incident_id : 'masked_incident_id' }),
> (ndt:NRLS_DATA_TYPE { code : 'IN07' })
> RETURN i, ndt;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| i                                                                                                                                                                                                                                  | ndt                                                                            |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Node[2143024]{incident_id:"masked_incident_id",incident_timestamp:"2005-02-05T12:00:00",incident_description:"masked_incident_description"} | Node[2]{code:"IN07",description:"Incident description",input_type:"free-text"} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row
124 ms

Let me know if I missed anything.

~ icenine
org.neo4j.server.Bootstrapper.nps.zip
org.neo4j.server.Bootstrapper.csv.zip

icenine

unread,
Jan 6, 2014, 6:30:56 AM1/6/14
to ne...@googlegroups.com, icenine, Mark Needham
By the way, here's an updated list of counts, it's still taking about a minute 10 to a minute 20 to load 100 rows, but staying steady around that mark due to the cleanup of files and using a transaction endpoint, it'd probably be even better if I wrapped say every 20 rows into one transaction but there's no row_count function in Excel::Reader::XLSX so I'd have to iterate through the entire spreadsheet first to get the row count.

neo4j-sh (?)$ START n=node(*) MATCH (n)-[r]-() RETURN labels(n)[0] AS node_labels, type(r) AS relationship, count(*) AS rel_per_node_count ORDER BY rel_per_node_count;               +-------------------------------------------------------------------------------+        
| node_labels           | relationship                     | rel_per_node_count |
+-------------------------------------------------------------------------------+
| "DEGREE_OF_HARM"      | "HAS_NRLS_DATA_TYPE"             | 7                  |
| "INCIDENT_REPORTER"   | "HAS_NRLS_DATA_TYPE"             | 12                 |
| "INCIDENT_SPECIALITY" | "HAS_NRLS_DATA_TYPE"             | 17                 |
| "INCIDENT_CATEGORY"   | "HAS_NRLS_DATA_TYPE"             | 17                 |
| "NHS_TRUST"           | "HAS_NRLS_DATA_TYPE"             | 74                 |
| "NHS_TRUST"           | "HAS_NHS_TRUST_LOCATION"         | 580                |
| "NHS_TRUST_LOCATION"  | "HAS_NHS_TRUST_LOCATION"         | 580                |
| "PATIENT"             | "HAS_NRLS_DATA_TYPE"             | 2913               |
| "INCIDENT"            | "HAS_INCIDENT_DEGREE_OF_HARM"    | 2565754            |
| "INCIDENT"            | "HAS_INCIDENT_PATIENT"           | 2565754            |
| "INCIDENT"            | "HAS_INCIDENT_SPECIALITY"        | 2565755            |
| "INCIDENT"            | "IS_NHS_TRUST_LOCATION_INCIDENT" | 2565755            |
| "INCIDENT"            | "HAS_INCIDENT_CATEGORY"          | 2565755            |
| "INCIDENT"            | "IS_NHS_TRUST_INCIDENT"          | 2565755            |
| "INCIDENT"            | "HAS_INCIDENT_REPORTER"          | 2565755            |
| "INCIDENT"            | "HAS_NRLS_DATA_TYPE"             | 2565756            |
| "INCIDENT_REPORTER"   | "HAS_INCIDENT_REPORTER"          | 2565849            |
| "INCIDENT_SPECIALITY" | "HAS_INCIDENT_SPECIALITY"        | 2565855            |
| "INCIDENT_CATEGORY"   | "HAS_INCIDENT_CATEGORY"          | 2565862            |
| "DEGREE_OF_HARM"      | "HAS_INCIDENT_DEGREE_OF_HARM"    | 2565872            |
| "PATIENT"             | "HAS_INCIDENT_PATIENT"           | 2565927            |
| "NHS_TRUST"           | "IS_NHS_TRUST_INCIDENT"          | 2566029            |
| "NHS_TRUST_LOCATION"  | "IS_NHS_TRUST_LOCATION_INCIDENT" | 2566062            |
| "NRLS_DATA_TYPE"      | "HAS_NRLS_DATA_TYPE"             | 2568825            |
+-------------------------------------------------------------------------------+
24 rows
507473 ms
neo4j-sh (?)$ START n=node(*) MATCH (n)-[r]-() WITH labels(n)[0] AS node_labels, type(r) AS relationship, count(*) AS rel_per_node_count RETURN min(rel_per_node_count), max(rel_per_node_count), avg(rel_per_node_count);
+-----------------------------------------------------------------------------+
| min(rel_per_node_count) | max(rel_per_node_count) | avg(rel_per_node_count) |
+-----------------------------------------------------------------------------+
| 7                       | 2569478                 | 1711292.1666666667      |
+-----------------------------------------------------------------------------+
1 row
491720 ms
neo4j-sh (?)$ START n=node(*) RETURN labels(n)[0] AS node_labels, count(*) ORDER BY node_labels;                                                                                      +----------------------------------+                                                     
| node_labels           | count(*) |
+----------------------------------+
| "DEGREE_OF_HARM"      | 7        |
| "INCIDENT"            | 2567079  |
| "INCIDENT_CATEGORY"   | 17       |
| "INCIDENT_REPORTER"   | 12       |
| "INCIDENT_SPECIALITY" | 17       |
| "NHS_TRUST"           | 74       |
| "NHS_TRUST_LOCATION"  | 12       |
| "NRLS_DATA_TYPE"      | 13       |
| "PATIENT"             | 2913     |
| <null>                | 1        |
+----------------------------------+
10 rows
19573 ms

icenine

unread,
Jan 6, 2014, 11:10:36 AM1/6/14
to ne...@googlegroups.com, icenine, Mark Needham
Ok, so I've managed to tune my queries to the point where it's now only taking me 30 seconds per 500 rows, so an improvement of 40 - 50 seconds saved per 500 rows. I've done this by using CREATE UNIQUE in conjunction with USING INDEX where I could (and there's probably more places I can use it but I just wanted to change a bit at a time). But it is now running through at 23 - 31 seconds per 500 rows.


An example query change:

  1. MATCH (i:INCIDENT), (ndt:NRLS_DATA_TYPE)
  2. WHERE i.incident_id = { incident_id }
  3. AND ndt.code = { code }
  4. CREATE UNIQUE
  5. (i)-[hndt:HAS_NRLS_DATA_TYPE]->ndt
  6. RETURN hndt

changes to

  1. MATCH (ndt:NRLS_DATA_TYPE), (i:INCIDENT)
  2. USING INDEX ndt:NRLS_DATA_TYPE(code)
  3. USING INDEX i:INCIDENT(incident_id)
  4. WHERE i.incident_id = { incident_id }
  5. AND ndt.code = { code }
  6. CREATE UNIQUE
  1. (i)-[hndt:HAS_NRLS_DATA_TYPE]->(ndt)
  2. RETURN hndt

    So now I'm definitely using the unique constraint for incident_id which I don't think was happening before. This single query was taking on average over a second (average 1.04) before adding the USING INDEX statements, so this is my biggest savings. It would be great to have this same type of functionality on MERGE. Should I make a post about optimizing inserts using indexes + create unique? If you would like me to profile this type of loading as well I'd be happy to do so and provide the output (granted it should now only take me ~ 12 minutes to load a 12,000 row file ;).

    icenine

    unread,
    Jan 6, 2014, 11:44:06 AM1/6/14
    to ne...@googlegroups.com, icenine, Mark Needham
    Just fyi, this change is most helpful when you're reloading data that already exists, but it definitely helps with new data as well.

    icenine

    unread,
    Jan 10, 2014, 6:51:48 AM1/10/14
    to ne...@googlegroups.com, icenine, Mark Needham
    Are we just considering this thread closed then?

    Mark Needham

    unread,
    Jan 10, 2014, 6:57:58 AM1/10/14
    to ne...@googlegroups.com, icenine
    Are you still seeing problems with the performance of MERGE? If you could take a thread dump while you're running the import job and send that to me I can have a look into it.

    Michael Hunger

    unread,
    Jan 10, 2014, 7:18:30 AM1/10/14
    to ne...@googlegroups.com
    Also I'd like to rerun your scripts with the current master

    Mark could you share a version for testing?

    Sent from mobile device
    | ["INCIDENT"]     =A

    icenine

    unread,
    Jan 10, 2014, 8:21:37 AM1/10/14
    to ne...@googlegroups.com
    Hi Guys,

    I've dropped using MERGE in favour of CREATE UNIQUE especially when creating relationships (along with possibly nodes). I've provided a simple example of why below. In each set (MERGE, CREATE UNIQUE, CREATE UNIQUE USING INDEX) I've created a unique node to avoid any type of caching clashes. The fastest performance was using CREATE UNIQUE USING INDEX by far.

    What I'm trying to understand is why MERGE isn't taking advantage (seemingly) of the INDEX i:INCIDENT(incident_id). I had asked Mark a while ago about what was faster MERGE or CREATE UNIQUE but I've found on my own it all depends on context it seems. I've signed up for both the webinar for tuning the cypher queries as well as the meet-up on the 29th so hopefully I can get a better understanding of the underlying subtleties in using MERGE vs CREATE UNIQUE but also I'm starting to design the application layer for this now so it will help to know the cypher tricks a bit better.

    Note this is on a copy of the database with the counts below, so no other loading process is running against this database. The database is also 2.0.0 stable enterprise with the cache settings following the counts.

    --- Counts:

    Enter code neo4j-sh (?)$ START n=node(*) WHERE NOT (id(n) = 0) RETURN labels(n)[0], count(*);
    +----------------------------------+
    | labels(n)[0]          | count(*) |
    +----------------------------------+
    | "INCIDENT_REPORTER"   | 12       |
    | "INCIDENT_CATEGORY"   | 17       |
    | "PATIENT"             | 3020     |
    | "NRLS_DATA_TYPE"      | 13       |
    | "INCIDENT"            | 3061831  |
    | "INCIDENT_SPECIALITY" | 17       |
    | "NHS_TRUST_LOCATION"  | 12       |
    | "DEGREE_OF_HARM"      | 7        |
    | "NHS_TRUST"           | 89       |
    +----------------------------------+
    9 rows
    25399 ms
    neo4j-sh (?)$ START n=node(*) MATCH (n)-[r]-() WITH labels(n)[0] AS node_labels, type(r) AS relationship, count(*) AS rel_per_node_count RETURN min(rel_per_node_count), max(rel_per_node_count), avg(rel_per_node_count);
    +-----------------------------------------------------------------------------+
    | min(rel_per_node_count) | max(rel_per_node_count) | avg(rel_per_node_count) |
    +-----------------------------------------------------------------------------+
    | 7                       | 3064993                 | 2041541.0               |
    +-----------------------------------------------------------------------------+
    1 row
    580615 ms

    --- Settings:

    [jsymons@larva conf]$ egrep '^[^#].*(mapped|cache)' neo4j.properties
    neostore.nodestore.db.mapped_memory=50M
    neostore.relationshipstore.db.mapped_memory=756M
    neostore.propertystore.db.mapped_memory=300M
    neostore.propertystore.db.strings.mapped_memory=1024M
    neostore.propertystore.db.arrays.mapped_memory=50M
    node_cache_array_fraction=6
    relationship_cache_array_fraction=7
    relationship_cache_size=2G
    [jsymons@larva conf]$ ls -l ../data/graph.db/neostore*
    -rw-r--r--. 1 jsymons saichonstaff        63 Jan 10 12:49 ../data/graph.db/neostore
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.id
    -rw-r--r--. 1 jsymons saichonstaff        55 Jan 10 10:21 ../data/graph.db/neostore.labeltokenstore.db
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.labeltokenstore.db.id
    -rw-r--r--. 1 jsymons saichonstaff       456 Jan 10 10:21 ../data/graph.db/neostore.labeltokenstore.db.names
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.labeltokenstore.db.names.id
    -rw-r--r--. 1 jsymons saichonstaff  43138368 Jan 10 12:49 ../data/graph.db/neostore.nodestore.db
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.nodestore.db.id
    -rw-r--r--. 1 jsymons saichonstaff        68 Jan 10 10:21 ../data/graph.db/neostore.nodestore.db.labels
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.nodestore.db.labels.id
    -rw-r--r--. 1 jsymons saichonstaff 256888780 Jan 10 12:49 ../data/graph.db/neostore.propertystore.db
    -rw-r--r--. 1 jsymons saichonstaff       128 Jan 10 10:21 ../data/graph.db/neostore.propertystore.db.arrays
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.propertystore.db.arrays.id
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.propertystore.db.id
    -rw-r--r--. 1 jsymons saichonstaff       162 Jan 10 10:21 ../data/graph.db/neostore.propertystore.db.index
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.propertystore.db.index.id
    -rw-r--r--. 1 jsymons saichonstaff       722 Jan 10 10:21 ../data/graph.db/neostore.propertystore.db.index.keys
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.propertystore.db.index.keys.id
    -rw-r--r--. 1 jsymons saichonstaff 995533184 Jan 10 10:21 ../data/graph.db/neostore.propertystore.db.strings
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.propertystore.db.strings.id
    -rw-r--r--. 1 jsymons saichonstaff 809811255 Jan 10 12:49 ../data/graph.db/neostore.relationshipstore.db
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.relationshipstore.db.id
    -rw-r--r--. 1 jsymons saichonstaff        50 Jan 10 10:21 ../data/graph.db/neostore.relationshiptypestore.db
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.relationshiptypestore.db.id
    -rw-r--r--. 1 jsymons saichonstaff       418 Jan 10 10:21 ../data/graph.db/neostore.relationshiptypestore.db.names
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.relationshiptypestore.db.names.id
    -rw-r--r--. 1 jsymons saichonstaff      1600 Jan 10 10:21 ../data/graph.db/neostore.schemastore.db
    -rw-r--r--. 1 jsymons saichonstaff         9 Jan 10 10:21 ../data/graph.db/neostore.schemastore.db.id
    [jsymons@larva conf]$ grep 'wrapper.java.*memory' neo4j-wrapper.conf 
    wrapper.java.initmemory=2048
    wrapper.java.maxmemory=5632
    [jsymons@larva conf]$ free -m
                 total       used       free     shared    buffers     cached
    Mem:          7873       7736        136          0        133       2236
    -/+ buffers/cache:       5366       2506
    Swap:        10063         22      10041

    --- Create INCIDENT and INCIDENT/NHS_TRUST_LOCATION relationship using only MERGE

    MERGE (i:INCIDENT { incident_id : '99999999' })
    ON CREATE SET i.incident_timestamp = 'test',
    i.incident_description = 'test'
    RETURN i.incident_id

    neo4j-sh (?)$ MERGE (i:INCIDENT { incident_id : '99999999' })
    > ON CREATE SET i.incident_timestamp = 'test',
    > i.incident_description = 'test'
    > RETURN i.incident_id;
    +---------------+
    | i.incident_id |
    +---------------+
    | "99999999"    |
    +---------------+
    1 row
    Nodes created: 1
    Properties set: 3
    Labels added: 1
    107 ms

    MATCH (ntl:NHS_TRUST_LOCATION { location_level_01 : 'General / acute hospital' }),
    (i:INCIDENT { incident_id : '99999999' })
    MERGE (i)-[r:IS_NHS_TRUST_LOCATION_INCIDENT]->(ntl)
    RETURN r

    neo4j-sh (?)$ MATCH (ntl:NHS_TRUST_LOCATION { location_level_01 : 'General / acute hospital' }),
    > (i:INCIDENT { incident_id : '99999999' })
    > MERGE (i)-[r:IS_NHS_TRUST_LOCATION_INCIDENT]->(ntl)
    > RETURN r;
    +---------------------------------------------+
    | r                                           |
    +---------------------------------------------+
    | :IS_NHS_TRUST_LOCATION_INCIDENT[24515883]{} |
    +---------------------------------------------+
    1 row
    Relationships created: 1
    22412 ms

    MATCH (i:INCIDENT)-[r]-()
    USING INDEX i:INCIDENT(incident_id)
    WHERE i.incident_id = '99999999' DELETE i, r

    neo4j-sh (?)$ MATCH (i:INCIDENT)-[r]-()
    > USING INDEX i:INCIDENT(incident_id)
    > WHERE i.incident_id = '99999999' DELETE i, r;
    +-------------------+
    | No data returned. |
    +-------------------+
    Nodes deleted: 1
    Relationships deleted: 1
    72 ms

    --- Create INCIDENT and INCIDENT/NHS_TRUST_LOCATION relationship using only MERGE (create node) and CREATE UNIQUE (relationship)

    MERGE (i:INCIDENT { incident_id : '00000000' })
    ON CREATE SET i.incident_timestamp = 'test',
    i.incident_description = 'test'
    RETURN i.incident_id

    neo4j-sh (?)$ MERGE (i:INCIDENT { incident_id : '00000000' })
    > ON CREATE SET i.incident_timestamp = 'test',
    > i.incident_description = 'test'
    > RETURN i.incident_id;
    +---------------+
    | i.incident_id |
    +---------------+
    | "00000000"    |
    +---------------+
    1 row
    Nodes created: 1
    Properties set: 3
    Labels added: 1
    79 ms

    MATCH (ntl:NHS_TRUST_LOCATION), (i:INCIDENT)
    WHERE ntl.location_level_01 = 'General / acute hospital'
    AND i.incident_id = '00000000' 
    CREATE UNIQUE
    (i)-[r:IS_NHS_TRUST_LOCATION_INCIDENT]->(ntl)
    RETURN r

    neo4j-sh (?)$ MATCH (ntl:NHS_TRUST_LOCATION), (i:INCIDENT)
    > WHERE ntl.location_level_01 = 'General / acute hospital'
    > AND i.incident_id = '00000000' 
    > CREATE UNIQUE
    > (i)-[r:IS_NHS_TRUST_LOCATION_INCIDENT]->(ntl)
    > RETURN r;
    +---------------------------------------------+
    | r                                           |
    +---------------------------------------------+
    | :IS_NHS_TRUST_LOCATION_INCIDENT[24515886]{} |
    +---------------------------------------------+
    1 row
    Relationships created: 1
    110 ms

    MATCH (i:INCIDENT)-[r]-()
    USING INDEX i:INCIDENT(incident_id)
    WHERE i.incident_id = '00000000' DELETE i, r

    neo4j-sh (?)$ MATCH (i:INCIDENT)-[r]-()
    > USING INDEX i:INCIDENT(incident_id)
    > WHERE i.incident_id = '00000000' DELETE i, r;
    +-------------------+
    | No data returned. |
    +-------------------+
    Nodes deleted: 1
    Relationships deleted: 1
    69 ms

    --- Create INCIDENT and INCIDENT/NHS_TRUST_LOCATION relationship using only MERGE (create node) and CREATE UNIQUE USING INDEX (relationship)

    MERGE (i:INCIDENT { incident_id : '11110000' })
    ON CREATE SET i.incident_timestamp = 'test',
    i.incident_description = 'test'
    RETURN i.incident_id

    neo4j-sh (?)$ MERGE (i:INCIDENT { incident_id : '11110000' })
    > ON CREATE SET i.incident_timestamp = 'test',
    > i.incident_description = 'test'
    > RETURN i.incident_id;
    +---------------+
    | i.incident_id |
    +---------------+
    | "11110000"    |
    +---------------+
    1 row
    Nodes created: 1
    Properties set: 3
    Labels added: 1
    128 ms

    MATCH (ntl:NHS_TRUST_LOCATION), (i:INCIDENT)
    USING INDEX ntl:NHS_TRUST_LOCATION(location_level_01)
    USING INDEX i:INCIDENT(incident_id)
    WHERE ntl.location_level_01 = 'General / acute hospital'
    AND i.incident_id = '11110000' 
    CREATE UNIQUE
    (i)-[r:IS_NHS_TRUST_LOCATION_INCIDENT]->(ntl)
    RETURN r

    neo4j-sh (?)$ MATCH (ntl:NHS_TRUST_LOCATION), (i:INCIDENT)
    > USING INDEX ntl:NHS_TRUST_LOCATION(location_level_01)
    > USING INDEX i:INCIDENT(incident_id)
    > WHERE ntl.location_level_01 = 'General / acute hospital'
    > AND i.incident_id = '11110000' 
    > CREATE UNIQUE
    > (i)-[r:IS_NHS_TRUST_LOCATION_INCIDENT]->(ntl)
    > RETURN r;
    +---------------------------------------------+
    | r                                           |
    +---------------------------------------------+
    | :IS_NHS_TRUST_LOCATION_INCIDENT[24515887]{} |
    +---------------------------------------------+
    1 row
    Relationships created: 1
    56 ms

    MATCH (i:INCIDENT)-[r]-()
    USING INDEX i:INCIDENT(incident_id)
    WHERE i.incident_id = '11110000' DELETE i, r

    neo4j-sh (?)$ MATCH (i:INCIDENT)-[r]-()
    > USING INDEX i:INCIDENT(incident_id)
    > WHERE i.incident_id = '11110000' DELETE i, r;
    +-------------------+
    | No data returned. |
    +-------------------+
    Nodes deleted: 1
    Relationships deleted: 1
    72 ms

    Regards,

    Joshua

    icenine

    unread,
    Jan 10, 2014, 9:02:27 AM1/10/14
    to ne...@googlegroups.com
    Maybe I'm just in the same boat as Jason @ https://groups.google.com/forum/#!topic/neo4j/g63fTmPM4GE due to the huge number of relationships via INCIDENT and I'll have to wait for 2.1 to get better performance? I've also submitted a query to MAJ regarding the commit behaviour and auto-commit which can be viewed here: https://rt.cpan.org/Public/Bug/Display.html?id=91898 .

    There are 8 relationships to each INCIDENT, the number of incidents is 3,061,831, so that's 24,494,648 total relationships one way, and 48,989,296 total relationships for both ways if I'm thinking of this correctly. So anything going against INCIDENT where one of it's properties isn't specified directly is going to be painful.

    Mark Needham

    unread,
    Jan 10, 2014, 9:42:04 AM1/10/14
    to ne...@googlegroups.com
    Hi Joshua,

    Could you paste the output of 'schema' as well - I'm going to try and re-create this, sounds like strange behaviour, MERGE should pick up indexes, it's intended to be a better version of CREATE UNIQUE so something must have gone wrong given you're not seeing that!

    Cheers
    Mark


    --

    Mark Needham

    unread,
    Jan 10, 2014, 9:45:45 AM1/10/14
    to ne...@googlegroups.com
    Joshua,

    Could you paste the output of PROFILE on the various versions of creating the relationship from incident -> NHS trust location?

    Mark

    icenine

    unread,
    Jan 10, 2014, 9:51:42 AM1/10/14
    to ne...@googlegroups.com
    Hi Mark,

    I'm going to send both you and Michael a small version of the database (with 2 NHS Trusts), would that be better?

    Regards,

    Joshua

    Mark Needham

    unread,
    Jan 10, 2014, 9:57:06 AM1/10/14
    to ne...@googlegroups.com
    Sounds good. 

    icenine

    unread,
    Jan 10, 2014, 1:13:57 PM1/10/14
    to ne...@googlegroups.com
    Database sent via email. It's 16M so if you have a problem receiving it let me know and I'll use the college file exchange website to send it.
    Reply all
    Reply to author
    Forward
    0 new messages