Version 2.1.0-M01 CSV Import Index Lookup

37 views
Skip to first unread message

Michel Ávila

unread,
Mar 26, 2014, 5:13:40 PM3/26/14
to ne...@googlegroups.com
I have 3 files, containing a set of companies, persons and the relationships between these entities, respectively.
I managed to load the companies and the persons files in no time, but and i'm having some performance issues when loading the last one (the relationships).
It took more than 1 hour and i killed it, because i knew something was not right.
This sample has following:
  • ~100k companies;
  • ~100k persons;
  • ~250k relationships;
I needed to be sure that the file was being read correctly, so i left only one data row in the "rels" file and ran the following cypher:

LOAD CSV WITH HEADERS FROM "file:D:\\rels.csv" AS f MATCH (c:company { document: f.company_document } ) RETURN c

The result took about 20 seconds to bring me back the company, so it was not a problem reading the file, but finding the company.
Then i asked the prompt to profile the cypher, and the result was:

ColumnFilter(symKeys=["f", "c"], returnItemNames=["c"], _rows=1, _db_hits=0)
Filter(pred="Property(c,document(3)) == Property(f,company_document)", _rows=1, _db_hits=112865)
 
NodeByLabel(identifier="c", _db_hits=0, _rows=112865, label="company", identifiers=["c"], producer="NodeByLabel")
   
LoadCSV(_rows=1, _db_hits=0)

The way i see it, the loader is reading the entire node set under the label "company" and applying the document filter later.
When i make the same "MATCH" cypher outside the "LOAD" command, the profile is this:

profile MATCH (c:company { document: "76875897000169" } ) RETURN c;
SchemaIndex(identifier="c", _db_hits=0, _rows=1, label="company", query="Literal(76875897000169)", identifiers=["c"], property="document", producer="SchemaIndex")

It's clear to me that it's querying the "company" label index as it was designed to do.
So, why the "LOAD CSV" uses another query plan to do the same lookup?

Thanks in advance!

Michael Hunger

unread,
Mar 27, 2014, 9:11:00 AM3/27/14
to ne...@googlegroups.com
Can you try to use MERGE instead of MATCH in your relationship-statement that should definitely use the index.


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michel Ávila

unread,
Mar 27, 2014, 10:11:55 AM3/27/14
to ne...@googlegroups.com
Yes Michael, that definitely did the trick! Works like a charm now.
Can you explain the differences between MERGE and MATCH, in this case, so i can choose between them consciously next time?

Thank you again!

Michael Hunger

unread,
Mar 27, 2014, 10:31:35 AM3/27/14
to ne...@googlegroups.com
Actually MATCH should work as well, could your raise a github issue about it? http://github.com/neo4j/neo4j/issues

The difference is: MERGE is a get-or-create whereas MATCH is a lookup only.

So MERGE does more, but at least it uses the index/constraint as it should.

Cheers,

Michael

Michel Ávila

unread,
Mar 27, 2014, 2:43:05 PM3/27/14
to ne...@googlegroups.com
This is why i was using MATCH, instead of MERGE.
Because i created all nodes before, and only then the relationships file was loaded.
I tested both commands outside the LOAD CSV, and they used the index.
It's only inside the LOAD CSV context that MATCH doesn't behave as it's supposed to.
Weird stuff indeed.
I'll raise a github issue about it, as you suggested.

Thanks.


--
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/NLcAOt_orD8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Michel Leite de Ávila
Reply all
Reply to author
Forward
0 new messages