I have 3 files, containing a set of companies, persons and the relationships between these entities, respectively.
I managed to load the companies and the persons files in no time, but and i'm having some performance issues when loading the last one (the relationships).
It took more than 1 hour and i killed it, because i knew something was not right.
This sample has following:
- ~100k companies;
- ~100k persons;
- ~250k relationships;
I needed to be sure that the file was being read correctly, so i left only one data row in the "rels" file and ran the following cypher:
LOAD CSV WITH HEADERS FROM "file:D:\\rels.csv" AS f MATCH (c:company { document: f.company_document } ) RETURN c
The result took about 20 seconds to bring me back the company, so it was not a problem reading the file, but finding the company.
Then i asked the prompt to profile the cypher, and the result was:
ColumnFilter(symKeys=["f", "c"], returnItemNames=["c"], _rows=1, _db_hits=0)
Filter(pred="Property(c,document(3)) == Property(f,company_document)", _rows=1, _db_hits=112865)
NodeByLabel(identifier="c", _db_hits=0, _rows=112865, label="company", identifiers=["c"], producer="NodeByLabel")
LoadCSV(_rows=1, _db_hits=0)
The way i see it, the loader is reading the entire node set under the label "company" and applying the document filter later.
When i make the same "MATCH" cypher outside the "LOAD" command, the profile is this:
profile MATCH (c:company { document: "76875897000169" } ) RETURN c;
SchemaIndex(identifier="c", _db_hits=0, _rows=1, label="company", query="Literal(76875897000169)", identifiers=["c"], property="document", producer="SchemaIndex")
It's clear to me that it's querying the "company" label index as it was designed to do.
So, why the "LOAD CSV" uses another query plan to do the same lookup?
Thanks in advance!