I have a problem where I need to be able to generate edges between nodes using 2 or more join fields to properly resolve the match.
It's similar to this question on
stack overflow... the solution in that problem is to add multiple
joinFieldName entries into the edge transformer, but this isn't quite working as expected when I tried it out...
If I change the data by appending a new row, 2,1 to each data files to get this:
data1.csv
data2.csv
then using the json provided:
data1.json
{
"source": { "file": { "path": "./data1.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "A" } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:./test.orientdb",
"dbType": "graph",
"dbAutoCreate": true,
"classes": [
{"name": "A", "extends": "V"},
{"name": "B", "extends": "V"},
{"name": "Conn", "extends": "E"}
]
}
}
}
data2.json
{
"source": { "file": { "path": "./data2.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "B" } },
{ "edge": { "class": "Conn",
"joinFieldName": "b1",
"lookup": "A.a1",
"joinFieldName": "b2",
"lookup": "A.a2",
"direction": "out"
}}
],
"loader": {
"orientdb": {
"dbURL": "plocal:./test.orientdb",
"dbType": "graph",
"dbAutoCreate": true,
"classes": [
{"name": "B", "extends": "V"},
{"name": "Conn", "extends": "E"}
]
}
}
}
the result from running oetl.sh on data1.json then data2.json gives me this:
orientdb {db=test.orientdb}> select from v
+----+-----+------+----+----+-------------+----+----+-------------+
|# |@RID |@CLASS|a1 |a2 |in_Conn |b2 |b1 |out_Conn |
+----+-----+------+----+----+-------------+----+----+-------------+
|0 |#25:0|A |1 |1 |[#41:0,#45:0]| | | |
|1 |#26:0|A |1 |2 |[#44:0] | | | |
|2 |#27:0|A |2 |3 |[#43:0] | | | |
|3 |#28:0|A |2 |1 |[#42:0,#46:0]| | | |
|4 |#33:0|B | | | |1 |1 |[#41:0,#42:0]|
|5 |#34:0|B | | | |3 |2 |[#43:0] |
|6 |#35:0|B | | | |2 |1 |[#44:0] |
|7 |#36:0|B | | | |1 |2 |[#45:0,#46:0]|
+----+-----+------+----+----+-------------+----+----+-------------+
8 item(s) found. Query executed in 0.01 sec(s).
which seems wrong to me... if I write out the edges:
A(1,1) <-- #41:0 --- B(1,1) OK
A(1,1) <-- #45:0 --- B(2,1) WRONG
A(1,2) <-- #44:0 --- B(1,2) OK
A(2,3) <-- #43:0 --- B(2,3) OK
A(2,1) <-- #42:0 --- B(1,1) WRONG
A(2,1) <-- #46:0 --- B(2,1) OK
My understanding here is that the two joinFieldName entries should be creating an AND operation between the two keys... so I expect to match an A to a B if A.a1 == B.b1 AND A.a2 == B.b2, but this isn't what is happening. From the looks of it, the first joinFieldName is ignored and the 2nd joinFieldName entry is the thing that's actually used to match.
Is this a bug? If not and it's working as intended, how can I set up something in ETL to generate edges between nodes based on more than one field?
Thanks!
-William