Creating edges using ETL that require multiple join fields

221 views

Skip to first unread message

William

unread,

Apr 11, 2017, 4:08:54 PM4/11/17

to OrientDB

I have a problem where I need to be able to generate edges between nodes using 2 or more join fields to properly resolve the match.

It's similar to this question on stack overflow... the solution in that problem is to add multiple joinFieldName entries into the edge transformer, but this isn't quite working as expected when I tried it out...

If I change the data by appending a new row, 2,1 to each data files to get this:

data1.csv

a1,a2
1,1
1,2
2,3
2,1

data2.csv

b1,b2
1,1
2,3
1,2
2,1

then using the json provided:

data1.json

{
  "source": { "file": { "path": "./data1.csv" } },
  "extractor": { "csv": {} },
  "transformers": [
    { "vertex": { "class": "A" } }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "plocal:./test.orientdb",
       "dbType": "graph",
       "dbAutoCreate": true,
       "classes": [
         {"name": "A", "extends": "V"},
         {"name": "B", "extends": "V"},
         {"name": "Conn", "extends": "E"}
       ]
    }
  }
}

data2.json

{
  "source": { "file": { "path": "./data2.csv" } },
  "extractor": { "csv": {} },
  "transformers": [
    { "vertex": { "class": "B" } },
    { "edge": { "class": "Conn",
                "joinFieldName": "b1",
                "lookup": "A.a1",
                "joinFieldName": "b2",
                "lookup": "A.a2",
                "direction": "out"
            }}
  ],
  "loader": {
    "orientdb": {
       "dbURL": "plocal:./test.orientdb",
       "dbType": "graph",
       "dbAutoCreate": true,
       "classes": [
         {"name": "B", "extends": "V"},
         {"name": "Conn", "extends": "E"}
       ]
    }
  }
}

the result from running oetl.sh on data1.json then data2.json gives me this:

orientdb {db=test.orientdb}> select from v


+----+-----+------+----+----+-------------+----+----+-------------+
|#   |@RID |@CLASS|a1  |a2  |in_Conn      |b2  |b1  |out_Conn     |
+----+-----+------+----+----+-------------+----+----+-------------+
|0   |#25:0|A     |1   |1   |[#41:0,#45:0]|    |    |             |
|1   |#26:0|A     |1   |2   |[#44:0]      |    |    |             |
|2   |#27:0|A     |2   |3   |[#43:0]      |    |    |             |
|3   |#28:0|A     |2   |1   |[#42:0,#46:0]|    |    |             |
|4   |#33:0|B     |    |    |             |1   |1   |[#41:0,#42:0]|
|5   |#34:0|B     |    |    |             |3   |2   |[#43:0]      |
|6   |#35:0|B     |    |    |             |2   |1   |[#44:0]      |
|7   |#36:0|B     |    |    |             |1   |2   |[#45:0,#46:0]|
+----+-----+------+----+----+-------------+----+----+-------------+


8 item(s) found. Query executed in 0.01 sec(s).

which seems wrong to me... if I write out the edges:

A(1,1) <-- #41:0 --- B(1,1) OK

A(1,1) <-- #45:0 --- B(2,1) WRONG

A(1,2) <-- #44:0 --- B(1,2) OK

A(2,3) <-- #43:0 --- B(2,3) OK

A(2,1) <-- #42:0 --- B(1,1) WRONG

A(2,1) <-- #46:0 --- B(2,1) OK

My understanding here is that the two joinFieldName entries should be creating an AND operation between the two keys... so I expect to match an A to a B if A.a1 == B.b1 AND A.a2 == B.b2, but this isn't what is happening. From the looks of it, the first joinFieldName is ignored and the 2nd joinFieldName entry is the thing that's actually used to match.

Is this a bug? If not and it's working as intended, how can I set up something in ETL to generate edges between nodes based on more than one field?

Thanks!

-William

William

unread,

Apr 11, 2017, 4:53:43 PM4/11/17

to OrientDB

Upon further investigation... I think the issue is answered here w/rt what is happening in data2.json... multiple "joinFieldName" entries don't cause an error in JSON, but the last one smashes the first one.

Enabling DEBUG messages in oetl.sh for my example resulted in:

OrientDB etl v.2.2.18 (build 3e8d46e73aa087fce245fa1125ab7d984a247f6e) https://www.orientdb.com
[file] INFO Load from file ./data2.csv
BEGIN ETL PROCESSOR
[file] INFO Reading from file ./data2.csv with encoding UTF-8
Started execution with 1 worker threads
[orientdb] DEBUG orientdb: found 0 vertices in class 'null'
[orientdb] DEBUG orientdb: found 0 vertices in class 'null'
Start extracting
[csv] DEBUG document={b2:1,b1:1}
[1:vertex] DEBUG Transformer input: {b2:1,b1:1}
[csv] DEBUG document={b2:3,b1:2}
[csv] DEBUG document={b2:2,b1:1}
[csv] DEBUG document={b2:1,b1:2}
Extraction completed
<SNIP/>
[4:vertex] DEBUG Transformer input: {b2:1,b1:2}
[4:vertex] DEBUG Transformer output: v(B)[#36:0]
[4:edge] DEBUG Transformer input: v(B)[#36:0]
[4:edge] DEBUG joinCurrentValue=1, lookupResult=[#25:0, #28:0]
[4:edge] DEBUG created new edge=e[#45:0][#36:0-Conn->#25:0]
[4:edge] DEBUG created new edge=e[#46:0][#36:0-Conn->#28:0]
[4:edge] DEBUG Transformer output: v(B)[#36:0]
[4:log] DEBUG Transformer input: v(B)[#36:0]
[4:log] INFO >>> v(B)[#36:0]
[4:log] DEBUG Transformer output: v(B)[#36:0]
[orientdb] INFO committing
Pipeline worker done without errors: true
END ETL PROCESSOR

Looking at just the last vertex matching b(2,1), which I'd like to link to a(2,1)... but it's only matching that 2nd value... so the edges are only getting generated based on a match between A.a2 and B.b2.

I'm still interested in knowing if it's possible to generate edges between nodes using ETL transformers in OrientDB based on multiple keys rather than just one. If it's not possible, I'll have to write some custom code to do this... but I'd rather do that through the ETL process if possible :)

Reply all

Reply to author

Forward

0 new messages