node merge history: best approach

40 views
Skip to first unread message

swami....@ishafoundation.org

unread,
Aug 3, 2015, 10:28:11 AM8/3/15
to OrientDB
Hi,

We are building a contact management application. Each contact is a node. If 2 or more contacts are discovered to be duplicates we want to provide the ability to merge them into a single node. Additionally, we want to maintain the pre-merge node states, so that we can undo the merge if required (*).

We propose to model this by creating a new node and linking the old nodes to it with a "merged_into" edge, and setting a status property to "removed".

Now we have two options:

1. We copy all the existing edges from the two merged nodes to the new node

2. We don't.

Option 2 gives a simpler data structure, however it makes all our queries much more complex. Because we have to travel back through potentially multiple levels of merged nodes to fetch all the edges

Option 1 would keep the queries the same, but will introduce a lot of extra edges.

We also are considering a 3rd option of creating a copy of the full database with all the merged nodes collapsed. i.e. just a view of the current contacts. This would need to be kept in sync with the main database.

Would appreciate any advice/suggestions on the best way to handle this.

I'd also like to suggest a new "collapse" query feature, which would enable Option 2 to work more easily.... something like this:

select out("attended_class") collapse("merged_into") from 10#12

which would collapse the specified edges until there are no further outbound "merged_into" edges, and thus retrieve all the edges attached to the previous (pre-merged) nodes


* To keep things *simple* we won't allow the unmerge operation after any edges have been defined on the new node

Kind Regards

Swami Kevala

SavioL

unread,
Aug 4, 2015, 3:53:49 AM8/4/15
to OrientDB
hi,
to resolve your question there are various way as rightly you have listed too. One of these ways, perhaps the simplest approach, might be the one shown in the picture:



It should not (I think, the only way to be sure would compare it with other solutions) be a bad idea to develop it so, because you have a field that indicates if the node is active or not and there are N edge of individual nodes linked to duplication. As DB structure should not be complex / expensive relative to the performance of the DB side.
I repeat, it is one of the possible N solutions.. this is the best solution? to answer the question should develop others and then compare them. I do not know if you have been of help, I told you the way I would have done ..

regards,
Savio L.

Reply all
Reply to author
Forward
0 new messages