Snapshots and Diff reports for detecting and propagating changes

49 views
Skip to first unread message

Regunath Balasubramanian

unread,
Mar 3, 2014, 5:54:37 AM3/3/14
to netfli...@googlegroups.com
Hi,

I intend to use Zeno as a library to detect changes between two cycles. The use case is like this:

  1. I have a source data store where data gets updated based on a number of user-initiated and backend update jobs. The changes that happen on this system is of interest to multiple other systems. Each client system normally starts off at a snapshot and is interested in changes that happen after. 
  2. In order to do this, I presume I need to create snapshots periodically - I could version each snapshot with the time when it was created.
  3. Given that each client is interested only in changes (at the field/property level of the POJO), I guess this will have to come from the DiffReport emerging from comparing two states. 
  4. Based on the Zeno Wiki comments, Deltas appear to be more resource friendly than snapshots. However it appears incorrect for me to compare a Snapshot with a Delta even though all objects are added to the state engine in each cycle. 
In light of the above scenario, is this a reasonable approach to take :
  1. Create only snapshots in each cycle, no deltas are produced. Number/version them suitably using timestamp.
  2. Based on client's last stored version (say timestamp), pick up a suitable snapshot. 
  3. Perform a diff between the oldest snapshot and the one picked up for a client (from step 2 above). Inspect the Diff report to identify "extra" objects and "different" objects. From the "different" objects, identify fields/properties that have changed. This may then be served suitably to the client in a form like:
class ChangesInA {
     String property;
     Object oldValue;
     Object newValue;
}

Thanks 

Drew Koszewnik

unread,
Mar 3, 2014, 3:42:23 PM3/3/14
to netfli...@googlegroups.com
Hi again Regunath,

You are correct that you will want to produce snapshots periodically, so that newly started clients don't have to bootstrap their state engines from the beginning of time (the first and only snapshot ever produced).  Based on the comment "it appears incorrect for me to compare a Snapshot with a Delta even though all objects are added to the state engine in each cycle", I think there might be some confusion about deltas and snapshots:  When you compare two data states, you are not comparing "deltas" vs "snapshots".  Deltas and snapshots are transitions to data states, they are not the states themselves.  For example, let's say we create a data state on our server and call it version "1", then produce a snapshot.  If we load that snapshot on a client, the client state engine will be at version "1".  Now, if we create a new data state on our server and apply it on top of the state engine which we previously loaded with version "1", we can call this version "2".  We can produce both a snapshot and a delta from this state.  A new client can load the snapshot and get to version "2".  However, our client with version "1" already loaded can load the delta and also get to version "2".  There will be no difference in the data between the new client which loaded a full snapshot of version "2", and the preexisting client which applied a delta on top of version "1" to get to version "2".  When we do a diff, we compare data states, not data transitions.  It doesn't matter how we arrived at the data states.

Your proposed approach will work, but there are much more efficient ways to get this.

Here is one way to achieve your goal more efficiently.  If you recall, we added a class TypeDeserializationStateListener soon after release, partially based on your earlier comments.  For each of the types you are interested in tracking add a TypeDeserializationStateListener.  Each time you apply a delta (or another snapshot) and any field changes in a specific object, you will receive one call to removedObject and one call to addedObject.  Each time you receive a removedObject call, store it off into some temporary "removed objects" collection.  Each time you receive an "addedObject" call, store it off into some temporary "added objects" collection.  At the end of this process, you can match up the removed and added objects by their keys, then check each of those for their differences.

You may want to do the matching yourself between your from / to collections, then detect the differences yourself as well.  However, if you want to do this in a semantically independent way, you may save yourself some maintenance down the line if your data model changes.  You should be able to quickly get the differences as a TypeDiff object by creating a TypeDiffOperation and a DiffSerializationFramework.  Call performDiff on your TypeDiffOperation with your removed objects collection as the "fromState" argument, and your added objects collection as the "toState" argument.

Give this a try and please let us know if this works for you.  

Thanks again,
Drew.

Regunath Balasubramanian

unread,
Mar 4, 2014, 12:20:40 AM3/4/14
to Drew Koszewnik, netfli...@googlegroups.com
Hello Drew,

Thanks for the detailed answers and explanation. Appreciate it.
I learned something new from your mail and would like to paraphrase as follows:
  1. The state engine is empty i.e. does not hold anything if no snapshots have been saved. If a snapshot exists, load it into the state engine using the FastBlobReader. This behavior holds good for every cycle.
  2. Perform a scan from the source data store and add all the items to the state engine. Use some recency logic to determine whether you want to write a snapshot or delta - say write a snapshot once every 5 cycles or 24 hours.
  3. From 1 and 2 above, I will now have a series of timestamped (version) snapshots and corresponding deltas. A client may then choose to apply a snapshot or delta depending on its stored version.
Regarding change detection, I would like to stay away from doing it myself - for various reasons. One is the need to maintain this as schema changes like you rightfully pointed out.

I have one question on the TypeDeserializationStateListener and callback methods. When is removedObject and addedObject called? I mean what is the semantic meaning of these objects? Does this mean these objects denote deleted and new objects in the source system? I guess not and suspect it reflects change in data stored by the the state engine.

From your comments, I assume I'll get field/property level changes by performing the diff operation as described?

Thanks
Regu


--
You received this message because you are subscribed to a topic in the Google Groups "Netflix Zeno Discussion Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/netflix-zeno/kDM_vD_Yois/unsubscribe.
To unsubscribe from this group and all its topics, send an email to netflix-zeno...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Drew Koszewnik

unread,
Mar 4, 2014, 2:26:03 PM3/4/14
to netfli...@googlegroups.com, Drew Koszewnik
Hi Regu,

All POJOs in the zeno data model are immutable.  Consequently, the only way to change any field in an object is to remove the whole object and add a replacement.  

When you apply an update to a state engine, and you have hooked up a TypeDeserializationStateListener to a type, then that listener will receive a call for each object removal and addition.  

The way to detect whether additions / removals are actually changes is to inspect the keys of the objects.  If there is a removal for a given key, and an addition for the same key, then this pair represents a modification.

The TypeDiffOperation should do this pairing for you.  In order to accomplish this task, you define how to identify the keys of your objects with a TypeDiffInstruction.  Once the operation is completed, you can gather details about specifically what changed for each object by inspecting the returned TypeDiff.  

Drew.

Regunath Balasubramanian

unread,
Mar 5, 2014, 12:11:13 AM3/5/14
to Drew Koszewnik, netfli...@googlegroups.com
Hello Drew,

Thanks for the explanation. I will try out the diff operation using the suggested approach of hooking up a TypeDeserializationStateListener.

Thanks
Regu 

Regunath Balasubramanian

unread,
Mar 26, 2014, 9:00:18 AM3/26/14
to netfli...@googlegroups.com, Drew Koszewnik
Hello Drew,

Just an update to say that the approach of hooking up a TypeDeserializationStateListener works and I was able to interpret changes. Thanks for the help!
I ran a test for half a million objects and am quite impressed with the memory footprint and performance.

Regards
Regu
To unsubscribe from this group and all its topics, send an email to netflix-zeno+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages