Faster diffing for large-ish nested data structures

438 views
Skip to first unread message

lvh

unread,
Apr 19, 2017, 7:32:45 PM4/19/17
to clo...@googlegroups.com
Hi,


I have two deeply nested data structures (consisting of maps, vecs and the occasional seq; althoguh I can make it maps and vecs consistently if need be). I want to compute (and store) diffs; ideally diffs that store [path oldval newval] so that I can apply them in either direction.

Using clojure.data/diff on them takes a long time (well north of 10 minutes on my new laptop).

If I flatten these nested map out to entries they have about 2E5 entries. I'm expecting between 1E5 and 1E6 entries per map. These maps  represent the same data at two close points in time, so I'm expecting small differences. The tree is unbalanced: it has inconsistent depth and branching factors, but they're still going to be consistent between snapshots.

Here are some ideas I'm trying (but I'm open to suggestions, experiences):

- The machines I'm doing this on have plenty of beefy cores. Since the data structures are immutable, I should be able to parallelize this operation somewhat, even if it's only a constant speedup of ~4x or so. (I care about minor speedups since it takes 10 minutes, not 10 hours, to do the diff right now -- so it's entirely possible that enough small speedups add up.)

- clojure.data/diff builds a giant data structure of things that are the same. I don't care about the parts that are the same; just parts that are different. That takes time.

- clojure.data/diff doesn't use transients. While I'm not expecting a lot of diffs, this might be a speedup.

I've found https://groups.google.com/forum/#!topic/clojure/VPpjlRC2INg , but it appears that mostly doesn't go anywhere unless I want to maintain something that knows a lot about internal Clojure data structures :)


lvh

Timothy Baldridge

unread,
Apr 19, 2017, 8:03:47 PM4/19/17
to clo...@googlegroups.com
I've gotten really fast diffing in Clojure by using the following concepts:

1) If you can sometimes make parts of A and B be identical, then your differ can skip checking those parts by doing a (identical? A B), this improves performance
2) Take a side effecting approach, pass into the differ a function of (fn [path a-val b-val] ...), and whenever you see a difference call that function, this turns the differ into a event generator. From there it's trivial to use transients, tuples, etc. to improve performance
3) If you can restrict yourself to maps, vectors and seqs, you can use reduce-kv for maps and vectors, and walk the seq with a index count for the latter. This will result in some rather efficient walking. 


Not perfect, but it's the fastest method I've come up with so far. Could probably also replace the calls to condp with nested case statements. 

Timothy

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
“One of the main causes of the fall of the Roman Empire was that–lacking zero–they had no way to indicate successful termination of their C programs.”
(Robert Firth)

Mars0i

unread,
Apr 22, 2017, 1:22:57 AM4/22/17
to Clojure
This might be a job for which Specter is particularly useful. You might have to dive pretty deep into it, but if you get stuck, the creator Nathan Marz is often very helpful.

Timothy Baldridge

unread,
Apr 22, 2017, 9:47:53 AM4/22/17
to clo...@googlegroups.com
Can Specter walk two sequences in lock-step? That's what's needed for a good diffing engine, and that seems quite far removed from Specter's design. 

On Fri, Apr 21, 2017 at 11:22 PM, Mars0i <mars...@logical.net> wrote:
This might be a job for which Specter is particularly useful.  You might have to dive pretty deep into it, but if you get stuck, the creator Nathan Marz is often very helpful.
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Laurens Van Houtven

unread,
Apr 22, 2017, 10:52:52 AM4/22/17
to clo...@googlegroups.com
Not to speak for Nathan, but I asked in #specter and he indicated it's unlikely to help, which I imagine is primarily for the reason Tim mentioned :)

(It bears repeating though: I was wrong about specter. It's awesome and Nathan is incredibly helpful.)

lvh

Sent from my iPhone

For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.

Mars0i

unread,
Apr 22, 2017, 11:42:34 AM4/22/17
to Clojure
tbc++, lvh, I'm not surprised that Specter doesn't help. That seems reasonable.  Although I have come up against Specter's limits, I've been surprised at what it can do, so as a non-expert I tend to err on the side of thinking that it might help if I don't know that it can't.

Sophia Gold

unread,
Apr 28, 2017, 10:08:30 PM4/28/17
to Clojure
I actually asked Nathan about a somewhat similar problem recently and he told me that making Specter support operations on multiple data structures would require a significant overhaul. Cases like this do seem quite common to me, though, so if there's a critical mass of interest I'd be willing to pitch in on it. I wonder if it makes sense to open a Github issue in order to gauge interest and discuss what changes it would require.


On Saturday, April 22, 2017 at 10:52:52 AM UTC-4, lvh ‌ wrote:
Not to speak for Nathan, but I asked in #specter and he indicated it's unlikely to help, which I imagine is primarily for the reason Tim mentioned :)

(It bears repeating though: I was wrong about specter. It's awesome and Nathan is incredibly helpful.)

lvh

Sent from my iPhone

On Apr 22, 2017, at 08:47, Timothy Baldridge <tbald...@gmail.com> wrote:

Can Specter walk two sequences in lock-step? That's what's needed for a good diffing engine, and that seems quite far removed from Specter's design. 
On Fri, Apr 21, 2017 at 11:22 PM, Mars0i <mars...@logical.net> wrote:
This might be a job for which Specter is particularly useful.  You might have to dive pretty deep into it, but if you get stuck, the creator Nathan Marz is often very helpful.

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
“One of the main causes of the fall of the Roman Empire was that–lacking zero–they had no way to indicate successful termination of their C programs.”
(Robert Firth)
Reply all
Reply to author
Forward
0 new messages