[ANN and RFC] Bifurcan: impure functional data strucures

Zach Tellman

unread,

Mar 27, 2017, 12:51:46 PM3/27/17

to Clojure

This is a slightly irregular announcement, because it's not for a Clojure library. Rather, it's for a library written purely in Java: https://github.com/lacuna/bifurcan.

This is a collection of mutable and immutable data structures, designed to address some of my personal frustrations with what's available in the Clojure and Java ecosystems. Notably, they have pluggable equality semantics, so while they *can* use Clojure's expensive hash and equality checks, they don't *have* to. They also provide high-performance mutable variants of the data structure which share an API with their immutable cousins.

I'm posting it here to ask for people's thoughts on how, if at all, this should be exposed as a Clojure library. It would be simple to simply wrap them in the Clojure interfaces and make them behave identically to Clojure's own data structures, but that kind of obviates the point. However, creating an entirely new set of accessors means that we can't leverage Clojure's standard library.

It's possible that I'm alone in my frustrations, and no Clojure wrapper is necessary. But if this does solve a problem you have, I'd like to hear more about what it is, and how you think Bifurcan might help. Please feel free to reply here, or to grab me at Clojure/West and talk about it there.

Thanks in advance,

Zach

Michael Gardner

unread,

Mar 27, 2017, 1:05:37 PM3/27/17

to clo...@googlegroups.com

> On Mar 27, 2017, at 09:51, Zach Tellman <ztel...@gmail.com> wrote:
>
> They also provide high-performance mutable variants of the data structure which share an API with their immutable cousins.

How does their performance compare to Clojure's transients? Transients are slower than Java's native mutable collections, so if the mutable collections in this library deliver the same performance as the latter, they could act as a drop-in replacement for the former (given a compatible Clojure wrapper).

Zach Tellman

unread,

Mar 27, 2017, 1:14:01 PM3/27/17

to clo...@googlegroups.com

Benchmarks are available here, and the Clojure benchmarks make use of transients wherever possible: https://github.com/lacuna/bifurcan/blob/master/doc/benchmarks.md.

More generally, while transients are often used in practice to quickly construct a read-only data structure, the more formal definition is that they provide an O(1) mechanism for transforming between immutable and mutable forms. This isn't possible with purely mutable data structures like Java's HashMap or Bifurcan's LinearMap. So while wrapping these data structures in the Clojure API would provide better performance for construction and lookups, it wouldn't be quite the same thing as a transient.

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/1m_I7IrDGb0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luke Burton

unread,

Mar 27, 2017, 3:30:00 PM3/27/17

to clo...@googlegroups.com

I'm not well versed enough in these data structures to know this without asking (apologies if it's really obvious to some people): is there opportunity to improve Clojure's built-in data structures with Bifurcan rather than trying to wrap Bifurcan's structures in Clojure?

As an aside, I want to draw people's attention to the sweet little criterium + gnuplot setup you have there for generating benchmarking plots. Nice!

Zach Tellman

unread,

Mar 27, 2017, 3:42:10 PM3/27/17

to clo...@googlegroups.com

Both the 'List' and 'Map' data structures in Bifurcan use innovative approaches that were published after Clojure's original release [1] [2]. In the case of the immutable map, you get faster iteration and the structural invariants allow for some clever stuff w.r.t. equality checks and set operations. In the case of the immutable list/vector, you get fast concatenation, the ability to add and remove from both ends of the collection, and a `subvec` that doesn't hold onto the entire underlying data structure.

All of this is MIT licensed, so please feel welcome to open a PR against Clojure to change the core data structures using my code, but I'd rate the chance of that being accepted as somewhere between "low" and "nonexistent". Also, it should be noted that Clojure's implementation is much more battle-tested than my own at this point. But if anyone wants to tilt at that particular windmill, feel free to ask me any questions you may have about the implementation.

Zach

[1] https://michael.steindorfer.name/publications/oopsla15.pdf

[2] https://infoscience.epfl.ch/record/169879/files/RMTrees.pdf

Dave Dixon

unread,

Mar 27, 2017, 6:49:31 PM3/27/17

to Clojure

I think this would solve an issue I'm facing. I'm working on implementing variations of Monte Carlo tree search for very large trees, with states and actions represented by maps. There are several lookup tables indexed by either state or state-action pairs. I haven't done any detailed benchmarking or perf analysis, but I'm guess that hash/equality consumes no small amount of time.

Mark Engelberg

unread,

Mar 28, 2017, 7:12:42 AM3/28/17

to clojure

I do a lot of work with data structures, so this, I think, would be useful to me.

For the immutable data structures, it seems like they could be done as a drop-in replacement for the Clojure built-ins. There are a couple new functions for splitting and concatenating. I'd recommend following precedents set by core.rrb-vector when relevant. Map linear/forked to transient API.

For the mutable Linear data structures, my instinct would be to hook into the transient functions when possible, even though it doesn't behave exactly like transients. Name new functions on the mutable collections with a `!` character, but return the mutated collection as output (as opposed to returning void), so you don't have to write functions over the data as a "bang in place".

--Mark

Dave Dixon

unread,

Apr 17, 2017, 4:52:39 PM4/17/17

to Clojure

What is the issue with wrapping in Clojure interfaces? Added overhead of function calls?

I'm finding myself in the process of doing some of this, at least for constructors. Also thinking of generating predicates/generators for use with spec.

On Monday, March 27, 2017 at 9:51:46 AM UTC-7, Zach Tellman wrote:

Dave Dixon

unread,

Apr 18, 2017, 9:53:20 AM4/18/17

to Clojure

Stared at this a bit yesterday. Seems like if you want to leverage spec while using bifurcan, then the bifurcan types need to have the Clojure wrapper. The alternative appears to be re-implementing at least a large subset of collection-related spec code, which is a lot to bite off. Also tried updating some existing code to use bifurcan. Similar to spec, there are going to be cases which are less perf sensitive, where it would be nice to use code that is polymorphic for collections, and drop down to the fast interface in perf-sensitive parts.

Zach Tellman

unread,

Apr 18, 2017, 12:32:32 PM4/18/17

to Clojure

To be clear, my intention was always to wrap the implementations in the appropriate Clojure interfaces, and I don't believe that will cause much, if any, of a performance hit (inlining is magic). However, there are some real questions regarding how to expose non-standard equality semantics, and whether transients should be represented using the immutable or mutable collection variants.

For what it's worth, I have about 1/3 of an implementation of Clojure-compatible versions of these data structures, I just wanted to mull on the above questions a bit before going further. I'm happy to discuss them here in more depth if you have any questions or opinions.

Zach

--

Mikera

unread,

Apr 19, 2017, 12:57:18 AM4/19/17

to Clojure

Looks cool! I'm going to mine this for ideas and potentially use it. FWIW I've also been implementing some Java functional data structures for my language design experiments.

If anyone is interested happy to share code, my own motivations were:

- I wanted decent persistent Lists, Sets, Maps for language experiments without pulling in the whole of Clojure as a dependency

- I care about some operations that are not very efficient in Clojure (sublists and concatenation especially)

- It's actually quite a fun challenge writing functional data structures

I have the same annoyance that it isn't easy to play nicely with Clojure code unless you implement the Clojure interfaces (IPersistentVector etc.). It would be nice if Clojure used protocols so you could extend interoperability to arbitrary types, but I can't see any way that is going to happen and it probably isn't a good idea overall for performance reasons.

So I agree the practical way forward would be to write Clojure "wrappers" that extend IPersistentVector etc. if you want to use bifurcan in Clojure.

Dave Dixon

unread,

Apr 20, 2017, 11:54:56 AM4/20/17

to Clojure

Sounds great. If you have time, I'd certainly like to hear your thoughts on the issues of equality semantics and transients, maybe I can ponder and make some suggestions based on my target use-case.

Zach Tellman

unread,

Apr 21, 2017, 12:53:56 AM4/21/17

to Clojure

Sure, happy to elaborate. Bifurcan offers potential performance wins a few different ways:

* We can use standard Java equality semantics, bypassing all the overhead of the hash calculations and enhanced numeric equality checks (this can lead to moderate performance gains)

* We can use a mutable data structure as long as it never escapes a local context (this can lead to significant performance gains)

* We can use the extra capabilities the data structures expose, like concatenation, slicing, set operations, etc. (this is too dependent on the use case to really quantify)

it would be easy to have a `map` and `map*` method that expose Clojure and Java equality semantics, respectively, but that puts a big onus on the developer to determine if the latter is safe for their use case. I've been bit by this when I've used j.u.c.ConcurrentHashMap before, so I expect people will suffer similarly weird bugs.

However, I think there's a way to use the mutable data structures. Technically, transient data structures allow arbitrary persistent data structures to be batch updated, but in practice they tend to be empty, and after they're populated they tend to be treated as read-only.

If we're convinced this is common enough, every empty transient data structure could be mutable, and when we make it persistent we could wrap it in a "virtual" collection [1] which allows updates without touching the base collection. This would allow for faster writes, faster reads, and only marginally slower updates if those are required.

This is all predicated on a bunch of assumptions that are hard to validate, but if this describes enough real-world use cases, it could lead to a big, easy performance win. It's even possible to automatically replace the base Clojure collections with these alternatives using something like Sleight [2].

Anyway, that's what I've been mulling over. If anyone has opinions, I'm happy to hear them.

Zach

[1] https://github.com/lacuna/bifurcan/blob/master/src/io/lacuna/bifurcan/Maps.java#L103

[2] https://github.com/ztellman/sleight

Dave Dixon

unread,

Apr 23, 2017, 2:02:01 PM4/23/17

to Clojure

FWIW, the use-case I have essentially involves Monte Carlo simulations. So we start with a state (non-empty map), and then make a series of modifications to it. Various statistics are held in hash-maps keyed by the state, so there's a lot of lookups and modifications in those maps.

That said, I'm not sure if for this particular case I care too much using Clojure idioms vs. direct API access. The algorithms tend to be hand-tweaked for performance anyway. The big win for me in wrapping bifurcan would be the ability to use spec without having to write specialized specs, generators, etc.

Zach Tellman

unread,

Apr 23, 2017, 4:18:56 PM4/23/17

to Clojure

Are you relying on the immutability of these structures, or are they effectively always transient?

Dave Dixon

unread,

Apr 24, 2017, 11:49:00 AM4/24/17

to Clojure

Both, actually. The algorithm incrementally builds a tree via simulation. There's a step of traversing the current version of the tree to find a leaf, which requires immutability, as I have to remember the path taken (the path is generated via simulation, not by walking an existing tree structure, so requires updates to the state maps). But once at a leaf, I run multiple simulations in parallel, and provided each starts with an independent mutable "copy", then the subsequent states can be mutable. That should be a big perf win, since it's where the algo spends a lot of it's time. The data structures that hold the statistics about the states in the tree could also be mutable. These are keyed by state maps, hoping improved hash/equality performance will help a little there as well.

Sophia Gold

unread,

Apr 28, 2017, 9:38:05 PM4/28/17

to Clojure

I'm a bit late to this, but it caught my eye.

The only common use case I have for mutating data structures in Clojure is when storing state in a global map (similar to Om.Next), but I almost always make them atomic to account for nondeterminism in the order operations on them will finish. Would the performance gains of Bifurcan's Map and IntMap over Clojure's PersistentHashMap hold for atomic versions?

Faster set operations might be useful for me as well, since I often end up rolling my own, although less so for maps and more for vectors with a small number of values so not sure whether that's applicable here.