[Proposal] Sorted Map and Sorted Set

Skip to first unread message

Ricky Han

May 24, 2017, 5:10:48 AM5/24/17
to elixir-lang-core
The most important data structure elixir is missing - sorted, ordered sets and maps. Say what? Elixir only has non-ordered sets. The default(recommended) map is HashMap whereas Haskell Data.Map is backed by binary tree. JVM, stdlib, .NET are treemaps too. This is strange because there is 0 penalty for functional language to use trees.(sidenode: Redis uses skip lists which is bad for immutability).

These are actually several solutions out there:

:orddict, :ordset, :gb_sets, :gb_trees

:orddict and :ordset are very bad as they are backed by lists and have linear time complexity.

:gb_sets and :gb_trees are performant because they are backed by AA trees. However, annoyingly they don't track size of subtrees which means can't be indexed unless converted to list(expensive).

Here is why this would be better than all the existing solutions and should be integrated into elixir stdlib.

1. Being able to index keys means we can it can replace Redis sorted set completely. With ZRANK, ZRANGE abilities
2. All the other languages have it
3. Proves to be fast and space efficient
4. First class citizen so no need to use inferior Erlang library with limited functionalities and 1990s documentation

I find sorted sets and maps very useful in other languages(especially ruby). So like a good hacker I ported red black tree from Haskell[1] Currently, both data structures are functional but documentations are few and far between.

José Valim

May 24, 2017, 7:20:58 AM5/24/17
to elixir-l...@googlegroups.com
Thanks for the proposal Ricky Han!

It is also worth mentioning the red black trees library from Robert Virding: https://github.com/rvirding/rb

While having a library is already great for the ecosystem, we can consider this being added as part of Elixir given Elixir doesn't have an indexed data structure besides tuples (which are expensive to transform).

However, there is a lot of work to do before we are able to fully evaluate it:

1. As you said, it needs documentation. You mention the 1990 look and feel from the Erlang libraries but they are at least *documented*

2. We need to validate the claims it is fast and space efficient. Have you benchmarked it against gb_trees and gb_sets and measured insertion/retrieval/deletion times as well as memory usage? For example, your implementation uses maps for nodes and using tuples will likely be much more efficient

The core team has discussed adding indexed data structures multiple times in the past but we haven't found something that feels right.

José Valim
Skype: jv.ptec
Founder and Director of R&D

You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-core+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/5b22966b-3f82-4446-b494-ec06d892991c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ricky Han

May 28, 2017, 3:19:59 AM5/28/17
to elixir-lang-core, jose....@plataformatec.com.br
Hi José,

I posted an update on the benchmark here [1].

To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.

José Valim

May 28, 2017, 4:00:12 AM5/28/17
to Ricky Han, elixir-lang-core
Thanks for the numbers! And let's please continue the discussion in the mailing list.

At the current benchmark numbers, we don't believe the current implementation is ready for being added to Elixir. I would expect them to perform at least in the same order of magnitude as gb_trees or gb_sets.

Given the current results, it is also worth comparing them to orddict and ordsets. The last two will definitely be slower for larger numbers but they are often the fasted up to certain sizes as they generate less garbage. You can see the previous HashDict and HashSet implementations in Elixir to see how they performed resizing and other optimization ideas.

It is also worth repeating that we have already explored solutions to this problem and we haven't found an answer we were comfortable with, so we definitely don't want to rush a solution here.

José Valim
Skype: jv.ptec
Founder and Director of R&D

Robert Virding

Jun 1, 2017, 6:09:25 AM6/1/17
to elixir-lang-core, jose....@plataformatec.com.br
I am just curious: what is your use case?

If you look in my luerl system, https://github.com/rvirding/luerl, you will see a module ttdict.erl which uses 2-3 trees. These are basically the same as red-black trees. Well, actually, rb trees are sort 2-3 trees but only using binary nodes. I have not done any serious speed comparison but my guess is they should be about as fast s rb trees. An easy way to handle the slowness of size would be to carry around an explicit size field. the 2-3 trees and rb trees have the same interface as dict.

For fun I also implemented aa trees and sets, which Arne Andersson trees.

Data structures are fun.


P.S. I use 23-trees in luerl as maps are missing one very important function needed for implementing Lua and that is to be able to efficiently step through a tree. I need a 'first' function which returns the first key-value pair and then a 'next' which returns the next pair after a given key. Order is unimportant but I need guarantees that I will see all pairs.

Wiebe-Marten Wijnja

Jun 16, 2017, 5:20:28 PM6/16/17
to elixir-lang-core, jose....@plataformatec.com.br
The 'indexed data structure' that is often used now in many projects is the built-in (Hash)Map that the newer versions of Erlang/OTP provide. The keys of a map kan be seen as a poor-man's pointers if you want to have (amortized) constant field access and update behaviour data structure. This is for instance the approach I took in the implementation of sparse vectors/matrices/tensors in the Tensor library (https://github.com/qqwy/tensor). The nice thing about offloading most of the work to a built-in data structure is that the handling of this data structure happens in a compiled piece of code, that is furthermore allowed to 'cheat' the functional rules (as long as the final outcome remains pure).

It is worth noting that many of Erlang's built-in tools are still of the time that Erlang did not have map-support, which means that there are now more options available to us to implement certain data structures differently.
(That being said, maps are definitely slower than plain tuples at least up to a certain size and depending on the kind of structure you want to build and the guarantees it should have.) Benchmarking will show us what happens in practice.

I'd love there to be an Elixir-variant of Haskell's Data.Sequence (which is built on top of 2-3 trees and has O(1) element appending and prepending, and O(min(log(length(seq1), length(seq1))) concatenation) and Patrizia trees+sets.
On a related note, I've lately done a little bit of work to write out the different efficient functional FIFO queues (and deques) that Okasaki wrote about, both the trivial amortized O(1) variant (which is similar to what :queue does now) and the less trivial hard real-time O(1) variant. I plan to benchmark these against :queue and each other, to see for what input which version works the best.

Data structures really are fun! :D

Charles Lanahan

Mar 1, 2024, 7:21:20 PMMar 1
to elixir-lang-core
Would love this to be available.  Was this ever explored in more depth?

José Valim

Mar 2, 2024, 3:17:43 AMMar 2
to elixir-l...@googlegroups.com
The recommendation is definitely for those to be developed as packages and made available to the community as such.

For more information on Elixir development, please check this link: 

Reply all
Reply to author
0 new messages