[erlang-questions] towards a unified dict api

58 views
Skip to first unread message

Richard Carlsson

unread,
Dec 23, 2011, 5:09:00 PM12/23/11
to erlang-q...@erlang.org
A thing I've been tinkering with on and off for years is making a
unified API for dictionaries in Erlang (dict, orddict, gb_trees, ets,
dets). This requires figuring out a set of function names and calling
conventions that are mostly familiar but which don't already have a
conflicting definition in one or more of the modules involved.

In the end I went for using most of the dict module API unchanged, with
some new synonyms, and a number of additional useful functions. I also
made the dict module define a new 'dict' behaviour (even though it's
just an interface rather than a complete behaviour).

One particular detail is that gb_trees (with its user-unfriendly name
and rather different calling conventions) can now quite simply be used
through the dict module as an ordered variant of dict, and you can
pretend you never heard of the gb_trees module unless you want to use
one of its specially implementation-dependent functions. An ordered dict
can be created through dict:new([ordered_set]). This also resolved some
major problems with clashing function definitions.

The code (based on the OTP maint branch) can be found here:

https://github.com/richcarl/otp/tree/dict-api

The quickest way of having a look at the suggested API is to follow this
link and scroll down to see the behaviour callback:

https://github.com/richcarl/otp/blob/dict-api/lib/stdlib/src/dict.erl


Reactions, suggestions, and test pilots are most welcome. I'm not ready
to write an EEP until I feel the API is good enough. Things still on the
TODO-list are: updating the docs for ets and dets, figuring out a good
set of iterator functions, and maybe include QLC table/1/2 functions and
match/select functions in the API. And then the same kind of unified API
should be made for sets and probably also for sequences.

A Good Yule to y'all,

/Richard
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Fred Hebert

unread,
Dec 23, 2011, 6:04:11 PM12/23/11
to Richard Carlsson, erlang-q...@erlang.org
I'm wondering how that changes impact modules like 'sets', 'ordsets' and 'gb_sets', where ordsets were based on the same data structure as orddict, sets the same as dict, and gb_sets were based on gb_trees' structure. There was some kind of general relationship between all these modules that is now being lost.

Otherwise, that sounds good.

Andrew Berman

unread,
Dec 23, 2011, 7:43:47 PM12/23/11
to Richard Carlsson, erlang-q...@erlang.org
I like this idea. Coming from Java, I'm used to interfaces and abstract classes whereby Set, Map, List, etc. implementations can share the same basic API except if you need to use something specific to that implementation.  So, basically you want to mimicking the behavior of the Map interface which has multiple implementations below it of HashMap, TreeMap, etc.  I think it greatly simplifies usage and makes switching between implementations far easier.  The same can/should be done with sets too, no?

Eric Merritt

unread,
Dec 23, 2011, 9:26:31 PM12/23/11
to Andrew Berman, erlang-q...@erlang.org
When Torben and I did signatures (something akin to java interfaces)
last year for erlware_commons we went through the same process.

You can see the dictionary signature here

https://github.com/erlware/erlware_commons/blob/master/src/ec_dictionary.erl

which contains what we came up with along those lines.

signatures are our attempt to provide, standard interfaces that are
friendly to tools like dialyzer and xref. The best part about them is
they don't require language changes.

T

Witold Baryluk

unread,
Dec 29, 2011, 4:39:00 AM12/29/11
to Richard Carlsson, erlang-q...@erlang.org
On 12-23 23:09, Richard Carlsson wrote:
> A thing I've been tinkering with on and off for years is making a
> unified API for dictionaries in Erlang (dict, orddict, gb_trees,
> ets, dets). This requires figuring out a set of function names and
> calling conventions that are mostly familiar but which don't already
> have a conflicting definition in one or more of the modules
> involved.
>
> In the end I went for using most of the dict module API unchanged,
> with some new synonyms, and a number of additional useful functions.
> I also made the dict module define a new 'dict' behaviour (even
> though it's just an interface rather than a complete behaviour).
>
> One particular detail is that gb_trees (with its user-unfriendly
> name and rather different calling conventions) can now quite simply
> be used through the dict module as an ordered variant of dict, and
> you can pretend you never heard of the gb_trees module unless you
> want to use one of its specially implementation-dependent functions.
> An ordered dict can be created through dict:new([ordered_set]). This
> also resolved some major problems with clashing function
> definitions.
>
> The code (based on the OTP maint branch) can be found here:
>
> https://github.com/richcarl/otp/tree/dict-api
>

I agree dict API is way to go, because it is so used. However I would be
against introdusing any synonymous, it may start confusing.

What I do not link in dict API is how update/4 function behaves.
It is called like update(Key, Fun, Initial, Dict) -> Dict.
Problem is that if Initial is some sort of complex and costly
to compute initial value, then it will be most of the time
wasted, because it will be not needed.

THis is why I often use function like this:

update_full(Key, Fun, FunInit, Dict) when is_function(Fun, 1), is_function(FunInit, 0) ->
case dict:is_key(Key, Dict) of
true ->
dict:update(Key, Fun, Dict);
false ->
Val = FunInit(),
dict:store(Key, Val, Dict)
end;

Of course integrating it directly into dict will make it even faster, by
not needing to traverse datastructure twice.

I'm not really sure if integrating abstract collections into existing
dict module is good idea. It should not break compatibility, but it will
for sure bring some performance hit (due pattern matching in the dict
module and delegation to other modules).

I think dict module should be leaves as is, and new module should be
introduced, like gen_dict. Sure in some sense, it is easier to just
find all dict:new() using simple grep, and change it to dict:new([...]),
where appropriate without worring about other call sites, but if for some
reasons one changes dict:new([dict]), to something else, some functions
may subtelly change how they work, so I think it should not messed too much.

And running sed s/\bdict:\b/gen_dict:/ or similar things, shouldn't be hard,
but will make sure developer understand it can be:
1) slightly slower, due additional level of indirection
2) have slightly different API

In fact I prepared much simpler (but ugly) wrapper years ago, and often
I use it when I know I needs to store and access randomly potentially
lots of elements, but want to test if it is actually faster than using
simple orddict, or proplist or gb_trees or own datastructures:

https://github.com/baryluk/common_collection % I have somewhere on disk even newer version, will need to search

No tests, no real documentation, no nothing, no simple way of adding
collections without editing file (howver it should be simple to fix). I
will for sure write it now in different way (I was using for some time
a parametrized modules as wrappers to dict and proplists also, but because
of unknown future of them Erlang - I personally like them very much! -
i stoped using them, despite being even simpler to use). But it works.

It looks that https://github.com/erlware/erlware_commons/tree/master/src
is more modular and easier to extend by other persons without changing
source code. It probably is faster. But it lacks many very useful
functions, as well is not separate project on its own (as it should be).

--
Witold Baryluk
JID: witold.baryluk // jabster.pl

Richard Carlsson

unread,
Dec 29, 2011, 12:35:46 PM12/29/11
to Witold Baryluk, erlang-q...@erlang.org
On 2011-12-29 10:39, Witold Baryluk wrote:
> I agree dict API is way to go, because it is so used. However I would be
> against introdusing any synonymous, it may start confusing.

The synonyms I was referring to are only the old names for certain
functions in the dict module. For backwards compatibility reasons, they
cannot simply be removed from dict.erl, but new names are needed for the
unified API to avoid collisions with functions in other modules (or, in
a few cases, to allow better names and more consistent argument order).
The suggested unified dict API does not in itself contain more than one
name for any function.

> What I do not link in dict API is how update/4 function behaves.
> It is called like update(Key, Fun, Initial, Dict) -> Dict.
> Problem is that if Initial is some sort of complex and costly
> to compute initial value, then it will be most of the time
> wasted, because it will be not needed.

This could be a good point. The function update/4 is part of the
existing dict module and cannot be changed, but my suggested new
function map/4 could be made to take an initialization function (Key) ->
term() instead of just a term to be stored.

> I'm not really sure if integrating abstract collections into existing
> dict module is good idea. It should not break compatibility, but it will
> for sure bring some performance hit (due pattern matching in the dict
> module and delegation to other modules).

And this is definitely not the intention. The only delegation that is
done is in order to save the user from knowing about the gb_trees
module. The dict module becomes more like ets, in that you can choose
between a hashed and an ordered dictionary, but you shouldn't have to
know anything about the implementation. Other kinds of dictionaries such
as ets and orddict only implement the same interface; you are not
supposed to be able to use the dict module to manipulate an orddict or
an ets table. This is not object oriented programming. If you want to do
that kind of wrapper where the data structure knows its implementation
and does its own dispatch, you can build that yourself on top of this API.

> I think dict module should be leaves as is, and new module should be
> introduced, like gen_dict. Sure in some sense, it is easier to just
> find all dict:new() using simple grep, and change it to dict:new([...]),
> where appropriate without worring about other call sites, but if for some
> reasons one changes dict:new([dict]), to something else, some functions
> may subtelly change how they work, so I think it should not messed too much.

You should not have to change anything in your code. Existing calls to
dict:new() work as they are, since the default is still to create a
plain old hashed dictionary. Only if you want to create an ordered
dictionary do you need to call dict:new([ordered_set]). You do not have
to specify a callback module as in your example dict:new([dict]).

I agree that if you want to make a more object oriented implementation
with method dispatch, another module name should be used for that. But
what I have suggested is just a unified API for the existing dictionary
data types.

/Richard

Steve Davis

unread,
Dec 29, 2011, 7:25:14 PM12/29/11
to erlang-q...@erlang.org
Once, I would have strongly agreed with you. Now, I am not so sure
that this tenet is true for many cases. Specifically what guides this
change of heart for me is the painful progress in RPC/CORBA/JRMI
abstractions over the years. I suspect that it also applies to data/
state sources.

My 2c.
/s

On Dec 29, 11:35 am, Richard Carlsson <carlsson.rich...@gmail.com>
wrote:
> ...but you shouldn't have to know anything about the implementation.

Richard Carlsson

unread,
Dec 30, 2011, 5:26:45 AM12/30/11
to erlang-q...@erlang.org
On 2011-12-30 01:25, Steve Davis wrote:
> On Dec 29, 11:35 am, Richard Carlsson<carlsson.rich...@gmail.com>
> wrote:
>> ...but you shouldn't have to know anything about the implementation.
>
> Once, I would have strongly agreed with you. Now, I am not so sure
> that this tenet is true for many cases. Specifically what guides this
> change of heart for me is the painful progress in RPC/CORBA/JRMI
> abstractions over the years. I suspect that it also applies to data/
> state sources.
>
> My 2c.
> /s

And I basically agree, which is why I'm not aiming for an object
oriented approach. But I think that letting the dict module have an
ordered version, just like ets tables, is a good idea. You still get to
choose between O(n) time for an unordered table and O(log n) time for an
ordered table when you initialize the dict, and from a usability
perspective it means that for most cases where you want a functional
dictionary data type, the dict module is all you need to know.

I couldn't get a good unified dict API until I decided to remove the
gb_trees module from the equation by making it accessible via dict
instead, because its current API is too different from the existing
dict/orddict API and the name clashes got too difficult to resolve. I
could of course also invent a new name for gb_trees, but that seems
unnecessary.

/Richard

Robert Virding

unread,
Jan 12, 2012, 12:31:00 PM1/12/12
to Richard Carlsson, erlang-q...@erlang.org
I think that in many cases you DO want to know about the implementation as different algorithms have different properties and choosing the right one can have a significant effect on an application. This was the reason we have different versions with the same API, so you could test and choose between them and find the best one. Unfortunately it took a long time before there came a new one after the original two.

I think it is good to have at least one tree based version as well. Though I would much prefer to have it in a separate module (gbdict?) instead of baking it into the dict version. I prefer to be explicit rather then "hiding" behind one module. In most case you do really want to know what is going on. I have two tree versions that I have long been meaning to include in the distribution, rbdict based on Red-Black trees and aadict based on the same algorithms as gb_trees but implemented in a different way. The more the merrier.

But as I said I strongly feel it is much better to have separate modules. It also makes it easier to add versions, which as you can probably understand I think there should be.

I have not yet looked through the suggestion for extending the API, but I will shortly.

Robert

José Valim

unread,
Jan 13, 2012, 6:44:15 AM1/13/12
to Robert Virding, erlang-q...@erlang.org
I agree with the general direction of creating another module called gen_dict that adds a level of indirection instead of changing dict implementation. So developers can choose between transparency (gen_dict) and performance (dict).

But the truth is that Erlang lacks the proper abstractions to (elegantly) solve this problem. Even if we create a module that properly wraps gb_trees, orddict and dict, if the developer decides to create a new dictionary (for example, a red black tree based implementation) there is no way the developer will be able to hook his own dictionary into the gen_dict API. An abstraction like Clojure's protocols would be able to solve this more elegantly.

Tuple modules could be a possible solution to this problem, so one would be able to call Dict:find(Key) and not worry about what a Dict really is, but they wouldn't work for orddicts or gbtrees unless they are rewritten to be in the tuple module format and I doubt this is a direction people would like OTP libraries to move towards to (I certainly don't).

Richard Carlsson

unread,
Jan 13, 2012, 8:57:08 AM1/13/12
to José Valim, erlang-q...@erlang.org

There seems to be a lot of confusion going on here between *interface*
(API) and *dispatch* (redirecting the API calls to the actual
implementation depending on the data structure). I have only suggested a
unified API; not a generic dispatch mechanism.

"Tuple modules" (please call them Parameterized modules or Abstract
modules - the implementation-as-tuples is a temporary hack, and you
should treat them as opaque objects just like funs) are not needed,
because there is nothing here to parameterize. If all dict-like modules
implement the same API, it's enough to use the module's name. For example:

MyDict = orddict,
...
D = MyDict:new(),
...

What you *could* do is to parameterize the module that uses the dictionary:

-module(my_module, [SomeDict]).
...
foo() ->
D = SomeDict:new(),
...

and then you could easily change implementation as you like:

M1 = my_module:new(orddict),
M1:foo(),
M2 = my_module:new(ets),
M2:foo(),
...

That said, I think Clojure's protocols (which are a way of dynamically
getting the same effect that you get with Haskell's type classes) are a
neat idea that would probably work well with Erlang. But that's a
different story.

José Valim

unread,
Jan 13, 2012, 9:10:16 AM1/13/12
to Richard Carlsson, erlang-q...@erlang.org
There seems to be a lot of confusion going on here between *interface* (API) and *dispatch* (redirecting the API calls to the actual implementation depending on the data structure). I have only suggested a unified API; not a generic dispatch mechanism.

If the main purpose is an unified API, agreed. Thanks for clearing it up.

"Tuple modules" (please call them Parameterized modules or Abstract modules - the implementation-as-tuples is a temporary hack, and you should treat them as opaque objects just like funs).

I thought it was the opposite. Parameterized modules were the temporary hack and the implementation-as-tuples/"tuple modules" were here to stay.
 
That said, I think Clojure's protocols (which are a way of dynamically getting the same effect that you get with Haskell's type classes) are a neat idea that would probably work well with Erlang. But that's a different story.

Agreed as well. Regardless, protocols should be possible to implement in Erlang via parse transforms. I have implemented it for a language of top of the Erlang VM I am working on, but it is not officially released yet. Hoping to talk more about it on Erlang Factory Lite in Kraków this march.

Richard Carlsson

unread,
Jan 13, 2012, 9:21:15 AM1/13/12
to Robert Virding, erlang-q...@erlang.org
On 01/12/2012 06:31 PM, Robert Virding wrote:
> I think that in many cases you DO want to know about the
> implementation as different algorithms have different properties and
> choosing the right one can have a significant effect on an
> application. This was the reason we have different versions with the
> same API, so you could test and choose between them and find the best
> one. Unfortunately it took a long time before there came a new one
> after the original two.
>
> I think it is good to have at least one tree based version as well.
> Though I would much prefer to have it in a separate module (gbdict?)
> instead of baking it into the dict version. I prefer to be explicit
> rather then "hiding" behind one module. In most case you do really
> want to know what is going on. I have two tree versions that I have
> long been meaning to include in the distribution, rbdict based on
> Red-Black trees and aadict based on the same algorithms as gb_trees
> but implemented in a different way. The more the merrier.
>
> But as I said I strongly feel it is much better to have separate
> modules. It also makes it easier to add versions, which as you can
> probably understand I think there should be.

It is of course possible to *also* give gb_trees a new interface (which
requires a new module name) although I didn't want to do that as part of
this exercise. Still, I think that it's a good thing to just make the
dict module itself offer both a hashed form and an ordered (tree based)
form, because it makes Erlang more approachable. These things are not
mutually exclusive.

The actual implementation used for the hash and tree forms could change
between Erlang releases depending on what is deemed to be the most
efficient; after all, the current hash table implementation is
explicitly said to be opaque and might be replaced with some more
efficient variant one day.

But for the basic Erlang user, I think it ought to be enough to know
that the dict module covers most of his needs, both for ordered and
unordered dictionaries. If he later wants to optimize his code, he can
then familiarize himself with the other N dict-like modules in the
distribution and pick one that might be better for his use case. If they
all speak the same API, swapping them should then be simple.

Robert Virding

unread,
Jan 18, 2012, 3:54:09 AM1/18/12
to Richard Carlsson, erlang-q...@erlang.org

Yes, the most important thing is definitely that the various dict modules all have the same API. My thought with having separate modules for each implementation would be that it would be easier to add new and different implementations. You could have versions tuned to different types of usage profiles. Even if you add the ordered tree option to 'dict' I think there should still be a version of it in a separate module, like gbdict. For backwards compatibility.

Robert

Tim Watson

unread,
Feb 28, 2012, 8:42:27 PM2/28/12
to Richard Carlsson, erlang-q...@erlang.org
On 13 Jan 2012, at 13:57, Richard Carlsson wrote:
There seems to be a lot of confusion going on here between *interface* (API) and *dispatch* (redirecting the API calls to the actual implementation depending on the data structure). I have only suggested a unified API; not a generic dispatch mechanism.

"Tuple modules" (please call them Parameterized modules or Abstract modules - the implementation-as-tuples is a temporary hack, and you should treat them as opaque objects just like funs) are not needed, because there is nothing here to parameterize. If all dict-like modules implement the same API, it's enough to use the module's name. For example:

  MyDict = orddict,
  ...
  D = MyDict:new(),
  ...

What you *could* do is to parameterize the module that uses the dictionary:

 -module(my_module, [SomeDict]).
 ...
 foo() ->
     D = SomeDict:new(),
     ...

and then you could easily change implementation as you like:

 M1 = my_module:new(orddict),
 M1:foo(),
 M2 = my_module:new(ets),
 M2:foo(),
 ...

That said, I think Clojure's protocols (which are a way of dynamically getting the same effect that you get with Haskell's type classes) are a neat idea that would probably work well with Erlang. But that's a different story.

A quick question Richard, if I may. How do you see this looking in the consumer's (Erlang) code? I can see a bunch of ways to make the definition available and do codegen from it etc, but it's the consumer who's suffering at the moment for lack of a decent abstraction. So far I've thought about having the 'protocol' declaration whilst I'm build and in a parse transform (or post build code instrumentation step, even at runtime I suppose) looking for an implementation that matches the call site. Rewriting the call to the 'appropriate' handler isn't hard, but the problem I'm having is understanding what is 'the right way' to resolve the implementation that should be used. 

I do also worry that without some kind of 'namespaces' concept, this could get messy in a hurry. Perhaps the solution to that is to allow the consumer to control the implementations which are in scope, but again, I'm not sure whether that's even sensible let alone usable.

I'd love to know your thoughts about this, as I'm on a mission to rid my Erlang code of voluminous boilerplate and this issue of dynamically binding an API call to an implementation is very high on my hit list.

Cheers,

Tim 

Tim Watson

unread,
Feb 29, 2012, 8:35:43 AM2/29/12
to Tim Watson, erlang-q...@erlang.org
And just to clarify, my question was about having something like Clojure's protocols in Erlang. Apologies if that wasn't clear at first!
Reply all
Reply to author
Forward
0 new messages