Proposal: Add new Protocols, specifically Dict

177 views
Skip to first unread message

Ed W

unread,
Apr 15, 2015, 10:14:39 AM4/15/15
to elixir-l...@googlegroups.com
Hi, as per a discussion on elixir-lang, it seems theoretically more elegant to me if we had Protocols for certain Elixir datatypes, starting with Dict

As per the documentation for Dict, the core API is simply:
    * `delete/2`
    * `fetch/2`
    * `put/3`

and it seems to me for performance reasons we should also consider the following core:
    * update
    * size/1
    * equal?/2

The above is a compact API and default implementations of the remainder of the Dict interface can be built using these definitions, albeit performance reasons may encourage writing specific protocol handlers for a given datatype

Motivation:
- The Dict interface is fundamantally "key value store" and that's quite a common and well understood interface. 
- Making Dict a protocol rather than a behaviour allows developers to wrap any third party datatype in a common interface.
- Obviously the developer also gets to keep any concerns arising from forcing a Dict interface onto the datatype, but the point is that this doesn't need to concern the original author of the datatype, ie you don't need the code or to alter the upstream datatype
- To my eye, Dict is a common interface that I would like to make use of more widely?

- Additionally there *is* already a polymorphism implementation in the Dict interface.  If you call Dict.xx on a map then it forwards that call to Map.xx
- So at present there are two styles of polymorphism in active use, using Protocol, and using Behaviour with target() based dispatch.  However, the latter does not allow a developer to easily wrap a thirdparty datatype with the given interface functions (that's the magic of a Protocol)

- Splitting out the interface should help keep it clean and tight. I think at present I tend to see HashDict and Map and fail to realise they are really the same thing.  I find it easier to resist packing ever more functionality into a generic interface, because I can think of it in new contexts. Essentially I find it simpler to see the core/generic/high-level requirements and not get pulled into the implementation details.


Observations:
- It might be worth splitting Dict into two interfaces? I see that the core operations delete/fetch/put/update look like a core API that I might want to use elsewhere, eg for my Array or Queue protocol? 
- Perhaps there is room for a core "Access" interface (I realise that word is already taken) which implements get/put and the specific Dict interfaces are a secondary Protocol?  This feels overkill, but encouraging re-use of the same words fetch/put for the same meaning in all datatypes is surely a win?

- "size/1" is repeated in the Enum interface as count/1.  Do we need both? If Dict implementations were required to implement Enum then it seems we could strike size from the Dict interface?
- size/1 feels like a function which can be significantly optimised for many datatypes so worth worrying about?
- size/1 feels almost like a protocol in its own right? We already have size & count, should we be encouraging a standardisation on the word choice for new datatypes?

- equal?/2 feels like a function which can be significantly optimised for certain datatypes.
- equal?/2 feels VERY MUCH a protocol in its own right?  Is there an argument for including it in Enum or even as its own protocol?

- update is very much an optimisation, but for many datatypes it might be an essential optimisation.
- Should it be considered a "core" part of the interface


Example of Map
The core protocol for Dict is quite small.  In practice most of the derived Dict types are implemented generically. eg, Specifically looking at Map the entire module can be abbreviated approx as follows (I have removed merge, could possibly remove has_key?):

  def size, do: maps.size(map)
  def has_key?(map, key), do: :maps.is_key(key, map)
  def fetch(map, key), do: :maps.find(key, map)
  def put(map, key, val), do: :maps.put(key, val, map)
  def delete(map, key), do: :maps.remove(key, map)
  def equal?(%{} = map1, %{} = map2), do: map1 === map2

So Map roughly breaks down as 8 lines of code for the core implementation.  Most of this is pass-through to the erlang :maps implementation. Also with a few exceptions the additional non core functions don't seem to be performance critical.

Checking HashDict for example, we see it's already implemented as a Struct type (ie straightforward to convert to a Protocol)


Reasons Not to?
- There appear to be performance concerns with a Protocol implementation that don't exist with the current
- Anything else?


Anyone want to pour some sauce on this?

Ed W

José Valim

unread,
Apr 15, 2015, 11:02:47 AM4/15/15
to elixir-l...@googlegroups.com
Hi, as per a discussion on elixir-lang, it seems theoretically more elegant to me if we had Protocols for certain Elixir datatypes, starting with Dict

I think this is conceptually wrong. We don't define protocols for datatypes. We define protocols because we are interested in some characteristics and some data-types may provide those characteristics.
 
As per the documentation for Dict, the core API is simply:
    * `delete/2`
    * `fetch/2`
    * `put/3`

and it seems to me for performance reasons we should also consider the following core:
    * update
    * size/1
    * equal?/2

put_new, pop, get_and_update, keys, values would have to be part of the list too. The first three operations can be done in one pass in many data types (instead of two). Regarding keys and values, we could actually have fast mechanisms for retrieving those, specially for small maps, where keys and values are allocated separately and continuously in memory. So there is a chance for optimizing them too.

This points back to what I said before: if we think of a protocol as an abstraction for the datatypes themselves, something will have to give somewhere forcing us to sacrifice performance or API size, and I don't think we should sacrifice any.
 
Motivation:
- The Dict interface is fundamantally "key value store" and that's quite a common and well understood interface. 

So define a KeyValue protocol in your application. It doesn't make sense to put all of Dict into a protocol. You need to decide what you want:

* If you want a dictionary - define all dictionary functions according to the Dict behaviour. You will have to implement a large "interface" but that is what it means to be a dictionary in Elixir

* If you want something like a key-value - define your KeyValue protocol. Different users may expect different properties and operations from their KeyValue operation so define a protocol with functions that make sense to you and your application 

This becomes clear when you ask: "is it worth splitting Dict into two interfaces"? No, it is not. A Dictionary is a single "interface". If you have a data structure that is not a dictionary but can work as key-value, it still is not a dictionary. A dictionary is designed up-front as a dictionary. Dictionaries can, however, implement multiple protocols, including your own protocols.

This is not the first time that someone proposes adding more protocols to Elixir but this is a slippery slope because we can't add a single protocol that makes sense to all applications. For example, take a possible size protocol. Some data structures can calculate their size as O(1). Others will take linear time. Should they all be in the same protocol then? Who draws the line?

We had to take some decisions like those when designing Enum. For example, we assume that everything is linear and, even though some data-structure may optimize count, you should still assume it is linear. This assumption makes sense for Enum because going through Stream makes everything linear anyway. But I wouldn't want to design all language and all protocols based on worst case times! Sometimes it is expected and important to know something will be constant time.

That also happens with Dict. If you are don't care about some properties, go through Dict. But if you expect a particular behaviour, you use the specific implementation. That's why being able to optimize every function is *very* important.

So the solution is always: define your own protocols with the properties that make sense to you.
 
- "size/1" is repeated in the Enum interface as count/1.  Do we need both?

Yes because Dict.size is going to work only with dictionaries.
  
- size/1 feels almost like a protocol in its own right? We already have size & count, should we be encouraging a standardisation on the word choice for new datatypes?

count is a verb. That's why we have Enum.count, it reflects the linear nature of the operation. size is when the size can be retrieved in constant time.
 
Reasons Not to?
- There appear to be performance concerns with a Protocol implementation that don't exist with the current

And that is the final point worth discussing. Note we have an Access protocol and it is going away in Elixir v1.1 because, although they are fast in production, in development they can become a bottleneck when multiple processes are involved. We haven't noticed issues with any other protocol besides the Access one because the Access operations are tiny and common, and we end-up invoking the protocol many, many times, which ended-up causing the performance issues.

Again, design your protocols in a way that makes sense for your application. Shipping with a bunch of tiny protocols, without well-defined characteristics, is going to make your code harder to understand and reason in the long term. You always sacrifice something when you make your code generic.

José Valim

unread,
Apr 15, 2015, 11:05:15 AM4/15/15
to elixir-l...@googlegroups.com
Also, please look at Functors in OCaml, which is another way for doing polymorphism and more inline with our Dict implementation. Although they don't have a dispatching module like Dict, they still allow you to implement your own dictionaries with default implementations while allowing you to customize any function you want. OCaml functors are less general purpose than protocols though (in my opinion).



José Valim
Skype: jv.ptec
Founder and Lead Developer

Chris Keele

unread,
Apr 15, 2015, 12:34:33 PM4/15/15
to elixir-l...@googlegroups.com, jose....@plataformatec.com.br
Out of curiosity, then, why are Enum/Enumerable separate? Is it to maintain parity with Stream?

Saša Jurić

unread,
Apr 15, 2015, 12:49:45 PM4/15/15
to elixir-l...@googlegroups.com
Continuing on the discussion from the "talk" list, I'd like to add that it somehow seems "off" that Dict is not implemented through protocols. I understand the reasons explained in that thread, but as Peter mentioned, default implementations can also be done in the style of  Enumerable.member? and Enumerable.count. Alternatively we may offer the `use` helper with default implementations, so one can do:

defimpl Dictionary, for: MySuperDict do
  use Dict.GenericImplementation

  # Here I need to implement only few functions...
  # ... but I can also override default impls for perf reasons.
end

Having Dict implemented in the current way possibly sends a wrong message and confuses people. Ed was obviously puzzled, and I was myself confused at some point, and even now I'm not sure how would I explain to some newcomer why Dict based polymorphism doesn't rely on protocols.

Another question is whether Dict is needed at all? I seem to recall that the reason for introducing it was related to upcoming introduction of maps. Maybe I remembered wrong, but either way I can't remember the details.

José Valim

unread,
Apr 15, 2015, 1:07:02 PM4/15/15
to elixir-l...@googlegroups.com
Out of curiosity, then, why are Enum/Enumerable separate? Is it to maintain parity with Stream?

To clarify: Enum is *not* a datatype, it is a bunch of functions that work with collections, including dicts. I believe a lot of the confusion  is because we are trying to put things like Enum and Dict in the same bucket.

* Enum and Stream are modules that use the Enumerable protocol.
* Dicts, lists, streams, etc implement the Enumerable protocol

With Dict, we don't have a protocol, we have a datatype specification and functions to work with this specification. For example, if you take Dict.merge, it doesn't simply dispatch to the underlying Dict, it actually has logic to work with different datatypes.

Ed W

unread,
Apr 15, 2015, 1:12:04 PM4/15/15
to elixir-l...@googlegroups.com
Hi


As per the documentation for Dict, the core API is simply:
    * `delete/2`
    * `fetch/2`
    * `put/3`

and it seems to me for performance reasons we should also consider the following core:
    * update
    * size/1
    * equal?/2

put_new, pop, get_and_update, keys, values would have to be part of the list too. The first three operations can be done in one pass in many data types (instead of two). Regarding keys and values, we could actually have fast mechanisms for retrieving those, specially for small maps, where keys and values are allocated separately and continuously in memory. So there is a chance for optimizing them too.

This points back to what I said before: if we think of a protocol as an abstraction for the datatypes themselves, something will have to give somewhere forcing us to sacrifice performance or API size, and I don't think we should sacrifice any.

Understood.... But I thought with a Protocol I *can* have all of the above? I don't see that any of the above isn't true both for a Behaviour or a Protocol?

- Default implementations (non optimal) are provided through the Protocol fallback (just as they are at present through the behaviour __USING__)
- If I need an optimised "get_and_update" for my specific datatype then I just define that protocol instead of relying on the default implementation (just as at present where there are overrides in HashDict/Map)

What am I missing?  I am anticipating that the code would remain identical in HashDict/Map, only the method of dispatching to the functions is what I'm proposing?




Motivation:
- The Dict interface is fundamantally "key value store" and that's quite a common and well understood interface. 

So define a KeyValue protocol in your application. It doesn't make sense to put all of Dict into a protocol.

But why doesn't it make sense to pull all of Dict into a protocol?  Seems like a fine Protocol to me?  It already *is* an interface definition in that its a behaviour, all I'm saying is that its a fine interface and perhaps we should make use of it more widely elsewhere?

Its no problem for me to pull KV/Dict into my own protocol, so will Bob, and so will Dave and Jane and so on and we will all repeat each other.  My observation is that this looks like a re-usable "paradigm", (ie show me a modern language which doesn't have a built in Dict/Hash type?), so why not push it out as a core concept for people to re-use in their projects?

Why is Enum a more obvious Protocol than Dict?


* If you want a dictionary - define all dictionary functions according to the Dict behaviour. You will have to implement a large "interface" but that is what it means to be a dictionary in Elixir

* If you want something like a key-value - define your KeyValue protocol. Different users may expect different properties and operations from their KeyValue operation so define a protocol with functions that make sense to you and your application

Can you explain what difference you see in Dict vs Key-Value?  To me they are synonymous types?

Conversely, the reason I would argue for a clear interface is *BECAUSE* if you aren't careful, one tends to add functions based on some knowledge that "it's implemented as a hash" and forget that you are implementing "a Dict".  ie you get drawn into knowledge of your implementation.  If you aren't careful you will implement things that work well for HashDict and would be problematic for Map say.

Finally, this is a philosophical question because I'm still trying to get my head around functional programming, but:
- Isn't the whole point that we look at FP as about the *function* on the data?
- Ie we think in terms of what we want the function to *do*, and less about what the data is?

This is why I am coming at the problem in terms of "I want to put/3 some value into this datatype".  Of course I also am conscious of efficiency, (but for the vast number of situations efficiency is not a problem.)



This becomes clear when you ask: "is it worth splitting Dict into two interfaces"? No, it is not. A Dictionary is a single "interface". If you have a data structure that is not a dictionary but can work as key-value, it still is not a dictionary. A dictionary is designed up-front as a dictionary. Dictionaries can, however, implement multiple protocols, including your own protocols.

OK, I'm happy with that.  Can we have a Dictionary protocol please?

I'm not really getting why you think Dict isn't a Key-Value? The words feel the same to me?  I look at the docs page for Dict and it basically shows me all the functions I might expect to implement for all kinds of things from a cache store to an advanced priority queue?  For example, if I wanted a generic config parser, I would suggest that Dict is a great starting point for an interface?

I would have thought Erlang was already a great example of how one can bend a few careful interfaces and re-use them in wonderful ways?  We have gen_server which can become almost anything.  I don't quite see why Dict isn't a candidate for careful re-use in all kinds of wonderful ways?




This is not the first time that someone proposes adding more protocols to Elixir but this is a slippery slope because we can't add a single protocol that makes sense to all applications.

Sure... But I don't see a behaviour is any less easy to change than a protocol?  Both are about getting it out there and kicking it around and seeing who can offer some insight?

I'm suggesting to leave the actual functions exactly as it is.  The interface functions looks fine to me whether its a behaviour or a protocol?


For example, take a possible size protocol. Some data structures can calculate their size as O(1). Others will take linear time. Should they all be in the same protocol then? Who draws the line?

Don't see that this makes a difference?  That's a problem for the implementation?  In all cases though the interface will be the same!



That also happens with Dict. If you are don't care about some properties, go through Dict. But if you expect a particular behaviour, you use the specific implementation. That's why being able to optimize every function is *very* important.

Err?  Aren't you arguing FOR protocols here?

If you used a Protocol then you would call the generic interface in all cases and automatically get upgraded to the specific implementation if one was provided?  This is surely BETTER than having to figure out whether you use the generic or specific implementation in your code!




So the solution is always: define your own protocols with the properties that make sense to you.
 
- "size/1" is repeated in the Enum interface as count/1.  Do we need both?

Yes because Dict.size is going to work only with dictionaries.

Sorry, I'm not getting it?  You solve this by just defining a specific count/1 protocol function for Map/HashDict and now count works optimally on all our Dictionary types? Isn't this what you are supposed to do anyway? (define a count function if you can do better than the default reduce implementation)

This is already how it seems to work, eg Enum.count for HashDict simply calls HashDict.size:
    https://github.com/elixir-lang/elixir/blob/master/lib/elixir/lib/hash_dict.ex#L233
Maps appears to call :erlang.map_size (although because it's not a protocol this is tricky to discover as it needs to be implemented in enum.ex...)

So right now I don't see that size gives a different answer to count for the current Dictionary types?  Can you explain when and where it would?  If it doesn't then why do we need both implementations if they give identical answers?



- size/1 feels almost like a protocol in its own right? We already have size & count, should we be encouraging a standardisation on the word choice for new datatypes?

count is a verb. That's why we have Enum.count, it reflects the linear nature of the operation. size is when the size can be retrieved in constant time.

Yeah, but count/1 will have constant time implementations for situations where it's possible?

I'm not sure I understand your point here?  Some future actual Dictionary might really not have a constant time size function available?  eg I could write some tree implementation which doesn't bother to record the number of nodes, I would still have to implement the size function (because it says so in the behaviour spec), but the result wouldn't be constant time just because of the function name? 

I don't see any special reason all Dictionaries will implement constant time size functions (although I concede they would be foolish not to...).  Likewise I would expect many other non dictionary structures to be able to implement Enum.count in constant time


You can ignore the rest of my email, but I really think you should reconsider if we need Dictionary.size, given that Enum.count is implemented optimally for all Dictionary types?

Likewise almost EVERYONE needs an equal?/2 function for their datatype don't they?!  Why don't we bang that in as a standard protocol right away??!



Reasons Not to?
- There appear to be performance concerns with a Protocol implementation that don't exist with the current

And that is the final point worth discussing. Note we have an Access protocol and it is going away in Elixir v1.1 because, although they are fast in production, in development they can become a bottleneck when multiple processes are involved. We haven't noticed issues with any other protocol besides the Access one because the Access operations are tiny and common, and we end-up invoking the protocol many, many times, which ended-up causing the performance issues.

Performance is a superb reason not to do this.  Is this the real answer? In which case I learn from this not to use Protocols in my performance critical code?  (Remember I started out by asking a question about when I should use Protocols and when I should use Behaviour with a handrolled polymorphism a-la Dict)



Again, design your protocols in a way that makes sense for your application. Shipping with a bunch of tiny protocols, without well-defined characteristics, is going to make your code harder to understand and reason in the long term. You always sacrifice something when you make your code generic.

Hmm, most of the world would argue that abstracting things generally helps clean them up?  I could implement some tree direct in my datastructure, but it's usually a win to say abstract it out into a dictionary a-like type and simplify the interfaces.

I would strongly argue that the Hash, (and Array) interface is an *extremely* common pattern to implement?  For example, its practically mandatory in Perl to offer a tied hash interface to any interesting module. Wierd random example to illustrate: the link below shows how you can search Google through a hash type datatype (erk!):
    http://search.cpan.org/~darren/Tie-Google-0.03/Google.pm
I won't hold up Perl as an example of why Elixir should do something (!), but to say that the Hash/Dict pattern is universal and extremely widely used seems like an understatement? 

Cheers

Ed W

José Valim

unread,
Apr 15, 2015, 1:24:54 PM4/15/15
to elixir-l...@googlegroups.com
Continuing on the discussion from the "talk" list, I'd like to add that it somehow seems "off" that Dict is not implemented through protocols.

I am not sure how else I can answer this question because, in my mind, I have expressed all concerns. Or I am not being clear (which is completely possible) or you don't value the trade-offs in the same why as I do (which is completely fine). So I am glad to give this one last try.

* Large protocols are smell. Why? Because a protocol should be a minimum set of functions that provide some particular characteristics. If you have a large protocol, it means you have failed to provide this specification. If the current Dict implementation is confusing, I would still choose a confusing Dict implementation than opening a precedent or giving the example of a large protocol.

* Performance. The Dict module does not work as a protocol. For example, Dict.merge does not do a custom dispatch but has custom logic for merging different types. So we would need a Dict module as front-end, that calls the DIctionary protocol. This brings us to the performance issues that lead us to remove the Access protocol in the first place and we would likely hit those with a Dictionary protocol too.

* Conceptually, don't treat dictionaries or datatypes as a protocol. They are the definition of a datatype. I agree this is very fuzzy because we don't have a name for this concept in Elixir and things without name are confusing. I recommending reading about OCaml functors to see if maybe it helps.

To make the last point a bit clearer, consider the following case. With dictionaries, sometimes I want to use the functions in Dict and sometimes I want to use the function in the implementation, like HashDict or Map. In fact, I use HashDict and Map directly more than I use the Dict one. If we had protocols, we would need three modules:

* Dict (polymorphic)
* HashDict (the datatype)
* Dictionary.HashDict (the implementation of the protocol)

So I would call HashDict.new to create a new HashDict and I could call Dict to use the polymorphic implementation. However, if I wanted to use directly the HashDict module, or I would need to:

1. use Dictionary.HashDict directly (which is really, really bad. What was the last time you called Enumerable.List?)
2. duplicate all functions in Dictionary.HashDict in the HashDict module (which is really, really bad)

Again, protocols != datatype definition.

I hope this clarifies.

Ed W

unread,
Apr 15, 2015, 1:25:18 PM4/15/15
to elixir-l...@googlegroups.com
Aha, ok now I see the core of your argument.  BUT:

- I would argue that the underlying datatype is the underlying datatype!
- It could be a Redis wrapper, it could be a gb_trees implementation, it could be a trie implementation
- KeyValue (to use a different name) is therefore is a protocol to dispatch to the underlying datatype

So perhaps based on my new understanding above, let me rephrase my proposal:

- Dict behaviour helps create certain modules that implement a key-value type operation
- I see a requirement for a "KeyValue" type Protocol which allows me to have a more universal API to address Map, HashDict, gb_trees, Redis and much more.

- The mistake in my previous proposal is that I see this Protocol as being basically identical functions to Dict behaviour.
- I mistakenly thought this meant we don't need both, I see your point above
- Therefore I see value in having a new Protocol which is extremely similar to the current Behaviour and draws this and other datatypes into a similar top level interface
- Making it a protocol means I can easily wrap some new datatype and give it a common access API.  For example why invent a new API for simple use cases for amazon simpledb. Lets just build a simpledb datatype and then wrap it in our common KeyValue protocol (this doesn't have to exclude a more advanced interface for more complex use cases).  Sure the performance profile is going to be very different - but that's implicit


Does this seem more sensible?

Ed W


Ed W

unread,
Apr 15, 2015, 1:32:11 PM4/15/15
to elixir-l...@googlegroups.com
Please see my other email.  I withdraw my proposal to remove the Dict behaviour, I now propose that Dict stays as a behaviour to build datatypes (which is Jose's point), however, I still maintain that a very set of functions is a valuable Protocol and so my new proposal is similar in that we need a core and approved Protocol to use when accessing KeyValue type structures...

This is roughly the same thing in a roundabout way, but I do now see Jose's point (apologies english keyboard)

To clarify, I think we have gone astray because:

- We obviously have a datatype which needs  a standardised interface
- But as highlighted by myself and Saša, one often wants a standardised way to access LOTS of random datastructures, often they might be very different in underlying implementation to a Dictionary. eg accessing a config file can be very satisfying through a Dictionary type interface, even though the underlying may well be an .INI file or a JSON structure or whatever.
- I don't even see that for those cases its a problem not to implement some functions, eg the config file parser might refuse to "put" and only "get" values (say)

Can I reshuffle the discussion along those lines?  I think I agree with Jose that Dict behaviour is something different. BUT I think the proposed Protocol is nearly identical and that is confusing

Thanks

Ed W
--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/0cab6b5e-410e-4e13-acb0-fa1a81e40f0c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ed W

unread,
Apr 15, 2015, 1:37:30 PM4/15/15
to elixir-l...@googlegroups.com
Hi

> * Large protocols are smell. Why? Because a protocol should be a
> minimum set of functions that provide some particular characteristics.
> If you have a large protocol, it means you have failed to provide this
> specification.

But it's not currently a large "interface". Its actually quite a
reasonable size.

I think if I were asked to design a *protocol* (lets call it "KeyValue"
so we don't misunderstand the proposal) that could be used as a generic
way to talk to HashDict, Map and some third party datatype, I would
design very roughly what is in the current Dict behaviour.

I guess there will be minor deviations because one is an interface to a
datatype (as you have corrected me), and the other is a generic set of
functions to work on key value type structures, but I'm looking again
now and I don't really see anything I would change right now? They
would both have identical interface signatures I think?

> If the current Dict implementation is confusing,

I don't think it is! I think it's great! Hence the proposal that we
should *also* add a Protocol based on this "interface". I think the
Protocol would be really useful to unify lots of different datatypes!

Cheers

Ed W

José Valim

unread,
Apr 15, 2015, 1:48:26 PM4/15/15
to elixir-l...@googlegroups.com


Can you explain what difference you see in Dict vs Key-Value?  To me they are synonymous types?

It depends. For example, in your follow-up e-mail, you mention about a KeyValue protocol that could also be backed by HashDict, Redis or an Amazon Web Service. This is a very dangerous idea for many reasons:

1. A KV.fetch operation in a in memory dictionary can only fail because the key is missing. However, a KV.fetch in Redis can fail by many, many reasons, including network partitions.

2. The performance characteristics are very, very different. For example, we have a function named Enum.group_by/3 that expects a dictionary as argument. If you are grouping by a list of 1000 elements into a dict, that is going to be very, very fast. Now, through a web service, it will be order of magnitudes slower.

Those characteristics should be all part of your protocol definition. It is not only about the data type semantics, but the error conditions, performance characteristics and so on. I would like to be very confident that when I write KV.fetch, there isn't a chance it is going to invoke something that causes a crash in my current process.

Finally, this is a philosophical question because I'm still trying to get my head around functional programming, but:
- Isn't the whole point that we look at FP as about the *function* on the data?
- Ie we think in terms of what we want the function to *do*, and less about what the data is?

I don't believe this to be true. You need to know your data types. Pattern matching is a prime example of that.

OK, I'm happy with that.  Can we have a Dictionary protocol please?

No, due to all the points mentioned in this reply, in the earlier reply to you and the reply to Sasa.

This is already how it seems to work, eg Enum.count for HashDict simply calls HashDict.size:
    https://github.com/elixir-lang/elixir/blob/master/lib/elixir/lib/hash_dict.ex#L233
Maps appears to call :erlang.map_size (although because it's not a protocol this is tricky to discover as it needs to be implemented in enum.ex...)

I was not talking about the implementation. Go back to the "assertive code" blog post. When I call Dict.size/1, I know it will be valid only for dictionaries. When I look at the code, I know it expects a dictionary. Those have different level of permission: 

* HashDict.size/1 - I expect a HashDict
* Dict.size/1 - I expect any dictionary
* Enum.count/1 - I expect any enumerable
 
Yeah, but count/1 will have constant time implementations for situations where it's possible?

If it has, it is an optimization that you shouldn't rely on. If your code can run faster beyond your original assumption/expectation, that is a good thing, isn't it? Always rely on linear time for the functions in Enum unless otherwise is explicitly said.
 
You can ignore the rest of my email, but I really think you should reconsider if we need Dictionary.size, given that Enum.count is implemented optimally for all Dictionary types?

I have read all your e-mail. I don't quote all chunks of it though because I cannot answer every question nor I have something interesting to add to all of them.

It would really help if you streamlined more your thoughts and focused on the important ones. While I wrote this reply, I got three notifications with new e-mails from you and I definitely won't have the time go through all of them.
 
Hmm, most of the world would argue that abstracting things generally helps clean them up?

I have given plenty of examples and anecdotes, including the assertive code blog post, about making the code clearer and avoid abstracting/introducing polymorphism unless there is need. We typically program with abstractions on top of abstractions but we should always pick the lowest abstraction that can elegantly solve the problem at hand.

I won't hold up Perl as an example of why Elixir should do something (!), but to say that the Hash/Dict pattern is universal and extremely widely used seems like an understatement? 

If you ignore the performance characteristics, memory usage, error handling, and so on, then Hash/Dict is universal. I wouldn't like to program like that though.
 

John W Higgins

unread,
Apr 15, 2015, 1:56:25 PM4/15/15
to elixir-l...@googlegroups.com
Why not just write what you want and release it as a library first? 

Why sit here and butt heads over something that has no need of being anywhere near the core initially? 

Jose is a perfectly rational person and I'm certain that if you wrote something that made sense he would look at it again - but I know for a fact it would be easier for him to be convinced with written code in front of him rather then simply trying to argue facts that he doesn't currently subscribe to. We all work better that way.

John

Ed W

unread,
Apr 15, 2015, 2:21:09 PM4/15/15
to elixir-l...@googlegroups.com
I would have written the wrong thing though!?  In the last hour Jose has convinced me to delete my previous proposal and I have sent a new proposal!

Please bear in mind I am "asking a question". I don't know the answer and I don't intend this to come across as butting heads. Apologies if it reads that way

Cheers

Ed W
--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.

Saša Jurić

unread,
Apr 15, 2015, 2:56:04 PM4/15/15
to elixir-l...@googlegroups.com
On 15 Apr 2015, at 19:24, José Valim <jose....@plataformatec.com.br> wrote:

* Large protocols are smell. Why? Because a protocol should be a minimum set of functions that provide some particular characteristics. If you have a large protocol, it means you have failed to provide this specification. If the current Dict implementation is confusing, I would still choose a confusing Dict implementation than opening a precedent or giving the example of a large protocol.

That’s a valid point, though I don’t necessarily agree. But yeah, I see the reasoning.
The question still remains, why do we need generic Dict (and Set)?


* Performance. The Dict module does not work as a protocol. For example, Dict.merge does not do a custom dispatch but has custom logic for merging different types. So we would need a Dict module as front-end, that calls the DIctionary protocol. This brings us to the performance issues that lead us to remove the Access protocol in the first place and we would likely hit those with a Dictionary protocol too.

Wouldn’t perf be resolved with consolidation?

Also, why couldn’t merge work in the same way it works now? The idea of using protocols is to have a generic Dict module, which is what you already have. The only difference is the way it dispatches to the concrete data types.

This allow us to easily adapt even the types we don’t own. For example with the current approach of Dict, I can’t make it work with :gb_trees unless I wrap the module. But then it means I either have to fully copy :gb_trees interface to my own GbTree, or otherwise have partial implementation and use sometimes :gb_trees, sometimes GbTree, and sometimes Dict.

In contrast, with protocols, I only need to implement Enumerable + Dictionary, and :gb_trees magically works with Dict.


* Conceptually, don't treat dictionaries or datatypes as a protocol. They are the definition of a datatype. I agree this is very fuzzy because we don't have a name for this concept in Elixir and things without name are confusing. I recommending reading about OCaml functors to see if maybe it helps.

To make the last point a bit clearer, consider the following case. With dictionaries, sometimes I want to use the functions in Dict and sometimes I want to use the function in the implementation, like HashDict or Map. In fact, I use HashDict and Map directly more than I use the Dict one. If we had protocols, we would need three modules:

* Dict (polymorphic)
* HashDict (the datatype)
* Dictionary.HashDict (the implementation of the protocol)

I don't equal protocols and data types. If I said something of the sort, then it was poorly worded on my behalf. I see protocols as a contract which is enforced by the generic logic.
Thus, the way I’m heading is exactly what you wrote. Dict is generic and it can work with different types. And to me, this is exactly the case for protocols.


So I would call HashDict.new to create a new HashDict and I could call Dict to use the polymorphic implementation. However, if I wanted to use directly the HashDict module, or I would need to:

1. use Dictionary.HashDict directly (which is really, really bad. What was the last time you called Enumerable.List?)
2. duplicate all functions in Dictionary.HashDict in the HashDict module (which is really, really bad)

Or you could make Dictionary.HashDict delegate to HashDict, like you’re doing here: https://github.com/elixir-lang/elixir/blob/v1.0.4/lib/elixir/lib/hash_dict.ex#L229-L234
This is in fact the way I would usually implement a protocol.
Granted, it gets clumsy since the protocol itself would be large. This is indeed the biggest reason I see against having a Dictionary protocol.

José Valim

unread,
Apr 15, 2015, 3:23:42 PM4/15/15
to elixir-l...@googlegroups.com
That’s a valid point, though I don’t necessarily agree. But yeah, I see the reasoning.
The question still remains, why do we need generic Dict (and Set)?

Sorry you asked it before and I missed it. Although maps in 18 will make HashDict obsolete, we still have Map and Keyword, so I would say yes.

Wouldn’t perf be resolved with consolidation?

Unfortunately for operations in Dict, if they are very frequent, we can see severe slow downs in development when protocols are not consolidated because every call is serialized through the code server. Access is just part of the Dict API and it is already going away because of that.
 
Also, why couldn’t merge work in the same way it works now?

Yes, it could. I was just showing the amount of indirection will affect performance. Dict -> Dictionary -> Dictionary.HashDict -> HashDict.
 
In contrast, with protocols, I only need to implement Enumerable + Dictionary, and :gb_trees magically works with Dict.

Unfortunately that's not true because gb_trees is a tuple. So you still need a GbTree module to provide the struct. And then implement things as a protocol implementation or behaviour functions makes no difference (from the implementation side). This is not a pro or a con to me, just clarifying!

Saša Jurić

unread,
Apr 15, 2015, 4:25:53 PM4/15/15
to elixir-l...@googlegroups.com
On 15 Apr 2015, at 21:23, José Valim <jose....@plataformatec.com.br> wrote:

That’s a valid point, though I don’t necessarily agree. But yeah, I see the reasoning.
The question still remains, why do we need generic Dict (and Set)?

Sorry you asked it before and I missed it. Although maps in 18 will make HashDict obsolete, we still have Map and Keyword, so I would say yes.

My question really aims at how many known cases do we have where user code wants to be oblivious of the specific Dict impl? Differences between maps and keyword lists are arguably significant in perf, and also in some semantics (ordering, multiple values per key).

Also, what do we need Set for?


Wouldn’t perf be resolved with consolidation?

Unfortunately for operations in Dict, if they are very frequent, we can see severe slow downs in development when protocols are not consolidated because every call is serialized through the code server. Access is just part of the Dict API and it is already going away because of that.

This kind of raises a tangential question, isn’t Enumerable affected by the same issue? This is IMO by far the most used protocol in Elixir code.

 
In contrast, with protocols, I only need to implement Enumerable + Dictionary, and :gb_trees magically works with Dict.

Unfortunately that's not true because gb_trees is a tuple. So you still need a GbTree module to provide the struct. And then implement things as a protocol implementation or behaviour functions makes no difference (from the implementation side). This is not a pro or a con to me, just clarifying!

Yeah, good point, I forgot about that. It still holds though for any Elixir struct, though hopefully the implementer of a custom dictionary would make it Dict compliant.

José Valim

unread,
Apr 17, 2015, 5:58:21 AM4/17/15
to elixir-l...@googlegroups.com
My question really aims at how many known cases do we have where user code wants to be oblivious of the specific Dict impl? Differences between maps and keyword lists are arguably significant in perf, and also in some semantics (ordering, multiple values per key).

There are some cases. For example, Enum.group_by/3, URI.decode/2 and a couple others in core. The biggest motivation for Dict though was HashDict <-> Map, which will be an upcoming big change. We can evaluate the importance of Dict once HashDict is phased out. 
 
Also, what do we need Set for?

Same history. Today we have HashSet but MapSet is already on master thanks to the larger maps.
 
This kind of raises a tangential question, isn’t Enumerable affected by the same issue? This is IMO by far the most used protocol in Elixir code.

No because we can inline one function in Enum that will make 80% of all functions fast for common data structures like lists and maps. Everything in Enum is implemented on top of reduce so if we inline Enum.reduce, they will be super fast (no protocol dispatch) and almost all functions in Enum will use it.

This is hard to do with Dict because we will would need to inline almost every operation because they are disjoint.
 
Yeah, good point, I forgot about that. It still holds though for any Elixir struct, though hopefully the implementer of a custom dictionary would make it Dict compliant.

For data structures like queue, gb_trees and so on, if mapping them to Elixir, it is honestly easier to reimplement them. Otherwise things like Enumerable become very hard to implement efficiently.

Saša Jurić

unread,
Apr 17, 2015, 7:01:17 AM4/17/15
to elixir-l...@googlegroups.com
Makes sense, thanks for all the explanations!


--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-core" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-core/YZfQutavgS4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4KfpVn%3DfmL1J8Chyz0wBzCnoJ_hB1hJP-uy48M9mUfbWg%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages