[Proposal] Document the Big-O complexity (time and space) of every collection and their functions

40 views
Skip to first unread message

Mário Guimarães

unread,
Jul 18, 2019, 12:36:04 PM7/18/19
to elixir-lang-core
Hi,

I find this can be useful in selecting the collections and their manipulating functions to be used in our code.

The case is that in an immutable language it may not be evident what is the O(...) of the insert, read, update, delete and sort, in our collections.

For example, what is the complexity of sorting an immutable enumerable via `Enum.sort`? I really do not know.

This could be documented with a table at the start of the documentation for the collection.

This effort could be decomposed in as many issues as the number of collections to document.

Experienced people may know many of these Big-O details, but for newcommers to Elixir (and Erlang) it could be very helpful.

Thanks
Mário

José Valim

unread,
Jul 18, 2019, 1:13:47 PM7/18/19
to elixir-l...@googlegroups.com
Perhaps we could use Documentation Metadata for this:

@doc big_o: "O(n)"

or

@doc big_O: "O(n)"

By keeping the information structured, we can find different ways to show and format it in the documentation tools. I am not sure if this would provide enough structure though.


José Valim
Skype: jv.ptec
Founder and Director of R&D


--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/a3f79568-10d3-49b0-a93f-a1bcf7054c59%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrea Leopardi

unread,
Jul 18, 2019, 4:08:41 PM7/18/19
to elixir-l...@googlegroups.com
Maybe worth considering having a generic 

  @doc performance: "O(n)"

that can also be used as say

  @doc performance: "read-optimized"

or whatever.


For more options, visit https://groups.google.com/d/optout.
--

Andrea Leopardi

Mário Guimarães

unread,
Jul 18, 2019, 5:49:23 PM7/18/19
to elixir-lang-core
I would suggest then to have

@doc time: "O(n)"

and

@doc space: "O(n^2)"


to clearly refer to time and space complexity of the function.

Mário

José Valim

unread,
Jul 18, 2019, 6:01:47 PM7/18/19
to elixir-l...@googlegroups.com
I like that. Maybe time_complexity / space_complexity for clarity.



José Valim
Skype: jv.ptec
Founder and Director of R&D

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.

Michał Muskała

unread,
Jul 18, 2019, 6:37:59 PM7/18/19
to elixir-l...@googlegroups.com
I'd like to actually argue against this addition. 

For one — it adds complexity to the documentation - you have to know what the big-O notation means in the first place  I'd argue it will make the documentation less approachable for people without formal CS education. It only becomes worse with some of the more complex data structures that feature "amortised" or "expected" big-O performance. To give just one example - all operations on maps implemented as HAMT are rather complex to express in this notation and would probably need a paragraph if not more to explain.

Secondly, it might prove problematic in a way that it bounds the standard library not to change those in the future. Changing an implementation where it would affect the complexity of the function could be considered a breaking change. I believe that for that reason, for example, the Rust standard library does not feature such documentation in general.

But there's also a much bigger issue and that's if this is actually useful in real programs beyond most simple divagations. Modern computers with out-of-order CPU execution, expensive and complex caching mechanisms and other such features mean that the classical notion of big-O notation rarely directly reflects on real-world performance. Even worse if the function at hand allocates - should it also consider the potential impact it could have on garbage collection? There are numerous algorithms that theoretically should be worse, but are much better in practice. I fear that adding such information would lead to people deciding which potential implementation is better just by looking at that documentation instead of benchmarking for their own use case, which is the only reliable method of testing for performance.

To summarise - I think this will only add additional, superfluous information to the documentation that is of questionable usefulness but that might prove problematic in the future.

Michał.

Paul Schoenfelder

unread,
Jul 18, 2019, 7:28:37 PM7/18/19
to elixir-l...@googlegroups.com
Michał did a good job of explaining why I'm not much of a fan of this either. I think when it makes sense at all, it is best to have the big-O information in module/function docs, where they can be contextualized with their intended usage, or at least sit alongside a warning that says something like "always benchmark your use case".

Paul

Rich Morin

unread,
Jul 18, 2019, 9:04:37 PM7/18/19
to elixir-lang-core
I would like to see a "getting started" page that covers the time and space
footprint of each major data type. It could, for example, tell the reader
that adding an element to the end of a list is O(n), but that prepending
an element is O(1).

It could also talk about persistent data structures and how that affects how
expensive it is (in time and space) to make a modified copy of a binary, map,
tuple, etc.

I would also be happy to have notes in the documentation for common library
functions, explaining anything unusual about their performance. For example,
although the Enum.sort_by/3 entry contains performance hints of this nature,
the Enum.sort/2 entry does not. So, an incautious reader might well miss it.

-r

Mário Guimarães

unread,
Jul 19, 2019, 5:38:40 AM7/19/19
to elixir-lang-core

I like the proposed names of time_complexity and space_complexity; it can't be more clear.

Some points in clarifying Big-O:
- it captures the time and space performance of algorithms
- time and space are quality factors, thus improving them does not break the API of the algorithm
- it is really useful as an indicator of algorithm performance in the presence of large inputs (asymptotic performance)
- as such, it is really useful in predicting the performance of algorithms used large software systems, like those on the Internet
- it guides the design and choice of algorithms
- it educates people on algorithm design
- and because of all this, it is thought in every serious data structures and algorithms course

No known computer design can automatically improve the asymptotic performance of algorithms; what it can do, is to raise the bar on the size of the input that an algorithm can take before it starts to show its asymptotic behavior

As the previous points stated, Big-O is a guide of asymptotic performance on large inputs; obviously, if the inputs are small, then most competing algorithms will show equivalent performance. It is for large inputs that Big-O is a great aid on choosing among competing algorithms.

Big-O is never an excuse to avoid testing our programs for their real performance; again, Big-O is just an indicator of what this might be for large inputs, and overall, a great tool to base the initial design of our programs before tuning and realease to the world.

Functional programming is penetrating a world in which mutable data structures and their algorithms dominate, and for which Big-O indicators are very well documented and accessible. This is not the case for the Big-Os of immutable data structures and their algorithms, whose implementation details are yet somewhat obscure to the general audience.

In short, I think this will be a great addition to the Elixir documentation.

Thanks
Mário

quinta-feira, 18 de Julho de 2019 às 23:37:59 UTC+1, Michał Muskała escreveu:
I'd like to actually argue against this addition. 

For one — it adds complexity to the documentation - you have to know what the big-O notation means in the first place  I'd argue it will make the documentation less approachable for people without formal CS education. It only becomes worse with some of the more complex data structures that feature "amortised" or "expected" big-O performance. To give just one example - all operations on maps implemented as HAMT are rather complex to express in this notation and would probably need a paragraph if not more to explain.

Secondly, it might prove problematic in a way that it bounds the standard library not to change those in the future. Changing an implementation where it would affect the complexity of the function could be considered a breaking change. I believe that for that reason, for example, the Rust standard library does not feature such documentation in general.

But there's also a much bigger issue and that's if this is actually useful in real programs beyond most simple divagations. Modern computers with out-of-order CPU execution, expensive and complex caching mechanisms and other such features mean that the classical notion of big-O notation rarely directly reflects on real-world performance. Even worse if the function at hand allocates - should it also consider the potential impact it could have on garbage collection? There are numerous algorithms that theoretically should be worse, but are much better in practice. I fear that adding such information would lead to people deciding which potential implementation is better just by looking at that documentation instead of benchmarking for their own use case, which is the only reliable method of testing for performance.

To summarise - I think this will only add additional, superfluous information to the documentation that is of questionable usefulness but that might prove problematic in the future.

Michał.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-l...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-l...@googlegroups.com.

Miguel Palhas

unread,
Jul 19, 2019, 6:21:37 AM7/19/19
to elixir-l...@googlegroups.com
I agree with all the previous points about why time/space complexity is useful info in general
But at the same time, and remembering other proposals that I see going through this mailing list, I often see that the decision goes a lot towards preserving the clarity of the documentation, especially when it comes to less experienced developers.

I think a good compromise would be to add this information to the collection module top-level documentation, or a separate page for example. This has already been suggested a couple of times in this thread.

But documenting complexity for every collection function individually seems like it won't be very helpful.
I'd argue that, if you know what time & space complexity means, then it's likely that you already have at least some understanding of how that works, and you'll know at least the basics, rendering that extra documentation less interesting for you. Or at the very least, you'll know that the concept exists, and that for certain use cases you may have to think about what the most appropriate structure will be. At that point, you can do any additional research outside of the Elixir documentation.

And for those who don't know the meaning, it will just be an additional barrier to their understanding of the documentation.

To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/7442e39a-5926-4b80-8a9d-fd28173bb7ba%40googlegroups.com.


--
Best,
Miguel Palhas

Wiebe-Marten Wijnja

unread,
Jul 19, 2019, 6:28:56 AM7/19/19
to elixir-lang-core
Having read all of the replies, I like the suggestion to add simple readable documentation to the data structures' module docs the best.

Adding new documentation metadata seems to me not worth it. Besides the already mentioned problems, I would like to mention that complexity-documentation is not that structured:

- Besides Big-O, there is Big-Omega and Little-o, which are also very useful and relatively common in literature. Do we also need to add new metadata names for those?
- Time-complexity and Space-complexity are by no means the only complexities that are being measured for data structures. Do we also need to add new metadata names for e.g. communication complexity,
- What about amortized complexities?
- In regards to what is the complexity measured? (i.e. "what is `n` in O(2n+3)"?)

And that, combined with the fact that I think it is important to also accommodate people who are new to this mathematical way of writing algorithmic complexities, makes me conclude that it is a bad idea to create a rigid structured way for it, and that adding some humanly-readable unstructured documentation for common operations on commonly-used datastructures would be the way to go. These unstructured documentation snippets might be accompanied by their formal complexity notations, but should not only consist of these.

~Qqwy/Marten

Fernando Tapia Rico

unread,
Jul 19, 2019, 6:32:22 AM7/19/19
to elixir-lang-core
To give a bit more context: the top level documentation for Enum, Keyword, List, Map, MapSet, and Stream already contain some "hints" about time complexity :)

Mário Guimarães

unread,
Jul 19, 2019, 7:26:02 AM7/19/19
to elixir-lang-core

The idea of this proposal is to document time and space complexity of collections in a more structured / organized way instead of being missing or spread opportunistically here and there throughout the documentation.

I also think that such information could be presented in a proper section at the top part of a collection's module documentation before the function list.

Using `@doc` annotations on functions are not strictly necessary, but provide a way to capture that information that is ameanable for post-processing, for example, to generate a table at the top of the documentation page. Moreover, they do not need to be presented next to the documentation of the functions they annotate.


The Big-O annotation practice is not to write "O(2n+3)" but to reduce this to its pattern of asymptotic growth, which in this case is simply "O(n)".

This means that I totally agree that complex mathematical formulas do not belong to this documentation. 

Among the several existing predictors, Big-O is the one generally used and that people are most aware of. It is also the one that predicts the worst-case asymptotic performance an algorithm can get when systems are put under the stress of large inputs. Adding other indicators introduces unnecessary complexity and confusion, which obviously is not what we want in our documentation.

In general, the "n" relates to the number of elements in the collection, and I suppose that in cases of functions involving two collections, they can be annotated with something like "O(n+m)", where "n" and "m" are the number of elements in each collection.

These are very simple expressions every software developer should know or be interested to learn for the sake of understanding algorithms better and the quality of the software systems they write.

Wiebe-Marten Wijnja

unread,
Jul 19, 2019, 10:15:03 AM7/19/19
to elixir-lang-core


On Friday, July 19, 2019 at 1:26:02 PM UTC+2, Mário Guimarães wrote:


The Big-O annotation practice is not to write "O(2n+3)" but to reduce this to its pattern of asymptotic growth, which in this case is simply "O(n)".

While it is true that O(2n+3) and O(n) are both linear, and that we often do reduce notations to its pattern of asymptotic growth, a common reason to keep the expanded form (like e.g. "O(2n+3)") is to be able to compare algorithms of the same asymptotic complexity with one another, which is rather common.
 

Among the several existing predictors, Big-O is the one generally used and that people are most aware of. It is also the one that predicts the worst-case asymptotic performance an algorithm can get when systems are put under the stress of large inputs. Adding other indicators introduces unnecessary complexity and confusion, which obviously is not what we want in our documentation.

Correct, but there are many cases in which people are more interested in the average running time than the worst-case running time. It is mostly when writing real-time code that limiting the worst-case complexities become important.


In general, the "n" relates to the number of elements in the collection, and I suppose that in cases of functions involving two collections, they can be annotated with something like "O(n+m)", where "n" and "m" are the number of elements in each collection.

This is true for linear collections, but it is not true many other structures.
When working on numbers, for instance, `n` is the size of the number in bits.
And often we have other symbols as well. Like "m" you mentioned. But also we have for instance Patricia trees "O(min(n, W))" where W happens to be the size of a word on the computer that runs it.
 

These are very simple expressions every software developer should know or be interested to learn for the sake of understanding algorithms better and the quality of the software systems they write.

 
 I agree that it would be nice if more people were to learn about these concepts, because being aware that a choice between two implementations matters and how will improve the quality of the software you write. :-)


Chris Keathley

unread,
Jul 19, 2019, 10:23:13 AM7/19/19
to elixir-l...@googlegroups.com
I agree with all of the points made by Michał and Paul. Additionally, most functions will never need to specify this information. In the few cases where you might want to specify this information it seems like it would be very reasonable to use module docs and function docs. This is especially true because asymptotic time _isn't_ the most useful way to measure immutable data structures. Often you'll want to look at both asymptotic and amortized time complexities. But even if we set amortized time to the side for a second, how would you document a data structure that internally uses a list until the list size reaches a certain amount and then switches to a HAMT with a branching factor of 32? Do you say that searching this data structure is O(n) when n < 32 and O(log32(n)) when n >= 32? In my mind putting this information in the module doc is clearer and allows me to write a full explanation.

It seems to me that the core problem is communicating relative performance of different data structures and algorithms. I don't think this proposal is needed to solve that problem and it creates more complexities than benefits.

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.


--
Chris

Mário Guimarães

unread,
Jul 19, 2019, 10:59:59 AM7/19/19
to elixir-lang-core
This proposal is to document the complexity for the collections that come with Elixir and the functions that manipulate them. There aren't many of these collections, they are the simplest collections that can exist from the user perspective, they are very distinct, and so, I think that the most simplified "O(...)" is enough to compare across them. For anyone creating new collections outside the standard library, they choose to document their collections as they wish.

The case of average running times I believe will depend on the application domain, so it is better dealt by benchmarking the application for the typical inputs in its domain.

The case  of "O(min(n, W))"  can be well complemented by explaining what "W" is in the function's documentation, in case such uncommon variables stay after simplification of the "O(...)". Again, I do not think that Elixir collections will need these.

What's the difficulty of annotation like this, which can be complemented if necessary with an additional explanation in the module or function documentation?

@doc time_complexity: "O(n) if n < 32, otherwise O(log32(n))"

Mário

sexta-feira, 19 de Julho de 2019 às 15:23:13 UTC+1, Chris Keathley escreveu:
I agree with all of the points made by Michał and Paul. Additionally, most functions will never need to specify this information. In the few cases where you might want to specify this information it seems like it would be very reasonable to use module docs and function docs. This is especially true because asymptotic time _isn't_ the most useful way to measure immutable data structures. Often you'll want to look at both asymptotic and amortized time complexities. But even if we set amortized time to the side for a second, how would you document a data structure that internally uses a list until the list size reaches a certain amount and then switches to a HAMT with a branching factor of 32? Do you say that searching this data structure is O(n) when n < 32 and O(log32(n)) when n >= 32? In my mind putting this information in the module doc is clearer and allows me to write a full explanation.

It seems to me that the core problem is communicating relative performance of different data structures and algorithms. I don't think this proposal is needed to solve that problem and it creates more complexities than benefits.

To unsubscribe from this group and stop receiving emails from it, send an email to elixir-l...@googlegroups.com.


--
Chris

Chris Keathley

unread,
Jul 19, 2019, 11:19:04 AM7/19/19
to elixir-l...@googlegroups.com
> What's the difficulty of annotation like this, which can be complemented if necessary with an additional explanation in the module or function documentation?

What's the difficulty in just using module docs and function docs? If additional information would go in the module and function documentation why not put everything there? I'm not saying that we shouldn't communicate this information when its useful. But it hasn't been proven to me that we need an addition to the language to accomplish that goal.

To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/e6919fe4-9bb6-4894-93ce-8ac4f4255869%40googlegroups.com.


--
Chris

José Valim

unread,
Jul 19, 2019, 11:34:51 AM7/19/19
to elixir-l...@googlegroups.com
Chris, to clarify, no addition to the language is necessary to support doc metadata. But the discussion here show that the doc metadata is too compact and concise to portray all relevant information.

José Valim
Skype: jv.ptec
Founder and Director of R&D

Mário Guimarães

unread,
Jul 19, 2019, 11:35:15 AM7/19/19
to elixir-l...@googlegroups.com
Oh, that was already explained: to have the O(...) annotation at a well structured place that can be used for post-processing; for example, to generate a table collecting those O(...) annotations, to be placed at a well known point, like the top of the collection's module.

This is not an addition to the language per se, because @doc can be used already in the form of @doc x: term

This is an addition to the documentation of the collections that come with Elixir, and not only on the functions of those collections, because for example "sort" comes from "Enum" only and deserves to have its complexity documented for the different collections; in this case, documenting sort in a complexity section on the top of the collection document is perhaps the best place.


Chris Keathley

unread,
Jul 19, 2019, 12:08:42 PM7/19/19
to elixir-l...@googlegroups.com
> Chris, to clarify, no addition to the language is necessary to support doc metadata

Sorry I misspoke. I still contend that the core problem can be solved with more information and more clarity using all of our existing tools.

> Oh, that was already explained:

I understand what you're proposing. But showing a table isn't a *problem*. It's one solution to a problem. I'm not convinced this proposal adequately solves the core problem.



--
Chris

Mário Guimarães

unread,
Jul 19, 2019, 12:33:36 PM7/19/19
to elixir-l...@googlegroups.com
Chris

this proposal is originally to "Document the Big-O complexity (time and space) of every collection and their functions".
This does not require any `@doc`, it suffices to have a small section on the top of a collection's module documentation, where such information can be accessed easy and clearly.

Mário

Reply all
Reply to author
Forward
0 new messages