Proposal: memoized streams

102 views
Skip to first unread message

Alex Bruns

unread,
Oct 15, 2024, 7:52:52 PM10/15/24
to elixir-lang-core
Elixir streams should have a "pure" version that automatically memoizes the result (aka lazy lists). E.g.

s = PureStream.map([1, 2, 3], fn x -> x + 1 end)
Enum.to_list(s) # lazily computes => [2, 3, 4]
Enum.to_list(s) # returns [2, 3, 4] *but* no new computation is done because we already did the computation and memoized it

They could allow the caller to provide the memoization implementation to accommodate process-dict based caching, node-wide caching, cluster-wide caching, or even system-wide caching via something like a db, and they could be configured to only cache when computations take a set amount, etc.

Imagine the following situation:

@spec get_people() :: Enumerable.t(Person.t())
@spec without_kids(people :: Enumerable.t(Person.t())) :: Enumerable.t(Personal.t())
@spec names(people :: Enumerable.t(Person.t())) :: String.t()

If I implement these functions to return lists, then 
get_people() |> without_kids() |> names() 
iterates over everything 3 times. No good if I have a billion people.

If I implement these functions to return streams then
names = get_people() |> without_kids() |> names() 
first_names = Enum.map(&to_first_name/1)
last_names = Enum.map(&to_last_name/1)
iterates over everything twice.

Okay, but I can do
names = get_people() |> without_kids() |> names() |> Enum.to_list()
first_names = Enum.map(&to_first_name/1)
last_names = Enum.map(&to_last_name/1)

This works, but it's eager. So, this
names = get_people() |> without_kids() |> names() |> Enum.to_list()
maybe_first_names = if :rand.uniform() > 0.5, do: Enum.map(&to_first_name/1)
maybe_last_names = if :rand.uniform() > 0.5, do:Enum.map(&to_last_name/1)
will still compute names even though a quarter of the time neither first_names nor last_names use names.

Technically, we can already do this today by abusing the fact that we know how streams work, but of course, nobody wants people doing that.

Jean Klingler

unread,
Oct 16, 2024, 6:20:23 AM10/16/24
to elixir-l...@googlegroups.com
I wonder what a concrete use case for this would be, although I understand the example it feels a bit contrived.

Streams are typically used to avoid fitting the whole collection at once in memory, but if you're going to store it anyway you might be able to use lists in the first place? You might need to reorganize the code and the conditionals a bit though for the initial lazy eval:

first? = :rand.uniform() > 0.5
last? = :rand.uniform() > 0.5

maybe_names = if first? or last?, do: get_people() |> without_kids() |> names() |> Enum.to_list()

Implementation-wise, due to immutability you can't really have a one-size-fit-all solution, as you mentioned you'd need to pick some backend (or even make it flexible), which makes me think the scope is probably too ambitious and feels more like a candidate for a library than for a general abstraction in core.

Or perhaps using something like Cachex is enough in practice if you need to cache some steps and avoid performing some operations twice? (in this case it doesn't feel to be specifically about enumerables)


--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/68274100-c167-440e-89e3-c039ea5262f0n%40googlegroups.com.

Alex Bruns

unread,
Oct 16, 2024, 5:58:44 PM10/16/24
to elixir-l...@googlegroups.com
It's true that you could always change existing code to support bespoke loading logic. However, this introduces an undesirable form of incidental coupling. There is no fundamental reason that the callsite at which names (or maybe_names) needs to know about how it will be used, and doing so has a real cost. In your example, you've hoisted the conditional logic to the top level, but what if that is not possible because it depends on some local context? Such a hoisting operation could introduce a massive amount of complexity. My opinion is basically that we have the ability to provide better performance characteristics for a large number of programs without introducing additional complexity and we should do that.

As for this being a library, I think that's a really good point. There's nothing stopping me, or someone else, from doing this in user space. That said, one could say the same thing about streams. I think the primary reason I support adding this to core would be to push widespread adoption as, without core approval, the only devs who would use this would be those that already recognize its value, limiting adoption.

Wrt caching individual steps, this is suboptimal in situations where you have a lot of steps. Still, probably a great 90% solution for lots of use cases.

You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-core" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-core/GHUFTnJL6hQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/CANnyohaNBqHkPEhODW%2BMjg1jCpN7-C51Vgke809rXxbQ_CcHHA%40mail.gmail.com.

Austin Ziegler

unread,
Oct 17, 2024, 11:44:58 AM10/17/24
to elixir-l...@googlegroups.com
I think that this is definitely done better as a library than adding it to the core, because fundamentally memoization is about keeping state (the prior state values of the stream), which would *probably* require the use of Agent by default. The precise mechanism used for keeping state will differ based on the application needs and environment (if working in a memory constrained environment like would be typical for a Nerves project, one should not be using memory caching). There are two related things which Elixir delegates to third-party libraries, providing only a basic (though useful) implementation: timezone database and calendar support. I don't think that Elixir could meaningfully provide a baseline caching mechanism that would satisfy the needs of users.

-a



--

Ben Wilson

unread,
Oct 18, 2024, 4:09:33 PM10/18/24
to elixir-lang-core
> That said, one could say the same thing about streams.

This isn't true, because the Elixir standard library itself wants to surface streams. "We want to use it in the standard library" has always been a good reason to include something in core.

A similar argument could be made regarding memoized streams if there is a scenario where the standard library would want to use them. However given the complexity of the design space here, and the bit where there's no advantage to iterating on this within the standard library vs a 3rd party library, the 3rd party library is 100% the way to go.

José Valim

unread,
Oct 18, 2024, 4:16:44 PM10/18/24
to elixir-l...@googlegroups.com
For what is worth, we outline the development goals here: https://elixir-lang.org/development.html

To quote:

Since v1.0, the language development has become focused to provide a compact and consistent core. The Elixir team focuses on language features that:

  1. are necessary for developing the language itself
  2. bring important concepts/features to the community in a way its effect can only be maximized or leveraged by making it part of the language
And this applies to the Elixir team and myself. For example, we implemented parallel and concurrent streams in Flow, and we didn't add it to the language either. Furthermore, a library should not require "core approval" for adoption. Elixir was designed to be an extensible language. Core approval would go directly against the language goals.

Have a great weekend,



Reply all
Reply to author
Forward
0 new messages