Shouldn't collection zip functions return streams?

163 views
Skip to first unread message

Ian Nowland

unread,
Nov 25, 2015, 11:46:22 AM11/25/15
to scala-l...@googlegroups.com
Just a thought: when users use the various collection zip functions, what they're really saying is that they want to perform the next function against their collection with the elements compared against some other value, such as the elements of a parallel collection, or the index. 

From that perspective, it seems like the zip functions should be returning Streams rather than maintain the base collection type. This would make more sense computationally whenever that next operation they want to perform is something like a find or an exists, but I also think that it makes more sense logically as well - by converting to a Stream, the zip isn't actually being performed at the moment, instead it's more like a just in time operation on the elements of the collection from that point on.

--Ian 

Simon Ochsenreither

unread,
Nov 25, 2015, 12:17:44 PM11/25/15
to scala-language
Practically everything in the collection API should return streams. :-)

Julien Vion

unread,
Nov 26, 2015, 5:42:42 AM11/26/15
to scala-language
AFAIK, (c1, c2).zipped performs a lazy zip of c1, c2.

Le mer. 25 nov. 2015 à 18:17, Simon Ochsenreither <simon.och...@gmail.com> a écrit :
Practically everything in the collection API should return streams. :-)

--
You received this message because you are subscribed to the Google Groups "scala-language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-languag...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Simon Ochsenreither

unread,
Nov 26, 2015, 10:45:48 PM11/26/15
to scala-language

AFAIK, (c1, c2).zipped performs a lazy zip of c1, c2.

Absolutely correct. The largest issue with the rest of collections is that they evaluate things eagerly. Every other problem follows from that. We are paying an extremely high price in terms of correctness, performance and reusability for a slight increase in convenience.
Within the current design none of them can be addressed.

Rex Kerr

unread,
Nov 28, 2015, 5:54:28 PM11/28/15
to scala-l...@googlegroups.com
This isn't entirely true.  Laziness brings its own sets of issues, in that you either have a very limited set of operations, or you can be hit by massive performance bugs caused by innocently reusing an intermediate operation that is expensive to materialize.  Also, many things involving mutation are easier to reason about in eagerly evaluated code.  So there can be big pitfalls in correctness, performance, and reusability with lazily evaluated code.

So merely being lazy isn't enough.  You need really clearly defined boundaries between lazy and eager evaluation with an easy way to jump between the two, to really enable correctness, performance, and reusability.  And, yes, that's beyond the scope of the current collections design.

  --Rex


--

Simon Ochsenreither

unread,
Nov 28, 2015, 8:21:02 PM11/28/15
to scala-language
Not evaluating everything eagerly doesn't necessarily mean laziness in the sense you are using it here.

What I think we need is a clearly defined boundary between _no_ evaluation and evaluation, just like many other libraries out there.

Rex Kerr

unread,
Nov 28, 2015, 9:41:47 PM11/28/15
to scala-l...@googlegroups.com
Actually I was using it to mean precisely "not evaluating everything eagerly".

The clarity of the border is really important, though: I completely agree there.  A clear but awkward border isn't so great, though, as that encourages people to avoid border-crossings when they really make sense.

  --Rex


On Sat, Nov 28, 2015 at 5:21 PM, Simon Ochsenreither <simon.och...@gmail.com> wrote:
Not evaluating everything eagerly doesn't necessarily mean laziness in the sense you are using it here.

What I think we need is a clearly defined boundary between _no_ evaluation and evaluation, just like many other libraries out there.

--

Simon Ochsenreither

unread,
Nov 28, 2015, 10:35:03 PM11/28/15
to scala-language
Then it has to be "lazy" in the same sense as a non-executed database query is lazy. :-)

The border doesn't need to be awkward, it comes naturally when we make a distinction between assembling the computation to be run, and executing it.
Mixing them together is the root issue behind sequential vs. parallel vs. iterator vs. sequential views vs. parallel views vs. Slick vs. Spark implementations of the same thing.

These days I would consider an API that provides map/flatMap/filter/groupBy/join/... and talks about how it is implemented/executed conceptually wrong. It causes issues without solving any.

Rex Kerr

unread,
Nov 28, 2015, 10:52:58 PM11/28/15
to scala-l...@googlegroups.com
The border doesn't need to be awkward, but it does need to be exquisitely carefully designed.

Consider groupBy, for example.  There are zero ways to do this that don't require the entire collection to be rebuilt in the general case.  Likewise with sorted or partition.

If you consider these to be part of "assembling the computation", you have to be really careful when you abstract out part of your assembly, because they'll be super-expensive if you don't cache them, and you won't even save any memory because it's all created anyway.

So, if they're not minefields to trap the unwary coder, they need to require execution.  But then you commit yourself to a large set of eager and non-eager methods, which can be tricky to keep straight.

I'm not saying you can't make a clean and correct, but the naive ways to do these things either leave you with a pretty impoverished (by Scala standards) set of collections methods, or contain major traps/pitfalls.

It's not as simple as "not eager" or "assembly vs execution".  That's part of the puzzle, but it's a tricky puzzle to get right.  So Scala's approach of keeping the efficiency easy to understand by making everything eager is not _such_ a woefully bad idea, even if I agree that with sufficient care something better could (very likely) be crafted.

 --Rex


--

Simon Ochsenreither

unread,
Nov 28, 2015, 11:06:08 PM11/28/15
to scala-language

Consider groupBy, for example.  There are zero ways to do this that don't require the entire collection to be rebuilt in the general case.  Likewise with sorted or partition.

I think every major database vendor will disagree with that.
One benefit of being able to look at the whole computation is that optimization steps can be shared, so even if we have constraints for the general case, there are plenty of things that can be done looking at specific computations. Database vendors have done that for decades. Haskell does it in GHC, too. We could do it in a library.

(Additionally, that's what I was talking about regarding "I would consider an API that provides map/flatMap/filter/groupBy/join/... and talks about how it is implemented/executed conceptually wrong".)

The map/flatMap/filter/groupBy/join/... API should not be rebuilding anything, in fact it should be do nothing at all except represent a computation.

So, if they're not minefields to trap the unwary coder, they need to require execution.  But then you commit yourself to a large set of eager and non-eager methods, which can be tricky to keep straight.

No, that doesn't follow. Having these large sets of eager vs. non-eager methods is one of the major pain points of the current state of the art. I want to get away from that. Again, the whole distinction about how operations are executed is tackling the problem at the wrong level.

Rex Kerr

unread,
Nov 28, 2015, 11:29:14 PM11/28/15
to scala-l...@googlegroups.com
If you're going to go all the way to Akka Streams where you build the processing graph in advance, then I agree.

Otherwise, no.  You're favoring a particular kind of application, where you do exactly one thing with your data (single linear pipeline), and neglecting the application where you do some processing and then several different things from that result, and then branch off again from those.

Providing natural nodes for actual materialization really helps build efficient processing trees.

  --Rex


--

Simon Ochsenreither

unread,
Nov 29, 2015, 1:32:48 PM11/29/15
to scala-language

If you're going to go all the way to Akka Streams where you build the processing graph in advance, then I agree.

Agree, and it's not such a big deal, other libraries do the same, too.
It gives people much more control of execution, if they receive the computation and can decide on their own what's their evaluation strategy, instead of having to decide on this upfront (and then having to copy-paste code, because currently every computation is incompatible with all the others).

Rex Kerr

unread,
Nov 29, 2015, 3:34:39 PM11/29/15
to scala-l...@googlegroups.com
If you have followed the various alterations of the Akka Streams syntax, and the existence of two layers of specification, and so on, I am surprised you say "it's not such a big deal".  They're trying, and it's getting pretty good, but it's _hard_.

And they're doing as well or better than anyone else out there (AFAICT) in giving you the tools to specify the graph, rather than saying, "Well, I can make these tools easily, so your graph had better look like this".

I think it's a great thing to have, but it looks like a big deal to me.  And I think it's an open question whether the cognitive overhead is larger than eager computation with assignment at-will of intermediate results to vals.

  --Rex


--
Message has been deleted

Simon Ochsenreither

unread,
Nov 29, 2015, 4:20:12 PM11/29/15
to scala-language

If you have followed the various alterations of the Akka Streams syntax, and the existence of two layers of specification, and so on, I am surprised you say "it's not such a big deal".  They're trying, and it's getting pretty good, but it's _hard_.

What I referred to with "it's not such a big deal" is the conceptual change of having an API doing various things depending on the implementation vs. just describing the computation. I didn't intend to diminish anyone's contribution. One part why things are hard is because everyone has to reinvent their own wheel, having to re-solve things that have already solved elsewhere. Implementations should be able to focus on their actual goals instead of having to roll the nth map/flatMap/filter/groupBy/join/... API.

I think it's a great thing to have, but it looks like a big deal to me.  And I think it's an open question whether the cognitive overhead is larger than eager computation with assignment at-will of intermediate results to vals.

I think being able to keep one API in your head instead of half a dozen, which are all slightly different, is worth the benefit. I'd say it would tremendously lower the cognitive overhead involved.

Lex Spoon

unread,
Nov 30, 2015, 10:14:54 AM11/30/15
to scala-l...@googlegroups.com

Programmers need to know when an operation is strict or not. It will help if the collection libraries have some simple way to know, and not just decided differently for each operation on a case by case basis.

This may be surprising, but many good programmers choose strict as a default, and only use lazy operations if there is a specific reason for it. The reason, at least for me, is that I want to make changes to software and be able to predict the performance impact. For people who like that style, it is nicer if everything is strict unless the collection has been explicitly converted into a lazy view of some kind.

I've written a lot of SQL and Datalog code, and I've worked a fair amount on Datalog engines. In my experience, heavy optimization across a whole query, or worse the whole query library, makes performance very mysterious. You make a small change, and your performance will suddenly tank. To recover performance, you either change things around randomly, or you use tools provided by the system to understand why it made the optimization decisions it did. Either way, programmers end up spending a lot of time studying what the optimizer did.

If you decide as a tool provider to use non-local optimizations, or even just laziness, it is a huge help to developers if you also provide a debugging tool for understanding the optimization decisions. In the Google Web Toolkit, we sensed a big increase in user happiness when we added a "Story of Your Compile" for debugging bad optimization of two subsystems: the code splitter, and the serialization system.

For lazy zip, I'm not sure what such a tool would look like. A dynamic trace of some kind. For staged queries, a branch - off subtopic of this thread, it really helps to have a way to print out the query plan.

Good luck, those working on it.

Lex Spoon

Ian Nowland

unread,
Nov 30, 2015, 10:31:40 AM11/30/15
to scala-l...@googlegroups.com
Well, that's why I proposed that what the zip returns is a Stream. By changing the type, it's much more explicit about being lazy.

--

Simon Ochsenreither

unread,
Nov 30, 2015, 3:57:28 PM11/30/15
to scala-language

Programmers need to know when an operation is strict or not. It will help if the collection libraries have some simple way to know, and not just decided differently for each operation on a case by case basis.


Exactly! This is one of the main benefits behind composing computations: Nothing gets executed until it is explicitly requested.
This also solves the current issue with "we have this huge collection API that creates new values vs. these dozen important mutable operations which change the data structure itself – and no good way to discover the difference" Haoyi mentioned.
Reply all
Reply to author
Forward
0 new messages