Hi Mata,
to be clear, my comment on TASTY vs. collections is not an official position, but just a gut feeling.
Everything that follows is also just my own opinion.
If
you look at various "collection-like" APIs/things Slick or Spark, you
will find multiple cases where it would be interesting to work with the
ASTs at compile-time instead of providing an implementation at runtime.
In fact, that was more or less what Slick's "direct embedding" tried to
do. It more or less failed, because the technology wasn't ready at that
time, and the restrictions it placed on code were just not practical.
For instance, compilation from source was required, because the
necessary information for macros to work just wasn't there anymore in
class files.
If you look at the lifted embedding, it's basically a
very elaborated way of saying "I see your T, and I'll lift that into a
Rep[T] so that I can reason about what goes on to generate queries
later. You still have to do weird things like replacing == with ===
though, because we can't make the illusion perfect".
Don't get me
wrong, I'm constantly impressed and amazed how Stefan manages to do
deal with this very very hard problem, but in a sense, he shouldn't even
need to have to go through all the pain just to be able to emit
queries.
It's kind of the same with Spark.
From my point
of view, TASTY will change this situation drastically: Because the AST
is "always" available, many of the things that were impractical before
are now reasonable ideas.
The question that arises from all of
that is: Now that the reason for having these elaborate APIs with
lifting, Rep[T]s, etc. is going away, why should we still have
different, incompatible "collection-like" APIs all over the place?
The
reason is that the collection API prescribes a very precise way of
doing things, which is straight-forward, impossible to optimize, and
doesn't work for practically anything except in-memory collections.
From
this point of view, I think that revisiting the collections API after
TASTY makes a lot of sense: It's the first time that we have all the
tools available which are necessary to provide one API which works
reasonably well for more than a single use-case at a time.
My
interest is primarily not about getting rid of CanBuildFrom or improving
the collection library itself (from a usage point of view, collections
are amazing):
The quest which I think is interesting to explore is whether we can give up a
(not so) tiny amount of convenience in the collection use-case to be
able to use the API for more than in-memory collections.
The fact
that everyone – despite the expressiveness of Scala's type-system –
needs to invent their own "collection-like" API is completely
embarrassing, it just shouldn't be that way.
If people go down
the path I'm imagining, I think we will see a strong reduction of CanBuildFroms in
the end, not because it is a design goal, but because the thing
CanBuildFrom implies just isn't meaningful in anything except in-memory
collections: What would it even mean to say "Your SQL query is required
to create intermediate data-structures after every select, join, where,
group by"? How are we even going to verify that? Will we run every SQL query in steps, where join(...).groupBy(...).filter(...).map(...) means that we will first run a query with just join(...), then a second with join(...).groupBy(...), then a third with join(...).groupBy(...).filter(...) etc.?
If we remove all the things from a hypothetical API
which don't make any sense in general ("the collection API contract
requires you to build intermediary data-structure after every operation,
even if we are only interested in the first three elements in the
end"), I think we will approach a design where we clearly separate
- The kind of input (collection, database table, Spark RDD, ...)
- The composition of operations (map, filter, flatMap, groupBy, ...)
- The style of execution (force, toFuture, toTask, toReactiveStream, run, runAsync, ...)
from
each other, and that design would be beneficial to in-memory collection
use-cases, too. Things would be slightly more inconvenient (operations
don't run implicitly, execution needs to be requested), but could be
more efficient, easier to optimize, require less awareness of "standard"
vs. views vs. iterator styles, and have less surprising outcomes).
Just my opinion.
Bye,
Simon