Fathoming TASTY and possible implications

Matan

unread,

Oct 4, 2015, 12:39:21 PM10/4/15

to scala-user

Hi,

This is a diversion from a collections api thread here on the scala google group.

TASTY may sound like something that can generate a large shock wave across the build and IDE ecosystems (sbt, scala-IDE...). Should we expect TASTY being part of any imminent scala version release, or will it be shipped in some optional and incremental fashion?

I'd be happy to learn what impact it might have on collections redesign, Spark, or other, as I didn't get the chance to delve into the possible impacts beyond the very obvious ones on version compatibility and future virtual machine run-time containers. I guess the only connection I can see is TASTY being big enough a change, to postpone working on better collections until it is over.

So, I'm happy to have some comments about both aspects laid out above.

Also, I'm currently working on AST analysis and visualization, and if TASTY is soon out the door, maybe I should do that on some branch of it, rather than working through a compiler plugin that extracts the AST of 2.10 and 2.11 projects during compile time.

Thanks,

Matan

Message has been deleted

Simon Ochsenreither

unread,

Oct 4, 2015, 8:08:01 PM10/4/15

to scala-user

Hi Mata,

to be clear, my comment on TASTY vs. collections is not an official position, but just a gut feeling.

Everything that follows is also just my own opinion.

If you look at various "collection-like" APIs/things Slick or Spark, you will find multiple cases where it would be interesting to work with the ASTs at compile-time instead of providing an implementation at runtime. In fact, that was more or less what Slick's "direct embedding" tried to do. It more or less failed, because the technology wasn't ready at that time, and the restrictions it placed on code were just not practical. For instance, compilation from source was required, because the necessary information for macros to work just wasn't there anymore in class files.

If you look at the lifted embedding, it's basically a very elaborated way of saying "I see your T, and I'll lift that into a Rep[T] so that I can reason about what goes on to generate queries later. You still have to do weird things like replacing == with === though, because we can't make the illusion perfect".

Don't get me wrong, I'm constantly impressed and amazed how Stefan manages to do deal with this very very hard problem, but in a sense, he shouldn't even need to have to go through all the pain just to be able to emit queries.

It's kind of the same with Spark.

From my point of view, TASTY will change this situation drastically: Because the AST is "always" available, many of the things that were impractical before are now reasonable ideas.

The question that arises from all of that is: Now that the reason for having these elaborate APIs with lifting, Rep[T]s, etc. is going away, why should we still have different, incompatible "collection-like" APIs all over the place?

The reason is that the collection API prescribes a very precise way of doing things, which is straight-forward, impossible to optimize, and doesn't work for practically anything except in-memory collections.

From this point of view, I think that revisiting the collections API after TASTY makes a lot of sense: It's the first time that we have all the tools available which are necessary to provide one API which works reasonably well for more than a single use-case at a time.

My interest is primarily not about getting rid of CanBuildFrom or improving the collection library itself (from a usage point of view, collections are amazing):

The quest which I think is interesting to explore is whether we can give up a (not so) tiny amount of convenience in the collection use-case to be able to use the API for more than in-memory collections.
The fact that everyone – despite the expressiveness of Scala's type-system – needs to invent their own "collection-like" API is completely embarrassing, it just shouldn't be that way.

If people go down the path I'm imagining, I think we will see a strong reduction of CanBuildFroms in the end, not because it is a design goal, but because the thing CanBuildFrom implies just isn't meaningful in anything except in-memory collections: What would it even mean to say "Your SQL query is required to create intermediate data-structures after every select, join, where, group by"? How are we even going to verify that? Will we run every SQL query in steps, where join(...).groupBy(...).filter(...).map(...) means that we will first run a query with just join(...), then a second with join(...).groupBy(...), then a third with join(...).groupBy(...).filter(...) etc.?

If we remove all the things from a hypothetical API which don't make any sense in general ("the collection API contract requires you to build intermediary data-structure after every operation, even if we are only interested in the first three elements in the end"), I think we will approach a design where we clearly separate

The kind of input (collection, database table, Spark RDD, ...)
The composition of operations (map, filter, flatMap, groupBy, ...)
The style of execution (force, toFuture, toTask, toReactiveStream, run, runAsync, ...)

from each other, and that design would be beneficial to in-memory collection use-cases, too. Things would be slightly more inconvenient (operations don't run implicitly, execution needs to be requested), but could be more efficient, easier to optimize, require less awareness of "standard" vs. views vs. iterator styles, and have less surprising outcomes).

Just my opinion.

Bye,

Simon

Matan Safriel

unread,

Oct 4, 2015, 8:43:17 PM10/4/15

to scala-user

Thanks and sorry for involving you in having to state it was a private opinion. Anything that would make things like Slick less involved under the hood and awkward to use relative to other collections - will be a very positive development. So my private opinion is that what you have just suggested/described would be a huge improvement over the current state of affairs. In particular the notion of better building blocks for out-of-memory collections sounds very interesting.

--
You received this message because you are subscribed to a topic in the Google Groups "scala-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scala-user/qXtyVxZ1OQM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scala-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eric Richardson

unread,

Oct 7, 2015, 6:46:53 PM10/7/15

to scala-user

Hi Simon,

This is a very good observation and I agree with this even though I don't have too much expertise in the area. I created a simple flatMap/reduce algorithm using a Scala List collection. I tried this out on Spark and it worked there as well so then I thought, can I share this in both domains? Well, as you pointed out Spark RDD has collection like operations like flatMap/reduce but there is no common trait or class in common with Scala collections that I could find. There may be some advanced way to share the code but that would be above my knowledge in Scala.

It seems there needs to be trait definitions that can be reused across domains like Spark, Slick, and Scala collections are very important even if they are "thin" - you mention this in your bullet "The composition of operations" below.