serialization in sbt

377 views
Skip to first unread message

Havoc Pennington

unread,
Sep 22, 2014, 3:12:36 PM9/22/14
to sbt-dev
Hi,

We are trying to navigate how to serialize stuff in sbt. I haven't
discussed this too much with others, it's just my initial problem
statement / ideas.

There are several distinct cases I know of.

1. currently sbt uses sbinary for caching
https://github.com/sbt/sbt/blob/0.13/cache/src/main/scala/sbt/Cache.scala

2. we are using play-json in sbt-remote-control for the client-server protocol

3. we are using play-json in sbt-remote-control for serializing task
results over the wire; the serializers are registered via a
registeredFormats key that is (in essence) a list of
(Manifest[T],Format[T])

4. other future uses (serializing an event log?)


We want to move these sbt-remote-control APIs (or something based on
them) into sbt itself:

https://github.com/sbt/sbt-remote-control/blob/master/commons/ui-interface/src/main/scala/sbt/UI.scala

Note the use of play-json's Writes type to send events:
https://github.com/sbt/sbt-remote-control/blob/master/commons/ui-interface/src/main/scala/sbt/UI.scala#L16


play-json is going to make a mess here; it has a bunch of transitive
dependencies that could interfere with sbt plugins, plus play depends
on sbt, so it will be a circular dependency (and also since sbt has a
frozen ABI, we might end up having play-json X.Y in play X.Y+1 -
awkward!).

scala pickling does not have a frozen ABI yet (right?) so if we pulled
it into sbt we'd be stuck on a prerelease of it for the duration of
0.13.x, which would cause trouble if anyone used the final release
inside the sbt classloader.

dependencies are just kind of a problem for sbt, since they all get
version-locked and keep people from using any other version of that
dependency in an sbt plugin.

Ideally we might have a serialization API like pickling's that
supports both binary and JSON, since for some of the use cases the
binary efficiency would be useful.


Here is a strawman suggestion for what to do:

* via shameless cut-and-paste of existing apache-licensed libs,
create a minimal sbt-specific interface to serialization

* in this interface, have the equivalent of play-json's implicit
Reads and Writes, but instead of reading and writing an AST, use
pickling-style streaming Builder and Reader:
https://github.com/scala/pickling/blob/0.9.x/core/src/main/scala/pickling/PBuilderReader.scala

* in the interface, have a macro like play's Json.reads and
Json.writes that can generate case class serializers

* in the interface, have default support for primitive and collection
types like play-json (but omit joda types and other extras that
play-json has that pull in deps)

* in the interface, allow writing either binary or JSON to a stream,
and reading same

* in the public interface, do NOT have explicit access to
implementation of either JSON or binary parsers/writers, or any AST,
or any kind of validation with human-readable errors

* in the public interface, do NOT have any way for third parties to
plug their own parsers/writers

* in private implementation, sbt could use sbinary and probably a
cut-and-paste of one of the simpler json parsers (to avoid adding a
json dep that might stomp something else in sbt's classloader)

* not sure how to migrate current uses of sbinary; current use of
play-json in sbt-remote-control is not yet ABI-locked so can be
stripped out before we move those APIs to sbt proper.

The intended outcome is:

* the sbt serialization interface is relatively small, since it
exports the bare minimum to stream typed data to/from JSON or binary,
with no AST and no third-party extensibility (third-party extension is
limited to adding new types which can be serialized). AST, Validation,
support for additional formats are all out of scope.

* we might do this pretty quickly through liberal use of cut-and-paste.

* by rolling an sbt-specific solution, sbt has nothing on the
classpath which will stomp on people's attempts to use the libraries
they want to use; we avoid circular deps, surprise ABI breaks, etc.

* we can change the private implementation over time if needed; in
particular if scala or the scala ecosystem really clearly locks down
on one ABI-frozen lib that does what we want we could plug it in
there. woohoo http://en.wikipedia.org/wiki/Indirection

* most API users (sbt plugin authors) would interact with this whole
thing in a very limited way; say the plugin defines type Foo which is
a task result type, then the plugin has to do a thing like:

Keys.registeredFormats += sbt.serialization.format[Foo]

otherwise this should be largely an sbt-internal issue.

Feasibility spike questions:

* can the serialization interface be kept pretty small?

Thoughts/alternatives?

Havoc

Josh Suereth

unread,
Sep 22, 2014, 3:23:44 PM9/22/14
to sbt...@googlegroups.com
Just to add some color.

On Mon, Sep 22, 2014 at 3:12 PM, Havoc Pennington <h...@typesafe.com> wrote:
Hi,

We are trying to navigate how to serialize stuff in sbt.  I haven't
discussed this too much with others, it's just my initial problem
statement / ideas.

There are several distinct cases I know of.

1. currently sbt uses sbinary for caching
https://github.com/sbt/sbt/blob/0.13/cache/src/main/scala/sbt/Cache.scala

2. we are using play-json in sbt-remote-control for the client-server protocol

3. we are using play-json in sbt-remote-control for serializing task
results over the wire; the serializers are registered via a
registeredFormats key that is (in essence) a list of
(Manifest[T],Format[T])

4. other future uses (serializing an event log?)

One thing we'd like to experiment with is REQUIRING serializers for all tasks (as well some notion of differencing).  This would let us experiment with thigns like AUTOMATICALLY enabling caching/incremental tasks when we detect a task is taking a long time.

 


We want to move these sbt-remote-control APIs (or something based on
them) into sbt itself:

https://github.com/sbt/sbt-remote-control/blob/master/commons/ui-interface/src/main/scala/sbt/UI.scala

Note the use of play-json's Writes type to send events:
https://github.com/sbt/sbt-remote-control/blob/master/commons/ui-interface/src/main/scala/sbt/UI.scala#L16


play-json is going to make a mess here; it has a bunch of transitive
dependencies that could interfere with sbt plugins, plus play depends
on sbt, so it will be a circular dependency (and also since sbt has a
frozen ABI, we might end up having play-json X.Y in play X.Y+1 -
awkward!).

scala pickling does not have a frozen ABI yet (right?) so if we pulled
it into sbt we'd be stuck on a prerelease of it for the duration of
0.13.x, which would cause trouble if anyone used the final release
inside the sbt classloader.


Essentially, we'd have our own fork of scala pickling with a frozen ABI.  I've talked with Heather about this, and we'd be working with the authors on such a release.

 
dependencies are just kind of a problem for sbt, since they all get
version-locked and keep people from using any other version of that
dependency in an sbt plugin.


Right, I think we need to start thinking of sbt as a "platform" where things like scala-picklers/serializers are mandated libraries from the platform.  If sbt starts making decisions here, things like "which scalaz shoudl plugins use" go away.

 
Ideally we might have a serialization API like pickling's that
supports both binary and JSON, since for some of the use cases the
binary efficiency would be useful.


Here is a strawman suggestion for what to do:

 * via shameless cut-and-paste of existing apache-licensed libs,
create a minimal sbt-specific interface to serialization

 * in this interface, have the equivalent of play-json's implicit
Reads and Writes, but instead of reading and writing an AST, use
pickling-style streaming Builder and Reader:
https://github.com/scala/pickling/blob/0.9.x/core/src/main/scala/pickling/PBuilderReader.scala


If we don't end up using picklers wholesale, this has my vote (but I'd like the macros that picklers brings for usability).
It sounds like it's mostly a picklers fork.  Why not JUST fork the picklers library, put it in the org.scala-sbt namespace and issue binary compatibility releases of it?  I'm totally down with doing it, and keeping all the other points you make similar.

- Josh

eugene yokota

unread,
Sep 22, 2014, 3:25:49 PM9/22/14
to sbt...@googlegroups.com
My preference would be to separate out the three layers:
- Build user facing API for persisting case classes
- Eventual serialization format/protocol
- Backend engine implementation

Josh and I have been talking about for a quite some time, and I'd go for something like scala/pickling or rapture.io on the user-facing,
Json for formats.
Jon Pretty was recommending jawn (https://github.com/non/jawn) for Json backend in Tokyo in his rapture.io talk and it looks pretty good.

-eugene

--
You received this message because you are subscribed to the Google Groups "sbt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sbt-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sbt-dev/CAFLqJky06-rx64ANQ4nhHYqcCw1ms0in1xQOKY-MiiXocneFcQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Havoc Pennington

unread,
Sep 22, 2014, 3:34:10 PM9/22/14
to sbt-dev
On Mon, Sep 22, 2014 at 3:23 PM, Josh Suereth <joshua....@gmail.com> wrote:
> It sounds like it's mostly a picklers fork. Why not JUST fork the picklers
> library, put it in the org.scala-sbt namespace and issue binary
> compatibility releases of it? I'm totally down with doing it, and keeping
> all the other points you make similar.
>

The only difference between this and what I suggested is that I bet we
could drop a large(?) portion of the public API from picklers (and
maybe some other stuff too - I'm imagining we have much simpler
requirements, but perhaps that's a fantasy).

Havoc

Havoc Pennington

unread,
Sep 22, 2014, 3:35:33 PM9/22/14
to sbt-dev
On Mon, Sep 22, 2014 at 3:25 PM, eugene yokota <eed3...@gmail.com> wrote:
> My preference would be to separate out the three layers:
> - Build user facing API for persisting case classes
> - Eventual serialization format/protocol
> - Backend engine implementation
>
> Josh and I have been talking about for a quite some time, and I'd go for
> something like scala/pickling or rapture.io on the user-facing,
> Json for formats.
> Jon Pretty was recommending jawn (https://github.com/non/jawn) for Json
> backend in Tokyo in his rapture.io talk and it looks pretty good.
>

One trick to the implementation backend is that (barring nutty
classloader hackery) the JVM doesn't let us truly hide a third-party
dependency. We still 'contaminate' plugins even if our public API
doesn't use anything from the dependency.

Havoc

Havoc Pennington

unread,
Sep 22, 2014, 5:22:32 PM9/22/14
to sbt-dev
Jim Powers just reminded me of another issue we've encountered, which
is the need for parallel data type hierarchies right now - see
https://github.com/sbt/sbt-remote-control/blob/master/server/src/main/scala/sbt/server/SbtToProtocolUtils.scala
where we convert. We have "plain old data" case classes in
sbt.protocol available to both client and server, and the original sbt
type available only in the server.

These may not always perfectly correspond; consider the case of adding
fields to a task result type, you might add a new type to the protocol
but just add a new method to the plain sbt type.

In sbt 1.0 when we can break compatibility, we might be able to simply
mandate that sbt tasks return a serializable type - which would
naturally have to be a "plain old data" type without associated logic,
OR possibly a trait that would have both a POD implementation in
sbt.protocol and some other implementation in sbt proper. It isn't
clear.

This is sort of a separate problem from how to do serialization but
it's another issue to be addressed if we want to be able to
(de)serialize any task result.

Havoc



On Mon, Sep 22, 2014 at 3:12 PM, Havoc Pennington <h...@typesafe.com> wrote:

eugene yokota

unread,
Sep 22, 2014, 5:34:27 PM9/22/14
to sbt...@googlegroups.com
I'm personally ok with sbt having more opinionated stack of library dependencies on the classpath like specific versions of Scalaz, Dispatch, jawn|json4s|*, etc.
The backend replaceability, is just for better engineering while achieving good performance and long-term stability.
I'm also ok with shading json libraries, so we don't interfere with plugin authors including newer version of jawn|json4s|*.

-eugene

--
You received this message because you are subscribed to the Google Groups "sbt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sbt-dev+u...@googlegroups.com.

eugene yokota

unread,
Sep 23, 2014, 11:54:40 AM9/23/14
to sbt...@googlegroups.com
I spent some time yesterday kicking around picker, and I'm left with more long-term concern, which stems from the difference between serialization and data binding.

- serialization: start with an object => encode it to bytes/xml/json => decode it back to an object (on same platform)
- data binding: start with communication protocol => generate data transfer object (DTO) => marshal DTO into bytes/xml/json => you get exactly what you specified => unmarshal back to DTO (on any platform)

They look similar, but the emphasis on serialization is automation and losslessness, whereas data binding guarantees stable data format across platforms. Data binding is sometimes called design by contract. Over time, when the protocol (and its corresponding objects) evolves by adding fields, serialization goes belly up unless it's specifically worked around, whereas data binding can evolve the protocol description in a way such that old/new wire representation can interoperate.

Here's a simple demonstration using pickling:

case class Foo(x: Int, y: Option[Int])

object Hello extends App {
  import scala.pickling._
  import json._
  val obj = """{
    "x": 1
  }""".unpickle[Foo]
}

This currently blows up as follows:

java.util.NoSuchElementException: key not found: y
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:58)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:58)
  at scala.pickling.json.JSONPickleReader.readField(JSONPickleFormat.scala:243)
  at scala.pickling.json.JSONPickleReader.readField(JSONPickleFormat.scala:166)
  at hello.Hello$HelloFooUnpickler2$2$.unpickle(hello.scala:10)  

Part of the reason I suspect Json is so popular is because it allows people to evolve the protocol, due to the lack of hard binding and tooling. 
In that way, hand parsing effectively is data binding.
Pickling is unapologetically on the serialization side of the spectrum especially with it's emphasis on compile-time safety, and it's probably fine for reading/writing exactly what you started out with. We might even be able to provide pickler that's somehow lenient for additional, missing, or explicitly optional fields. But this is a concern for me because sbt relies on Scala's ability to grow a class over time.
To minimize the effort on the developers part of maintain data binding schema, I think there's a school of tools that would generate the schema by analyzing the instance documents. This is sometimes called schema inference or schema generation.

-eugene

On Mon, Sep 22, 2014 at 3:12 PM, Havoc Pennington <h...@typesafe.com> wrote:

Havoc

--
You received this message because you are subscribed to the Google Groups "sbt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sbt-dev+u...@googlegroups.com.

Havoc Pennington

unread,
Sep 23, 2014, 12:51:06 PM9/23/14
to sbt-dev
Hi,

The example you gave is about *removing* a field, not adding one. I
think that's impossible without planning ahead (by making the field an
Option in the first place as you did in your example). It looks like
pickling has chosen to require an Option field to be present (probably
set to null?), but play-json I'm pretty sure allows it to be simply
absent. In my mind this is kind of an implementation detail of the
JSON unpickler that needs to be changed for our purposes, rather than
anything fundamental; doesn't it seem like we could fix this with a
pretty simple patch, to allow Option fields to be missing?

I do think protocol evolution is an important issue though, and I'm
not sure what makes sense here. More below -

On Tue, Sep 23, 2014 at 11:54 AM, eugene yokota <eed3...@gmail.com> wrote:
> - serialization: start with an object => encode it to bytes/xml/json =>
> decode it back to an object (on same platform)
> - data binding: start with communication protocol => generate data transfer
> object (DTO) => marshal DTO into bytes/xml/json => you get exactly what you
> specified => unmarshal back to DTO (on any platform)
>

I feel like this is more about how we use the tools; for example, do
we have a separate DTO, or just the original sbt types?

The mail I sent yesterday about
https://github.com/sbt/sbt-remote-control/blob/master/server/src/main/scala/sbt/server/SbtToProtocolUtils.scala
makes this issue very concrete, no?

That is, we could have a set of DTO types (the things we have now in
sbt.protocol._) and we would always explicitly convert to and from the
actual sbt types.

The way you would handle adding a field is something like this:

package protocol {
case class Foo(x: Int)
case class Foo2(x: Int, y: Int)
}

package sbt {
class Foo { def x: Int, def y: Int }
}

When you add field "y" to sbt.Foo you would update to map sbt.Foo to
protocol.Foo2

Older peers expecting just the "x" field simply ignore the "y" field.
It *is* important that the deserializer for Foo ignores unknown
fields; this rules out some binary encoding possibilities, for
example.

It would also be possible to do it this way:

package protocol {
case class Foo(x: Int)
case class Foo2(x: Int, y: Option[Int]) // y is optional
}

The Option would enable someone expecting protocol.Foo2 to cope with
older peers that send only the protocol.Foo with no "y" field.

Anyway, right now sbt-remote-control does have this mapping layer from
native type to protocol (DTO) type. We allow registering conversions
from native to protocol:
https://github.com/sbt/sbt-remote-control/blob/master/commons/ui-interface/src/main/scala/sbt/UI.scala#L98

Jim discovered yesterday a case where this gets a bit ugly, which is
if you have:

package protocol {
case class Foo(x: Int)
case class Bar(foo: Foo, something: Int)
}

package sbt {
case class Foo(x: Int)
case class Bar(x: Foo, something: Int)
}

the issue is that to do a macro-generated serializer for sbt.Bar, it
needs a serializer for sbt.Foo, but we only have a serializer for
protocol.Foo. I think the clean solution for this may be to have an
explicit typeclass for "protocolizable" which, given type S, can give
you type P which has a Writes[P] available. However, I'm not sure that
actually works without blowing scala's mind in some way. We need to
think about this one.

The kind of obvious question is why don't we just have a
Writes[sbt.Foo] which is defined by converting to protocol.Foo and
then writing protocol.Foo. I think there was some reason we didn't do
this in sbt-remote-control but I am blanking on it right now.

Now, we do not *have* to do these two parallel type trees. We could
explore solutions that keep one or the other.

Option 1, keep only protocol:
* the rule (starting in sbt 1.0) would be that all task results and
data sent in events must be plain old data, no associated behavior
* we put all these plain data types (I think what you are calling a
DTO) in their own jar, shared by both sbt and sbt clients
* sbt tasks use the DTO types directly (we drop the idea of
`registeredProtocolConversions`)
* the problem of course is how to pass around file handles, loggers,
and stuff like that (non-data); the answer may be that these have to
be services or something and not task results

Option 2, keep only the native sbt types:
* go directly from sbt.Foo to JSON/binary on the sbt server side
* clients are "on their own" in that they don't share a DTO jar with
the server (maybe Scala clients have an sbt.protocol which is
autogenerated in some way by serializing the native sbt types and
creating schemas from those examples?)
* we can't *truly* get rid of the sbt.protocol DTO stuff on the
client side though, as far as I can figure out, unless clients just
get AST blobs and use them directly or something

Let's call Option 3 keeping both types, as sbt-remote-control does now.

An advantage of having two separate sets of types (DTO and sbt native)
is that protocol breaks actually show up as compile errors, or at
least they often would.
A disadvantage is complexity / duplication.

Havoc

Havoc Pennington

unread,
Sep 23, 2014, 12:54:32 PM9/23/14
to sbt-dev
Another minor issue is that we do use the JsValue AST in a couple of
places to basically tunnel protocol through the protocol:

https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/Protocol.scala#L188
final case class TaskEvent(taskId: Long, name: String, serialized:
JsValue) extends Event

If we do an API with no AST (which could be a great way to cut down
the API surface area), this would probably need to become some kind of
opaque "thing you can unpickle from" object. It could just be a
ByteString type of thing, but the downside can be that you end up with
ugly protocol such as a JSON string which has JSON in it (vs. a nested
JSON value). This may be avoidable if we invent an opaque object that
would essentially have an AST inside (in the case of JSON).

Havoc

eugene yokota

unread,
Sep 23, 2014, 3:37:14 PM9/23/14
to sbt...@googlegroups.com
On Tue, Sep 23, 2014 at 12:50 PM, Havoc Pennington <havoc.pe...@typesafe.com> wrote:
The example you gave is about *removing* a field, not adding one.

I was trying to mimic a situation in which ver 1.0 has class Foo(x: Int), and ver 1.1 has class Foo(x: Int, y: Option[Int]).

 
In my mind this is kind of an implementation detail of the
JSON unpickler that needs to be changed for our purposes, rather than
anything fundamental; doesn't it seem like we could fix this with a
pretty simple patch, to allow Option fields to be missing?
[snip]
I feel like this is more about how we use the tools; for example, do
we have a separate DTO, or just the original sbt types?

I did mention "We might even be able to provide pickler that's somehow lenient for additional, missing, or explicitly optional fields." too, but to me serialization is all on roundtrip of objects, and data binding is about roundtrip of Json and/or XML. These are different points of guarantee, and without schema, I think it's difficult to ensure protocol stability.
With data binding tools, you start out with schema document like XSD or instance documents like json files (if we go towards schema generation), and case class and/or trait would be generated from the code generator, which is mostly automatic. The reason we wanted something like pickling is because when you approach from case class, the experience is also automatic. 

In some cases you might be able to directly use the generated DTO case classes, but since case classes are harder to evolve in bincompat way, likely we'll be converting DTOs to some more usable form of entity object. 
Yes. This glue layer can be thought of as a form of Generation Gap pattern (http://martinfowler.com/dslCatalog/generationGap.html) if we use code gen on protocol._ side.

When you add field "y" to sbt.Foo you would update to map sbt.Foo to
protocol.Foo2

Older peers expecting just the "x" field simply ignore the "y" field.
It *is* important that the deserializer for Foo ignores unknown
fields; this rules out some binary encoding possibilities, for
example.

Picker persists original type name into tpe field.

package hello

case class Foo(x: Int)
case class Foo2(x: Int)

object Hello extends App {
  import scala.pickling._
  import json._
  val pkl = Foo(1).pickle
  val obj = pkl.value.unpickle[Foo2]
}

This results to:

java.lang.ClassCastException: hello.Foo cannot be cast to hello.Foo2
  at hello.Hello$delayedInit$body.apply(hello.scala:10)
  at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
  at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
  at scala.App$$anonfun$main$1.apply(App.scala:71)
  at scala.App$$anonfun$main$1.apply(App.scala:71)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
  at scala.App$class.main(App.scala:71)


Anyway, right now sbt-remote-control does have this mapping layer from
native type to protocol (DTO) type. We allow registering conversions
from native to protocol:
 https://github.com/sbt/sbt-remote-control/blob/master/commons/ui-interface/src/main/scala/sbt/UI.scala#L98

Jim discovered yesterday a case where this gets a bit ugly, which is
if you have:

package protocol {
  case class Foo(x: Int)
  case class Bar(foo: Foo, something: Int)
}

package sbt {
  case class Foo(x: Int)
  case class Bar(x: Foo, something: Int)
}

the issue is that to do a macro-generated serializer for sbt.Bar, it
needs a serializer for sbt.Foo, but we only have a serializer for
protocol.Foo. I think the clean solution for this may be to have an
explicit typeclass for "protocolizable" which, given type S, can give
you type P which has a Writes[P] available. However, I'm not sure that
actually works without blowing scala's mind in some way. We need to
think about this one.

If entity objects (sbt types) and DTO (protocol._ types) do not match up, we would need to convert them manually anyway.
case-class-to-case-class conversion is actually not that bad in my opinion.
In the case like above, I am guessing Pickler or rapture.io does what you're describing.

Now, we do not *have* to do these two parallel type trees. We could
explore solutions that keep one or the other.

Option 1, keep only protocol:
[snip] 
Option 2, keep only the native sbt types:
[snip] 
Let's call Option 3 keeping both types, as sbt-remote-control does now.

An advantage of having two separate sets of types (DTO and sbt native)
is that protocol breaks actually show up as compile errors, or at
least they often would.
A disadvantage is complexity / duplication.

After a while we would want to implement methods on entity objects, so we'd need two layers for complex stuff.
For plugins, we should promote option 1.

As useful as case class is, I generally try to avoid it unless it's private[sbt] because of bincompat.
I've been hoping someone would write a pseudo-case-class generator that's aware of API evolution e.g. generate hashCode and def this based on 1.0 interface, but provide extra fields for apply and unapply on 1.1 interface. An ideal option 1 would be a set of pseudo-case-classes and associated type class instances based on series of json files marked with version number.

-eugene

Havoc Pennington

unread,
Sep 23, 2014, 5:42:22 PM9/23/14
to sbt-dev
Hi,

Trying to split up some issues, not sure what we are discussing necessarily ;-)

Pickling's current JSON pickler is too picky
===

It looks like pickling's current JSON pickler does not do what we
want, but I think this is easy to address; play-json DOES do what we
want already, with a very similar API.

- if a field is missing, set optional fields to None instead of throwing
- don't put the class name in the serialization; require a class to
be provided when unpickling, and if that class expects the same fields
as the pickled class, it works

Both of those are already true with play-json. It _is_ a problem that
they aren't true in pickling yet, but I gotta believe that's simple to
fix.

Side note: we do put the class names in the sbt server protocol, but
it's done outside of the serializer. i.e.

{ name: "sbt.protocol.Foo", serialization: <output from Writes[Foo]
goes here> }

This is done because there's no static type known for an Event or
Request in that protocol. However, either party could select the
sbt.protocol.Foo2 deserializer when it sees the name sbt.protocol.Foo.
The serialization framework doesn't know anything about encoding the
type name. The type name here is just an arbitrarily-agreed-upon
discriminator value and in theory it could be an integer or something
as long as both sides agree on the values. Type name is used simply
because it's available and creating a map from types to some kind of
discriminator integer would feel pointless.

In the case of something like a task result cache, you would not write
the type name out (because it's always statically known). In that case
the serializer and the deserializer can completely disagree on the
type name with no consequence.

=> Suggested path forward here: let's agree we have to fix pickling to
work like play-json, if we use pickling.


case classes in the protocol._ (DTO) jar
===

If we use case classes for protocol._ then we have to introduce a new
case class when we want to add an optional field, rather than adding
the field to the existing case class - this is for ABI reasons rather
than serialization reasons. We could solve that by using traits (or
whatever) instead of case classes in protocol._, at the expense of
making protocol._ a little more annoying for us to implement and a
little harder for people to use.

We could quickly sketch out how we'd do it, maybe the contrast is
doing extensions from 1.0 to 1.1 this way:

final case class Foo(x: Int)
final case class Foo2(x: Int, y: Option[Int]) // @since 1.1

vs.

sealed trait Foo {
def x: Int
def y: Option[Int] // @since 1.1
}
object Foo { /* manually do apply and unapply and stuff? not really
sure what the details are */ }

There's the obvious practical issue here that the existing
serialization macros in pickling and play-json don't support traits,
but I think we could solve that. (I'm not actually sure play-json
would break as long as we have apply and unapply, I think it reflects
those and uses them to get the fields and build instances... don't
know how pickling does it.)

The trait version might avoid having to keep a mapping between
sbt.protocol.Foo2 and sbt.protocol.Foo. With the case classes we could
end up wanting something like:

@serializedTag("Foo") // tag it as a Foo
final case class Foo2(x: Int, y: Option[Int])

(Or this could just be a Map stored somewhere, instead of an
annotation. If there's no entry in the map or no annotation, it's
assumed the tag is derived from the type name since the type has never
had fields added. or the by-convention solution could be: always dump
any digits on the end of the type name.)

With the trait version we don't have to add a new type name so we
avoid this issue, while creating a different issue (i.e. generating
the boilerplate ourselves).

Detail question about the trait version: isn't there only one unapply
allowed? So once we add a field, anyone who wants to use that field
has to use `case foo: Foo => ` instead of `case Foo(x,y) =>` - it's OK
I guess. We could add new apply() except then we need a way to tell
the serializer macros which apply to use.

=> Suggested path forward here: I'm open to the
hand-roll-a-non-case-class-thing trait version, but I haven't worked
out what it looks like in detail enough to know that it in fact helps
us. So I'd probably want to do that next (explore details). The
question I'd want to answer by working out the details is whether
keeping the same type name when adding fields is enough of a win vs.
whatever the extra complexity turns out to be.

> I've been hoping someone would write a pseudo-case-class generator that's
> aware of API evolution e.g. generate hashCode and def this based on 1.0
> interface, but provide extra fields for apply and unapply on 1.1 interface.
> An ideal option 1 would be a set of pseudo-case-classes and associated type
> class instances based on series of json files marked with version number.

Is this even a serialization-specific problem? maybe we want to go
ahead and start generating types like this everywhere in sbt that's
tempted to use a case class?

I think we can fork this do-we-use-case-classes issue from the rest of
the serializer API discussion - I don't think this should affect the
actual serializer API, it should only affect what our sbt.protocol
types look like and how we build serializer instances for them.


Is there a schema description language / schema files?
===

If the goal is to have both a file with case classes and a file with a
schema in some schema description language, then you can generate
either from the other, conceptually. The one thing that's clearly bad
is to write them BOTH by hand, because then they can be out of sync.

=> Suggested path forward: figure out what code we want to generate,
then figure out some simple way to do so. For plain case classes I
think we can use them as their own schema, until such time as we care
about something other than Scala, and then we could switch to a
separate schema language. For the non-case-class approach, I don't
know yet since I don't know what that approach looks like in detail.
Maybe we could have a file with just the abstract traits, and codegen
the concrete implementations and the companion objects from the
traits?


Whether to have two layers or one
===

Do we want "protocol" types and "sbt native" types separate, or only
one or the other?

> After a while we would want to implement methods on entity objects, so we'd
> need two layers for complex stuff.
> For plugins, we should promote option 1.

Good insight that if we have option 3 (two layers) we will always have
some values that are still single-layer; if nothing else, Int and
String won't have the two layers. ;-) So we can always encourage
people to KISS if they can and use plain old data, even while
providing a way to handle the two-layer case.

=> Suggested path forward: What I'm leaning toward saying here is that
we have a jar with some plain old data types shared between client and
server; sometimes we use these directly; sometimes we have to convert
sbt's native types to them. So we have two layers but only actually
duplicate the layers when we need to.


Havoc

eugene yokota

unread,
Sep 23, 2014, 6:39:39 PM9/23/14
to sbt...@googlegroups.com
On Tue, Sep 23, 2014 at 5:42 PM, Havoc Pennington <havoc.pe...@typesafe.com> wrote:
Trying to split up some issues, not sure what we are discussing necessarily ;-)

Good idea. We can wrap up some topics.

Pickling's current JSON pickler  is too picky
===

It looks like pickling's current JSON pickler does not do what we
want, but I think this is easy to address; play-json DOES do what we
want already, with a very similar API.

Why do we like pickler so much? The chief motivation is that it uses macros to auto serialize simple case classes.
My argument is that using serializer for long-term caching and restful API is not a good fit, because json needs to be the king, not case classes. Especially if we are thinking about requiring plugin authors to adopt the protocol, we need to make json the king.

case classes in the protocol._ (DTO) jar
===

If we use case classes for protocol._ then we have to introduce a new
case class when we want to add an optional field, rather than adding
the field to the existing case class - this is for ABI reasons rather
than serialization reasons. We could solve that by using traits (or
whatever) instead of case classes in protocol._, at the expense of
making protocol._ a little more annoying for us to implement and a
little harder for people to use.

We could quickly sketch out how we'd do it, maybe the contrast is
doing extensions from 1.0 to 1.1 this way:

 final case class Foo(x: Int)
 final case class Foo2(x: Int, y: Option[Int]) // @since 1.1

vs.

  sealed trait Foo {
      def x: Int
      def y: Option[Int] // @since 1.1
  }
  object Foo { /* manually do apply and unapply and stuff? not really
sure what the details are */ }

What I've been doing in sbt code is using final class those constructor is marked private[sbt].

final class EvictionWarningOptions private[sbt] (
    val configurations: Seq[Configuration],
    val warnScalaVersionEviction: Boolean,
    val warnDirectEvictions: Boolean,
    val warnTransitiveEvictions: Boolean,
    val showCallers: Boolean,
    val guessCompatible: Function1[(ModuleID, Option[ModuleID], Option[IvyScala]), Boolean]) {
  private[sbt] def configStrings = configurations map { _.name }

  def withConfigurations(configurations: Seq[Configuration]): EvictionWarningOptions =
    copy(configurations = configurations)
  ....

  private[sbt] def copy(configurations: Seq[Configuration] = configurations,
    warnScalaVersionEviction: Boolean = warnScalaVersionEviction,
    warnDirectEvictions: Boolean = warnDirectEvictions,
    warnTransitiveEvictions: Boolean = warnTransitiveEvictions,
    showCallers: Boolean = showCallers,
    guessCompatible: Function1[(ModuleID, Option[ModuleID], Option[IvyScala]), Boolean] = guessCompatible): EvictionWarningOptions =
    new EvictionWarningOptions(configurations = configurations,
      warnScalaVersionEviction = warnScalaVersionEviction,
      warnDirectEvictions = warnDirectEvictions,
      warnTransitiveEvictions = warnTransitiveEvictions,
      showCallers = showCallers,
      guessCompatible = guessCompatible)
}

The boilerplate is the copy method. Here's the apply and unapply I'd want:

package hello

final class Foo private[hello] (val x: String, val y: Option[Int]) {
  def copy(x: String, y: Option[Int]): Foo =
    new Foo(x = x, y = y)
}

object Foo {
  def apply(x: String): Foo = new Foo(x, None)
  def apply(x: String, y: Option[Int]) = new Foo(x, y)

  def unapply(foo: Foo): Option[(String, Option[Int])] =
    Some((foo.x, foo.y))
}

object Hello extends App {
  import scala.pickling._
  import json._
  val pkl = Foo("hello").pickle
  println(pkl.value)
}

There's the obvious practical issue here that the existing
serialization macros in pickling and play-json don't support traits,
but I think we could solve that.

The above code worked fine for pickler.
 
Detail question about the trait version: isn't there only one unapply
allowed? So once we add a field, anyone who wants to use that field
has to use `case foo: Foo => ` instead of `case Foo(x,y) =>` - it's OK
I guess.

Yes.

=> Suggested path forward here: I'm open to the
hand-roll-a-non-case-class-thing trait version, but I haven't worked
out what it looks like in detail enough to know that it in fact helps
us. So I'd probably want to do that next (explore details). The
question I'd want to answer by working out the details is whether
keeping the same type name when adding fields is enough of a win vs.
whatever the extra complexity turns out to be.

To me, the ability to grow API while remaining bincompat is really important.
We need to keep the same name so we're not constantly mapping between DTO <-> hand-rolled entity classes.
 
> I've been hoping someone would write a pseudo-case-class generator that's
> aware of API evolution e.g. generate hashCode and def this based on 1.0
> interface, but provide extra fields for apply and unapply on 1.1 interface.
> An ideal option 1 would be a set of pseudo-case-classes and associated type
> class instances based on series of json files marked with version number.

Is this even a serialization-specific problem? maybe we want to go
ahead and start generating types like this everywhere in sbt that's
tempted to use a case class?

There's a general wishes/needs for pseudo case classes, at least in my head.
But if we are making json the king, and generating DTOs anyway, it makes sense to piggyback growable ABI agenda.
I'm totally for start making pseudo case class generator.

Is there a schema description language / schema files?
===

If the goal is to have both a file with case classes and a file with a
schema in some schema description language, then you can generate
either from the other, conceptually. The one thing that's clearly bad
is to write them BOTH by hand, because then they can be out of sync.

Sort of. There's schema, DTO classes, and the marshaling/unmarshaling code.
If we decide json is the king, then we start out with generating/inferring json schema, and then from there we generate both DTO and marshaling typeclass instances. For that we would need to write such tools ourselves.
In the interim, I'm ok with manually writing both Json parsing code and pseudo case classes because at least I would have control over exactly what Json it's going to emit.

Whether to have two layers or one
===

=> Suggested path forward: What I'm leaning toward saying here is that
we have a jar with some plain old data types shared between client and
server; sometimes we use these directly; sometimes we have to convert
sbt's native types to them. So we have two layers but only actually
duplicate the layers when we need to

Agreed. Once we have DTO and stable protocol these could be shared on both client and server side.
In some cases we could even enrich-my-library the DTO to append methods without copying data.
 
-eugene

Josh Suereth

unread,
Sep 24, 2014, 8:39:47 AM9/24/14
to sbt...@googlegroups.com
---==== Parallel data hierarchy issues ===---

I'm of the opinion that we may confusing symptoms of separate-from-main-project for a true disease here.   *IF* we could break binary compatibility in sbt 0.13, we could have the same API for things returned in tasks, but concrete implemnetations that we can share witht he client.   I.e. we could define "raw POD types" for the API which we control in a protocol.jar that is then depended on by the incremental compiler, and other libraries, so that all types returned by tasks are concrete types.

We could even, inside the compile task, do the conversion from the incremental compiler type to a POD type, meaning that the sbt "TasKey[T]" types would all be PODs that are free to serialize in/out of sbt, preventing the need for parallel APIs to begin with.

The only thing that's rough in this scenario is if we try to serialize exceptions on task failure, but I think that's also something we could remedy with a "ProtocolFriendlyException" trait which defined how to get serialziers for exceptions.


IN any case, I think the current issue is a symptom of trying to retain binary compatibility and can be tackled with implementation restrictions in sbt 1.0.

--
You received this message because you are subscribed to the Google Groups "sbt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sbt-dev+u...@googlegroups.com.

Josh Suereth

unread,
Sep 24, 2014, 9:07:50 AM9/24/14
to sbt...@googlegroups.com
On Tue, Sep 23, 2014 at 6:39 PM, eugene yokota <eed3...@gmail.com> wrote:

On Tue, Sep 23, 2014 at 5:42 PM, Havoc Pennington <havoc.pe...@typesafe.com> wrote:
Trying to split up some issues, not sure what we are discussing necessarily ;-)

Good idea. We can wrap up some topics.

Pickling's current JSON pickler  is too picky
===

It looks like pickling's current JSON pickler does not do what we
want, but I think this is easy to address; play-json DOES do what we
want already, with a very similar API.

Why do we like pickler so much? The chief motivation is that it uses macros to auto serialize simple case classes.
My argument is that using serializer for long-term caching and restful API is not a good fit, because json needs to be the king, not case classes. Especially if we are thinking about requiring plugin authors to adopt the protocol, we need to make json the king.

In sbt, types are the king, JSON is merely a convenient serializers/end-of-the-world concept.  The Goals/Requirements we should have in serialization:

  • End user serialization is dead simple (i.e. users CAN make custom tasks with custom return types, and we can enforce serializers on all of these)
  • Minor protocol adjustments should be convenient.  I should be able to make "intuitive" compatible changes to the case class to evolve an API and not break clients.   This means, if I'm using v1.0 of a plugin in a client, and the build has v1.0.1, my client should *semantically* work, possibly with not as much functionality.  
  • NON-REQUIREMENT:  Major protocol adjustments (these should literally be new types).  Upon failure to load, a cache should be INVALIDATED.  Similarly, when sending messages and the message doesn't meet an expected format, you should error with incompatible.   Instead, we need to work in some notion of "protocol level" where we can move up/down a protocol version inside the server so we adapt to the protocol a client supports.   These should LITERALLY be different types (possibly the same names, but different packages) with adaptors between them.  
  • A protocol escape hatch: If we are unable to deserialize a message into a type, we can give it to the client in raw JSON as an escape hatch, but ideally this doesn't happen often.
  • Java Binary compatibility should be maintained on the *Scala API* for the underlying protocol types/classes/traits/objects.
  • The ability to efficiently serialize over time (cache to disk) a value *OR* serialize over location (send to client) a value.  (Hence JSON vs. binary representation as a goal).
Cool. I do think abstract trait or non instantiable class for bincompat is the way we'll have to take the APIs so we can adapt them on demand.
 
 
Detail question about the trait version: isn't there only one unapply
allowed? So once we add a field, anyone who wants to use that field
has to use `case foo: Foo => ` instead of `case Foo(x,y) =>` - it's OK
I guess.

Yes.

=> Suggested path forward here: I'm open to the
hand-roll-a-non-case-class-thing trait version, but I haven't worked
out what it looks like in detail enough to know that it in fact helps
us. So I'd probably want to do that next (explore details). The
question I'd want to answer by working out the details is whether
keeping the same type name when adding fields is enough of a win vs.
whatever the extra complexity turns out to be.

To me, the ability to grow API while remaining bincompat is really important.
We need to keep the same name so we're not constantly mapping between DTO <-> hand-rolled entity classes.
 

Ideally the fact there's a DTO can mostly be hidden (or automated) and we only care when we need to that they exist.
 
> I've been hoping someone would write a pseudo-case-class generator that's
> aware of API evolution e.g. generate hashCode and def this based on 1.0
> interface, but provide extra fields for apply and unapply on 1.1 interface.
> An ideal option 1 would be a set of pseudo-case-classes and associated type
> class instances based on series of json files marked with version number.

Is this even a serialization-specific problem? maybe we want to go
ahead and start generating types like this everywhere in sbt that's
tempted to use a case class?

There's a general wishes/needs for pseudo case classes, at least in my head.
But if we are making json the king, and generating DTOs anyway, it makes sense to piggyback growable ABI agenda.
I'm totally for start making pseudo case class generator.


I agree on growable ABI.  Sounds like we may need our own API which resembles the best in others? Is this what you're suggesting?

 
Is there a schema description language / schema files?
===

If the goal is to have both a file with case classes and a file with a
schema in some schema description language, then you can generate
either from the other, conceptually. The one thing that's clearly bad
is to write them BOTH by hand, because then they can be out of sync.

Sort of. There's schema, DTO classes, and the marshaling/unmarshaling code.
If we decide json is the king, then we start out with generating/inferring json schema, and then from there we generate both DTO and marshaling typeclass instances. For that we would need to write such tools ourselves.
In the interim, I'm ok with manually writing both Json parsing code and pseudo case classes because at least I would have control over exactly what Json it's going to emit.


In the interim this is ok, but I don't think the end user experience will be good enough.

 
Whether to have two layers or one
===

=> Suggested path forward: What I'm leaning toward saying here is that
we have a jar with some plain old data types shared between client and
server; sometimes we use these directly; sometimes we have to convert
sbt's native types to them. So we have two layers but only actually
duplicate the layers when we need to

Agreed. Once we have DTO and stable protocol these could be shared on both client and server side.
In some cases we could even enrich-my-library the DTO to append methods without copying data.
 

+1. I think consolidating on the DTO and appending functions is the right way to go here if needed.  Additionally, when using APIs forcing task implementors to transfer data out of the raw API intot the DTO and only exposing DTO in *Key[T] types feels the right way to go.


Havoc Pennington

unread,
Sep 24, 2014, 9:10:36 AM9/24/14
to sbt-dev
On Wed, Sep 24, 2014 at 8:39 AM, Josh Suereth <joshua....@gmail.com> wrote:
> ---==== Parallel data hierarchy issues ===---
>

Right, this is what I called "option 1" earlier I think?

Option 1, keep only protocol:
* the rule (starting in sbt 1.0) would be that all task results and
data sent in events must be plain old data, no associated behavior
* we put all these plain data types (I think what you are calling a
DTO) in their own jar, shared by both sbt and sbt clients
* sbt tasks use the DTO types directly (we drop the idea of
`registeredProtocolConversions`)
* the problem of course is how to pass around file handles, loggers,
and stuff like that (non-data); the answer may be that these have to
be services or something and not task results


Havoc

Havoc Pennington

unread,
Sep 24, 2014, 10:05:56 AM9/24/14
to sbt-dev
Hi,

Re: pickling, assuming we strip it down to only our needed API, some
known differences from play-json are:

* no AST; uses "streaming" instead (see
https://github.com/scala/pickling/blob/0.9.x/core/src/main/scala/pickling/PBuilderReader.scala
)
* abstracts binary vs. json
* may already be able to pickle non-case-classes (Eugene you just
said it could do the private-constructor class right)
* json pickler has two problems already noted (embeds typename, does
not allow Option fields to be missing)

On Wed, Sep 24, 2014 at 9:07 AM, Josh Suereth <joshua....@gmail.com> wrote:
> End user serialization is dead simple (i.e. users CAN make custom tasks with
> custom return types, and we can enforce serializers on all of these)

We do need to answer the "what about streams" question here; some
things are not reasonably serializable. I think there are some answers
available such as use "services" for these, or serialize some kind of
dummy object.

> Instead, we need to work in some notion of
> "protocol level" where we can move up/down a protocol version inside the
> server so we adapt to the protocol a client supports.

My intuition about "major versions" in protocols is that once there's
an acceptable extension mechanism and the protocol is widely used, the
major version freezes eternally.
That is, it is so much easier to just introduce a new kind of message
or new field, rather than changing the whole protocol, that people are
always gonna pick that if it's remotely possible to do so.

My suggestion here would be to add a major protocol version in the
handshake here:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/com/typesafe/sbtrc/ipc/IPC.scala#L46

But my prediction is we will never increment that thing. It's just a
big red lever if we ever absolutely need it.

We also have another lever, which is changing how we locate the
server. (Newer clients could be set up to try method A and then B,
older use only B)

> A protocol escape hatch: If we are unable to deserialize a message into a
> type, we can give it to the client in raw JSON as an escape hatch, but
> ideally this doesn't happen often.

minor detail, with pickling it wouldn't be raw JSON but rather some
kind of opaque blob.

Havoc

Heather Miller

unread,
Sep 24, 2014, 10:12:53 AM9/24/14
to sbt...@googlegroups.com

Hey guys,


Let me respond to two specific points in this discussion, and a more general point about the architecture of pickling.


scala pickling does not have a frozen ABI yet (right?) so if we pulled
it into sbt we'd be stuck on a prerelease of it for the duration of
0.13.x, which would cause trouble if anyone used the final release
inside the sbt classloader.

We can work with you here. The ABI that we currently have is for the most part frozen – we've been working with it while integrating pickling into Spark and never needed to change the API, and thus the corresponding ABI *could have been frozen*. In just now revisiting the PickleFormat API, the only thing we'd like to change is to remove any remaining hard-coded references to macros (there are two, and they seem to be have snuck in there by accident). We'll open a PR shortly with this fixed. Based on this streamlined API, a frozen ABI will follow.


In the above, you also express concern over depending on a prerelease version. Databricks requires a final release for their integration with Spark by November 3rd. 0.8.0 is the current version of pickling. The changeset that we're sitting on that's targeted to go into 0.9.0 is enormous, and we're in a state where we can and should cut a 0.9.0 release ASAP (e.g., many of our users have moved from 0.8.0 to 0.9.0 SNAPSHOTs again). Thus, what we can do is this: we can cut a 0.9.0 release this week, and any of the larger/breaking changes that sbt needs, including pickler contravariance, we can put into 0.10.0. We could release 0.10.0 as soon as it covers all your immediate needs (like the two issues related to lightweight protocol evolution that Havoc pointed out), within the next few weeks. FYI, we'll also have another release, possibly 0.11.0, around November 3rd for Spark/Databricks.


- if a field is missing, set optional fields to None instead of throwing
- don't put the class name in the serialization; require a class to
be provided when unpickling, and if that class expects the same fields
as the pickled class, it works

Issue (a) should indeed be a simple fix in our JSON pickle format (note: in pickling, "format" denotes JSON/binary/XML/.., unlike play-json). In fact, we could enable this generally (meaning without creating a separate "flexible" JSON pickle format), since code relying on all fields to exist continues to work without changes in behavior; we'd simply allow more JSON to be unpickled successfully. (JSONPickleFormat should be configured to throw an exception when encountering a missing field.)


Issue (b) should not be too difficult to support either. When unpickling, we already require specifying the target type, for example. Here, the questions are more about what's the most desirable solution, and how to support certain binary encoding schemes that elide information such as field names. The required functionality could be something that *some* pickle formats support, but that *not all* pickle formats are required to support. For the regular JSON pickle format, this should be simple. For binary, one could easily create a new pickle format that pickles all field names, and which can have the exact same unpickling logic as the JSON pickle format.


My argument is that using serializer for long-term caching and restful API is not a good fit, because json needs to be the king, not case classes. Especially if we are thinking about requiring plugin authors to adopt the protocol, we need to make json the king.

A point of pickling that's been missing a bit from this discussion is that *you* have control of the format (i.e., the JSON/binary that gets put on the wire). And you can also control how things are serialized. Thus, if you need JSON reading/writing to be made more powerful/flexible, you can fork/provide your own pickle format that is more focused on schema evolution, say (though, below we explain how we can/will address the specific issues Havoc summarized also for the generic formats that we ship).


The great thing about all of this is that you are in complete control of the bits and pieces of serialization that are important to you – look at Scala/pickling as a tool that does two things that *are decoupled* (the whole planet sees these things as being baked together, thanks to Java serialization). There are two parts of serializing any object: (1) there's all the important/complicated analysis similar to what the JVM's serialization does to figure out the "shape" of user-defined types, including all of the tricky ways that objects can be constructed. (2) There's the actual act of actually packing all of that data into a byte array or other data representation. In Java, Kryo, and every other serialization framework that comes to mind, (1) and (2) are baked together. In Scala/pickling, we separate these two things, and for (2) we give you an API to control how pickles are assembled. That means that you get to piggyback on all of that static analysis that took us now more than a year to get right.


So the TLDR is this: rather than needing to fork the entire framework, why not just do as intended and fork the JSON format? You can get almost everything that you require in doing so, and anything that needs additional support from the framework, we can add for you.


Also of note; we're happy to give you guys the keys to the repo, so you can make changes as you require them – we don't want to be in a position where we're frustrating gatekeepers of any kind for you. in general, Philipp and I would be happy to help realize features you require or help with bugfixes as they arrive. We have the same agreement with Databricks/Spark.


Reasons why it's good to reuse pickling rather than forking it. Two reasons, in order of importance. (1) It's better for the ecosystem. As compared to other communities, important Scala open source projects have set a standard for not really working well with each other, and instead shipping forked/patched versions of each others' libs. Giving pickling a shot and depending on it as usual would give us a chance to try a new way of managing open source projects – openness of project direction (let it develop based on the community's needs) by giving important users like sbt and Spark commit rights. This is inspired by how Spark has developed and is developing (note, it's the largest Scala project to the best of my knowledge in terms of contributors >90 per month https://www.openhub.net/p/apache-spark). (2) It's good for pickling. Your needs are shared by others, so a lot of people would benefit by us better supporting sbt. Further, in many cases, adding support for some of the stuff you guys require isn't even difficult for us.


Will post again when we make a pickling PR which addresses the ABI issue above.


Cheers,

Heather



Heather Miller

unread,
Sep 24, 2014, 10:17:11 AM9/24/14
to sbt...@googlegroups.com


On Wednesday, September 24, 2014 4:05:56 PM UTC+2, Havoc Pennington wrote:
Hi,

Re: pickling, assuming we strip it down to only our needed API, some
known differences from play-json are:

 * no AST; uses "streaming" instead (see
https://github.com/scala/pickling/blob/0.9.x/core/src/main/scala/pickling/PBuilderReader.scala
)
 * abstracts binary vs. json
 * may already be able to pickle non-case-classes (Eugene you just
said it could do the private-constructor class right)

I'm not sure where Eugene got the impression that "the chief motivation (of pickling) is that it uses macros to auto serialize simple case classes". That's quite far from the truth (maybe you're confusing pickling with Salat?) 

Pickling is a general purpose serialization framework, and as such, it can handle almost any situation that Java can – e.g. all sorts of crazy situations with private constructors, or even things like @transient constructor arguments. FYI, we have it running large-scale distributed machine learning as the primary serialization mechanism within Spark (0.9.x branch) – and believe me, Spark serializes some crazy complicated objects.

Heather Miller

unread,
Sep 24, 2014, 10:19:03 AM9/24/14
to sbt...@googlegroups.com


On Wednesday, September 24, 2014 4:05:56 PM UTC+2, Havoc Pennington wrote:
Not sure what makes you think so? How is this the case? (I could be confused/missing context) 

Havoc

eugene yokota

unread,
Sep 24, 2014, 10:44:46 AM9/24/14
to sbt...@googlegroups.com
On Tue, Sep 23, 2014 at 12:54 PM, Havoc Pennington <h...@typesafe.com> wrote:
Another minor issue is that we do use the JsValue AST in a couple of
places to basically tunnel protocol through the protocol:

This reminds me of the Microsoft's jump from WCF (an all-encompassing communication layer to unify all communication layers support, SOAP, REST, binary on queue etc) to Web API (hard-coded to REST).
The life became so much easier when we dropped the pretension and just assumed it's REST with Json by default with occasional binary when needed to provide API for files.

In that sense rapture I/O[1] is attractive to me:
- it can turn json into case classes with macros
- you can choose backend json engine

serialization vs data binding problem exists here too, but by hard-coding to json, we can provide all sorts of richness in json handling.
Like accessing a json field via dynamic:

scala> import rapture.core._, rapture.json._
import rapture.core._
import rapture.json._

scala> import jsonParsers.scalaJson._
import jsonParsers.scalaJson._

scala> import strategy.throwExceptions
import strategy.throwExceptions

scala> val json = json"""{
     |   "fruits": {
     |     "apples": [{ "name": "Cox" }, { "name": "Braeburn" }],
     |     "pears": [{ "name": "Conference" }]
     |   }
     | }"""
json: rapture.json.Json = 
{
 "fruits": {
  "apples": [
   {
    "name": "Cox"
   },
   {
    "name": "Braeburn"
   }
  ],
  "pears": [
   {
    "name": "Conference"
   }
  ]
 }
}

scala> json.fruits.apples
res0: rapture.json.Json = 
[
 {
  "name": "Cox"
 },
 {
  "name": "Braeburn"
 }
]

scala> case class Apple(name: String)
defined class Apple

scala> json.fruits.apples.as[List[Apple]]
res1: List[Apple] = List(Apple(Cox), Apple(Braeburn))

-eugene


Havoc Pennington

unread,
Sep 24, 2014, 10:59:11 AM9/24/14
to sbt-dev
On Wed, Sep 24, 2014 at 10:19 AM, Heather Miller
<heather....@gmail.com> wrote:
>> minor detail, with pickling it wouldn't be raw JSON but rather some
>> kind of opaque blob.
>
>
> Not sure what makes you think so? How is this the case? (I could be
> confused/missing context)

I'll try to explain better. What I'm talking about here is this
JsValue in TaskEvent, for example:

https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/Protocol.scala#L188

What's happening there is that an sbt plugin can send an event that is
not statically known to this sbt.protocol package. That is, we want to
tunnel com.someplugin.WhateverEvent through the client-server protocol
(and then presumably the client will know how to reconstitute
WhateverEvent, if it cares about WhateverEvent, and will just ignore
it otherwise).

The reason I say it wouldn't be raw JSON in pickling is that I'm
assuming (hoping, really) that a JSON AST is "out of scope" for the
main pickling API. My expectation would be that the JsValue in there
would instead be "opaque thing which can be unpickled" - but I don't
really know what that thing would be yet, just hand-waving that there
must be some kinda thing it could be. My idea would be that if we are
getting our TaskEvent from the JSON pickling backend then it would
contain JSON inside the opaque thing, and if we are getting our
TaskEvent from some binary backend then it would contain some other
format inside the opaque thing.

So:

final case class TaskEvent(taskId: Long, name: String, serialized:
MyInventedOpaqueUnpickleable) extends Event

One thing that seems sort of hard to me about this, is that on the
wire ideally we have:

{ taskId: 42, name: "whatever", serialized: { foo: 1, bar: "hello" } }

rather than the seems-easier-to-implement:

{ taskId: 42, name: "whatever", serialized: "{ foo: 1, bar: \"hello\" }" }

or even worse but maybe even easier hex-encoded binary blob:

{ taskId: 42, name: "whatever", serialized: "32434deadbeef" }

Havoc

Havoc Pennington

unread,
Sep 24, 2014, 11:47:03 AM9/24/14
to sbt-dev
Hi,

Thanks for helping us out Heather!

Let me be sure it's clear what (speaking only for myself) the ABI
concern is. The issue in my mind is not that the ABI might break
(because sbt could just not upgrade, and if necessary it could
maintain a stable branch with whatever ABI we were using in sbt). The
issue is that if sbt did that (was stuck on an old ABI of pickling) we
would prevent people from using the new ABI inside of sbt plugins, in
the same way that sbt locks people to old scala. And then that might
transitively keep people from using other libs that depend on
pickling, etc.

This is not really a catastrophe, I guess. But the reason for
considering renamespacing picking was to allow people to use pickling
1.0 if they want, when it comes out.

If pickling is feeling ready to ABI freeze, then we can just not worry
about this. So that is good news.

inline stuff -

On Wed, Sep 24, 2014 at 10:12 AM, Heather Miller
<heather....@gmail.com> wrote:
> In the above, you also express concern over depending on a prerelease
> version. Databricks requires a final release for their integration with
> Spark by November 3rd. 0.8.0 is the current version of pickling. The
> changeset that we're sitting on that's targeted to go into 0.9.0 is
> enormous, and we're in a state where we can and should cut a 0.9.0 release
> ASAP (e.g., many of our users have moved from 0.8.0 to 0.9.0 SNAPSHOTs
> again). Thus, what we can do is this: we can cut a 0.9.0 release this week,
> and any of the larger/breaking changes that sbt needs, including pickler
> contravariance, we can put into 0.10.0. We could release 0.10.0 as soon as
> it covers all your immediate needs (like the two issues related to
> lightweight protocol evolution that Havoc pointed out), within the next few
> weeks. FYI, we'll also have another release, possibly 0.11.0, around
> November 3rd for Spark/Databricks.

From my perspective it would be really great to coordinate in this way
if you are up for it.

What I think we'd have to do is try porting sbt-remote-control to some
kind of pickling snapshot that we think will work, and then we could
give you the "all clear" that we have what we need.

So maybe the plan would be 1) let us know when the known needs are
sorted out and there's a snapshot we should try porting to 2) we try
to port and see what happens, 3) iterate until it works, then 0.10.0 ?

> Issue (b) should not be too difficult to support either. When unpickling, we
> already require specifying the target type, for example. Here, the questions
> are more about what's the most desirable solution, and how to support
> certain binary encoding schemes that elide information such as field names.
> The required functionality could be something that *some* pickle formats
> support, but that *not all* pickle formats are required to support. For the
> regular JSON pickle format, this should be simple. For binary, one could
> easily create a new pickle format that pickles all field names, and which
> can have the exact same unpickling logic as the JSON pickle format.

Agreed, this would be format-specific ... in fact for its binary cache
sbt may well prefer to have exact type matching instead of allowing
missing fields, because the priority is on speed and we can always
drop the cache if deserialization fails. The ability to have missing
fields is important in the client-server socket protocol use case but
not in the caching use case. Right now the client-server protocol is
JSON (though I suppose we could make it something binary for
performance reasons, if we felt like it).

> The great thing about all of this is that you are in complete control of the
> bits and pieces of serialization that are important to you – look at
> Scala/pickling as a tool that does two things that *are decoupled* (the
> whole planet sees these things as being baked together, thanks to Java
> serialization). There are two parts of serializing any object: (1) there's
> all the important/complicated analysis similar to what the JVM's
> serialization does to figure out the "shape" of user-defined types,
> including all of the tricky ways that objects can be constructed. (2)
> There's the actual act of actually packing all of that data into a byte
> array or other data representation. In Java, Kryo, and every other
> serialization framework that comes to mind, (1) and (2) are baked together.
> In Scala/pickling, we separate these two things, and for (2) we give you an
> API to control how pickles are assembled. That means that you get to
> piggyback on all of that static analysis that took us now more than a year
> to get right.

I didn't realize in the earlier discussion that pickling was so
sophisticated at analyzing user-defined types and in fact could
already interpret the private-constructor classes approach that sbt is
using instead of case classses. Eugene to his credit actually tried it
and found it worked.

My preference (based on marginally-related past experiences such as
CORBA, dbus, gobject-introspection, etc.) is to derive schemas from
code rather than code from schemas, so if pickling can derive schemas
from the kinds of types sbt was going to use anyway, I like that.

We still have an ugly wart that sbt is typing a lot of boilerplate
instead of using case classes because case classes have some
limitations (extra bytecode size, lack of ability to evolve ABI). But
I think we could separate that from the serialization problem and
ultimately try to get it solved in Scala itself...

> Reasons why it's good to reuse pickling rather than forking it.

Agreed completely with these points, hope my clarification above
helped motivate cut-and-pasting. I don't think the intent here would
ever be a permanent fork but more "keep sbt from messing up the
classpath with weird outdated versions of stuff."

Havoc

Havoc Pennington

unread,
Sep 24, 2014, 12:13:59 PM9/24/14
to sbt-dev
On Tue, Sep 23, 2014 at 4:47 PM, eugene yokota <eed3...@gmail.com> wrote:
> The life became so much easier when we dropped the pretension and just
> assumed it's REST with Json by default with occasional binary when needed to
> provide API for files.

I do sympathize here and I think I was arguing your same point to Mark
H in some historical thread ;-)

What I think we should maybe do for sbt is insist that everything is
JSON-representable. What I mean by that is like in Typesafe Config,
where there's the elaborate HOCON format but once that format is
parsed, we end up with something that can be represented in JSON.
Similarly, we may have a binary format but it is required to boil down
to objects with string keys, lists, boolean, string, number, null. We
don't add 64-bit integers or byte arrays or datetime or whatever to
the conceptual serialization; all those things must somehow get
converted to JSON types.

If we do that then in theory we can always pickle with a JSON AST as
target, then mess with the AST.

> serialization vs data binding problem exists here too, but by hard-coding to
> json, we can provide all sorts of richness in json handling.
> Like accessing a json field via dynamic:

This stuff is what I'd consider the "AST" API which I was hoping
wouldn't be exported from sbt because it is a significant API
footprint beyond purely the ability to (de)serialize. But that does
raise the question, are plugins going to have to hand-code the
serializer sometimes? Are we? If we do, what does it look like in
pickling?

Let me go through where sbt.protocol is hand-coding now to try to
figure out why we hand-code rather than using Json.reads/Json.writes
macros.

I think in some of these cases we may not need to hand-code with
pickling, and in others we may want to take a look at how we'd
hand-code if we needed to.

* java.io.File:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L24
No meaningful way to serialize File by looking at fields in the class,
I guess. play-json only does case classes anyway.

* java.lang.Throwable:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L35
Similarly, the class just isn't auto-analyzable. play-json only does
case classes.

* Attributed: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L61
I have to say WTF here, I don't understand why this serialization has
`"attributed" : true` in it. I think we should be able to autogen this
one other than that.

* xsbti.Severity:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L75
It's a Java enum I believe. Should be possible to autogen in
principle, but play-json probably can't.

* xsbti.Position:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L113
It's a Java interface that returns a bunch of nulls, I think is the issue here.

* ExecutionAnalysis:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L144
A trait with several subtypes where we have to serialize a discriminator

* LogEntry: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L205
Again, trait with discriminator

* TestOutcome: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L229
Trait with discriminator.

* TaskEvent: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L334
This one has a custom apply and unapply, and play-json's macros are
confused by them. The case class itself could autogen the serializer.

* ValueChanged:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L344
I'm not sure what the issue is here, maybe play-json's macro is just
confused by the complexity.

* BackgroundJobEvent:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L354
Same as TaskEvent, custom apply/unapply confuses play-json

* Test events: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L377
Custom unapply confuses play-json

* CompilationFailure:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L403
Not sure - looks to me like the play macros should work.

* sbt.protocol.ByteArray:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L414
We have to hex encode the bytes

* Map[File, T]:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L434
play-json only knows how to serialize Map[String,T]

* Relation: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L440
Not sure why.

* Stamp: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L459
Trait with discriminator.

* xsbti.Problem
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L475
play-json macros can't do Java interfaces.

* Qualifier: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L478
Trait with discriminator

* Access: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L497
Trait with discriminator

* Variance: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L523
Trait with discriminator / Java type

* PathComponent:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L542
Trait with discriminator

* SimpleType: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L560
Trait with discriminator

* Type: https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L591
Trait with discriminator

* CompileFailedException:
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/package.scala#L671
play-json can't figure out how to do non-case-class

Havoc

eugene yokota

unread,
Sep 24, 2014, 2:50:48 PM9/24/14
to sbt...@googlegroups.com
On Wed, Sep 24, 2014 at 9:07 AM, Josh Suereth <joshua....@gmail.com> wrote:


On Tue, Sep 23, 2014 at 6:39 PM, eugene yokota <eed3...@gmail.com> wrote:

On Tue, Sep 23, 2014 at 5:42 PM, Havoc Pennington <havoc.pe...@typesafe.com> wrote:
Trying to split up some issues, not sure what we are discussing necessarily ;-)

Good idea. We can wrap up some topics.

Pickling's current JSON pickler  is too picky
===

It looks like pickling's current JSON pickler does not do what we
want, but I think this is easy to address; play-json DOES do what we
want already, with a very similar API.

Why do we like pickler so much? The chief motivation is that it uses macros to auto serialize simple case classes.
My argument is that using serializer for long-term caching and restful API is not a good fit, because json needs to be the king, not case classes. Especially if we are thinking about requiring plugin authors to adopt the protocol, we need to make json the king.

In sbt, types are the king, JSON is merely a convenient serializers/end-of-the-world concept.

In sbt 1.0.x world where we have all heavy-lifting done is sbt server, and simultaneously clients on terminal and IDEs are hitting the sever via Restful and/or Web Socket protocol, and all logging, long-term persistence (including persistence of dependency resolution graphs) are done via json, the long-term stability and maintenance of json becomes the king. Types are just an approximation of real-world so compilers can understand it.
A similar analogy is ORM. When you have a relational database of various tables that evolves over time, wouldn't it make sense to generate a schema document from the database and generate case classes from it?

The Goals/Requirements we should have in serialization:
  • End user serialization is dead simple (i.e. users CAN make custom tasks with custom return types, and we can enforce serializers on all of these)
  • Minor protocol adjustments should be convenient.  I should be able to make "intuitive" compatible changes to the case class to evolve an API and not break clients.   This means, if I'm using v1.0 of a plugin in a client, and the build has v1.0.1, my client should *semantically* work, possibly with not as much functionality.  
  • NON-REQUIREMENT:  Major protocol adjustments (these should literally be new types).  Upon failure to load, a cache should be INVALIDATED.  Similarly, when sending messages and the message doesn't meet an expected format, you should error with incompatible.   Instead, we need to work in some notion of "protocol level" where we can move up/down a protocol version inside the server so we adapt to the protocol a client supports.   These should LITERALLY be different types (possibly the same names, but different packages) with adaptors between them.  
  • A protocol escape hatch: If we are unable to deserialize a message into a type, we can give it to the client in raw JSON as an escape hatch, but ideally this doesn't happen often.
  • Java Binary compatibility should be maintained on the *Scala API* for the underlying protocol types/classes/traits/objects.
  • The ability to efficiently serialize over time (cache to disk) a value *OR* serialize over location (send to client) a value.  (Hence JSON vs. binary representation as a goal).

Here's my list of goals/requirements we should have in data binding:
  • It's JSON (or XML). Binary for streaming files only.
  • There's a document that describes all JSON data formats for a given API.
  • Given a JSON document a.json and data binding, it can convert to some Scala object, and marshal back to a compatible JSON document.
  • At each schema level, there's a notion of version number which adheres to the following semantics.
    • A JSON document a.json produced by Foo API 1.0 is guaranteed to be readable with Foo API 1.1.
    • A JSON document b.json produced by Foo API 1.1 is guaranteed to be readable with Foo API 1.0.
  • Scala types produced by compatible schemas are forward binary compatible.
  • When end user adds custom JSON data, it should make it easy to guarantee JSON compatibility.
    • One way of doing this may be to infer schema from instance documents.
 
=> Suggested path forward here: I'm open to the
hand-roll-a-non-case-class-thing trait version, but I haven't worked
out what it looks like in detail enough to know that it in fact helps
us. So I'd probably want to do that next (explore details). The
question I'd want to answer by working out the details is whether
keeping the same type name when adding fields is enough of a win vs.
whatever the extra complexity turns out to be.

To me, the ability to grow API while remaining bincompat is really important.
We need to keep the same name so we're not constantly mapping between DTO <-> hand-rolled entity classes.
 

Ideally the fact there's a DTO can mostly be hidden (or automated) and we only care when we need to that they exist.

You were "+1" on consolidating on DTO below, so that means you're ok with not hiding DTO, right?
But yea, if we're doing the pseudo-case-class bincomat final-class hidden actor thing, we need to generate that stuff.


There's a general wishes/needs for pseudo case classes, at least in my head.
But if we are making json the king, and generating DTOs anyway, it makes sense to piggyback growable ABI agenda.
I'm totally for start making pseudo case class generator.


I agree on growable ABI.  Sounds like we may need our own API which resembles the best in others? Is this what you're suggesting?

Not sure what you mean here, but I'd clarify what I meant by "piggybacking."
Currently there's the notion of serialization (generate json from case class) and data binding (generate case class from schema or JSON).
Understandably there's a hesitation of generating code from schema or anything, but my point is that writing binary compatible pseudo case class is pain in the bleep anyway, so that should alleviate the concern. Code generation also allows us to spread this best practice to plugin authors.
 
Is there a schema description language / schema files?
===

If the goal is to have both a file with case classes and a file with a
schema in some schema description language, then you can generate
either from the other, conceptually. The one thing that's clearly bad
is to write them BOTH by hand, because then they can be out of sync.

Sort of. There's schema, DTO classes, and the marshaling/unmarshaling code.
If we decide json is the king, then we start out with generating/inferring json schema, and then from there we generate both DTO and marshaling typeclass instances. For that we would need to write such tools ourselves.
In the interim, I'm ok with manually writing both Json parsing code and pseudo case classes because at least I would have control over exactly what Json it's going to emit.


In the interim this is ok, but I don't think the end user experience will be good enough.

Yes, for them, I'd like to have JSON -> schema -> codegen.

-eugene


eugene yokota

unread,
Sep 24, 2014, 3:38:50 PM9/24/14
to sbt...@googlegroups.com
Hi Heather,

Welcome to the discussion!

On Wed, Sep 24, 2014 at 10:17 AM, Heather Miller <heather....@gmail.com> wrote:
On Wednesday, September 24, 2014 4:05:56 PM UTC+2, Havoc Pennington wrote:
Hi,

Re: pickling, assuming we strip it down to only our needed API, some
known differences from play-json are:

 * no AST; uses "streaming" instead (see
https://github.com/scala/pickling/blob/0.9.x/core/src/main/scala/pickling/PBuilderReader.scala
)
 * abstracts binary vs. json
 * may already be able to pickle non-case-classes (Eugene you just
said it could do the private-constructor class right)

I'm not sure where Eugene got the impression that "the chief motivation (of pickling) is that it uses macros to auto serialize simple case classes". That's quite far from the truth (maybe you're confusing pickling with Salat?) 

Sorry, I did't mean it as a slander on pickling or to imply that's pickling's motivation.
What I meant by the statement is why we are interested in pickling as opposed to (current) sbinary and/or hand-rolled json parsing in the first place.
The emphasis is placed on "macros" (compile-time guarantee) and "auto serialize" (as oppose to hand-rolling typeclass instances), not on "simple" or "case classes". In fact I did demonstrate to Havoc that it can serialize final class with hidden constructor.

-eugene

Philipp Haller

unread,
Sep 24, 2014, 4:22:01 PM9/24/14
to sbt...@googlegroups.com
Hi Eugene,
I think one important point is also that pickling has a flexible and complete API for hand-rolling typeclass instances. So, that's another programmable piece of the pickling toolbox. One nice aspect of this API: it too abstracts from the pickle format (JSON/binary/...).

Cheers,
Philipp

eugene yokota

unread,
Sep 24, 2014, 5:30:05 PM9/24/14
to sbt...@googlegroups.com
On Wed, Sep 24, 2014 at 10:12 AM, Heather Miller <heather....@gmail.com> wrote:

In the above, you also express concern over depending on a prerelease version. Databricks requires a final release for their integration with Spark by November 3rd. 0.8.0 is the current version of pickling. The changeset that we're sitting on that's targeted to go into 0.9.0 is enormous, and we're in a state where we can and should cut a 0.9.0 release ASAP (e.g., many of our users have moved from 0.8.0 to 0.9.0 SNAPSHOTs again). Thus, what we can do is this: we can cut a 0.9.0 release this week, and any of the larger/breaking changes that sbt needs, including pickler contravariance, we can put into 0.10.0. We could release 0.10.0 as soon as it covers all your immediate needs (like the two issues related to lightweight protocol evolution that Havoc pointed out), within the next few weeks. FYI, we'll also have another release, possibly 0.11.0, around November 3rd for Spark/Databricks.


 Nice. I really appreciate you guys working with us.

My argument is that using serializer for long-term caching and restful API is not a good fit, because json needs to be the king, not case classes. Especially if we are thinking about requiring plugin authors to adopt the protocol, we need to make json the king.

A point of pickling that's been missing a bit from this discussion is that *you* have control of the format (i.e., the JSON/binary that gets put on the wire). And you can also control how things are serialized. Thus, if you need JSON reading/writing to be made more powerful/flexible, you can fork/provide your own pickle format that is more focused on schema evolution, say (though, below we explain how we can/will address the specific issues Havoc summarized also for the generic formats that we ship).


I see serialization (Scala type -> wire format -> Scala type) and data binding (wire format -> Scala type -> wire format) to be two different approaches.

The input dictates the range of expressiveness, like a specification, and other is ultimately an implementation detail. Given a restful API, I would like to document what JSON data it is expecting, not "what class Foo would generate according to these rules."
Furthermore, a schema document can go further than Scala types in terms of specification:
- default value when field is missing
- API version on fields (e.g. "since 1.0" vs "since 1.1")
- cardinality (minOccurs/maxOccurs)
- union of data
- 22 fields on Scala 2.10
- other value-level validations

Would these be encoded as custom attributes on picklee's type T?
Some of them like 22 fields and union types are just annoying encoding issues, which current and future Scala version might resolve,
but the useful ones are the first two.

### default value
Having default value allows us to evolve the DTO without having to put Option[T] all over the place.
Typically there's an equivalent of zero/blank/false value that's safe to default to.

### API version on fields
Given a type T, it should probably check all fields annotation and figure out the API baseline (max major version + ".0.0") and make check if any fields added after that are either optional or have a default value.

So the TLDR is this: rather than needing to fork the entire framework, why not just do as intended and fork the JSON format? You can get almost everything that you require in doing so, and anything that needs additional support from the framework, we can add for you.


If we are starting out with Scala types, I think it makes sense to use stock pickler 0.9.0 and implement custom pickle format as you suggest.
If we are going generate pseudo case class from json schema or json instance, I wouldn't call that a fork, and it could even co-exist with pickler.

-eugene

eugene yokota

unread,
Sep 24, 2014, 6:18:31 PM9/24/14
to sbt...@googlegroups.com

### AST vs opaque blob

On Wed, Sep 24, 2014 at 10:58 AM, Havoc Pennington <h...@typesafe.com> wrote:
On Wed, Sep 24, 2014 at 10:19 AM, Heather Miller
<heather....@gmail.com> wrote:
>> minor detail, with pickling it wouldn't be raw JSON but rather some
>> kind of opaque blob.
>
>
> Not sure what makes you think so? How is this the case? (I could be
> confused/missing context)
[snip] 

The reason I say it wouldn't be raw JSON in pickling is that I'm
assuming (hoping, really) that a JSON AST is "out of scope" for the
main pickling API.

If we know that the wire protocol is json, we can just write a custom picker to unpickle json document into any AST we want (jawn supposedly is the fastest parser and can handle AST-agnostic).

final case class TaskEvent(taskId: Long, name: String, serialized:
MyInventedOpaqueUnpickleable) extends Event

For these purpose, it might be useful to have a general purpose envelope type like

    case class Attributed[T](header: Map[String, Any], body: T)

If we have no idea what T is we just do Attributed[JsValue].

-eugene

Haoyi Li

unread,
Sep 25, 2014, 10:40:13 AM9/25/14
to sbt...@googlegroups.com
If you're only interested in serializing struct-like objects to traditional JSON representations with a case-class-handling macro and the ability to add optional fields, uPickle does that quite well out of the box. It's also a lot smaller than most other picklers (~1000loc) if you want to source-include it inside SBT and uses Jawn under the covers (which is tiny and should also be inlined).

The very-normal serialization format is even documented with exactly the spec you seem to want! It even (by default) throws away the classname so the pickle-as-one-class-unpickle-as-another-if-fields-match case works perfectly.

It lets you handle data-migration by adding additional arguments to your case class with default values. And has support for sealed-traits-are-discriminated-unions by default. Want some new field to default to null? to None? Just add it as a default value and it'll do the right thing.

You can also write custom picklers by hand.

It doesn't pickle Anys or Exceptions (You have to know the shape at compile-time) and it doesn't pickle body-vars by default (only uses case class apply/unapply) but from the thread you guys seem to be leaning towards struct-like-objects-only anyway, and in scala that means case classes.

I wrote uPickle out of exactly the same reasoning that is in this thread, so you should take at least a cursory look before going and making your own thing.

Havoc Pennington

unread,
Sep 29, 2014, 2:57:30 PM9/29/14
to sbt-dev
Haoyi,

Just wanted to say thanks for creating and posting about upickle - was
just talking to Josh and Eugene and I think we want some of the extra
bells and whistles in scala pickling, but personally this looks like a
cool little library and makes a lot of sense for many apps, especially
for the scala.js interop case.

Havoc
> --
> You received this message because you are subscribed to the Google Groups
> "sbt-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sbt-dev+u...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sbt-dev/f8410368-388f-44c2-940e-277740de319d%40googlegroups.com.

Havoc Pennington

unread,
Sep 29, 2014, 3:14:41 PM9/29/14
to sbt-dev
Hi,

Had a meeting with Josh and Eugene this morning and wanted to follow
up here to preserve visibility to others who might care.

In brief we want to take a crack at porting sbt-remote-control to pickling.

Some of the discussion threads here are really separable issues so we
will treat them separately:

- case classes vs. classes that can add fields while preserving ABI,
we probably want to make this change but it is a separable task

- the protocol (vs. serialization of values), for now we are just
looking at serialization of values, the protocol will remain the
custom binary protocol found in
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/impl/ipc/IPC.scala
(and this protocol is not exposed outside of the sbt package, or
shouldn't be anyway)

Some of the steps to try out pickling:

- add a test suite which will catch accidental protocol changes and
check back compat, by spitting out example serializations, checking
those into git, and testing vs. both old serializations and the latest

- remove JsValue from the API by introducing a thing like: trait
UnknownAny { def unpickle[T:Reads]: Try[T] } ; this probably overlaps
with / is redundant with / relates somehow to the BuildValue we
already have in
https://github.com/sbt/sbt-remote-control/blob/master/commons/protocol/src/main/scala/sbt/protocol/Values.scala#L10

- then try porting to scala pickling

For the pickling port we would probably create a custom PickleFormat
and try to wire things up so the client and server always use it, that
way we would always be able to introduce any needed compat hacks in
that PickleFormat.

Some random notes from discussion:

- it would be nice to have an annotation for a default value which
would allow adding fields without using Option, i.e. an optional field
could be defaulted instead of None'd

- we should try to automate the sealed-trait-with-discriminator pattern

- we need to use fully-qualified class name in TaskEvent instead of
just the unqualified name, and/or handle the JsValue in the same way
as in ValueChanged / BuildValue

- in 0.13.x we would not mess with the existing sbinary/cache stuff,
the relevance of that for now is just that we want a serialization
solution that will let us unify on it in 1.0 rather than having
sbinary also

Probably something else I don't remember but anyway. We will proceed
with this theory and see what we learn.

Thanks
Havoc

Naftoli Gugenheim

unread,
Sep 29, 2014, 8:16:08 PM9/29/14
to sbt...@googlegroups.com

Just curious which bells and whistles do you have in mind?
BTW scalajs interop could actually be interesting, you could make a browser-based sbt client that way.

Philipp Haller

unread,
Sep 30, 2014, 4:50:22 AM9/30/14
to sbt...@googlegroups.com
On Tuesday, September 30, 2014 2:16:08 AM UTC+2, nafg wrote:

Just curious which bells and whistles do you have in mind?
BTW scalajs interop could actually be interesting, you could make a browser-based sbt client that way.

Scala pickling will very soon be useable with Scala.js. Similar to how scala-async works when compiling Scala.js code, the pickling library itself needs to be built with Scala/JVM. However, the generated code will soon be compatible with Scala.js, for the cases where no runtime reflection is required. The switch for disallowing the generation of picklers that use runtime reflection is already in place:

Furthermore, there's now also a runtime pickler registry, which allows looking up picklers without reflection (similar to scala-js-pickling). This is also needed in projects like Spark.

Cheers,
Philipp
Reply all
Reply to author
Forward
0 new messages