Keeping the Java API up to date

99 views
Skip to first unread message

Josh Rosen

unread,
Jan 5, 2013, 3:24:46 PM1/5/13
to spark-de...@googlegroups.com
What process should we use to keep the Java API up to date with the Scala API?

Some new features are a bit more involved to implement in Java, like https://spark-project.atlassian.net/browse/SPARK-615, but others are trivial: https://spark-project.atlassian.net/browse/SPARK-606

For pull requests like https://github.com/mesos/spark/pull/353 and https://github.com/mesos/spark/pull/351, we only need to add one or two lines of code to call the corresponding Scala methods and wrap the returned values as JavaRDDs.

I don't like the idea of opening an extra Java API pull request for every minor feature, so it might be nice to add features to both APIs in the original pull request.  At the same time, I don't want to add a barrier that prevents people from contributing.

If I notice features missing from the Java API, should I open individual JIRA issues for them?  I don't want to flood the tracker.

I suppose that we could reconcile the Java and Scala APIs right before we cut a release, so maybe we could keep a single open issue per release that acts as a checklist of features that need to be ported to Java.

Similarly, how do we decide which new methods will be documented in the Scala programming guide?

Thoughts?

- Josh

Matei Zaharia

unread,
Jan 5, 2013, 10:19:40 PM1/5/13
to spark-de...@googlegroups.com
Good question, Josh. I think we should ask pull requests to update the Java API as well -- in fact I remembered to do that after committing 353 and 351. Feel free to comment on future requests that add new methods.

For existing features that are missing, you can always open one issue that lists a bunch of them, or just send a pull request with them. I like the "checklist" issue idea too if that's easier. And then we should just do a review before making the release.

For the programming guide, I only wanted to keep the simplest / highest-level methods, so don't add new ones unless there's really something we missed. The ScalaDoc page contains a fuller list of methods.

Luckily, it seems like a decent number of people are using the Java API, so hopefully we'll get comments about any features we miss.

Matei

Josh Rosen

unread,
May 1, 2013, 10:17:48 PM5/1/13
to spark-de...@googlegroups.com
I'm reviving this thread because I think that synchronizing the Java and Scala APIs should be done before the 0.8 release.

I don't think that we have a complete list of methods that are missing from the Java API.  Assembling this list manually might be painful, so maybe it would be worthwhile to write a script to identify methods that were added / changed between Spark releases (this could probably be hacked together using javap, sort, and diff).

Do we have a checklist for performing releases?  Can we add API synchronization to that checklist?

- Josh

Shivaram Venkataraman

unread,
May 1, 2013, 11:05:38 PM5/1/13
to spark-de...@googlegroups.com
I added the zipPartitions method in Scala which is missing from Java -
Will send a pull request for Java soon.

Thanks
Shivaram
> --
> You received this message because you are subscribed to the Google Groups
> "Spark Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to spark-develope...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Josh Rosen

unread,
May 8, 2013, 12:00:39 PM5/8/13
to spark-de...@googlegroups.com
As a proof of concept, I used `javap` (javap -classpath core/target/scala-*/classes/ spark.RDD spark.PairRDDFunctions spark.OrderedRDDFunctions) and `vimdiff` to find RDD methods added to the master branch since 0.7 (as of a few days ago.  I should do the same for SparkContext, too):

public boolean isTraceEnabled();
public spark.RDD combineByKey(scala.Function1, scala.Function2, scala.Function2, spark.Partitioner, boolean, java.lang.String);
public spark.RDD foldByKey(java.lang.Object, spark.Partitioner, scala.Function2);
public spark.RDD foldByKey(java.lang.Object, int, scala.Function2);
public spark.RDD foldByKey(java.lang.Object, scala.Function2);
public java.lang.String combineByKey$default$6();
public spark.RDD subtractByKey(spark.RDD, scala.reflect.ClassManifest);
public spark.RDD subtractByKey(spark.RDD, int, scala.reflect.ClassManifest);
public spark.RDD subtractByKey(spark.RDD, spark.Partitioner, scala.reflect.ClassManifest);

To find the complete list of methods to add to Java, we need to compare the master branch against 0.6 and filter out the methods that were already added.

Josh Rosen

unread,
Jul 18, 2013, 2:20:43 AM7/18/13
to spark-de...@googlegroups.com
I just submitted a pull request containing an automated tool for finding methods that are missing from the Java API: https://github.com/mesos/spark/pull/713 

I'd like to try to first bring the Java API up-to-date for Spark 0.7.4 then do the same in the master branch.  I'm fine making the necessary additions in Spark core, but I'd like help completing the Spark Streaming Java API.
Reply all
Reply to author
Forward
0 new messages