Hi everyone,
Following my call for additional maintainers, Hyukjin Kwon from Databricks stepped up and asked me if I would be interested in Databricks taking over the maintenance of Py4J.
For those who are not aware, Databricks created Spark and PySpark, one of the most popular Big Data frameworks and the largest source of users of Py4J by far. Hyukjin Kwon and Josh Rosen (also from Databricks) contributed to significant fixes and additions to Py4J so I am confident that they know their way around the code and I believe they share my views on backward compatibility.
Databricks can also dedicate some of their resources to Py4J so this is a net win for the community because Py4J will continue to evolve and will likely get security and performance improvements.
Hyukjin and I have been working on a migration plan and we will keep you updated as we move forward. We can already tell you that (1) the namespace (in Java and Py4J) will not change, (2) the License will change from BSD to Apache License 2, (3) the project will still be developed openly on Github, but under a new organization (databricks instead of my personal Github account). I’ve prepared a small FAQ at the end of this email because I’m sure you will all have questions, having businesses, personal projects or research projects depending on Py4J.
I have been flattered by all the good words and the offers to hire me or subcontract me to continue working on Py4J. Py4J started at a time when I was building crazy things and I get excited every time I learn about a new project using Py4J. I’m a builder at heart and I always saw Py4J as a way to build more things by leveraging both the Java and Python ecosystems. I would not approve of this move if I was not sure that Py4J would continue to contribute towards that vision.
Feel free to ask any question you may have by replying to this message.
Thanks,
Bart
FAQ
1. Why are you changing the license from BSD to Apache License 2?
Databricks being a large organization, they need the patent protection provided by the Apache Public License 2. APL 2 is compatible with major open source licenses, including Eclipse Public License, GPL, and BSD. I considered the Apache Public License 2 when I started working on Py4J so this was an easy decision for me. Code that has been released prior to the migration is obviously still available under the BSD 3-clause license.
2. Will you (Bart) go away?
No. I will initially help with the migration and keep commit access to the Py4J repository in the Databricks organization. I will participate in the release process, but the goal is that I move away from release engineering as soon as possible to not be a bottleneck. I will definitely review pull requests and because I have a broader understanding of the code and the various use cases, I will make sure that we do not inadvertently break something for someone :-)
3. What will happen with the Eclipse build?
I currently build an Eclipse update site for each release of Py4J and we will continue doing so. Databricks does not have a strong expertise with Eclipse so if any of the existing Eclipse users could step in to improve our build process and find a place where we can host the update site (and all the previous versions), that would be great!
I have personally not been using Eclipse (nor Java… until a few days ago where I had to connect to an old IBM mainframe… shudders…) so any help would be appreciated!
4. Will my code break when the new version of Py4J is released?
No. We won’t change where we deploy the code (sonatype, pypi, github, py4j web site), we won’t change the API and we won’t change the package names. We may change some of the names when releasing Py4J 1.0 or 2.0 but that has always been the plan.
Remember that PySpark is a large user of Py4J so breaking backward compatibility would be a mess for them and their users and they probably prefer to spend their time building new things than refactoring their entire integration with Py4J.