John Myles White
unread,Jan 20, 2014, 4:57:43 PM1/20/14Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to julia...@googlegroups.com
As I said in another thread recently, I am currently the lead maintainer of more packages than I can keep up with. I think it’s been useful for me to start so many different projects, but I can’t keep maintaining most of my packages given my current work schedule.
Without Simon Kornblith, Kevin Squire, Sean Garborg and several others doing amazing work to keep DataArrays and DataFrames going, much of our basic data infrastructure would have already become completely unusable. But even with the great work that’s been done on those package recently, there’s still lot of additional design work required. I’d like to free up some of my time to do that work.
To keep things moving forward, I’d like to propose a couple of radical New Year’s resolutions for the packages I work on.
(1) We need to stop adding functionality and focus entirely on improving the quality and documentation of our existing functionality. We have way too much prototype code in DataFrames that I can’t keep up with. I’m about to make a pull request for DataFrames that will remove everything related to column groupings, database-style indexing and Blocks.jl support. I absolutely want to see us push all of those ideas forward in the future, but they need to happen in unmerged forks or separate packages until we have the resources needed to support them. Right now, they make an overwhelming maintenance challenge even more onerous.
(2) We can’t support anything other than the master branch of most JuliaStats packages except possibly for Distributions. I personally don’t have the time to simultaneously keep stuff working with Julia 0.2 and Julia 0.3. Moreover, many of our basic packages aren’t mature enough to justify supporting older versions. We should do a better job of supporting our master releases and not invest precious time trying to support older releases.
(3) We need to make more of DataArrays and DataFrames reflect the Julian worldview. Lots of our code uses an interface that is incongruous with the interfaces found in Base. Even worse, a large chunk of code has type-stability problems that makes it very slow, when comparable code that uses normal Arrays is 100x faster. We need to develop new idioms and new strategies for making code that interacts with type-destabilizing NA’s faster. More generally, we need to make DataArrays and DataFrames fit in better with Julia when Julia and R disagree. Following R’s lead has often lead us astray because R doesn’t share Julia’s strenths or weaknesses.
(4) Going forward, there should be exactly one way to do most things. The worst part of our current codebase is that there are multiple ways to express the same computation, but (a) some of them are unusably slow and (b) some of them don’t ever get tested or maintained properly. This is closely linked to the excess proliferation of functionality described in Resolution 1 above. We need to start removing stuff from our packages and making the parts we keep both reliable and fast.
I think we can push DataArrays and DataFrames to 1.0 status by the end of this year. But I think we need to adopt a new approach if we’re going to get there. Lots of stuff needs to get deprecated and what remains needs a lot more testing, benchmarking and documentation.
— John