Pandas 2.0 Design Request: A more dplyr-like API

167 views
Skip to first unread message

Chris Said

unread,
Jun 13, 2017, 6:50:27 PM6/13/17
to PyData
Hi Pandas developers,

I want to start by thanking all of the pandas developers for the effort they've put into the project. So much of what you do is thankless, and I want you to know it is really appreciated. Pandas is a huge part of my day-to-day coding.

Because I use it so much, I want to submit a request. I want somebody to #MakePandasMoreLikeDplyr. To me and to almost everyone else I've talked to who knows pandas and dplyr, this is more important than performance improvements and arguably more important than most of the goals in the pandas 2.0 design docs.

I'm not an R guy. 95% of my work is done in pandas. But everyone I know who uses pandas is constantly having to google how to do things. In contrast, dplyr feels like coding at the speed of thought. In particular, the combination of groupby->{mutate, summarize} is incredibly natural. It is so easy to create multiple named output columns from multiple input columns. That's because the definition of new columns, with reference to multiple input columns, is all done inside the call to mutate / summarize. With pandas, it's much more complicated and hard to remember.

The new transform method in 0.20 gets us part of the way there. But instead of allowing users to name the output columns, it returns multi-indexed columns, which for me and most other people I've talked to are unwanted.

Thank you again for all your hard work. Just as a TL;DR: More like dplyr, less injection of multi-indexes. (Could they be eliminated entirely?)

Best,
Chris

Stephan Hoyer

unread,
Jun 13, 2017, 8:03:25 PM6/13/17
to pyd...@googlegroups.com
Hi Chris,

I think most of us agree with you. We've been slowly moving in this direction (e.g., with .assign()) and hope to do more. For example, see our speculative discussion concerning getting rid of indexes for pandas2 and a proposal for allowing indexes to be referenced by name.

There are a few major obstacles here:
1. Coming up with concrete plans for how new APIs should work. This is harder than just copying dplyr, because we don't have access to non-standard evaluation in Python.
2. Figuring out how to deprecate/replace existing behavior in a minimally painful way, to minimize clutter of the pandas API. (Arguably, we already have too many methods.)
3. Actually implementing these changes in a consistent fashion in the complex pandas codebase.

These are all important work, but only the last item requires actually writing code. Help would be appreciated on all of these.

It's worth noting that some of this may actually be easier to do outside of pandas proper. For example, Wes and Phil have been working on a pandas backend to Ibis.

Best,
Stephan

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Hobson

unread,
Jun 14, 2017, 11:24:08 AM6/14/17
to pyd...@googlegroups.com
Just my 2 cents on indexes:

Every time I think I'm done with them and don't need them any more, I get into some weird situation where a complex, nested, categorical index makes my life soooo much easier.

I recognize that if the library and general community doesn't need them, they can represent a significant maintenance burden. But they saved my ass a couple of times this week.

-paul

Joris Van den Bossche

unread,
Jun 14, 2017, 11:30:30 AM6/14/17
to PyData, panda...@python.org
2017-06-14 17:24 GMT+02:00 Paul Hobson <pmho...@gmail.com>:
Just my 2 cents on indexes:

Every time I think I'm done with them and don't need them any more, I get into some weird situation where a complex, nested, categorical index makes my life soooo much easier.

I recognize that if the library and general community doesn't need them, they can represent a significant maintenance burden. But they saved my ass a couple of times this week.

 
Stephan mentioned some ideas to make those cases where you don't need them easier (eg allow not to have an index), but there are no plans to ditch Indexes altogether (if you look at the linked issue, it speaks about "optional indexes", but Stephan's wording in the mail below was maybe a bit misleading).

Joris

Chris Bartak

unread,
Jun 14, 2017, 12:29:48 PM6/14/17
to Joris Van den Bossche, PyData, panda...@python.org
Chris,

I'd encourage you to experiment with (and contribute to!) the ibis expression api that Stephan mentioned.  The pandas backend is a work in progress, but functional enough to try out.  It is already quite dplyr-like, for example, here's the translation of your first example.  The obvious difference is the verbosity of having to fully qualify column names - this is essentially a python syntax limitation.


In [77]: (diamonds

             .groupby(diamonds.cut)

             .aggregate(mean_x=diamonds.x.mean(), mean_y=diamonds.y.mean())



_______________________________________________
Pandas-dev mailing list
Panda...@python.org
https://mail.python.org/mailman/listinfo/pandas-dev


Paul Hobson

unread,
Jun 14, 2017, 12:36:16 PM6/14/17
to pyd...@googlegroups.com
I see. Thanks for the insight. Very interesting discussion in those links.

Cheers,
-Paul

Phillip Cloud

unread,
Jun 17, 2017, 7:19:56 PM6/17/17
to pyd...@googlegroups.com
I hacked together something that looks a little like dplython at https://github.com/ibis-project/ibis/pull/1039, if anyone's interested in playing around with it. There are some bugs when doing joins with the pandas backend, but all existing SQL backends should work and most single-table operations that I've mapped to the dplyr API work with oandas.

 
-paul

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages