Dataframe API?

204 views
Skip to first unread message

Christopher Grainger

unread,
Mar 1, 2021, 7:50:04 PM3/1/21
to Numerical Elixir (Nx)
I'm so excited to see Elixir move in the direction of numerical computing. I use Python every day for deep learning and would love to move in the direction of Elixir (though I'm still wrapping my head around the paradigm vis-a-vis something like PyTorch).

One pain point I often feel is that Python's main data frame abstraction in Pandas is just so poorly designed. This is from a number of perspectives which I'd be happy to discuss further, but mainly: the paradigm lends itself to a very inconsistent API with often surprising behaviour.

The Tidyverse in R is much stronger in this respect. Hadley Wickham et al have built some immensely powerful data analysis tools that are consistent and unsurprising in what they do. Notably, the Tidyverse libraries leverage a pipe operator that works very similarly to the Elixir pipe operator. 

There are also some great affinities between dplyr and Ecto. Generally, a functional approach to data analysis is 'natural' in a lot of ways because you're basically just passing the data through a series of data transformations as functions. I have a nascent hope that we could see a dataframe API that could allow us to write Ecto similarly to dplyr: transparently across remote databases and local dataframes.

The downside to working with this stuff in R is poor non-dataframe data handling (particularly a lack of a good native map structure) and weak integration with deep learning libraries. Outside of the Tidyverse, much of the R ecosystem is... rough. It seems that Nx could lead to Elixir being a language that gets the best of both worlds.

Are there plans to develop a dataframe API on top of Nx? If so, have you considered looking to the Tidyverse as inspiration? How might I get involved in helping this along? I'd be incredibly keen to do my data analysis work in Elixir.

José Valim

unread,
Mar 2, 2021, 4:22:17 AM3/2/21
to elix...@googlegroups.com
Hi Chris!

We will definitely tackle something along this area. Our initial goal is to focus on the foundation and port something like Tensorflow's Feature Columns and xarray's Datasets in a way that it is compatible with Nx.Tensor and defn. This will allow us to precompile a large chunk of your data preprocessing plus inference to the CPU/GPU. However, at this point the focus will be on numerical data and how to convert non-numerical data to numerical (buckets, hashing, etc). I have created an issue here: https://github.com/elixir-nx/nx/issues/301

Once that is done, we can probably start focusing on higher-level constructs and the data-loader aspects. It is still unclear if that will be part of Nx per se or built on top as a separate library but I don't see any issue with implementing the higher-level constructs we find on the dplyr API. :)

Feel free to track the issue above or reach out to me at any time for feedback!

--
You received this message because you are subscribed to the Google Groups "Numerical Elixir (Nx)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-nx+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-nx/d48af207-3827-46f3-9a1a-a2436e8de9b0n%40googlegroups.com.

Thomas Browne

unread,
Mar 2, 2021, 6:00:39 AM3/2/21
to elix...@googlegroups.com

I can vouch for the opinion that Pandas is a kludge (though fast), and that inspiration from R is the way to go (vectorised from the ground up). Indeed Wes McKinney himself is not happy with the Pandas API which is why he has rebuilt completely up in the form of Apache Arrow, a cross language format which is gaining a huge amount of traction including in the R universe (the first implementation was the Wickham/McKinney feather format joint venture).

That said, I would caution against following the Tidyverse model too closely as the base level abstraction. It's very opinionated and loved by same, disliked by others. Personally I think an industry standard in-memory format such as Apache Arrow is the correct (lower) level at which to focus at first, and people can build their own higher level abstractions once this is in place, especially given that Elixir has such wonderful metaprogramming capabilities.

I'm currently investigating how easy it is to parse the Arrow format natively in Erlang.

Christopher Grainger

unread,
Mar 2, 2021, 4:58:42 PM3/2/21
to Numerical Elixir (Nx)
Thanks José! I think that makes a lot of sense. Following that issue now, and I've joined EEF and jumped into the Slack. I'd really like to help on Nx wherever I can.

I'm a huge fan of Arrow and I agree it's a great thing to focus on early. Zero copy reads are just a complete game changer for so much in this space, as a start. I think arrow as your in memory data.frame at a low level then a dplyr-inspired API to interact on top makes tons of sense.

I'd be very happy to contribute on getting Arrow natively in Erlang, though admittedly I have very little Erlang experience. That's something that would change the calculus about our ETL processes today.
Reply all
Reply to author
Forward
0 new messages