dframe: proposal for an Apache Arrow based dataframe for Go

459 views
Skip to first unread message

Sebastien Binet

unread,
Jan 11, 2019, 12:27:15 PM1/11/19
to gonum-dev
hi there,

I finally took the time to flesh out a bit more the dframe proposal:


Feel free to comment here or over there.

There are still a few things up for discussion or in flux:

- should dframe.Frame provide immutable semantics ?
- how are we supposed to handle bigger-than-RAM data sets? (this should be possible with Arrow, of course, but hasn't been tested and the API to deal with this -- although probably loosely based off arrow/array.TableReader -- is yet to be devised)
- I have only tested very simple operations: select/drop/copy columns. GroupBy, Aggregate and other Map/Reduce operations need to be sketched up,
- investigate a few more backend integrations (CSV, HDF5, HDFS, NumPy-IO, Parquet, ROOT, ...)

I am particularly interested in your opinion about this dframe.Tx transaction concept that I introduced to be able to handle atomic modifications (commit or rollback) to a data frame.

cheers,
-s

Randall O'Reilly

unread,
Jan 13, 2019, 2:21:55 AM1/13/19
to Sebastien Binet, gonum-dev
I’m just starting the process of converting my neural network simulation software (emergent) over to Go / Gonum / Arrow. We made extensive use of a DataTable object in emergent, which is functionally similar to the arrow.Table.

One of the key features we built into the DataTable was an index indirection array, that every access goes through (as is often done in databases). This makes operations such sorting and filtering etc much faster — you just sort the indexes instead of all the data. We also used it to optimize memory re-use when things were deleted and then added — the indexes serve as a kind of “free pool”. You could flatten the table to restore a sequential memory layout at any point..

There are obviously other complexity and performance costs to maintaining and using the indexes all the time, and overall it may be better overall to keep this optional, though some level of direct support would probably be good (e.g., SortByIndex, SelectByIndex methods or something to that effect, taking an Array arg of indexes).

As far as I can tell, the Arrow Table does not have any direct support for an extra level of index indirection? Have you had any thoughts about this feature? For large datasets it can be a massive speedup for sorting, and thus for operations such as GroupBy (though perhaps a map would be more efficient there — we just used sort..)

WRT GroupBy, do you have any ideas about how to specify the statistics to apply in grouping the data (Mean, StdDev etc)? We had a whole GroupBy “spec” structure where you could specify all the stats to compute for each column, including multiple levels of grouping — very efficient to do it all in one pass instead of separately for each column / stat — you get a full table of results in one pass.

I’d be happy to help with any of this in any way that might be useful.

- Randy
> --
> You received this message because you are subscribed to the Google Groups "gonum-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gonum-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Sebastien Binet

unread,
Jan 14, 2019, 5:09:34 AM1/14/19
to Ged Wed, gonum-dev
Hi Ged,

On Fri, Jan 11, 2019 at 6:52 PM Ged Wed <ged...@gmail.com> wrote:
Hey hey

I had a look yesterday.

Immutable data structures look very useful to me. It will make doing streaming pipelining much easier based off aggregates that are producing mutations.
But I fear it may require code gen ?
what do mean?
there is no language support for enforcing immutable data structures (there isn't even a "const" keyword -- not that I am asking for it for Go2!) so the only way to implement this is to expose non-ptr receiver methods and/or copy everything under the hood before running the operations that may mutate the internal state.
the latter is what I am doing in Tx.

Tx is mandatory. Without it you are forced into doing it manually. But the big question is what does arrow under you offer to facilitate transactions and of course linearizability ( sometimes called serialisation in the context of databases ).
serialization in Arrow is documented but it hasn't been yet implemented for Go Arrow (PRs welcomed! :P)
AFAIR, there isn't anything in Arrow that really facilitate transactions.
 
Which leads to you bringing up the other backing stores like hdfs etc. From my point of view if a developer has one of these other stores ( like hdfs ) should they just do ETL into arrows using golang ? Then all transaction questions become much simpler.
my take on this is that not everybody needs these tools because not everybody has problems that fit that scale.
but it's important to be able to scale out if that need shows up: being able to do this very efficiently and at minimal cost is a key selling point for Arrow (zero-copy and all that.)
 

I raising all this to constrain the scope of work to a combination of best practice approach and not technically boiling the ocean for the MVP.
Maybe certain things become easier outside of the scope after the MVP because we have more knowledge about the LOE ( level of effort ).
sure. I only mentioned these HDFS and other backends because I think they are an important feature: interoperability.
for MVP I don't think we need to implement interop with the whole universe, but we should at least try things out for a couple of them (CSV for sure, and at least another one) to make sure we don't overlook anything important.

cheers,
-s

Sebastien Binet

unread,
Jan 14, 2019, 5:44:12 AM1/14/19
to Randall O'Reilly, gonum-dev
Hi Randy,

On Sun, Jan 13, 2019 at 8:21 AM Randall O'Reilly <rcore...@gmail.com> wrote:
I’m just starting the process of converting my neural network simulation software (emergent) over to Go / Gonum / Arrow.  We made extensive use of a DataTable object in emergent, which is functionally similar to the arrow.Table.

One of the key features we built into the DataTable was an index indirection array, that every access goes through (as is often done in databases).  This makes operations such sorting and filtering etc much faster — you just sort the indexes instead of all the data.  We also used it to optimize memory re-use when things were deleted and then added — the indexes serve as a kind of “free pool”.  You could flatten the table to restore a sequential memory layout at any point..

There are obviously other complexity and performance costs to maintaining and using the indexes all the time, and overall it may be better overall to keep this optional, though some level of direct support would probably be good (e.g., SortByIndex, SelectByIndex methods or something to that effect, taking an Array arg of indexes).

As far as I can tell, the Arrow Table does not have any direct support for an extra level of index indirection? 
correct.
this is something `dframe` would have to implement.

Have you had any thoughts about this feature?  For large datasets it can be a massive speedup for sorting, and thus for operations such as GroupBy (though perhaps a map would be more efficient there — we just used sort..)
I haven't yet.
I know QFrame does have something like this.
hopefully Tobias (main author of qframe) has some insights about his qframe/internal/index package.


WRT GroupBy, do you have any ideas about how to specify the statistics to apply in grouping the data (Mean, StdDev etc)?  We had a whole GroupBy “spec” structure where you could specify all the stats to compute for each column, including multiple levels of grouping — very efficient to do it all in one pass instead of separately for each column / stat — you get a full table of results in one pass.
I haven't yet spent much time on this but I liked what qframe did in that respect (even if IIRC it didn't allow for the kind of optimization you talk about.)
thanks for bringing that up.

-s

Wes McKinney

unread,
Jan 14, 2019, 12:00:54 PM1/14/19
to Sebastien Binet, Randall O'Reilly, gonum-dev
Data frames in Go sounds great to me.

One item: I would encourage you all to do as much of the low-level
algorithmic work in the Apache Arrow project itself. This would mean
implementations of common analytic functions against the Arrow memory
format. We've begun doing this in C++, for example, and we also have
an LLVM-based expression compiler (Gandiva) for runtime generation of
function kernels.

The reason for this is that you will be defining opinionated semantics
about how data frames work (e.g. immutability, laziness, etc.) in your
project, so if you can avoid forcing developers to accept your
semantics in order to, say, compute the average of an Arrow array,
that is a good thing. This can permit the development of different
kinds of Arrow-based analytic tools in Go that may have different
semantics.

- Wes

Ian Davis

unread,
Jan 14, 2019, 12:37:23 PM1/14/19
to gonu...@googlegroups.com
On Fri, 11 Jan 2019, at 5:27 PM, Sebastien Binet wrote:
hi there,

I finally took the time to flesh out a bit more the dframe proposal:


Thanks for writing this useful proposal.

I have a port of R's datatable that may be of interest. I doubt it is suitable for dframe but it may at least inform some of the discussion.


All the best,

Ian

Sebastien Binet

unread,
Jan 14, 2019, 1:40:39 PM1/14/19
to Wes McKinney, Randall O'Reilly, gonum-dev
Wes,



On Mon, Jan 14, 2019, 18:00 Wes McKinney <wesm...@gmail.com wrote:
Data frames in Go sounds great to me.

One item: I would encourage you all to do as much of the low-level
algorithmic work in the Apache Arrow project itself. This would mean
implementations of common analytic functions against the Arrow memory
format. We've begun doing this in C++, for example, and we also have
an LLVM-based expression compiler (Gandiva) for runtime generation of
function kernels.

The reason for this is that you will be defining opinionated semantics
about how data frames work (e.g. immutability, laziness, etc.) in your
project, so if you can avoid forcing developers to accept your
semantics in order to, say, compute the average of an Arrow array,
that is a good thing. This can permit the development of different
kinds of Arrow-based analytic tools in Go that may have different
semantics.

Yes.
Right now, Go Arrow only implements the 'Sum' "kernel" on float64 and uint64 arrays (IIRC).
Implementing the low level bits in Go Arrow proper makes sense to me as well.

I'll mention this explicitly.

-s

Sebastien Binet

unread,
Jan 14, 2019, 1:41:25 PM1/14/19
to Ian Davis, gonum-dev
Ian,

Thanks, I'll have a look :)

-s



All the best,

Ian

Wes McKinney

unread,
Jan 14, 2019, 2:04:07 PM1/14/19
to Sebastien Binet, Ian Davis, gonum-dev
Given onerous licensing (MPL 2.0) I would recommend not looking too
closely at the R data.table implementation

Randall O'Reilly

unread,
Jan 29, 2019, 1:23:39 AM1/29/19
to Sebastien Binet, gonum-dev
Having dug into this topic a bit more and reaching the point where I urgently need a data frame-like structure, I think I might need to write my own version (which will be interoperable to the greatest extent possible), but I’d like to just make sure I’m not missing anything before I do. I only really have two major requirements:

* tensors as the primary columnar data structure. I’m often dealing with 4D and higher data structures as logical ways of organizing neural network inputs and outputs (and when you add the row for the table, just add one more dimension). Having everything use a consistent tensor organization keeps things simple, instead of that being a special case.

* Easy, direct mutability. I’m not really sure why everyone loves immutability so much for these things? I get that it greatly simplifies various things but I’m almost always wanting to write the data into these tables in various flexible ways, etc, and purely from within the Go world, it would be so much easier to have a primary tensor type that is backed by a simple slice and you can write to it all you want.. you could send this data off to an arrow view of it, etc, if you want to share in that way or upload to a GPU, etc, but having a pure Go mutable version seems like where you want to start? or at least include in the mix?

As far as I can tell, arrow.Table or arrow.Record does not support tensors? and in general, it seems like tensor is a go-specific add-on that is not supported by the other languages? or is this just a work-in-progress for all languages?

And qframe doesn’t support mutability, nor tensors?

FWIW: My very recent impl of a simple pure-go tensor with mutability, based on arrow API (and code generator) is here: https://github.com/emer/emergent/tree/master/etensor

I also looked at the gorgonia/tensor — it didn’t seem like a simple code-generated slice-backed alternative was considered there? Anyway, I liked the idea of just keeping as pure-go as possible..

Cheers,
- Randy

Wes McKinney

unread,
Jan 29, 2019, 10:38:53 AM1/29/19
to Randall O'Reilly, Sebastien Binet, gonum-dev
hi Randall,

On Tue, Jan 29, 2019 at 12:23 AM Randall O'Reilly <rcore...@gmail.com> wrote:
>
> Having dug into this topic a bit more and reaching the point where I urgently need a data frame-like structure, I think I might need to write my own version (which will be interoperable to the greatest extent possible), but I’d like to just make sure I’m not missing anything before I do. I only really have two major requirements:
>
> * tensors as the primary columnar data structure. I’m often dealing with 4D and higher data structures as logical ways of organizing neural network inputs and outputs (and when you add the row for the table, just add one more dimension). Having everything use a consistent tensor organization keeps things simple, instead of that being a special case.
>

This may be stretching the definition of "data frame" at this point.
Do you need something more like http://xarray.pydata.org/en/stable/?

> * Easy, direct mutability. I’m not really sure why everyone loves immutability so much for these things? I get that it greatly simplifies various things but I’m almost always wanting to write the data into these tables in various flexible ways, etc, and purely from within the Go world, it would be so much easier to have a primary tensor type that is backed by a simple slice and you can write to it all you want.. you could send this data off to an arrow view of it, etc, if you want to share in that way or upload to a GPU, etc, but having a pure Go mutable version seems like where you want to start? or at least include in the mix?
>

Seems like you aren't working with strings or nested data much. In
Arrow you can mutate numeric data if you want, but the memory layout
for varbinary / utf8 requires rebuilding the structure if you mutate.
This is a trade-off so that you are guaranteed data locality /
cache-efficiency for analytical operations on string columns (each
binary value is next to the previous one in memory)

Arrow main use case is SQL-style analytics or other kinds of 1D
columnar operations (e.g. columnar time series databases). So you
could use it to build an analytic database like KDB+ or Vertica

> As far as I can tell, arrow.Table or arrow.Record does not support tensors? and in general, it seems like tensor is a go-specific add-on that is not supported by the other languages? or is this just a work-in-progress for all languages?
>

We've discussed embedding tensors in Binary or FixedSizeBinary types,
so each cell in a column of a RecordBatch would contain a tensor. Is
that what you would need?

Randall O'Reilly

unread,
Jan 29, 2019, 1:48:29 PM1/29/19
to Wes McKinney, Sebastien Binet, gonum-dev
Wes — thanks for the reply!

- Randy

> On Jan 29, 2019, at 8:38 AM, Wes McKinney <wesm...@gmail.com> wrote:
>
> hi Randall,
>
> On Tue, Jan 29, 2019 at 12:23 AM Randall O'Reilly <rcore...@gmail.com> wrote:
>>
>> Having dug into this topic a bit more and reaching the point where I urgently need a data frame-like structure, I think I might need to write my own version (which will be interoperable to the greatest extent possible), but I’d like to just make sure I’m not missing anything before I do. I only really have two major requirements:
>>
>> * tensors as the primary columnar data structure. I’m often dealing with 4D and higher data structures as logical ways of organizing neural network inputs and outputs (and when you add the row for the table, just add one more dimension). Having everything use a consistent tensor organization keeps things simple, instead of that being a special case.
>>
>
> This may be stretching the definition of "data frame" at this point.
> Do you need something more like http://xarray.pydata.org/en/stable/?

Yeah, that looks like it captures some of what I want — in particular their Dataset object, except that I still do want the full heterogeneity of DataFrame (any different data types in same Frame), and there is a privileged common axis called “row” and all tensors have row-major organization and the same size along that outer-most dimension.

I went ahead and implemented it last night, so this is specifically what I want :)
https://github.com/emer/emergent/blob/master/dtable/dtable.go

>
>> * Easy, direct mutability. I’m not really sure why everyone loves immutability so much for these things? I get that it greatly simplifies various things but I’m almost always wanting to write the data into these tables in various flexible ways, etc, and purely from within the Go world, it would be so much easier to have a primary tensor type that is backed by a simple slice and you can write to it all you want.. you could send this data off to an arrow view of it, etc, if you want to share in that way or upload to a GPU, etc, but having a pure Go mutable version seems like where you want to start? or at least include in the mix?
>>
>
> Seems like you aren't working with strings or nested data much. In
> Arrow you can mutate numeric data if you want, but the memory layout
> for varbinary / utf8 requires rebuilding the structure if you mutate.
> This is a trade-off so that you are guaranteed data locality /
> cache-efficiency for analytical operations on string columns (each
> binary value is next to the previous one in memory)
>
> Arrow main use case is SQL-style analytics or other kinds of 1D
> columnar operations (e.g. columnar time series databases). So you
> could use it to build an analytic database like KDB+ or Vertica

That all makes sense. I’m just using []string as the backing for my etensor.Strings tensor — lots of issues there I’m sure but so much easier for random access / modification, etc.

>
>> As far as I can tell, arrow.Table or arrow.Record does not support tensors? and in general, it seems like tensor is a go-specific add-on that is not supported by the other languages? or is this just a work-in-progress for all languages?
>>
>
> We've discussed embedding tensors in Binary or FixedSizeBinary types,
> so each cell in a column of a RecordBatch would contain a tensor. Is
> that what you would need?

that wouldn’t be as efficient as having the column itself be the tensor, e.g., instead of a 1D array it is just n-D and the cell size is the inner n-1 dimensions. Would that be possible?

Wes McKinney

unread,
Jan 31, 2019, 10:32:50 PM1/31/19
to Randall O'Reilly, Sebastien Binet, gonum-dev
hi Randall,

On Tue, Jan 29, 2019 at 12:48 PM Randall O'Reilly <rcore...@gmail.com> wrote:
>
> Wes — thanks for the reply!
>
> - Randy
>
> > On Jan 29, 2019, at 8:38 AM, Wes McKinney <wesm...@gmail.com> wrote:
> >
> > hi Randall,
> >
> > On Tue, Jan 29, 2019 at 12:23 AM Randall O'Reilly <rcore...@gmail.com> wrote:
> >>
> >> Having dug into this topic a bit more and reaching the point where I urgently need a data frame-like structure, I think I might need to write my own version (which will be interoperable to the greatest extent possible), but I’d like to just make sure I’m not missing anything before I do. I only really have two major requirements:
> >>
> >> * tensors as the primary columnar data structure. I’m often dealing with 4D and higher data structures as logical ways of organizing neural network inputs and outputs (and when you add the row for the table, just add one more dimension). Having everything use a consistent tensor organization keeps things simple, instead of that being a special case.
> >>
> >
> > This may be stretching the definition of "data frame" at this point.
> > Do you need something more like http://xarray.pydata.org/en/stable/?
>
> Yeah, that looks like it captures some of what I want — in particular their Dataset object, except that I still do want the full heterogeneity of DataFrame (any different data types in same Frame), and there is a privileged common axis called “row” and all tensors have row-major organization and the same size along that outer-most dimension.
>
> I went ahead and implemented it last night, so this is specifically what I want :)
> https://github.com/emer/emergent/blob/master/dtable/dtable.go
>

I can take a closer look, but this isn't supported by the Arrow
columnar memory format at the moment. It would be useful to discuss
how your needs could be met in that context

> >
> >> * Easy, direct mutability. I’m not really sure why everyone loves immutability so much for these things? I get that it greatly simplifies various things but I’m almost always wanting to write the data into these tables in various flexible ways, etc, and purely from within the Go world, it would be so much easier to have a primary tensor type that is backed by a simple slice and you can write to it all you want.. you could send this data off to an arrow view of it, etc, if you want to share in that way or upload to a GPU, etc, but having a pure Go mutable version seems like where you want to start? or at least include in the mix?
> >>
> >
> > Seems like you aren't working with strings or nested data much. In
> > Arrow you can mutate numeric data if you want, but the memory layout
> > for varbinary / utf8 requires rebuilding the structure if you mutate.
> > This is a trade-off so that you are guaranteed data locality /
> > cache-efficiency for analytical operations on string columns (each
> > binary value is next to the previous one in memory)
> >
> > Arrow main use case is SQL-style analytics or other kinds of 1D
> > columnar operations (e.g. columnar time series databases). So you
> > could use it to build an analytic database like KDB+ or Vertica
>
> That all makes sense. I’m just using []string as the backing for my etensor.Strings tensor — lots of issues there I’m sure but so much easier for random access / modification, etc.
>

In Arrow we have the mantra of zero copy and straight-forward data
movement. When you use []string you're putting all your data in Go's
object heap (I think, not a Go export), so if you wanted to expose
that data over the wire or through shared memory you'd have to
serialize all that data somewhere and then figure out how to
communicate its structure to the receiver.

In Arrow strings are represented by a couple of contiguous chunks of
memory, as described in the spec. The benefits for analytical
processing justify the trade-offs (mutability, but random access is
still O(1))

> >
> >> As far as I can tell, arrow.Table or arrow.Record does not support tensors? and in general, it seems like tensor is a go-specific add-on that is not supported by the other languages? or is this just a work-in-progress for all languages?
> >>
> >
> > We've discussed embedding tensors in Binary or FixedSizeBinary types,
> > so each cell in a column of a RecordBatch would contain a tensor. Is
> > that what you would need?
>
> that wouldn’t be as efficient as having the column itself be the tensor, e.g., instead of a 1D array it is just n-D and the cell size is the inner n-1 dimensions. Would that be possible?
>

I suggest having a closer look at the Arrow format and seeing if
there's a mapping between the data structure you're describing and the
columnar format. We have a relatively simple algebra of physical
memory layout

* Fixed bit width primitive
* Variable size binary
* Nested thereof (variable-size list, struct, and union)

Logical types (integers, floating point, strings, timestamps) give
meaning to the physical memory. You can add custom metadata to a
schema to enable an application to interpret some memory as some other
kind of type
Reply all
Reply to author
Forward
0 new messages