hi Randall,
On Tue, Jan 29, 2019 at 12:48 PM Randall O'Reilly <
rcore...@gmail.com> wrote:
>
> Wes — thanks for the reply!
>
> - Randy
>
> > On Jan 29, 2019, at 8:38 AM, Wes McKinney <
wesm...@gmail.com> wrote:
> >
> > hi Randall,
> >
> > On Tue, Jan 29, 2019 at 12:23 AM Randall O'Reilly <
rcore...@gmail.com> wrote:
> >>
> >> Having dug into this topic a bit more and reaching the point where I urgently need a data frame-like structure, I think I might need to write my own version (which will be interoperable to the greatest extent possible), but I’d like to just make sure I’m not missing anything before I do. I only really have two major requirements:
> >>
> >> * tensors as the primary columnar data structure. I’m often dealing with 4D and higher data structures as logical ways of organizing neural network inputs and outputs (and when you add the row for the table, just add one more dimension). Having everything use a consistent tensor organization keeps things simple, instead of that being a special case.
> >>
> >
> > This may be stretching the definition of "data frame" at this point.
> > Do you need something more like
http://xarray.pydata.org/en/stable/?
>
> Yeah, that looks like it captures some of what I want — in particular their Dataset object, except that I still do want the full heterogeneity of DataFrame (any different data types in same Frame), and there is a privileged common axis called “row” and all tensors have row-major organization and the same size along that outer-most dimension.
>
> I went ahead and implemented it last night, so this is specifically what I want :)
>
https://github.com/emer/emergent/blob/master/dtable/dtable.go
>
I can take a closer look, but this isn't supported by the Arrow
columnar memory format at the moment. It would be useful to discuss
how your needs could be met in that context
> >
> >> * Easy, direct mutability. I’m not really sure why everyone loves immutability so much for these things? I get that it greatly simplifies various things but I’m almost always wanting to write the data into these tables in various flexible ways, etc, and purely from within the Go world, it would be so much easier to have a primary tensor type that is backed by a simple slice and you can write to it all you want.. you could send this data off to an arrow view of it, etc, if you want to share in that way or upload to a GPU, etc, but having a pure Go mutable version seems like where you want to start? or at least include in the mix?
> >>
> >
> > Seems like you aren't working with strings or nested data much. In
> > Arrow you can mutate numeric data if you want, but the memory layout
> > for varbinary / utf8 requires rebuilding the structure if you mutate.
> > This is a trade-off so that you are guaranteed data locality /
> > cache-efficiency for analytical operations on string columns (each
> > binary value is next to the previous one in memory)
> >
> > Arrow main use case is SQL-style analytics or other kinds of 1D
> > columnar operations (e.g. columnar time series databases). So you
> > could use it to build an analytic database like KDB+ or Vertica
>
> That all makes sense. I’m just using []string as the backing for my etensor.Strings tensor — lots of issues there I’m sure but so much easier for random access / modification, etc.
>
In Arrow we have the mantra of zero copy and straight-forward data
movement. When you use []string you're putting all your data in Go's
object heap (I think, not a Go export), so if you wanted to expose
that data over the wire or through shared memory you'd have to
serialize all that data somewhere and then figure out how to
communicate its structure to the receiver.
In Arrow strings are represented by a couple of contiguous chunks of
memory, as described in the spec. The benefits for analytical
processing justify the trade-offs (mutability, but random access is
still O(1))
> >
> >> As far as I can tell, arrow.Table or arrow.Record does not support tensors? and in general, it seems like tensor is a go-specific add-on that is not supported by the other languages? or is this just a work-in-progress for all languages?
> >>
> >
> > We've discussed embedding tensors in Binary or FixedSizeBinary types,
> > so each cell in a column of a RecordBatch would contain a tensor. Is
> > that what you would need?
>
> that wouldn’t be as efficient as having the column itself be the tensor, e.g., instead of a 1D array it is just n-D and the cell size is the inner n-1 dimensions. Would that be possible?
>
I suggest having a closer look at the Arrow format and seeing if
there's a mapping between the data structure you're describing and the
columnar format. We have a relatively simple algebra of physical
memory layout
* Fixed bit width primitive
* Variable size binary
* Nested thereof (variable-size list, struct, and union)
Logical types (integers, floating point, strings, timestamps) give
meaning to the physical memory. You can add custom metadata to a
schema to enable an application to interpret some memory as some other
kind of type