i/o tools and data-type layout descriptors

117 views
Skip to first unread message

Seb Binet

unread,
Mar 4, 2016, 3:01:06 PM3/4/16
to gonu...@googlegroups.com
hi,

I went a bit into a rabbit hole today implementing r/w access to numpy
data file.
this was actually prompted by my work on the persistency of
mat64.Dense and mat64.Vector, as well as the PR for having mat64.Dense
support TextMarshaler.

https://github.com/gonum/matrix/pull/341
https://groups.google.com/forum/#!searchin/gonum-dev/numcsv/gonum-dev/gzR7UFXRrUI/FkP0pIRC6WEJ

in these links we can see there is some amount of interest and
packages implementing I/O for various formats (CSV, netCDF, HDF5, npy
to name a few).

it occured to me that, for good cross-package pollinization,
interoperability and re-use, to be able to efficiently read/write data
into various types ([]mydata, []float64, mat64.Dense, mat64.Vector,
github.com/Kunde21/numgo.Array64, MyMatrixDuJour, ...) one should:

- have a way to describe the n-dim data array elements' type
- have a way to describe the rank of the ndim-array
- have a way to describe the row-major/col-major order of the 1d
linearized array-data

fortunately, the python/numpy folks have already dealt with that issue:
- it's the numpy.dtype data type descriptor class [0]
- combined with the buffer protocol/interface [1] to allow seamless
access to the (possible ndim) data from a python interpreter (and even
from C or other languages).

eons ago, I had started to tackle this issue of data-type descriptors [2].
it's nowhere near from finished but it's a start.

I believe gonum should provide at least a data-type descriptor type
(or interface) which could then be used in all the I/O packages I
talked about, in other data or numeric packages as well as in
serialization/deserialization to/from databases.

what do you think?

-s

[0] http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
[1] https://docs.python.org/3/c-api/buffer.html
[2] https://github.com/go-hep/dtypes

https://godoc.org/bitbucket.org/ctessum/cdf
https://github.com/btracey/numcsv
https://github.com/sbinet/npyio
https://github.com/sbinet/go-hdf5
https://github.com/go-hep/csvutil

https://github.com/Kunde21/numgo
Message has been deleted

Seb Binet

unread,
Mar 11, 2016, 1:24:44 PM3/11/16
to Kunde21, gonum-dev
(resending from the correct address. apologies for the noise)

On Fri, Mar 11, 2016 at 7:20 PM, Sebastien Binet <bi...@cern.ch> wrote:
> On Fri, Mar 11, 2016 at 5:01 AM, Kunde21 <kun...@gmail.com> wrote:
>> Data Science and Scipy community has already come together around the numpy
>> formatting decisions. R and Julia have built libraries to work with the npy
>> files, as well. So, it just makes sense to work with what's already in
>> place.
> yes. at least, if we come up with a better solution (for some
> definition of "better"), we should be able to easily and efficiently
> inter-operate with the already existing and thriving ecosystem.
>
>
>> On that point, would it make sense to just make npyio an extended wrapper
>> around the io.Reader object?
>
> it already is, kinda.
> (or maybe I am not completely getting your point?)
>
>> API idea (may or may not be any good):
>>
>> Rank() int64
>> Shape([]int64) []int64 //
>> Write(interface{}) (int, error)
>> Err() error
>>
>> Datatype(type interface{}) bool
>>
>> Float64() []float64
>> Float32() []float32
>> Uint64() []uint64
>
> I am not sure I completely get the intended usage of this API.
>
> 1) Rank() int64.
> ok. not sure if we need an int64. surely int is enough. (famous last words?)
>
> 2) Shape([]int64) []int64
> I get the intended return value usage (but, here again, I would just use []int)
> but what is the input slice argument needed for?
>
> 3) Write(interface{}) (int, error)
> so a reader is also a writer?
> is it to support data file updates?
>
> 4) Err() error
> ok.
>
> 5) Datatype(typ interface{}) bool
> what is this meant to do?
> is it to ask npyio if the given typ type is compatible with on-disk data type?
> if so, having a proper DataType interface as I was hinting in my first
> mail would, IMHO, be a better avenue.
>
> and all the "Foo() []Foo" methods wouldn't be needed.

Kunde21

unread,
Mar 11, 2016, 2:45:36 PM3/11/16
to gonum-dev, kun...@gmail.com
>> On that point, would it make sense to just make npyio an extended wrapper
>> around the io.Reader object?
>
> it already is, kinda.
> (or maybe I am not completely getting your point?)

By this I mean we would interact with the npyio object by reading and writing buffers rather than passing slice/mat64.Dense/numgo.Array64 objects into npyio and requiring object-specific read/write logic within the npyio library.  I haven't tackled slicing within numgo, but that could cause the numgo object to read the same buffer (float64 slice) in a different way.  I'd rather force that logic within numgo than heap that on npyio.
 
>
>> API idea (may or may not be any good):
>>
>> Rank() int64
>> Shape([]int64) []int64  //
>> Write(interface{}) (int, error)
>> Err() error
>>
>> Datatype(type interface{}) bool
>>
>> Float64() []float64
>> Float32() []float32
>> Uint64() []uint64
>
> I am not sure I completely get the intended usage of this API.
>
> 1) Rank() int64.
> ok. not sure if we need an int64. surely int is enough. (famous last words?)
True.  If the rank overflows an int32, I don't know that a machine could even process the data.

>
> 2) Shape([]int64) []int64
> I get the intended return value usage (but, here again, I would just use []int)
> but what is the input slice argument needed for?
My thought was to use `Shape(nil)` as a read and `Shape([]int64{a,b,c})` as the write call with nil return on error.  
 
>
> 3) Write(interface{}) (int, error)
> so a reader is also a writer?
> is it to support data file updates?
 
In my mind, this would replace the data buffer in the file.  There would be a requirement to write the new shape before writing the buffer to the file (thus, returning an error if the received buffer is the wrong size).  This is where I'm not certain about the writer API, ensuring the shape and buffer match before the file is closed.
 
>
> 4) Err() error
> ok.
>
> 5) Datatype(typ interface{}) bool
> what is this meant to do?
> is it to ask npyio if the given typ type is compatible with on-disk data type?
> if so, having a proper DataType interface as I was hinting in my first
> mail would, IMHO, be a better avenue.  
> and all the "Foo() []Foo" methods wouldn't be needed. 
Yeah, I was trying to solve the same problem twice.  As long as the reader is just reading into the buffer, there's no reason to add that complexity. i.e. []float64 for mat64.Dense/[]float64/numgo.Array64 rather than create or read into the whole object.  I got too hung up thinking about complex buffers from the numpy side, but they should just be struct slices.  
 
>
> what do you think?
>
> -s

With your suggestions, a better API idea might look more like:

Rank() int
Len
Shape([]int) []int            // Should this force []int64 or []int32 type?  The npy file spec isn't very clear, using the hardware int type to read and write the shape buffer.
Read [ReadBuffer?] (buf interface{})  int            // Will this read in chunks (read pointer and Seek functionality) or is it an all-or-nothing read?
Write [WriteBuffer?] (buf interface{})  int            // Will this write in chunks (read pointer + Seek) a la Read call question? 
Append [AppendBuffer?] (buf interface{}) int 

Open(fname string) npyio, error
Close() error
Flush() error
Err() error

Kunde21

unread,
Mar 11, 2016, 2:50:32 PM3/11/16
to gonum-dev, kun...@gmail.com
Also, has there been any communication with tabula library?  There's probably some ideas and/or use cases there, too.

Seb Binet

unread,
Mar 14, 2016, 10:44:28 AM3/14/16
to Kunde21, gonum-dev
On Fri, Mar 11, 2016 at 8:45 PM, Kunde21 <kun...@gmail.com> wrote:
>> >> On that point, would it make sense to just make npyio an extended
>> >> wrapper
>> >> around the io.Reader object?
>> >
>> > it already is, kinda.
>> > (or maybe I am not completely getting your point?)
>
>
> By this I mean we would interact with the npyio object by reading and
> writing buffers rather than passing slice/mat64.Dense/numgo.Array64 objects
> into npyio and requiring object-specific read/write logic within the npyio
> library. I haven't tackled slicing within numgo, but that could cause the
> numgo object to read the same buffer (float64 slice) in a different way.
> I'd rather force that logic within numgo than heap that on npyio.

one could indeed imagine having something like that:

func (r *Reader) Reader() io.Reader { /* ... */ }

which would be very similar to the archive/zip.Reader API.

or, departing a bit from zip.Reader:

func (r *Reader) Bytes() []byte { ... }

but I believe having people implement an interface would be better and
more inter-operable.

>>
>> >
>> >> API idea (may or may not be any good):
>> >>
>> >> Rank() int64
>> >> Shape([]int64) []int64 //
>> >> Write(interface{}) (int, error)
>> >> Err() error
>> >>
>> >> Datatype(type interface{}) bool
>> >>
>> >> Float64() []float64
>> >> Float32() []float32
>> >> Uint64() []uint64
>> >
>> > I am not sure I completely get the intended usage of this API.
>> >
>> > 1) Rank() int64.
>> > ok. not sure if we need an int64. surely int is enough. (famous last
>> > words?)
>
> True. If the rank overflows an int32, I don't know that a machine could
> even process the data.
>
>> >
>> > 2) Shape([]int64) []int64
>> > I get the intended return value usage (but, here again, I would just use
>> > []int)
>> > but what is the input slice argument needed for?
>
> My thought was to use `Shape(nil)` as a read and `Shape([]int64{a,b,c})` as
> the write call with nil return on error.
that kind of multi-modal API is very error-prone and hard to debug.
I'd rather stay away from it.
just have:
type Shaper interface {
Shape() []int
}

and if setting is also needed:
type ReShaper interface {
ReShape([]int) // or SetShape([]int) ?
}

>
>>
>> >
>> > 3) Write(interface{}) (int, error)
>> > so a reader is also a writer?
>> > is it to support data file updates?
>
>
> In my mind, this would replace the data buffer in the file. There would be
> a requirement to write the new shape before writing the buffer to the file
> (thus, returning an error if the received buffer is the wrong size). This
> is where I'm not certain about the writer API, ensuring the shape and buffer
> match before the file is closed.

that's why npyio.Write is done in one go.
alternatively, if (e.g.) a dtype-based interface is devised, it could be used.

I suppose it depends on the use case but I personally don't modify
.npy files, they are mostly write-once, read-many.
if updates were felt that important, I probably err towards providing
a dedicated npyio.Update(w, ptr) function.

>
>>
>> >
>> > 4) Err() error
>> > ok.
>> >
>> > 5) Datatype(typ interface{}) bool
>> > what is this meant to do?
>> > is it to ask npyio if the given typ type is compatible with on-disk data
>> > type?
>> > if so, having a proper DataType interface as I was hinting in my first
>> > mail would, IMHO, be a better avenue.
>>
>> > and all the "Foo() []Foo" methods wouldn't be needed.
>
> Yeah, I was trying to solve the same problem twice. As long as the reader
> is just reading into the buffer, there's no reason to add that complexity.
> i.e. []float64 for mat64.Dense/[]float64/numgo.Array64 rather than create or
> read into the whole object. I got too hung up thinking about complex
> buffers from the numpy side, but they should just be struct slices.

exactly, they should be []UserStruct.
at least, that's my mental model.

>>
>> >
>> > what do you think?
>> >
>> > -s
>
>
> With your suggestions, a better API idea might look more like:
>
> Rank() int
> Len
> Shape([]int) []int // Should this force []int64 or []int32 type?
> The npy file spec isn't very clear, using the hardware int type to read and
> write the shape buffer.
> Read [ReadBuffer?] (buf interface{}) int // Will this read in
> chunks (read pointer and Seek functionality) or is it an all-or-nothing
> read?
> Write [WriteBuffer?] (buf interface{}) int // Will this write in
> chunks (read pointer + Seek) a la Read call question?
> Append [AppendBuffer?] (buf interface{}) int
>
> Open(fname string) npyio, error
> Close() error
> Flush() error
> Err() error

I don't know.
this seems to conflate an API for npyio and one for describing n-dim data.

I'd like to try to disentangle the 2.

- have, say, a numpy-like dtype type (interface or otherwise) to
describe (possibly n-dim) data
- have a way, leveraging this dtype, to seamlessly handle user types
(and matrix, n-dim arrays, ... user types) inside npyio (but also
other gonum- and non-gonum-related) formats.

dtype could look like:
package dtype
type Type interface {
Kind() Kind // reflect.Kind + ndim-array-kind
Rank() int // 0: scalar, 1: slice/array, ... (perhaps not needed as
the same info can be obtained from len(Shape())
Shape() []int

// more stuff ?
// basically reflect.Type without the Func/Method/Interface/Chan support ?
}

then npyio could expose a Marshaler/Unmarshaler pair of interfaces:
https://github.com/sbinet/npyio/issues/1

> Also, has there been any communication with tabula library? There's probably some ideas and/or use cases there, too.
not from/with me.

(to be honest, I am not completely sure it is possible to have a
convenient *and* optimized/efficient pandas-like general library in
Go-1.
perhaps with Go-2, if Go-2 gets an n-dim array builtin or generics.)

-s

Kunde21

unread,
Mar 21, 2016, 6:53:17 PM3/21/16
to gonum-dev


On Monday, March 14, 2016 at 7:44:28 AM UTC-7, Sebastien Binet wrote:
On Fri, Mar 11, 2016 at 8:45 PM, Kunde21 <kun...@gmail.com> wrote:
>> >> On that point, would it make sense to just make npyio an extended
>> >> wrapper
>> >> around the io.Reader object?
>> >
>> > it already is, kinda.
>> > (or maybe I am not completely getting your point?)
>
>
> By this I mean we would interact with the npyio object by reading and
> writing buffers rather than passing slice/mat64.Dense/numgo.Array64 objects
> into npyio and requiring object-specific read/write logic within the npyio
> library.  I haven't tackled slicing within numgo, but that could cause the
> numgo object to read the same buffer (float64 slice) in a different way.
> I'd rather force that logic within numgo than heap that on npyio.

one could indeed imagine having something like that:

func (r *Reader) Reader() io.Reader { /* ... */ }

which would be very similar to the archive/zip.Reader API.

or, departing a bit from zip.Reader:

func (r *Reader) Bytes() []byte { ... }

but I believe having people implement an interface would be better and
more inter-operable.

>> > 3) Write(interface{}) (int, error)
>> > so a reader is also a writer?
>> > is it to support data file updates?
>
>
> In my mind, this would replace the data buffer in the file.  There would be
> a requirement to write the new shape before writing the buffer to the file
> (thus, returning an error if the received buffer is the wrong size).  This
> is where I'm not certain about the writer API, ensuring the shape and buffer
> match before the file is closed.

that's why npyio.Write is done in one go.
alternatively, if (e.g.) a dtype-based interface is devised, it could be used.

I suppose it depends on the use case but I personally don't modify
.npy files, they are mostly write-once, read-many.
if updates were felt that important, I probably err towards providing
a dedicated npyio.Update(w, ptr) function.   

>
> With your suggestions, a better API idea might look more like:
>
> Rank() int
> Len
> Shape([]int) []int            // Should this force []int64 or []int32 type?
> The npy file spec isn't very clear, using the hardware int type to read and
> write the shape buffer.
> Read [ReadBuffer?] (buf interface{})  int            // Will this read in
> chunks (read pointer and Seek functionality) or is it an all-or-nothing
> read?
> Write [WriteBuffer?] (buf interface{})  int            // Will this write in
> chunks (read pointer + Seek) a la Read call question?
> Append [AppendBuffer?] (buf interface{}) int
>
> Open(fname string) npyio, error
> Close() error
> Flush() error
> Err() error

I don't know.
this seems to conflate an API for npyio and one for describing n-dim data.

I'd like to try to disentangle the 2.

I'm a bit confused here.  npy files are just a binary form of ndim data.  So, I see the API simply describing and giving access to that ndim data.  Separating the two would seem to make interacting with the file more difficult and/or confusing.  


- have, say, a numpy-like dtype type (interface or otherwise) to describe (possibly n-dim) data
- have a way, leveraging this dtype, to seamlessly handle user types (and matrix, n-dim arrays, ... user types) inside npyio (but also other gonum- and non-gonum-related) formats.  

dtype could look like:
package dtype
type Type interface {
  Kind() Kind // reflect.Kind + ndim-array-kind
  Rank() int // 0: scalar, 1: slice/array, ... (perhaps not needed as the same info can be obtained from len(Shape())
  Shape() []int

  // more stuff ?
  // basically reflect.Type without the Func/Method/Interface/Chan support ?
}

dtype package would be useful in converting from the numpy-style text to and from a reader/writer.  The Rank(), Shape(), and anything else don't really fit, though, because those are specific to npy files or ndim objects rather than the element type itself.  It would also be a bit confusing, since the dtype can have a shape of its own and that would make a call to Shape() a bit confusing.

I would see dtype useful as a separate package with an API like

package dtype

type DTypeElement {

    reflect.Type
    shape []int
}

type DType struct {
    *binary.byteOrder
    *DTypeElement
}

func GetType(dtype string) DType
func DTypeOf(interface{}) DType
func (d *DType) String() string

Then, separately, the npyio package would offload the type conversion (string-to-dtype and dtype-to-string) and have an API focused on presenting the npy file as simply ndim data:

type npyFile struct {

Rank() int
Shape() []int

Kunde21

unread,
Mar 21, 2016, 7:47:44 PM3/21/16
to gonum-dev
Ugh, hit post a bit early.  Let's try that again.
 
dtype package would be useful in converting from the numpy-style text to and from a readable/writeable interface.  

I would see dtype useful as a separate package with an API like:

package dtype

type DTypeElement {
    Name string
    Type reflect.Type
    Shape []int
    Elements []DTypeElement
}

type DType struct {
    Order *binary.byteOrder
    Elements []DTypeElement
}

GetType(dtype string) DType
DTypeOf(interface{}) DType  //
(d *DType) String() string

Then, separately, the npyio package would offload the type conversion (string-to-dtype and dtype-to-string) and have an API focused on presenting the npy file as simply ndim data:

type NpyFile struct {
    Dtype   *DType
    Fortran bool
    shape   []int
    file    io.ReadWriteCloser
}

Open(fname string) *NpyFile, error  //This would load the NpyFile object if the file exists
NewNpyFile(fname string) *NpyFile, error
Flush() error  //Writes Header info to file
Close() error  //Writes Header and closes file

Rank()  int
Shape() []int
SetShape([]int)
(n *NpyFile) Read(buffer interface{}) int, error
(n *NpyFile) Write(buffer interface{}) int, error
(n *NpyFile) Append(buffer interface{}) int, error

Within the Write, it would just be a call to binary.Write(n.file, n.Dtype.Order, buffer) along with the header write.  Read would be similar.

Seb Binet

unread,
Mar 23, 2016, 4:37:59 AM3/23/16
to Kunde21, gonum-dev
On Tue, Mar 22, 2016 at 12:47 AM, Kunde21 <kun...@gmail.com> wrote:
> Ugh, hit post a bit early. Let's try that again.
>
> dtype package would be useful in converting from the numpy-style text to and
> from a readable/writeable interface.
>
> I would see dtype useful as a separate package with an API like:
>
> package dtype
>
> type DTypeElement {
> Name string
> Type reflect.Type
> Shape []int
> Elements []DTypeElement
> }
>
> type DType struct {
> Order *binary.byteOrder
> Elements []DTypeElement
> }
>
> GetType(dtype string) DType
> DTypeOf(interface{}) DType //
> (d *DType) String() string

something like that, yes.
but, dtype.GetType(name string) dtype.Type is probably too pythonic.

also, dtype shouldn't be too tied to numpy's dtype.

I'd prefer it to deal only with types, w/o any disk-based
representation considerations (so, no binary.Order) and in effect a
superset of reflect.Type.
IMHO, for dtype.Type to be useful and widely used, it should be
"reflect.Type with support for describing ndim-data."


I think having this:

///
package dtype

type Type interface {
Shape() []int
Kind() Kind
Elem() Type // pointers, arrays, slices, etc...
NumField() int
Field(i int) StructField
// ... like reflect.Type ...
}
///

is workable.

there are of course interesting implementation details I am glossing over.
eg:
- what dtype.Type should [2][3][4]int be? especially its Shape() and
Elem(): []int{2,3,4} and TypeOf(int(0)) ? or []int{2} and
TypeOf([3][4]int{}) ?)
- how should "ragged arrays" be represented? should they be
represented? e.g.: [][][]int. Shape() of []int{-1,-1,-1} and Elem() of
int ?

> Then, separately, the npyio package would offload the type conversion
> (string-to-dtype and dtype-to-string) and have an API focused on presenting
> the npy file as simply ndim data:
>
> type NpyFile struct {
> Dtype *DType
> Fortran bool
> shape []int
> file io.ReadWriteCloser
> }
>
> Open(fname string) *NpyFile, error //This would load the NpyFile object if
> the file exists
> NewNpyFile(fname string) *NpyFile, error
> Flush() error //Writes Header info to file
> Close() error //Writes Header and closes file
>
> Rank() int
> Shape() []int
> SetShape([]int)
> (n *NpyFile) Read(buffer interface{}) int, error
> (n *NpyFile) Write(buffer interface{}) int, error
> (n *NpyFile) Append(buffer interface{}) int, error
>
> Within the Write, it would just be a call to binary.Write(n.file,
> n.Dtype.Order, buffer) along with the header write. Read would be similar.

I don't see how that would work.
is "buffer interface{}" meant to be a flat slice []T ? or directly a T?
ie: is buffer a value of numgo.Array64 or mat64.Dense? or the
corresponding flat []float64?

also: I am not clear on what (*NpyFile).Append(buffer interface{})
(int,error) is supposed to do.
is it a method for a .npz file instead of an .npy one (.npz files are
zip-like, with multiple key/values ndim-data. .npy files have only one
ndim-data) ?
or is it really:
data := []float64{0,1,2,3,4}
f, err := npyio.Create("foo.npy")
_, err = f.Write(data)
f.SetShape([]int{len(data)})
data = append(data, 42)
_, err = f.Append(data[5:])
f.SetShape([]int{len(data)})
?
if the latter, then I think this is a different concept that warrants
its own set of interfaces. (and I am not sure the .npy file format
really supports that use case.)

AFAICT, the .npy file format is meant to be used as a pure one-shot
save/load facility.
That's why I "designed" sbinet/npyio the way it is, with top-level
npyio.Write and npyio.Read functions.

thanks for taking the time to discuss these interesting topics.
-s

Kunde21

unread,
Mar 24, 2016, 7:05:19 AM3/24/16
to gonum-dev, kun...@gmail.com


On Wednesday, March 23, 2016 at 1:37:59 AM UTC-7, Sebastien Binet wrote:
On Tue, Mar 22, 2016 at 12:47 AM, Kunde21 <kun...@gmail.com> wrote:
> Ugh, hit post a bit early.  Let's try that again.
>
> dtype package would be useful in converting from the numpy-style text to and
> from a readable/writeable interface.
>
> I would see dtype useful as a separate package with an API like:
>
> package dtype
>
> type DTypeElement {
>     Name string
>     Type reflect.Type
>     Shape []int
>     Elements []DTypeElement
> }
>
> type DType struct {
>     Order *binary.byteOrder
>     Elements []DTypeElement
> }
>
> GetType(dtype string) DType
> DTypeOf(interface{}) DType  //
> (d *DType) String() string

something like that, yes.
but, dtype.GetType(name string) dtype.Type is probably too pythonic.

also, dtype shouldn't be too tied to numpy's dtype.

Yes, it's a bit pythonic, but that's a bit unavoidable when we're dealing with python-dictated type descriptors.  Getting "<f4" to {binary.LittleEndian, float64} is hard to accomplish any other way.  That's just a minimal example, with numpy allowing for complex dtypes like [('name', np.str_), ('data', np.float64, (1,2,)].  If we're aiming for interop, the API will end up pythonic if we like it or not.

I'd prefer it to deal only with types, w/o any disk-based
representation considerations (so, no binary.Order) and in effect a
superset of reflect.Type.
IMHO, for dtype.Type to be useful and widely used, it should be
"reflect.Type with support for describing ndim-data." 
 
While I would like for this to work, numpy allows for different endieness in the written file.  The dtype package wouldn't be as useful if it could only read one of the two.  


I think having this:

///
package dtype

type Type interface {
  Shape() []int
  Kind() Kind
  Elem() Type         // pointers, arrays, slices, etc...
  NumField() int
  Field(i int) StructField
  // ... like reflect.Type ...
}
///

is workable.

there are of course interesting implementation details I am glossing over.
eg:
 - what dtype.Type should [2][3][4]int be? especially its Shape() and
Elem(): []int{2,3,4} and TypeOf(int(0)) ? or []int{2} and
TypeOf([3][4]int{}) ?)
 - how should "ragged arrays" be represented? should they be
represented? e.g.: [][][]int. Shape() of []int{-1,-1,-1} and Elem() of
int ?

Those ndim numpy dtype elements are the real crux in getting a truly universal go dtype library.  After racking my brain, I kind of settled on the nested {shape []int, type reflect.Type} construct.  That's really the only way I could think of to decouple the ndim elements from a specific go implementation, which would allow the library to convert it to a mat64.Dense/mat64.Sparse/numgo.Array64/etc element from the component parts (shape, buffer).  That would leave it to the libraries to convert from components to concrete types.
Somewhat connected to the dtype comment, the `buffer interface{}` would be the flat []float64 for mat64.Dense/numgo.Array64/arbitrary []float64 wrapper.  Libraries would need to implement their own conversion using the npyio and dtype API/objects.
 

also: I am not clear on what (*NpyFile).Append(buffer interface{})
(int,error) is supposed to do.
is it a method for a .npz file instead of an .npy one (.npz files are
zip-like, with multiple key/values ndim-data. .npy files have only one
ndim-data) ?
or is it really:
 data := []float64{0,1,2,3,4}
 f, err := npyio.Create("foo.npy")
 _, err = f.Write(data)
 f.SetShape([]int{len(data)})
 data = append(data, 42)
 _, err = f.Append(data[5:])
f.SetShape([]int{len(data)})
?
if the latter, then I think this is a different concept that warrants
its own set of interfaces. (and I am not sure the .npy file format
really supports that use case.)

AFAICT, the .npy file format is meant to be used as a pure one-shot
save/load facility.
That's why I "designed" sbinet/npyio the way it is, with top-level
npyio.Write and npyio.Read functions.

While I don't have a set of methods that would use this right now, I'm thinking that it would be useful for any object that has slicing capabilities.  For numgo.Array64 objects, the npy file would be "written" by iterating over the sliced elements and appending them, rather than building a new slice in memory just to write it to the file and throw it away.  Since slicing could be designed in many different ways, it might not be applicable to every similar object.  My idea for writing numgo slices (when they're implemented) is:

file, err := npyio.Create("file.npy")
for i := index; i < len; i+= OuterStride {
    file.Append(data[i:i+InnerStride])
}
file.SetShape([]int{<shape vector})
file.Flush()

Otherwise, a new slice would need to be created and build in the same way just to make a single call to `file.Write(buffer)`.  From an efficiency standpoint, letting the file writer handle the buffering would save memory and time.


thanks for taking the time to discuss these interesting topics.
-s

Glad to discuss this with you.  Serializing numgo objects to npy files was on my todo list, but I put it on the back-burner when I ran into the true complexity of the npy file design.

Kunde21

unread,
Mar 24, 2016, 7:16:52 AM3/24/16
to gonum-dev, kun...@gmail.com
On a separate note, npz files could be created as simply as:

f, err := npyio.CreateNpz("file.npz")
f.AddNpy(string name, npyio.File npy)
...
f.Close()

f, err := npyio.CreateNpz("file2.npz")
f.Write( map[string]npyio.File { {"a", fileA}, {"b", fileB} } )
f.Close()

Or read like:

NpyMap, err := npio.OpenNpz("file.npz")

NpyMap being a simple map[string]npyio.File object.

Seb Binet

unread,
Mar 30, 2016, 12:30:36 PM3/30/16
to Kunde21, gonum-dev
(resending from the correct address. apologies for the noise)

hi,

shifting gears a bit, I stumbled on this comment:
https://github.com/tensorflow/tensorflow/pull/1237#issuecomment-188003895

(part of an effort to provide Go bindings to TensorFlow...)

I believe gonum has a role to play in this* :)

-s

*: where "this" is: "defining interfaces, types and funcs to
interoperate with n-dim data"
> --
> You received this message because you are subscribed to the Google Groups
> "gonum-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to gonum-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Kunde21

unread,
Mar 31, 2016, 3:16:31 AM3/31/16
to gonum-dev, kun...@gmail.com
I shot an email out to start identifying our overlaps.  With the maturity of the SciPy stack, we can only benefit from a quick spin-up to interop on the data side.  It seems there are a lot of different groups interacting at different points of the SciPy ecosystem, though.  

webus...@gmail.com

unread,
Sep 20, 2016, 4:22:24 PM9/20/16
to gonum-dev, kun...@gmail.com
Any chance to add the read and write npz examples to the following page:


Regards,

WU

Dan Kortschak

unread,
Sep 20, 2016, 6:03:43 PM9/20/16
to webus...@gmail.com, gonum-dev, kun...@gmail.com

From a brief Google search (I rarely use NumPy) npz is just a ZIP archive containing a NumPy data files, so it's merely a matter of composing from the examples in Seb's docs and in pkg/archive/zip https://golang.org/pkg/archive/zip/#pkg-examples

--

webus...@gmail.com

unread,
Sep 20, 2016, 11:42:26 PM9/20/16
to gonum-dev, webus...@gmail.com, kun...@gmail.com
This was mentioned in this thread earlier:

f, err := npyio.CreateNpz("file.npz")
f.AddNpy(string name, npyio.File npy)
...
f.Close()

f, err := npyio.CreateNpz("file2.npz")
f.Write( map[string]npyio.File { {"a", fileA}, {"b", fileB} } )
f.Close()

Or read like:

NpyMap, err := npio.OpenNpz("file.npz")

NpyMap being a simple map[string]npyio.File object.

It would be great if the example was extended to go from npyip.File to a gonum matrix and vice versa. And also add to the npyio documentation at (https://github.com/sbinet/npyio)

webus...@gmail.com

unread,
Sep 20, 2016, 11:45:21 PM9/20/16
to gonum-dev, webus...@gmail.com, kun...@gmail.com
Also npz allows 1d/nd arrays with dtype to move between python <-> java <-> julia

I'm hoping npyio will now allow 1d/nd arrays to be exchanged between go <-> python <-> java <-> julia

Much more lightweight than using hdf5.
Reply all
Reply to author
Forward
0 new messages