Feather (re: output format)

Allen B. Riddell

unread,

Apr 5, 2016, 9:26:01 AM4/5/16

to stan...@googlegroups.com

re: output format

The developers' names might be familiar...

https://github.com/wesm/feather

> Feather: fast, interoperable data frame storage
>
> Feather is binary columnar serialization for data frames. It is designed
> to read and write data frames very efficiently, and to make it easy to
> share data across multiple data analysis languages. The initial version
> of Feather comes with bindings for python (written by Wes McKinney) and
> R (written by Hadley Wickham).

Bob Carpenter

unread,

Apr 5, 2016, 11:09:33 AM4/5/16

to stan...@googlegroups.com

Are you thinking of using this for storing draws
and maybe mass matrices as a simple table?

If we do go with a binary format, it'd be nice to be
able to convert it to something human readable with
a simple script.

The Apache license is totally OK by me, which is the
first hurdle! I think they're overselling the 70-80%
of CPU going to serialization/deserialization. Certainly
not for Stan!

- Bob

> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Allen B. Riddell

unread,

Apr 5, 2016, 12:01:42 PM4/5/16

to stan...@googlegroups.com

Yes, that was the idea. I suppose we'll have to see how much support the
underlying representation (Apache Arrow) gets. It certainly has an
impressive list of committees: https://arrow.apache.org/

Krzysztof Sakrejda

unread,

Apr 6, 2016, 2:50:10 PM4/6/16

to stan development mailing list, a...@ariddell.org

On a cursory look it seems as though they read/write a column-at-a-time
which might be hard to stream but there's certainly room for fudging it
(so that it's one array (in their terminology) per our csv "row").

K

Bob Carpenter

unread,

Apr 6, 2016, 2:55:41 PM4/6/16

to stan...@googlegroups.com, a...@ariddell.org

That has been the ongoing struggle. One structured line of
JSON per iteration sounds good, but it's crazy high overhead.

- Bob

Krzysztof Sakrejda

unread,

Apr 6, 2016, 3:09:23 PM4/6/16

to stan development mailing list

The json issues should not be relevant here, we would not need to add the structured text filler like json does.

Bob Carpenter

unread,

Apr 6, 2016, 3:17:34 PM4/6/16

to stan...@googlegroups.com

I'd hope not in a binary format!

- Bob

> On Apr 6, 2016, at 3:09 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
>
> The json issues should not be relevant here, we would not need to add the structured text filler like json does.
>

Allen B. Riddell

unread,

Apr 6, 2016, 4:12:17 PM4/6/16

to Krzysztof Sakrejda, stan development mailing list

I can't imagine the Hadoop/Spark/etc people don't have a way of dealing
with this problem (streaming data). It would be interesting to find out
what it is.

The attraction of having a standard format for enormous chains (where
CSV is unwieldy) is that we could have Python/R/Julia/Stata helper
functions which look the same.

(On the other hand, we haven't had too many people complaining on the
list about serializing/deserializing so perhaps this is all premature.)

Bob Carpenter

unread,

Apr 6, 2016, 4:42:44 PM4/6/16

to stan...@googlegroups.com, Krzysztof Sakrejda

> On Apr 6, 2016, at 4:12 PM, Allen B. Riddell <a...@ariddell.org> wrote:
>
> I can't imagine the Hadoop/Spark/etc people don't have a way of dealing
> with this problem (streaming data). It would be interesting to find out
> what it is.

The underlying C++ interfaces we're building do stream. What
we're really talking about are:

* a data serialization scheme

* persistence mechanism

These are often be tied together, as with a database scheme
and database transactional storage.

This all has to live on top of a

* transport layer

That can just be file streams on a desktop or SSL over the
network. As soon as we talk network, there's also a

* security layer

Tools like Hadoop/Spark are mainly about a networked transport layer
(and job control), though often interface with network file systems
or databases for persistence. They're very flexible because they
don't build many assumptions in, but just get out of the way and let
you stream bytes.

There are probably good practices for organizing said stream
of bytes, but I don't know much about it other than what I
can work out from first principles (like buffering over networks).

I think this is all worth thinking about. We've talked about
it before when someone wanted to do JSON output, which is where
the by-row vs. by-column discussion came up.

Protocol buffers seem to make a lot of sense if we can overcome
the size limitations (which don't seem that limiting) and they
scale well as overall size increases.

- Bob

Krzysztof Sakrejda

unread,

Apr 6, 2016, 5:11:23 PM4/6/16

to stan development mailing list, krzysztof...@gmail.com

All that's really needed to get around the size limitation is a chunking scheme
for larger objects (vectors/arrays/matrices) so I see our lack of a
schema as the main barrier to having good protobuf based serialization.

Krzysztof

>
> - Bob

Avraham Adler

unread,

Apr 10, 2016, 10:17:07 AM4/10/16

to stan development mailing list, krzysztof...@gmail.com

Two notes about feather: 1) it does not support compression and 2) there are a few hoops that need to be jumped through to get it to compile on Windows. The former issue is intended to be addressed soon-ish, the latter issue may get fixed automatically once it becomes a CRAN package. As regards speed of reading and writing, it is significantly faster than RDS, at least in my limited testing.

Bob Carpenter

unread,

Apr 10, 2016, 6:58:52 PM4/10/16

to stan...@googlegroups.com

> On Apr 10, 2016, at 10:17 AM, Avraham Adler <avraha...@gmail.com> wrote:
>
> Two notes about feather: 1) it does not support compression and 2) there are a few hoops that need to be jumped through to get it to compile on Windows. The former issue is intended to be addressed soon-ish, the latter issue may get fixed automatically once it becomes a CRAN package.

Stan isn't just an R package, though, so we need
this to work in Python, command-line, etc.

> As regards speed of reading and writing, it is significantly faster than RDS, at least in my limited testing.

:-) Isn't everything significantly faster than a relational
database?

- Bob

Krzysztof Sakrejda

unread,

Apr 10, 2016, 7:19:32 PM4/10/16

to stan development mailing list

On Sunday, April 10, 2016 at 10:17:07 AM UTC-4, Avraham Adler wrote:
> Two notes about feather: 1) it does not support compression and 2) there are a few hoops that need to be jumped through to get it to compile on Windows. The former issue is intended to be addressed soon-ish, the latter issue may get fixed automatically once it becomes a CRAN package. As regards speed of reading and writing, it is significantly faster than RDS, at least in my limited testing.

Are there good instructions for installing on Windows? I'm working with a student who's run into some trouble with trying feather. It would be worth understanding if it's just an issue with needing cookbook instructions or if it's something more touchy.

Krzysztof

Daniel Lee

unread,

Apr 11, 2016, 9:31:39 AM4/11/16

to stan-dev mailing list

I was at Wes McKinney's talk at NY R Conference on Apache Arrow, the back-end of feather. I should have paid a little more attention, but I didn't connect the dots until the very end.

He did mention not to expect stability in Apache Arrow for a little bit. And I think he said there is difficulty in Windows. (Either that, or a handful of people said it in their talks and now I'm falsely remembering Wes saying that.)

Daniel

Avraham Adler

unread,

Apr 12, 2016, 12:57:31 PM4/12/16

to stan development mailing list

On Sunday, April 10, 2016 at 7:19:32 PM UTC-4, Krzysztof Sakrejda wrote:
> Are there good instructions for installing on Windows? I'm working with a student who's run into some trouble with trying feather. It would be worth understanding if it's just an issue with needing cookbook instructions or if it's something more touchy.

Not really. See Wes's comments here https://github.com/wesm/feather/issues/58 Basically, now that KK patched mann.h, if you are working with MinGW-64 (for example, the Rtools v3.3.0.1959), what worked for me (in a slightly older configuration) was the following:

1) clone the github repository
2) manually follow the instructions at https://github.com/wesm/feather/blob/master/R/configure
2a) make the new subdirectories under /R/src
2b) copy the feather and flatbuffer files from src to R/src/feather etc.
2c) It seems that the tests are no longer in src (see https://github.com/wesm/feather/commit/fa0d80caf3ea3866586972da26035e3c44d2529e ). If they are there, delete them
3) build package from source