Feather - a fast on-disk format for data frames

558 views
Skip to first unread message

Douglas Bates

unread,
Mar 29, 2016, 5:22:52 PM3/29/16
to julia-stats
Wes McKinney and Hadley Wickham jointly developed a on-disk format for use storing and retrieving data frames from pandas in python and from R.  See http://blog.rstudio.org/

Sounds like something we should consider soon.  I haven't looked at the code yet but plan to do so soon.

Shashi Gowda

unread,
Mar 30, 2016, 1:36:10 AM3/30/16
to julia-stats

Tanmay is working on wrapping a similar format called parquet. https://github.com/tanmaykm/Parquet.jl it's a bit more sophisticated than feather

I wonder why they don't time the writes in the blog post.


On Wed 30 Mar, 2016, 2:52 AM Douglas Bates, <dmb...@gmail.com> wrote:
Wes McKinney and Hadley Wickham jointly developed a on-disk format for use storing and retrieving data frames from pandas in python and from R.  See http://blog.rstudio.org/

Sounds like something we should consider soon.  I haven't looked at the code yet but plan to do so soon.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jock....@gmail.com

unread,
Mar 30, 2016, 2:06:59 AM3/30/16
to julia-stats
I'm unclear what this provides that say SQLite doesn't. Thoughts?

Douglas Bates

unread,
Mar 30, 2016, 12:12:21 PM3/30/16
to julia-stats
The Arrow format and hence the feather format is columnar.

Also, a big selling point for this format is that it can be used from Python/pandas and from R.  A Julia package to read and write this format would be very useful for data exchange.

Douglas Bates

unread,
Apr 1, 2016, 2:12:15 PM4/1/16
to julia-stats
I have been playing with the feather format in Julia for a few days without great success.  At present the implementation is a C++ library which is somewhat beyond my ability to grok.  I could offer some choice comments on C++ here but I think I will just go back to programming in Julia.  My rudimentary efforts are in https://github.com/JuliaStats/Feather.jl.

To go any further I think I would need to decide to either go full bore C++ coding generating Julia objects within my compiled code, which is feasible but doesn't sound like a whole lot of fun, or figure out how to parse the metadata without going through the Flatbuffers-generated C++ code.

Cedric St-Jean

unread,
Apr 1, 2016, 8:28:25 PM4/1/16
to julia-stats
figure out how to parse the metadata without going through the Flatbuffers-generated C++ code.

By that, you mean a pure-Julia solution for reading the files?

Douglas Bates

unread,
Apr 6, 2016, 3:49:09 PM4/6/16
to julia-stats
Wes McKinney added a C CPI to libfeather and I was able to use that in the Feather.jl package that is under development.

Douglas Bates

unread,
Apr 15, 2016, 6:34:31 PM4/15/16
to julia-stats
Eventually I said "to hell with it" and wrote a Julia package for reading binary files created according to a flatbuffers IDL file, which is how the metadata in a feather file is stored.  The current Feather.Reader is working, more or less.  It will need to be polished and documented but I am very happy with having a native Julia implementation all the way down.

Stefan Karpinski

unread,
Apr 19, 2016, 12:33:30 PM4/19/16
to julia-stats
I'll have to try this out this week. Impressive work (as usual).

On Fri, Apr 15, 2016 at 6:34 PM, Douglas Bates <dmb...@gmail.com> wrote:
Eventually I said "to hell with it" and wrote a Julia package for reading binary files created according to a flatbuffers IDL file, which is how the metadata in a feather file is stored.  The current Feather.Reader is working, more or less.  It will need to be polished and documented but I am very happy with having a native Julia implementation all the way down.

--

Rob J. Goedman

unread,
Apr 20, 2016, 10:12:55 AM4/20/16
to julia...@googlegroups.com
Hi Doug,

A somewhat related discussion is taking place on the stan-dev list so I have been following your work on Flatbuffers.jl and Feather.jl a bit. 

The previous version (where I had to provide libfeather.dylib) worked fine, the flatbuffers version currently seems to read the meta data only. Is that correct or do I need more steps?

Regards,
Rob


julia> using Feather

julia> rr = Reader(Pkg.dir("Feather", "test", "data", "iris.feather"))
[150 × 5] @ /Users/rob/.julia/v0.4/Feather/test/data/iris.feather


julia> rr
[150 × 5] @ /Users/rob/.julia/v0.4/Feather/test/data/iris.feather


On Apr 15, 2016, at 15:34, Douglas Bates <dmb...@gmail.com> wrote:

Eventually I said "to hell with it" and wrote a Julia package for reading binary files created according to a flatbuffers IDL file, which is how the metadata in a feather file is stored.  The current Feather.Reader is working, more or less.  It will need to be polished and documented but I am very happy with having a native Julia implementation all the way down.

Douglas Bates

unread,
Apr 24, 2016, 6:11:30 PM4/24/16
to julia-stats
Andreas Noack and I have been plugging away on a version that uses Keno's Cxx package, which means it can only be used with Julia 0.5-

See the dmb/cxx branch of the repository, which was forked from the anj/cxx branch today.

One remarkable aspect of this is that it doesn't use the feather C++ library, it only uses the header files flatbuffers.h and metadata_generated.h, which is generated by the flatc compiler from metadata.fbs.

I think this version handles the missing data correctly but that hasn't been extensively tested.  I can detect when a category is stored but have not yet parsed the category metadata itself.

Rob J. Goedman

unread,
Apr 25, 2016, 12:06:55 PM4/25/16
to julia...@googlegroups.com
Thanks Doug,

Sounds promising. I’ll hold off for a while to see if the cxx stuff settles.

Until now I have been trying to avoid going the C++ route to interface with Stan and I was kind of hoping flat buffers (for in memory communication) and feather (slower, but if needed permanent, disk storage) would help out here.

Regards,
Rob

Douglas Bates

unread,
Apr 28, 2016, 1:00:39 PM4/28/16
to julia-stats
Okay, so we have a feather file reader for Julia 0.5- using the Cxx package.

$ julia5
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.0-dev+3782 (2016-04-28 12:43 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit e8601e8 (0 days old master)
|__/                   |  x86_64-linux-gnu

INFO: Cloning Feather from https://github.com/JuliaStats/Feather.jl
INFO: Computing changes...
INFO: Installing FlatBuffers v0.0.1

julia> Pkg.checkout("Feather", "dmb/cxx")
INFO: Checking out Feather dmb/cxx...
INFO: Pulling Feather latest dmb/cxx...
INFO: Removing FlatBuffers v0.0.1

julia> using Feather
WARNING: cfunction: process_cxx_exception does not returnWARNING: New definition 
    size(DataFrames.ModelMatrix, Any...) at /home/bates/.julia/v0.5/DataFrames/src/statsmodels/formula.jl:48
is ambiguous with: 
    size(Any, Integer, Integer, Integer...) at abstractarray.jl:23.
To fix, define 
    size(DataFrames.ModelMatrix, Integer, Integer, Integer...)
before the new definition.

julia> rr = Feather.Reader(Pkg.dir("Feather", "test", "data", "BOD.feather"))
[6 × 2] @ /home/bates/.julia/v0.5/Feather/test/data/BOD.feather
 Time    : Float64
 demand  : Float64


julia> BOD = DataFrame(rr)
6x2 DataFrames.DataFrame
│ Row │ Time │ demand │
┝━━━━━┿━━━━━━┿━━━━━━━━┥
│ 1   │ 1.0  │ 8.3    │
│ 2   │ 2.0  │ 10.3   │
│ 3   │ 3.0  │ 19.0   │
│ 4   │ 4.0  │ 16.0   │
│ 5   │ 5.0  │ 15.6   │
│ 6   │ 7.0  │ 19.8   │

As mentioned in the README, in the interests of speed the column contents are memory-mapped arrays pointing to the contents of the file on disk.  This means you can read big arrays very quickly.  However, if the Feather.Reader object is garbage collected you lose the contents of the columns.  Use deepcopy if you want to be safe rather than fast.
Reply all
Reply to author
Forward
0 new messages