I would really like this list's help with designing some kind of minimal dataframe-like Haskell structure. I have done a lot of work with R and some with IPython/Pandas, and I would love to bring this into Haskell. It's also fun to play with data, and Haskell has a lot of attributes that would make it ideal for this (as people have already noted) if we could make it more interactive and less boilerplate. I would love to take some popular IPython showcases and experiment with "translating" them to Haskell, looking at what kind of libraries, or higher-level APIs, sugar etc that is needed to make it look just as concise.
Most of my work is analyzing mixed data, often questionnaires (with numerical, enums (Yes/No, Male/Female, Never/Sometimes/Always) and string fields), sometimes web logs, forum contents etc. It's not necessarily huge data, but I did a project with 20 million rows of MOOC clicklog data. I wrote up a bit of my workflow in R here
http://reganmian.net/blog/2014/10/14/starting-data-analysiswrangling-with-r-things-i-wish-id-been-told/, including a brief video about using RStudio, whose interface I actually prefer to IPython.
I have been looking everywhere for something that comes close to the flexibility and ease-of-use of R dataframes (or Pandas, or Julia data.frame) in Haskell. I realize the challenges with the type system, but I also keep hearing people on iRC or mailing lists mention that this shouldn't be too hard, with HLists or HRecords etc. (The question of data.frame in Haskell seems to come up regularly throughout the last few years).
I spent some time looking at these libraries, but I was really struggling with understanding. I am a beginning Haskeller, and still struggle with type-level programming etc. It's made worse by the lack of documentation - I spent quite a lot of time looking at the records package with Data.Records, and even reading the accompanying paper, I still could not come up with a minimal example that works in ghci. (What is KindStar? How do you construct a name?)... I even searched GitHub for projects using records, to try to understand how it works.
I was very excited when I came upon Stephen Diehl's frame library (
https://github.com/sdiehl/frame). Not only did it have a minimal example in the README, but it looked like a very nice API:
λ: Right frame <- fromCsvHeaders "examples/titanic.csv"
λ: frame & cols ["sex", "name", "survived", "age", "boat"] . rows [1..20]
-- pretty printed table
λ: let Success ages = frame ^. get "age" :: Result [Maybe Double]
λ: take 5 ages
[Just 29.0,Just 0.916700006,Just 2.0,Just 30.0,Just 25.0]
λ: avg $ catMaybes ages
I spent quite a bit of time figuring out why it wouldn't install, and fixing it with some of my first pull-requests for a Haskell library :) And I began planning to write an IHaskell.Display instance for the library, so that we could get nice HTML tables for free. I wanted to create a Criterion suite to test with large CSV files, experiment with connecting it with the Statistics library, look at making it easy to graph using Chart, etc.
Basically we want to be able to use strongly-typed functions on frames, for example I have a frame with the type Frame [Int, String, Float, String] (never mind the actual underlying implementation, whether it is a record with a map of vectors like now, or a vector of records or what).
The easiest would be to apply a function of for example Int -> Int so that
Frame [.. Int] -> (Int -> Int) -> Frame [.. Int]
(in this case I use .. to represent all the other records, whose type we don't worry about, since we leave them alone (id))
But I should also be able to do
Frame [.. String] -> (String -> Int) -> Frame [.. String, Int]
ie. run a transformation and add the new column
the function could also rely on two or several columns
Frame [.. Int, Int, Int] -> (Int -> Int -> Int -> Int) -> Frame [.. Int, Int, Int, Int]
The other things I mentioned in my message to sdiehl:
---
What would be very useful for me are examples (if this is already possible given the lens api) of
- selecting rows based on a single column predicate (the equivalent of db[db$age > 30,] in R)
- selecting rows based on multiple column predicates (the equivalent of db[db$age > 30 && db$weight >50,])
- creating a new hframe where a row has been modified by a function (equivalent of db$age = db$age * 2, but functionally) - or something like
newframe = fmap (* 2) (frame ! "age")
- creating a new hframe with an added column calculated based on one or more existing columns (the equivalent of db$derived = db$age / db$weight, but functionally)
- an example of groupBy
---
If anyone could help me change frame into, or come up with a new structure that let's me do these things in some kind of minimal overhead way, I would be incredibly grateful. Even if the implementation is a bit kludgey and slow underneath, just giving us a chance to experiment with APIs and programming patterns, interfacing with other libraries etc, would be very useful!
Thank you
Stian