[Haskell-cafe] data analysis question

422 views
Skip to first unread message

Tobias Pflug

unread,
Nov 12, 2014, 4:45:40 AM11/12/14
to haskel...@haskell.org
Hi,

just the other day I talked to a friend of mine who works for an online
radio service who told me he was currently looking into how best work
with assorted usage data: currently 250 million entries as a 12GB in a
csv comprising of information such as which channel was tuned in for how
long with which user agent and what not.

He accidentally ran into K and Q programming language (*1) which
apparently work nicely for this as unfamiliar as it might seem.

This certainly is not my area of expertise at all. I was just wondering
how some of you would suggest to approach this with Haskell. How would
you most efficiently parse such data evaluating custom queries ?

Thanks for your time,
Tobi

[1] (http://en.wikipedia.org/wiki/K_(programming_language)
[2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems)
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Michael Snoyman

unread,
Nov 12, 2014, 5:01:30 AM11/12/14
to Tobias Pflug, haskel...@haskell.org
It's hard to answer without knowing what kinds of queries he's doing, but in the past, I've used csv-conduit to parse the raw data, convert the data to some Haskell ADT, and then used standard conduit processing to perform analyses in a streaming manner.

Peter Simons

unread,
Nov 12, 2014, 5:21:49 AM11/12/14
to haskel...@haskell.org
Hi Tobias,

> A friend [is] currently looking into how best work with assorted
> usage data: currently 250 million entries as a 12GB in a csv
> comprising of information such as which channel was tuned in for how
> long with which user agent and what not.

as much as I love Haskell, the tool of choice for data analysis is GNU R,
not so much because of the language, but simply because of the vast array of
high-quality libraries that cover topics, like statistics, machine learning,
visualization, etc. You'll find it at <http://www.r-project.org/>.

If you'd want to analyze 12 GB of data in Haskell, you'd have to jump
through all kinds of hoops just to load that CVS file into memory. It's
possible, no doubt, but pulling it off efficiently requires a lot of
expertise in Haskell that statistics guys don't necessarily have (and
arguably they shouldn't have to).

The package Rlang-QQ integrates R into Haskell, which might be a nice way to
deal with this task, but I have no personal experience with that library, so
I'm not sure whether this adds much value.

Just my 2 cents,
Peter

Roman Cheplyaka

unread,
Nov 12, 2014, 6:57:12 AM11/12/14
to Peter Simons, haskel...@haskell.org
On 12/11/14 05:21, Peter Simons wrote:
> If you'd want to analyze 12 GB of data in Haskell, you'd have to jump
> through all kinds of hoops just to load that CVS file into memory. It's
> possible, no doubt, but pulling it off efficiently requires a lot of
> expertise in Haskell that statistics guys don't necessarily have (and
> arguably they shouldn't have to).

Well, with Haskell you don't have to load the whole data set into
memory, as Michael shows. With R, on the other hand, you do.

Besides, if you're not an R expert, and if the analysis you want to do
is not readily available, it may be quite a pain to implement in R.

As a simple example, I still don't know an acceptable way to write
something like zipWith f (tail vec) vec in R.

Roman

Tobias Pflug

unread,
Nov 12, 2014, 7:06:49 AM11/12/14
to haskel...@haskell.org
On 12.11.2014 12:56, Roman Cheplyaka wrote:
> On 12/11/14 05:21, Peter Simons wrote:
>> If you'd want to analyze 12 GB of data in Haskell, you'd have to jump
>> through all kinds of hoops just to load that CVS file into memory. It's
>> possible, no doubt, but pulling it off efficiently requires a lot of
>> expertise in Haskell that statistics guys don't necessarily have (and
>> arguably they shouldn't have to).
> Well, with Haskell you don't have to load the whole data set into
> memory, as Michael shows. With R, on the other hand, you do.
>
>
That is exactly the thing that came to my mind thinking about R. I
haven't actually used R myself
but based on what I know and what some googling revealed all analysis
would have to happen in-memory.

PS: I could be wrong of course ;)

Peter Simons

unread,
Nov 12, 2014, 1:06:24 PM11/12/14
to haskel...@haskell.org
Hi Roman,

> With Haskell you don't have to load the whole data set into memory,
> as Michael shows. With R, on the other hand, you do.

Can you please point me to a reference to back that claim up?

I'll offer [1] and [2] as a pretty good indications that you may not be
entirely right about this.


> Besides, if you're not an R expert, and if the analysis you want to do
> is not readily available, it may be quite a pain to implement in R.

Actually, implementing sophisticated queries in R is quite easy because
the language was specifically designed for that kind of thing. If you
have no experience in neither R nor Haskell, then learning R is *far*
easier than learning Haskell is because it doesn't aim to be a powerful
general-purpose programming language. It aims to be a powerful language
for data analysis.

Now, one *could* write a DSL in Haskell, of course, that matches R
features and accomplishes data analysis tasks in a similarly convenient
syntax, etc. But unfortunately no such library exists, and writing one
is not trivial task.


> I still don't know an acceptable way to write something like zipWith
> f (tail vec) vec in R.

Why would that be any trouble? What kind of solutions did you find and
in what way were they unacceptable?

Best regards,
Peter


[1] http://cran.r-project.org/web/packages/ff/index.html
[2] http://cran.r-project.org/web/packages/bigmemory/index.html

Markus Läll

unread,
Nov 12, 2014, 5:17:34 PM11/12/14
to Tobias Pflug, haskell-cafe

Hi Tobias,

What he could do is encode the column values to appropriate lengths of Word's to reduce the size -- to make it fit in ram. E.g listening times as seconds, browsers as categorical variables (in statistics terms), etc. If some of the columns are arbitrary length strings, then it seems possible to get 12GB down by more than half.

If he doesn't know Haskell, then I'd suggest using  another language. (Years ago I tried to do a bigger uni project in Haskell-- being a noob --and failed miserably.)

Christopher Allen

unread,
Nov 12, 2014, 8:23:17 PM11/12/14
to Markus Läll, haskell-cafe
I'm working on a Haskell article for https://howistart.org/ which is actually about the rudiments of processing CSV data in Haskell.

To that end, take a look at my rather messy workspace here: https://github.com/bitemyapp/csvtest

And my in-progress article here: https://github.com/bitemyapp/howistart/blob/master/haskell/1/index.md (please don't post this anywhere, incomplete!)

And here I'll link my notes on profiling memory use with different streaming abstractions: https://twitter.com/bitemyapp/status/531617919181258752

csv-conduit isn't in the test results because I couldn't figure out how to use it. pipes-csv is proper streaming, but uses cassava's parsing machinery and data types. Possibly this is a problem if you have really wide rows but I've never seen anything that would be problematic in that realm even when I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not columns. With csv-conduit you might be able to incrementally process the columns too based on my guess from glancing at the rather scary code.

Let me know if you have any further questions.

Cheers all.

--- Chris Allen




Roman Cheplyaka

unread,
Nov 12, 2014, 9:25:38 PM11/12/14
to Peter Simons, haskel...@haskell.org
On 12/11/14 09:21, Peter Simons wrote:
> Hi Roman,
>
> > With Haskell you don't have to load the whole data set into memory,
> > as Michael shows. With R, on the other hand, you do.
>
> Can you please point me to a reference to back that claim up?
>
> I'll offer [1] and [2] as a pretty good indications that you may not be
> entirely right about this.

Ah, great then.

My impression was formed after listening to this FLOSS weekly episode:
http://twit.tv/show/floss-weekly/306 (starting from 33:55).

> > Besides, if you're not an R expert, and if the analysis you want to do
> > is not readily available, it may be quite a pain to implement in R.
>
> Actually, implementing sophisticated queries in R is quite easy because
> the language was specifically designed for that kind of thing. If you
> have no experience in neither R nor Haskell, then learning R is *far*
> easier than learning Haskell is because it doesn't aim to be a powerful
> general-purpose programming language. It aims to be a powerful language
> for data analysis.

That doesn't match my experience. Maybe it's just me and my
unwillingness and write C-like code that traverses arrays by indexes (I
know most scientists don't have a problem with that), but I found it
hard to express data transformations and queries functionally in R.

> > I still don't know an acceptable way to write something like zipWith
> > f (tail vec) vec in R.
>
> Why would that be any trouble? What kind of solutions did you find and
> in what way were they unacceptable?

This was a while ago, and I don't remember what solution I picked up
eventually. Of course I could just write a for-loop to populate an
array, but I hadn't found anything that matches the simplicity and
clarity of the line above. How would you write it in R?

Roman

Jeffrey Brown

unread,
Nov 12, 2014, 9:44:50 PM11/12/14
to Roman Cheplyaka, haskell-cafe, Peter Simons
My experience with R is that, while worlds more powerful than the dominant commercial alternatives (Stata, SAS, it was unintuitive relative to other general-purpose languages like Python. I wonder/speculate whether it was distorted by the pull of its statistical applications away from what would be more natural.

Brandon Allbery

unread,
Nov 12, 2014, 9:54:09 PM11/12/14
to Jeffrey Brown, haskell-cafe, Peter Simons
On Wed, Nov 12, 2014 at 9:42 PM, Jeffrey Brown <jeffbr...@gmail.com> wrote:
My experience with R is that, while worlds more powerful than the dominant commercial alternatives (Stata, SAS, it was unintuitive relative to other general-purpose languages like Python. I wonder/speculate whether it was distorted by the pull of its statistical applications away from what would be more natural.

It is an open source implementation of S ( http://en.wikipedia.org/wiki/S_(programming_language) ) which was developed specifically for statistical applications. I would wonder how much of *that* was shaped by Fortran statistical packages....

--
brandon s allbery kf8nh                               sine nomine associates
allb...@gmail.com                                  ball...@sinenomine.net
unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net

Dominic Steinitz

unread,
Nov 13, 2014, 12:45:49 AM11/13/14
to haskel...@haskell.org
Tobias Pflug <tobias.pflug <at> gmx.net> writes:

>
> Hi,
>
> just the other day I talked to a friend of mine who works for an online
> radio service who told me he was currently looking into how best work
> with assorted usage data: currently 250 million entries as a 12GB in a
> csv comprising of information such as which channel was tuned in for how
> long with which user agent and what not.
>
> He accidentally ran into K and Q programming language (*1) which
> apparently work nicely for this as unfamiliar as it might seem.
>
> This certainly is not my area of expertise at all. I was just wondering
> how some of you would suggest to approach this with Haskell. How would
> you most efficiently parse such data evaluating custom queries ?
>
> Thanks for your time,
> Tobi
>
> [1] (http://en.wikipedia.org/wiki/K_(programming_language)
> [2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems)
>

Hi Tobias,

I use Haskell and R (and Matlab) at work. You can certainly do data
analysis in Haskell; here is a fairly long example

http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-analysis-in-
haskell/.

IIRC the dataset was about 2G so not dissimilar to the one you are
thinking of analysing. I didn't seem to need pipes or conduits but
just used cassava. The data were plotted on a map of London (yes you
can draw maps in Haskell) with diagrams and shapefile
(http://hackage.haskell.org/package/shapefile).

But R (and pandas in python) make this sort of analysis easier. As a
small example, my data contained numbers like -.1.2 and dates and
times. R will happily parse these but in Haskell you have to roll your
own (not that this is difficult and "someone" ought to write a library
like pandas so that the wheel is not continually re-invented).

Also R (and python) have extensive data analysis libraries so if
e.g. you want to apply Nelder Mead then a very well documented R
package exists; I searched in vain for this in Haskell. Similarly, if
you want to construct a GARCH model, then there is not only a package
but an active community upon whom you can call for help.

I have the benefit of being able to use this at work

http://ifl2014.github.io/submissions/ifl2014_submission_16.pdf

and I am hoping that it will be open-sourced "real soon now" but it
will probably not be available in time for your analysis.

I should also add that my workflow (for data analysis) in Haskell is
similar to that in R. I do a small amount of analysis either in a file
or at the command line and usually chart the results again using the
command line:

http://hackage.haskell.org/package/Chart

I haven't had time to try iHaskell but I think the next time I have
some data analysis to do I will try it out.

http://gibiansky.github.io/IHaskell/demo.html
http://andrew.gibiansky.com/blog/haskell/finger-trees/

Finally, doing data analysis is quite different from quality
production code. I would imagine turning Haskell data analysis into
production code would be a lot easier than doing this in R.

Dominic.

Richard A. O'Keefe

unread,
Nov 13, 2014, 12:47:53 AM11/13/14
to haskell

On 13/11/2014, at 3:21 am, Peter Simons <sim...@cryp.to> wrote:

> Hi Roman,
>
>> With Haskell you don't have to load the whole data set into memory,
>> as Michael shows. With R, on the other hand, you do.
>
> Can you please point me to a reference to back that claim up?
>
> I'll offer [1] and [2] as a pretty good indications that you may not be
> entirely right about this.

It is *possible* to handle large data sets with R,
but it is *usual* to deal with things in memory.

>
>> Besides, if you're not an R expert, and if the analysis you want to do
>> is not readily available, it may be quite a pain to implement in R.

A heck of a lot of code in R has been developed by people who think
of themselves as statisticians/financial analysts/whatever rather than
programmers or “R experts”. There is much to dislike about R (C-like
syntax, the ‘interactive if’ trap, the clash of naming styles) but it
has to be said that R is a very good for for the data analysis problems
S was designed for, and I personally would find it *far* easier to
develop such a solution in R than Haskell. (For other problems, of
course, it would be the other way around.)

Not only does R already have a stupefying number of packages offering
all sorts of analyses, so that it’s quite hard to find something that
you *have* to implement, there is an extremely active mailing list
with searchable archives and full of wizards keen to help. If you
*did* have to implement something, you wouldn’t be on your own.

The specific case of ‘zipwith f (tail vec) vec’ is easy:
(1) vec[-1] is vec without its first element
vec[-length(vec)] is vec without its last element
(2) cbind(vec[-1], vec[-length(vec)])
is an array with 2 columns.
(3) apply(cbind(vec[-1], vec[-length(vec)]), 1, f)
applies f to the rows of that matrix. If f returns one
number, the answer is a vector; if f returns a row, the
answer is a matrix.
Example:
> vec <- c(1,2,3,4,5)
> mat <- cbind(vec[-1], vec[-length(vec)])
> apply(mat, 1, sum)
[1] 3 5 7 9
In this case, you could just do
> vec[-1] + vec[-length(vec)]
and get the same answer.

Oddly enough, one of the tricks for success in R is, like Haskell,
to learn your way around the higher-order functions in the library.

Richard A. O'Keefe

unread,
Nov 13, 2014, 1:10:32 AM11/13/14
to Brandon Allbery, Peter Simons, haskell-cafe

On 13/11/2014, at 3:52 pm, Brandon Allbery <allb...@gmail.com> wrote:
>
> It is an open source implementation of S ( http://en.wikipedia.org/wiki/S_(programming_language) ) which was developed specifically for statistical applications. I would wonder how much of *that* was shaped by Fortran statistical packages….

The prehistoric version of S *was* a Fortran statistical package.
While the inventors of S were familiar with GLIM, GENSTAT, SPSS, SAS, BMDP, MINITAB, &c.
they _were_ at Bell Labs, and so the language looks a lot like C.
Indeed, several aspects of S were shaped by UNIX, in particular the way S (but not R)
treats the current directory as an “outer block”.
Many (even new) R packages are wrappers around Fortran code.

However, that has had almost no influence on the language itself.
In particular:

- arrays are immutable
> (v <- 1:5)
> w <- v
> w[3] <- 33
> w
[1] 1 2 33 4 5
> v
[1] 1 2 3 4 5

- functions are first class values and higher
order functions are commonplace

- function arguments are evaluated lazily

- good style does *NOT* “traverse arrays by indexes”
but operates on whole arrays in APL/Fortran 90 style.
For example, you do not do
for (i in 1:m) for (j in 1:n) r[i,j] <- f(v[i], w[j])
but
r <- outer(v, w, f)
If you _do_ “express data transformations and queries
functionally in R” — which I repeat is native good style —
it will perform well; if you “traverse arrays by indexes”
you will wish you hadn’t. This is not something that
Fortran 66 or Fortran 77 would have taught anyone.

Let me put it this way: R is about as close to a functional
language as you can get without actually being one.
(The implementors of R consciously adopted implementation
techniques from Scheme.)

Christopher Reichert

unread,
Nov 13, 2014, 1:27:30 AM11/13/14
to Christopher Allen, haskell-cafe

On Wed, Nov 12 2014, Christopher Allen <c...@bitemyapp.com> wrote:
> [Snip]
> csv-conduit isn't in the test results because I couldn't figure out how to
> use it. pipes-csv is proper streaming, but uses cassava's parsing machinery
> and data types. Possibly this is a problem if you have really wide rows but
> I've never seen anything that would be problematic in that realm even when
> I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're
> streaming rows, but not columns. With csv-conduit you might be able to
> incrementally process the columns too based on my guess from glancing at
> the rather scary code.
>

Any problems in particular? I've had pretty good luck with
csv-conduit. However, I have noticed that it's rather picky about type
signatures and integrating custom data types isn't straight forward at
first.

csv-conduit also seems to have drawn inspiration from cassava:
http://hackage.haskell.org/package/csv-conduit-0.6.3/docs/Data-CSV-Conduit-Conversion.html

> [Snip]
> To that end, take a look at my rather messy workspace here:
> https://github.com/bitemyapp/csvtest

I've made a PR for the conduit version:
https://github.com/bitemyapp/csvtest/pull/1


It could certainly be made more performent but it seems to hold up well
in comparison. I would be interested in reading the How I Start Article
and hearing more about your conclusions. Is this focused primarily on
the memory profile or also speed?


Regards,
-Christopher

Christopher Allen

unread,
Nov 13, 2014, 2:24:43 AM11/13/14
to Christopher Reichert, haskell-cafe
Memory profiling only to test how stream-y the streaming was. I didn't think perf would be that different between them. The way I had to transform my fold for Pipes was a titch awkward, otherwise happy with it.

If people are that interested in the perf side of things I can setup a criterion harness and publish those numbers as well.

Mostly I was impressed with:

1. How easy it was to start using the streaming module in Cassava because it's just a Foldable instance.

2. How Pipes used <600kb of memory.

Your pull request for csv-conduit looks really clean and nice. I've merged it, thanks for sending it my way!

--- Chris Allen

Michael Snoyman

unread,
Nov 13, 2014, 2:30:13 AM11/13/14
to Christopher Allen, Christopher Reichert, haskell-cafe
Somewhat off topic, but: I said csv-conduit because I have some experience with it. When we were doing some analytic work at FP Complete, a few of us analyzed both csv-conduit and cassava, and didn't really have a good feel for which was the better library. We went with csv-conduit[1], but I'd be really interested in hearing a comparison of the two libraries from someone who knows about them.

[1] Don't ask what tipped us in that direction, I honestly don't remember what it was.

_______________________________________________

Tobias Pflug

unread,
Nov 13, 2014, 8:49:01 AM11/13/14
to Christopher Allen, Markus Läll, haskell-cafe
On 13.11.2014 02:22, Christopher Allen wrote:
> I'm working on a Haskell article for https://howistart.org/ which is
> actually about the rudiments of processing CSV data in Haskell.
>
> To that end, take a look at my rather messy workspace here:
> https://github.com/bitemyapp/csvtest
>
> And my in-progress article here:
> https://github.com/bitemyapp/howistart/blob/master/haskell/1/index.md
> (please don't post this anywhere, incomplete!)
>
> And here I'll link my notes on profiling memory use with different
> streaming abstractions:
> https://twitter.com/bitemyapp/status/531617919181258752
>
> csv-conduit isn't in the test results because I couldn't figure out
> how to use it. pipes-csv is proper streaming, but uses cassava's
> parsing machinery and data types. Possibly this is a problem if you
> have really wide rows but I've never seen anything that would be
> problematic in that realm even when I did a lot of HDFS/Hadoop
> ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not
> columns. With csv-conduit you might be able to incrementally process
> the columns too based on my guess from glancing at the rather scary code.
>
> Let me know if you have any further questions.
>
> Cheers all.
>
> --- Chris Allen
>
>
Thank you, this looks rather useful. I will have a closer look at it for
sure. Surprised that csv-conduit was so troublesome. I was in fact
expecting/hoping for the opposite. I will just give it a try.

Thanks also to everyone else who replied. Let me add some tidbits to
refine the problem space a bit. As I said before the size of the data is
around 12GB of csv files. One file per month with
each line representing a user tuning in to a stream:

[date-time-stamp], [radio-stream-name], [duration], [mobile|desktop],
[country], [areaCode]

which could be represented as:

data RadioStat = {
rStart :: Integer -- POSIX time stamp
, rStation :: Integer -- index to station map
, rDuration :: Integer -- duration in seconds
, rAgent :: Integer -- index to agent map
("mobile", "desktop", ..)
, rCountry :: Integer -- index to country map
("DE", "CH", ..)
, rArea :: Integer -- German geo location info
}

I guess it parsing a csv into a list of [RadioStat] list and respective
entries in a HashMap for the station names
should work just fine (thanks again for your linked material chris).

While this is straight forward I the type of queries I got as examples
might indicate that I should not try to
reinvent a query language but look for something else (?). Examples would be

- summarize per day : total listening duration, average listening
duration, amount of listening actions
- summarize per day per agent total listening duration, average
listening duration, amount of listening actions

I don't think MySQL would perform all that well operating on a table
with 125 million entries ;] What approach
would you guys take ?

Thanks for your input and sorry for the broad scope of these questions.
best wishes,
Tobi

Christopher Allen

unread,
Nov 13, 2014, 11:49:52 AM11/13/14
to Tobias Pflug, haskell-cafe
I wouldn't hold it against csv-conduit too much, conduit and Pipes both take some getting used too and I hadn't used either in anger before I started kicking around the CSV parsing stuff. I was a bit spoiled by how easy Cassava was to use as well.

Thanks to Christopher Reichert's PR, there is an example for csv-conduit as well, so you've now got four ways to try processing CSV, *three* of which are streaming :)

I'd say just try each in turn and see what you're happy with, if you're not married to a particular streaming operation.

>I don't think MySQL would perform all that well operating on a table with 125 million entries ;] What approach
would you guys take ?

Big enough machine with enough memory and it's fine. I used to keep a job queue with a billion rows on MySQL at a gig long ago. Could do it with PostgreSQL pretty easily too. On your personal work machine? I dunno.

Not trying to steer you away from using Haskell here by any means, but if you can process your data in a SQL database efficiently, that's often pretty optimal in terms of speed and ease of use until you start doing more sophisticated analysis. I don't have a lot of experience in data analysis but I knew people to do some preliminary slicing/dicing in SQL before moving onto a building a custom model for understanding the data.

Cheers,
Chris Allen

Tobias Pflug

unread,
Nov 13, 2014, 1:46:49 PM11/13/14
to Christopher Allen, haskell-cafe


Big enough machine with enough memory and it's fine. I used to keep a job queue with a billion rows on MySQL at a gig long ago. Could do it with PostgreSQL pretty easily too. On your personal work machine? I dunno.

Not trying to steer you away from using Haskell here by any means, but if you can process your data in a SQL database efficiently, that's often pretty optimal in terms of speed and ease of use until you start doing more sophisticated analysis. I don't have a lot of experience in data analysis but I knew people to do some preliminary slicing/dicing in SQL before moving onto a building a custom model for understanding the data.

I guess I was just curious what a sensible approach using Haskell would look like and i'll play
around with what I know now. If this was from my working place i'd just put it in a database with
enough horse power but it's just my curiosity in my spare time, alas..

thank you for your input.

Mark Fredrickson

unread,
Nov 13, 2014, 10:41:02 PM11/13/14
to haskell-cafe
Is there a mailing list for statistics/analytics/simulation/numerical
analysis/etc. using Haskell? If not, I purpose we start one. (Not to
take away from general discussion, but to provide a forum to hash out
these issues among the primary user base).

-M

kuj...@gmail.com

unread,
Nov 14, 2014, 2:54:08 AM11/14/14
to haskel...@googlegroups.com, haskel...@haskell.org, mark.m.fr...@gmail.com

Is there a mailing list for statistics/analytics/simulation/numerical
analysis/etc. using Haskell? If not, I purpose we start one.

+1

Chris Allen

unread,
Nov 14, 2014, 1:13:33 PM11/14/14
to Mark Fredrickson, haskell-cafe
There is #numerical-Haskell on Freenode and an NLP mailing list I believe.

Sent from my iPhone

Wojtek Narczyński

unread,
Nov 14, 2014, 5:49:36 PM11/14/14
to haskel...@haskell.org

On 13.11.2014 10:37, Tobias Pflug wrote:
>
> data RadioStat = {
> rStart :: Integer -- POSIX time stamp
> , rStation :: Integer -- index to station map
> , rDuration :: Integer -- duration in seconds
> , rAgent :: Integer -- index to agent map
> ("mobile", "desktop", ..)
> , rCountry :: Integer -- index to country map
> ("DE", "CH", ..)
> , rArea :: Integer -- German geo location info
> }
Could you show a sampe record or two? It will be an interesting case to
calculate now many bits of information there are vs. how many bits will
Haskell need.

--
Wojtek

Dominic Steinitz

unread,
Nov 15, 2014, 1:55:08 AM11/15/14
to haskel...@haskell.org
Mark Fredrickson <mark.m.fredrickson <at> gmail.com> writes:

>
> Is there a mailing list for statistics/analytics/simulation/numerical
> analysis/etc. using Haskell? If not, I purpose we start one. (Not to
> take away from general discussion, but to provide a forum to hash out
> these issues among the primary user base).

Sadly not but I think there are sufficient numbers of people
interested in this subject that it is probably worth setting one up. I
really don't like the google group experience but maybe that is the
best place to start?

Dominic Steinitz

unread,
Nov 16, 2014, 9:07:14 AM11/16/14
to Ben Gamari, haskel...@haskell.org
I’d much prefer that. I really dislike the google group experience. I’ll drop Austin a note.

Dominic Steinitz
dom...@steinitz.org
http://idontgetoutmuch.wordpress.com

On 15 Nov 2014, at 15:42, Ben Gamari <b...@smart-cactus.org> wrote:

> On November 15, 2014 1:54:38 AM EST, Dominic Steinitz <dom...@steinitz.org> wrote:
>> Mark Fredrickson <mark.m.fredrickson <at> gmail.com> writes:
>>
>>>
>>> Is there a mailing list for statistics/analytics/simulation/numerical
>>> analysis/etc. using Haskell? If not, I purpose we start one. (Not to
>>> take away from general discussion, but to provide a forum to hash out
>>> these issues among the primary user base).
>>
>> Sadly not but I think there are sufficient numbers of people
>> interested in this subject that it is probably worth setting one up. I
>> really don't like the google group experience but maybe that is the
>> best place to start?
>>
> I agree that this would be a worthwhile forum to have. Why not just stay with Haskell.org infrastructure? I'm sure Austin would set up a mailing list for the cause.
>
> Cheers,
>
> - Ben

Alp Mestanogullari

unread,
Nov 16, 2014, 9:52:06 PM11/16/14
to Dominic Steinitz, The Haskell Cafe
+1 for the mailing list suggestion. In addition to the obvious reasons why this would be a good idea, this would also let us coordinate efforts in the numerical computing / AI space to get a somewhat compatible/consistent ecosystem.
--
Alp Mestanogullari

Carter Schonwald

unread,
Nov 16, 2014, 11:50:13 PM11/16/14
to Alp Mestanogullari, Dominic Steinitz, The Haskell Cafe

Carter Schonwald

unread,
Nov 16, 2014, 11:50:38 PM11/16/14
to Alp Mestanogullari, Dominic Steinitz, The Haskell Cafe

Stian Håklev

unread,
Nov 21, 2014, 3:30:09 PM11/21/14
to haskel...@googlegroups.com, alpm...@gmail.com, dom...@steinitz.org, haskel...@haskell.org, carter.s...@gmail.com, sdiehl/frame
I would really like this list's help with designing some kind of minimal dataframe-like Haskell structure. I have done a lot of work with R and some with IPython/Pandas, and I would love to bring this into Haskell. It's also fun to play with data, and Haskell has a lot of attributes that would make it ideal for this (as people have already noted) if we could make it more interactive and less boilerplate. I would love to take some popular IPython showcases and experiment with "translating" them to Haskell, looking at what kind of libraries, or higher-level APIs, sugar etc that is needed to make it look just as concise. 

Most of my work is analyzing mixed data, often questionnaires (with numerical, enums (Yes/No, Male/Female, Never/Sometimes/Always) and string fields), sometimes web logs, forum contents etc. It's not necessarily huge data, but I did a project with 20 million rows of MOOC clicklog data. I wrote up a bit of my workflow in R here http://reganmian.net/blog/2014/10/14/starting-data-analysiswrangling-with-r-things-i-wish-id-been-told/, including a brief video about using RStudio, whose interface I actually prefer to IPython.

I have been looking everywhere for something that comes close to the flexibility and ease-of-use of R dataframes (or Pandas, or Julia data.frame) in Haskell. I realize the challenges with the type system, but I also keep hearing people on iRC or mailing lists mention that this shouldn't be too hard, with HLists or HRecords etc. (The question of data.frame in Haskell seems to come up regularly throughout the last few years). 

I spent some time looking at these libraries, but I was really struggling with understanding. I am a beginning Haskeller, and still struggle with type-level programming etc. It's made worse by the lack of documentation - I spent quite a lot of time looking at the records package with Data.Records, and even reading the accompanying paper, I still could not come up with a minimal example that works in ghci. (What is KindStar? How do you construct a name?)... I even searched GitHub for projects using records, to try to understand how it works.

I was very excited when I came upon Stephen Diehl's frame library (https://github.com/sdiehl/frame). Not only did it have a minimal example in the README, but it looked like a very nice API:

λ: Right frame <- fromCsvHeaders "examples/titanic.csv"
λ: frame & cols ["sex", "name", "survived", "age", "boat"] . rows [1..20]
-- pretty printed table
λ: let Success ages = frame ^. get "age" :: Result [Maybe Double]
λ: take 5 ages
[Just 29.0,Just 0.916700006,Just 2.0,Just 30.0,Just 25.0]
λ: avg $ catMaybes ages

I spent quite a bit of time figuring out why it wouldn't install, and fixing it with some of my first pull-requests for a Haskell library :) And I began planning to write an IHaskell.Display instance for the library, so that we could get nice HTML tables for free. I wanted to create a Criterion suite to test with large CSV files, experiment with connecting it with the Statistics library, look at making it easy to graph using Chart, etc.

But the discussion in this pull request poured cold water on those ideas: https://github.com/sdiehl/frame/commit/f7c3ef88036f039d931044e613ac53966533b0fc#commitcomment-8648380

Basically we want to be able to use strongly-typed functions on frames, for example I have a frame with the type Frame [Int, String, Float, String] (never mind the actual underlying implementation, whether it is a record with a map of vectors like now, or a vector of records or what). 

The easiest would be to apply a function of for example Int -> Int so that
Frame [.. Int] -> (Int -> Int) -> Frame [.. Int]

(in this case I use .. to represent all the other records, whose type we don't worry about, since we leave them alone (id))

But I should also be able to do 
Frame [.. String] -> (String -> Int) -> Frame [.. String, Int]

ie. run a transformation and add the new column

the function could also rely on two or several columns
Frame [.. Int, Int, Int] -> (Int -> Int -> Int -> Int) -> Frame [.. Int, Int, Int, Int]

The other things I mentioned in my message to sdiehl:

---

What would be very useful for me are examples (if this is already possible given the lens api) of

  • selecting rows based on a single column predicate (the equivalent of db[db$age > 30,] in R)
  • selecting rows based on multiple column predicates (the equivalent of db[db$age > 30 && db$weight >50,])
  • creating a new hframe where a row has been modified by a function (equivalent of db$age = db$age * 2, but functionally) - or something like newframe = fmap (* 2) (frame ! "age")
  • creating a new hframe with an added column calculated based on one or more existing columns (the equivalent of db$derived = db$age / db$weight, but functionally)
  • an example of groupBy
---
If anyone could help me change frame into, or come up with a new structure that let's me do these things in some kind of minimal overhead way, I would be incredibly grateful. Even if the implementation is a bit kludgey and slow underneath, just giving us a chance to experiment with APIs and programming patterns, interfacing with other libraries etc, would be very useful! 

Thank you
Stian

Reply all
Reply to author
Forward
0 new messages