[racket] DSL for multi-dimensional datasets?

21 views
Skip to first unread message

Simon Haines

unread,
Nov 5, 2012, 11:22:49 PM11/5/12
to us...@racket-lang.org
As part of my work, I frequently have to 'shape' multi-dimensional datasets. This is reasonably easy to do in Racket and I'm thinking about pulling together some of the functions I use into a library. Before I do this though, I was wondering if there is any similar work I can build upon, or perhaps use to guide me.

As an example of what I mean, I'll receive from a colleague a file like this:

Date, Site, Total Alkalinity as CaCO3 (mg/L), Carbonate as CaCO3 (mg/L),
1-Nov-12, BH1, 120, <5
1-Nov-12, BH2, 180, <5
1-Nov-12, BH3, 160, <5
26-Oct-12, BH1, 150, <1
26-Oct-12, BH2, 165, 0
26-Oct-12, BH3, 180, <5

(This is a laboratory analysis of water sampled from bore holes).

This file is composed of two datasets (a set each of total alkalinity and carbonate), with shared dimensions of 'date' and 'site'. I'll often deal with files containing up to 80 datasets.

More often than not, all I'll need to do is 'shape' these datasets into a format that can be pulled into a spreadsheet for further analysis/graphing. One example is:

"", Total Alkalinity as CaCO3 (mg/L), Carbonate as CaCO3 (mg/L)
BH1
1-Nov-12, 120, <5
26-Oct-12, 150, <1
BH2
1-Nov-12, 180, <5
26-Oct-12, 165, 0
BH3
1-Nov-12, 160, <5
26-Oct-12, 180, <5

Another example:

"", BH1, BH2, BH3
Total Alkalinity as CaCO3 (mg/L)
1-Nov-12, 120, 180, 160
26-Oct-12, 150, 165, 180
Carbonate as CaCO3 (mg/L)
1-Nov-12, <5, <5, <5
26-Oct-12, <1, 0, <5

As you can see, the recursive nature of these reports makes them ideal for processing with Racket, and although it takes me a little while to get the format of a report right, I usually can add the report to my toolbox for whenever it's needed later.

So I've started drafting what I think a good DSL for doing this type of task might be, something like:
(define-dataset
  (date (date 'dd-MM-yyyy'))
  (site (text))
  (parameter (text)) ...)

(define-report example1
  (columns (parameter ...))
  (rows ((site) date)))

I haven't worked out the details yet, and I'm not sure the above will work the way I want it to. But I've had a quick look at Microsoft's Scientific DataSet (http://sds.codeplex.com/), but it lacks the composability I'm used to with Racket. Is anyone aware of any similar work that does this, or that I could use as a guide?

Thanks,
Simon.

Jay McCarthy

unread,
Nov 5, 2012, 11:29:56 PM11/5/12
to simon....@con-amalgamate.net, users
I would suggest looking into PADS as well:

http://www.padsproj.org/doc.html
> ____________________
> Racket Users list:
> http://lists.racket-lang.org/users
>



--
Jay McCarthy <j...@cs.byu.edu>
Assistant Professor / Brigham Young University
http://faculty.cs.byu.edu/~jay

"The glory of God is Intelligence" - D&C 93
____________________
Racket Users list:
http://lists.racket-lang.org/users

Asumu Takikawa

unread,
Nov 5, 2012, 11:32:35 PM11/5/12
to Simon Haines, us...@racket-lang.org
On 2012-11-06 15:22:49 +1100, Simon Haines wrote:
> As part of my work, I frequently have to 'shape' multi-dimensional
> datasets. This is reasonably easy to do in Racket and I'm thinking
> about pulling together some of the functions I use into a library.
> Before I do this though, I was wondering if there is any similar work I
> can build upon, or perhaps use to guide me.
>
> [...]
>
> I haven't worked out the details yet, and I'm not sure the above will
> work the way I want it to. But I've had a quick look at Microsoft's
> Scientific DataSet ([1]http://sds.codeplex.com/), but it lacks the
> composability I'm used to with Racket. Is anyone aware of any similar
> work that does this, or that I could use as a guide?

I don't know about Racket, but have you seen the 'reshape' library in R?
It's very flexible and is probably one of the state of the art designs
in this space.

Here's a journal article describing its design:
http://www.jstatsoft.org/v21/i12/paper

and its website:
http://had.co.nz/reshape/

Cheers,
Asumu

Matthias Felleisen

unread,
Nov 6, 2012, 1:38:30 PM11/6/12
to Asumu Takikawa, us...@racket-lang.org, Simon Haines

Perhaps the right approach is to migrate/adapt/port the R library to Racket?

That way you get what you need, plus experience in building a DSL, plus the power and speed of Racket.

Simon Haines

unread,
Nov 6, 2012, 6:21:35 PM11/6/12
to Asumu Takikawa, us...@racket-lang.org
Thanks Asumu for these links. Although the code in the paper is confusing because I'm not familiar with R, it has given me a good insight: datasets need to be described as dimensions and variables. I think the library presented in the paper conflates the structure of the data as read (in a csv file, say) with the logical structure of the dataset as a whole. (I may be wrong on this point, but that is my reading.) I think these concepts should be separated and similarly the the structure of a report is separate again.
However, having given it a little thought after reading the paper, I think there's a good way forward by describing datasets as dimensions and variables, and then incorporating relational algebra primitives, particularly σ, π and G (group by). I'll brush up on the Codd model and see if that gives me any further insights.
Thanks again,
Simon.

Simon Haines

unread,
Nov 6, 2012, 6:26:42 PM11/6/12
to Jay McCarthy, users
Thanks Jay for this link. This is a most comprehensive system that seems to cover what I need, and a fair bit more as well. It will take me a while to get through the manual but I hope there are nuggets of insight contained within.
One thing surprises me though, it seems there are as many different approaches to this problem as there are implementations. There doesn't even seem to be a canonical language for describing multi-dimensional sets (that I've discovered, at any rate). I'll keep at it and post progress to this list. Thanks again,
Simon.

Simon Haines

unread,
Nov 6, 2012, 6:30:59 PM11/6/12
to Matthias Felleisen, us...@racket-lang.org
The power of Racket, particularly it's ability to compose of map, fold, and filter, is exactly what I need. I'm not sure the R library is particularly transferable to Racket though, but I'll need to study it more to be sure. I'll start down the path of implementing a few relational algebra primitives over a generalised dataset structure (well, an assoc list) and see how far that gets me. I'll keep the list informed though.
Thanks,
Simon.

Reply all
Reply to author
Forward
0 new messages