Hi everyone,
I just released a new version of F# Data library – one thing that is new is that the CSV provider also generates a couple of methods that can be used to work with the data once loaded (e.g. Filter and Truncate methods – see "Transforming CSV Files" in [1]).
While these are nice, I feel that this is a bit ad hoc extension (and that it does not logically belong to CSV type provider). After discussing this with Don and Howard Mansell I realized that what we actually need is a library representing "data frame" in F#. So, I would like to start the discussion about this –
In dynamic languages, data frame is quite easy thing (see Python [2] or R [3]). Assuming I have three sequences of values (or vectors) with dates (dates) and prices (opens and closes), I can write something like this in R:
# Create a data frame with 3 columns named Date, Open and Close
df = data.frame(Date=dates, Open=opens, Close=closes)
# Drop the Close column from the data frame
# Take the first 100 rows from the data frame
df_do = df[,c("Date", "Open")
df_dosm = df_do[1:100, ]
# Add column with averages over 5 day floating window
df$FloatingClose = rollmean(df$Open, 5)
# Calculate the mean of all columns in the data frame
mean(df)
The question is, how can we get something like this in a type-safe (as much as possible) way in F#? Our current CSV data type lets you do a couple of things, but:
· It does not implement more mathematical features (like mean, variance, …) because the goal was not to implement a data frame (but it would be useful: https://github.com/fsharp/FSharp.Data/issues/62)
· It does not support adding/removing columns. We could generate DropXyz method for every column Xyz, but adding them is a bit harder (perhaps we could do this if we required the user to give a list of all columns they want to use in the whole file).
· We do not really provide any decent API for constructing data frames at the moment (again, CSV provider is about reading CSV data).
So, I think it would be really nice to have something like FSharp.Data.DataFrame inside F# Data (or elsewhere), integrate this with the CSV provider, but add all other functionality that is needed by a proper data frame library (but not necessarily required by a CSV provider).
I would like to hear your feedback on this – do people thing that type-safe data-frame using F# type providers is a good idea? Would you be happy to help? Any suggestions about the design of this? What things do you like/dislike about data-frames in R, Pandas, Matlab or elsewhere?
(I imagine it look something like the code below.)
Thanks!
Tomas
[1] http://fsharp.github.io/FSharp.Data/library/CsvProvider.html
[3] http://www.r-tutor.com/r-introduction/data-frame
PS: I have not thought about this significantly, but I think this might be doable:
type DF = DataFrame<"Date (date), Open (decimal), Close (decimal), FloatingClose (float)">
// Create a data frame with 3 columns named Date, Open and Close
// Perhaps we can use some overloading to just say:
// DF.Create(Date=dates, Open=opens, Close=closes)
// but I’m not entirely sure if this would work. We want
// df.Open to be defined, but df.FloatingClose not to be!
let df = DF.Create().WithDate(dates).WithOpen(opens).WithClose(closes)
// Drop the Close column from the data frame
// Take the first 100 rows from the data frame
let df_do = df.DropClose()
let df_dosm = df_do[1 .. 100]
// Add column with averages over 5 day floating window
// (This can work on the data-frame as a whole, so we do not
// need to talk about individual columns – but here I just project
// Close at the end.)
let winCloses = df.Windowed(5).Map(fun win -> win.Mean()).Close
let df = df.WithFloatingClose(winCloses)
// Calculate the mean of all columns in the data frame
let dfMeans = df.Mean()
I find in R that my dataframes tend to evolve as I go. Typically I'll be creating columns and then renaming them progressively. They become living data structures. Very useful.
As long as they're easy to manipulate interactively including adding and deleting something that looks like a column.
Thanks very much for a detailed reply with all the details about BlueMountain’s data frame!
As I had a chance to play with the library, I found it very powerful and easy to use. I think the number of features that you described shows why I think this should be separated from the CSV provider itself – there are so many useful things that the data-frame should do (but that ordinary users of CSV provider might not need).
I think time-series alignment, heterogeneous values, missing data handling are certainly things that should be supported.
As for the type-safety, I think we could follow the same design that other F# Data type providers follow – there is a dynamic underlying implementation (e.g. with ? and ?<- operators) and on top of that, we could build some type-safe wrapper (perhaps using type providers) that you may or may not use (perhaps in your scenario, you would start with dynamic, but then added these additional types to guarantee more safety?)
Thanks!
Tomas
--
--
You received this message because you are subscribed to the Google
Groups "FSharp Open Source Community" group.
To post to this group, send email to fsharp-o...@googlegroups.com
To unsubscribe from this group, send email to
fsharp-opensou...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/fsharp-opensource?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "FSharp Open Source Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fsharp-opensource/3mb3HO3AvzA/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to fsharp-opensou...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
I am glad that this particular structure gets attention as this highlights a few points that makes dynamic language such a versatile and privileged tool. they do have strength and people use them for a reason, but where exactly ? Taking R as a vantage point is a fruitful place.
****
One of the trait to consider is the whole chain where the data get used as it is quite a different beast than your standard types. For instance, the same data often gets replicated to different stores of different reliability. Just as we have type system to assert our assumptions on the types, data follows in some workflow a similar refinement for shape and quality. As a result, different repositories, having the same data, can be given authority for different purposes (official accounting, trading dataset, third party), or optimised for different access pattern.
But going back even before the access pattern, what is really driving the usage of the dataframe structure on an applicative level in the first place ?
This is the real question to be asked.
*****
One can find a strong hint for answer in Hadley Wickham's work. His operations shape and melt, are all about exploiting the kind of flexibility allowed by the dataframe.
One of his paper, 'tidy data' (http://vita.had.co.nz/papers/tidy-data.pdf) he provides a down to earth sum up of what his libraries are here for and frame part of the (vast) problem space.
That should provide a good foundation as to where one should aim I think
****
Regarding the functionalities mentioned by Howard-san, one might want to separate operations for shaping data from the one using data further down in the pipe. And among those, time series oriented operations from the rest.
I do not think so.
At the moment, the CSV provider in F# Data has some of the functionality that – I think – should instead be in Data Frame. I would be keen to restructure this and add more features to DataFrame, but I probably won’t be able to do much until August. (Another problem is, that we probably need to do some experiments first to figure out what the best option is… especially with respect to type safe vs. easy to use)
T.
> aa = c(1,10,3)> bb = c (2,6,79)> df <- data.frame(attributeA = aa, attributeB = bb)> dfattributeA attributeB1 1 22 10 63 3 79
--
--
You received this message because you are subscribed to the Google
Groups "FSharp Open Source Community" group.
To post to this group, send email to fsharp-o...@googlegroups.com
To unsubscribe from this group, send email to
fsharp-opensou...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/fsharp-opensource?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "FSharp Open Source Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fsharp-opensource/3mb3HO3AvzA/unsubscribe.
To post to this group, send email to fsharp-opensource@googlegroups.com
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/fsharp-opensource?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "FSharp Open Source Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fsharp-opensource/3mb3HO3AvzA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fsharp-opensource+unsubscribe@googlegroups.com.
To post to this group, send email to fsharp-o...@googlegroups.com
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/fsharp-opensource?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "FSharp Open Source Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fsharp-opensource/3mb3HO3AvzA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fsharp-opensou...@googlegroups.com.
http://www.doodle.com/emuq8ccdncpf5e5g
Thanks
Howard
--
--
You received this message because you are subscribed to the Google
Groups "FSharp Open Source Community" group.
To post to this group, send email to fsharp-o...@googlegroups.com
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/fsharp-opensource?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "FSharp Open Source Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fsharp-opensource/3mb3HO3AvzA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fsharp-opensou...@googlegroups.com.
--
--
You received this message because you are subscribed to the Google
Groups "FSharp Open Source Community" group.
To post to this group, send email to fsharp-o...@googlegroups.com
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/fsharp-opensource?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "FSharp Open Source Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fsharp-opensource/3mb3HO3AvzA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fsharp-opensou...@googlegroups.com.
Hi everyone,
Here is an overview of the time series library I mentioned above, which doesn’t have the scope of a full data frame library but can provide potentially interesting design ideas and usage patterns to people interested in the project. I use it in production (in analytic web services) and for ad-hoc research.
Nabil.
Hi Nabil,
This is an excellent write-up! Thanks very much for sharing this – I’ll have a detailed look in a few days, but the lazy loading sounds like an interesting feature (we have this in mind for the design, but did not actually try to implement it yet… so learning from your experience will certainly help!)
Also, thanks for sharing the sample analyses. I’ll certainly try to re-write them using the new library to see if we’re missing something.
Tomas
PS: We were hoping to share something by the end of this week, but I did not quite finish some design changes that I started on Friday – so I’ll try to send the prototype to the group on Monday.
--
Hi David,
There is a prototype and it can be easily found on public GitHub of BlueMountain Capital (and there is a fork on my profile too). We are not actively publicizing the project at the moment, because there is still lots of work to do and we want to coordinate the release with other activities of the F# Data Science working group.
Of course, everyone is welcome to try it & submit issues and pull requests :-). If you want to keep in touch with current discussions, the best way is to join the F# Data Science WG (http://fsharp.org/technical-groups). I plan to send some update there in a couple of days.
T.
From: fsharp-o...@googlegroups.com [mailto:fsharp-o...@googlegroups.com] On Behalf Of David Terk
Sent: Friday, September 27, 2013 7:53 PM
To: fsharp-o...@googlegroups.com; fsharp-o...@googlegroups.com
Subject: Re: F# data frame library
Did a prototype ever make it out in the wild?
You received this message because you are subscribed to the Google Groups "FSharp Open Source Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fsharp-opensou...@googlegroups.com.