Dataset API proposal

105 views
Skip to first unread message

Aleksandr Sorokoumov

unread,
Jul 5, 2014, 1:07:09 PM7/5/14
to numerica...@googlegroups.com, inca...@googlegroups.com
Hi all,

As a part of Incanter and core.matrix integration process, there is an idea to evolve existing in core.matrix dataset type and use it in Incanter.

In order to do that, Incanter dataset functions should be implemented in core.matrix.
Can you please look at my user API proposal and share your opinion on it? As it should be used instead of existing Incanter dataset API, most of the functions are copied from it.

Thanks.

Mike Anderson

unread,
Jul 5, 2014, 1:37:06 PM7/5/14
to numerica...@googlegroups.com, inca...@googlegroups.com
Hi Alexandr,

I think it would be good to define exactly what we mean by a dataset as an abstraction? The definition would help us to be more precise in terms of what should / shouldn't be in the API

For example, we might define a dataset as something that has the following properties:
a) It can be treated as a 2D array / matrix
b) It has (uniquely?) named columns
c) It supports heterogeneous data types as columns
d) It is seqable? (producing a sequence of rows - core.matrix can't guarantee that in general, but we could make it a requirement for datasets...)

That may or may not be the right definition....

Mike Anderson

unread,
Jul 5, 2014, 2:08:45 PM7/5/14
to numerica...@googlegroups.com, inca...@googlegroups.com
Some specific comments on the different functions:

column-count - any reason why we need to duplicate column-count in the dataset API? won't the c.c.m/column-count function work equally well? Same goes for row-count

column-names : does this guarantee column order? If so should probably say in AI

get-column: again do we need to duplicate c.c.m/get-column?

conj-columns : seems misnamed: shouldn't this be "merge-columns" if the arguments are whole datasets?

select-columns / except columns: how are cols specified here? By-name?

add-derived-column: how does this work? why is this useful as a separate function rather than just letting people use (add-column ds col-name (f column1 column2))

transform-column: should this be called update-column for consistency with other clojure functions that do something similar?

from-matrix : should this just be a constructor function (dataset m)?

query-dataset: I'm dubious about trying to put this sort of functionality into the core API. I think I'd normally expect queries to be done using other tools and then the resulting dataset analysed? In general I'm suspicious of embedded mini-languages in a public API, it would be much better if we could figure out ways to integrate with other tools.


On Saturday, 5 July 2014 18:07:09 UTC+1, Aleksandr Sorokoumov wrote:

Aleksandr Sorokoumov

unread,
Jul 5, 2014, 2:10:17 PM7/5/14
to numerica...@googlegroups.com, inca...@googlegroups.com
Hi Mike,

Thank you for a quick feedback.
I think that:
a) it is a good idea. However, matrix/array operations should only produce matrices/arrays. Otherwise we will have complex cases, like what should be the result of matrix multiplication of 2 datasets? 
b) columns should have unique name. The question is how to treat conflicts - automatically rename duplicate column or raise an exception?
c) it is must have feature.
d)I don't see advantages of it if we treat dataset as matrix and provide comprehensive enough API.

Mike Anderson

unread,
Jul 5, 2014, 2:19:28 PM7/5/14
to numerica...@googlegroups.com
On 5 July 2014 19:10, Aleksandr Sorokoumov <aleksandr....@gmail.com> wrote:
Hi Mike,

Thank you for a quick feedback.
I think that:
a) it is a good idea. However, matrix/array operations should only produce matrices/arrays. Otherwise we will have complex cases, like what should be the result of matrix multiplication of 2 datasets? 
If you see datasets as array implementations then it is implementation defined just like other core.matrix API functions. As long as the result satisfied the matrix multiply API then it should be OK?
 
b) columns should have unique name. The question is how to treat conflicts - automatically rename duplicate column or raise an exception?
If we are enforcing uniqueness then we should probably throw exceptions in cases where it appears that the user has made an error. Obviously there may be some functions like "merge-columns" where the replacement behaviour would seem more natural

c) it is must have feature.
d)I don't see advantages of it if we treat dataset as matrix and provide comprehensive enough API.
OK. I guess users can just use c.c.m/rows if they want a row sequence.

A though occurs to me that we may want two different dataset implementations:
a) Dataset that stores a column vector for each column
b) Dataset that wraps an arbitrary array / matrix and just adds column names

b) would allow much more efficient implementations for getting individual dataset rows, for example....
 


On Saturday, July 5, 2014 7:37:06 PM UTC+2, Mike Anderson wrote:
Hi Alexandr,

I think it would be good to define exactly what we mean by a dataset as an abstraction? The definition would help us to be more precise in terms of what should / shouldn't be in the API

For example, we might define a dataset as something that has the following properties:
a) It can be treated as a 2D array / matrix
b) It has (uniquely?) named columns
c) It supports heterogeneous data types as columns
d) It is seqable? (producing a sequence of rows - core.matrix can't guarantee that in general, but we could make it a requirement for datasets...)

That may or may not be the right definition....

On Saturday, 5 July 2014 18:07:09 UTC+1, Aleksandr Sorokoumov wrote:
Hi all,

As a part of Incanter and core.matrix integration process, there is an idea to evolve existing in core.matrix dataset type and use it in Incanter.

In order to do that, Incanter dataset functions should be implemented in core.matrix.
Can you please look at my user API proposal and share your opinion on it? As it should be used instead of existing Incanter dataset API, most of the functions are copied from it.

Thanks.

--
You received this message because you are subscribed to the Google Groups "Numerical Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numerical-cloj...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aleksandr Sorokoumov

unread,
Jul 5, 2014, 3:04:52 PM7/5/14
to numerica...@googlegroups.com, inca...@googlegroups.com
I removed duplicates column-count, row-count and get-column.
conj-columns renamed to merge-columns
transform-column renamed to update-column

in select-columns / except-columns columns are specified by name. I added this to doc.
You are right, there is no advantage of having add-derived-columns function.

I defined from-matrix because there is also from-map function, which creates dataset from map. However, we can just create dispatch dataset function by argument type and remove both functions. What do you think?

I see your point about query-dataset and I think you are right. I removed it from API.

Alex Ott

unread,
Jul 7, 2014, 3:47:25 AM7/7/14
to numerica...@googlegroups.com, inca...@googlegroups.com
Hi

Dispatch dataset function - good idea, it's easy to extend to new types later...

Regarding the column names - do we continue to support Incanter's approach & allow keywords and strings? Or limit only to one type?


--
You received this message because you are subscribed to the Google Groups "Numerical Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numerical-cloj...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Mike Anderson

unread,
Jul 7, 2014, 5:38:37 AM7/7/14
to numerica...@googlegroups.com, inca...@googlegroups.com
I think it is fine to allow arbitrary values for column names. This would also be consistent with core.matrix principles more generally, which allow arbitrary types as array element values. The only constraints I think makes sense is that column names are unique (which we should enforce....)

Just checking - have there been any issues with allowing both keywords and Strings in Incanter?

Even numeric column names might be appropriate in some circumstances. A regular matrix could be treated as a dataset with Long column names - in which case I think it can fulfil the dataset abstraction. 

Hmmmm... now this is making me think about multi-dimensional cubes (OLAP style). We should think about how it might be possible in the future to allow multi-dimensional datasets with names on arbitrary dimensions, not just columns..... 


On Monday, 7 July 2014 08:47:25 UTC+1, Alex Ott wrote:
Hi

Dispatch dataset function - good idea, it's easy to extend to new types later...

Regarding the column names - do we continue to support Incanter's approach & allow keywords and strings? Or limit only to one type?
To unsubscribe from this group and stop receiving emails from it, send an email to numerical-clojure+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alex Ott

unread,
Jul 7, 2014, 5:47:17 AM7/7/14
to numerica...@googlegroups.com, inca...@googlegroups.com
I don't remember any open issues with mixing keywords/strings. Sometimes there was confusion when people expected string, and there was a keyword, but Aleksandr's patches allows to perform lookup for such cases


To unsubscribe from this group and stop receiving emails from it, send an email to numerical-cloj...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Mike Anderson

unread,
Jul 7, 2014, 5:56:09 AM7/7/14
to numerica...@googlegroups.com, inca...@googlegroups.com
Hmmmm does that mean we auto-convert lookup between strings and keywords? I think that's a risky idea - it can allow bugs to propagate (it's effectively "Weak Typing"). That lies the land of WAT.

Better IMHO would be to throw an informative exception if the case that a user asks for a non-existent column e.g. "Column ':foo' does not exist but there is a column named with the string "Foo" - did you mean to use that instead?". 

Alex Ott

unread,
Jul 7, 2014, 6:09:47 AM7/7/14
to numerica...@googlegroups.com, inca...@googlegroups.com
Maybe, this will be better - need to experiment & understand how this will affect the current behavior


To unsubscribe from this group and stop receiving emails from it, send an email to numerical-cloj...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Aleksandr Sorokoumov

unread,
Jul 8, 2014, 2:40:00 PM7/8/14
to numerica...@googlegroups.com, inca...@googlegroups.com

A though occurs to me that we may want two different dataset implementations:
a) Dataset that stores a column vector for each column
b) Dataset that wraps an arbitrary array / matrix and just adds column names
How about specifying different implementations in dataset function while creating new dataset? 

To unsubscribe from this group and stop receiving emails from it, send an email to numerical-cloj...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)
Skype: alex.ott

--
You received this message because you are subscribed to the Google Groups "Numerical Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numerical-cloj...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alex Ott

unread,
Jul 13, 2014, 6:04:08 AM7/13/14
to numerica...@googlegroups.com, inca...@googlegroups.com
It would be nice to describe the behavior of the merge-columns (and similar) when the column with same name exists in several datasets - the last wins, or we keep original, or something else?

Aleksandr Sorokoumov

unread,
Jul 13, 2014, 8:01:28 AM7/13/14
to numerica...@googlegroups.com, inca...@googlegroups.com
I think that exception should be raised on any conflict. I added this to description.

Mike Anderson

unread,
Jul 13, 2014, 3:54:31 PM7/13/14
to numerica...@googlegroups.com, inca...@googlegroups.com
clojure.core/merge has last-one-wins semantics. I think if we are to going to call anything "merge" it would be good to keep similar behaviour in order to avoid confusion?
Reply all
Reply to author
Forward
0 new messages