datasets equality?

39 views
Skip to first unread message

Alex Ott

unread,
Dec 23, 2014, 5:39:37 AM12/23/14
to numerica...@googlegroups.com
Hi

I'm preparing the first beta releases of Incanter 2.0 & got to the following problem with equality of the datasets:

FAIL in (melt-test) (dataset_tests.clj:101)
expected: (= (melt dset :id) expected)
  actual: (not (=
| :id | :value | :variable |
|-----+--------+-----------|
|   1 |      1 |     :time |
|   2 |      2 |     :time |
|   1 |      5 |       :x1 |
|   2 |      7 |       :x1 |
|   1 |      6 |       :x2 |
|   2 |      8 |       :x2 |

| :id | :variable | :value |
|-----+-----------+--------|
|   1 |     :time |      1 |
|   2 |     :time |      2 |
|   1 |       :x1 |      5 |
|   2 |       :x1 |      7 |
|   1 |       :x2 |      6 |
|   2 |       :x2 |      8 |
))

As you can see, the values in columns are equal, but the order of columns is different.

Is it ok for all, if we implement specialized equality operator for datasets that won't take the order of the columns into account, but will compare only values?

Is it better to add the separate DatasetEquality protocol, or simply implement PMatrixEquality for dataset implementation?

--
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Mike Anderson

unread,
Dec 23, 2014, 5:46:33 AM12/23/14
to numerica...@googlegroups.com
Great to hear the Incanter beta is coming along! I am doing a bit more work on core.matrix and vectorz-clj over the Christmas period (especially around sparse matrix support) so we may be able to get some co-ordinated releases :-)

I'm not sure I would want these two datasets to be considered equals : I think column order often matters and should be considered part of the comparison (consider graphs with sorted axes, for example?)

If we do go down the root of having these datasets be considered equals, then I think we would need a separate dataset equality function: otherwise we will be breaking the general contract of c.c.m/equals which requires element-wise comparison according to the array shape.

Shriphani Palakodety

unread,
Dec 23, 2014, 6:13:18 AM12/23/14
to numerica...@googlegroups.com
The typical sparse matrix format I use in python has rows of the form <row, col, value>. You can't flip the columns here.

A trick you can use is leverage the headers. If headers are available then you can switch columns and things work out as-is. Else columns can't be switched.

Alex Ott

unread,
Dec 23, 2014, 6:41:39 AM12/23/14
to numerica...@googlegroups.com
I'm thinking about leaving standard equal to be dependent on the order of columns, but add the additional function to check equality without checking the order of columns...

--
You received this message because you are subscribed to the Google Groups "Numerical Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numerical-cloj...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shriphani Palakodety

unread,
Dec 23, 2014, 7:46:55 AM12/23/14
to numerica...@googlegroups.com
+1 to this idea.

I am quite excited about incanter 2.0

Mars0i

unread,
Dec 23, 2014, 12:12:24 PM12/23/14
to numerica...@googlegroups.com
I haven't used Incanter much yet, but I'd been assuming that datasets are supposed to play a role roughly like R dataframes.  I just checked to see what R does when the == operator is applied to dataframes.

== performs an elementwise equality comparison, but it ignores column labels and only considers column order:
> df1
  a  b
1 1 10
2 2 20
3 3 30
> df2
   b a
1 10 1
2 20 2
3 30 3
> df3
  b  a
1 1 10
2 2 20
3 3 30
> df1 == df2
         a     b
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
> df1 == df3
        a    b
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE

The compare package includes a function compare() that will return a single true or false value when comparing dataframes.  However, it also takes into account column order, partially.

> compare(df1,df1)
TRUE
> compare(df1,df2)
FALSE [FALSE, FALSE]
> compare(df1,df3)
FALSE [TRUE, TRUE]

(I'm not really familiar with compare; I just now found it.)

On the theory that it's a benefit to be able to map intuitions developed in R, into intuitions about how Incanter behaves, the preceding provides some small support the proposal that basic equality for datasets should reflect column order as opposed to column labels.  To me it seems like a good thing to have an additional operator that ignores column order as well.

Thanks Alex!  I don't use Incanter much, but I do use it a little bit, and in the long run I'm likely to use it more.

Alex Ott

unread,
Dec 23, 2014, 12:27:14 PM12/23/14
to numerica...@googlegroups.com
Thank you!

I'm rarely using R, but I'll look to compare package - thank you for suggestion...

--
You received this message because you are subscribed to the Google Groups "Numerical Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numerical-cloj...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Anderson

unread,
Dec 23, 2014, 8:56:09 PM12/23/14
to numerica...@googlegroups.com
I think a separate comparison function is a good idea.

Just be careful with how you define the semantics of the new comparison function, there are a few complexities to consider, e.g.:
- What happens if you compare a labelled dataset with an unlabelled matrix?
- Keyword vs String labels (I believe these should be considered non-equal, but worth considering...) 
- Data type eqality? or numerical equality?
- What if more than one dimension is labelled? (core.matrix has support for this now in 0.32.0)

To unsubscribe from this group and stop receiving emails from it, send an email to numerical-clojure+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alex Ott

unread,
Dec 24, 2014, 7:12:49 AM12/24/14
to numerica...@googlegroups.com
Thanks for suggestion Mike

I put this item in my TODO list & will send PR as I implement this...

I need to look to latest changes in core.matrix first :-)

Happy holidays!

To unsubscribe from this group and stop receiving emails from it, send an email to numerical-cloj...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages