sax distance

46 views

Skip to first unread message

dlrts

unread,

Dec 8, 2013, 9:05:32 AM12/8/13

to jmotif-...@googlegroups.com

Hi there,

I am trying to use saxified timeseries for clustering and classification.
Now I am in need of a distance measure; the paper by Lin introduces a distance measure for the SAX that is based on a lookup table for each symbol-combination.

In Jmotif the only spot I found this is implemented is the function saxMinDis in SaxFactory. However, it accepts only char-array. Those can be aquired by transforming the SaxFrequencyData into its String representation (with empty separator).

But, and here comes the quirk, this mechanism can only work, if the window-size used for building the sax was the whole original timeseries. since this leads to a single sax word, doing this for different serieses leads to comparable string. Now, if the window-size is smaller, we end up with a number of sax words. The bad thing is, this number is not only determined by the length of the series but also by its shape (because words that are repeated are removed). This leads for two serieses with the same length to String-representations with (possibly) different length, that are thus not comparable anymore.

When I look into the paper, it seems, that this case is not covered. So what would be the correct way for the distance calculation? Would I compute a String that still has all the removed (repeated) words in it and use that? Does jmotiv has a method to do this?

hints appreciated,
daniel

Pavel Senin

unread,

Dec 8, 2013, 11:27:00 AM12/8/13

to jmotif-discuss

Hi Daniel:

Thank you for the interest.

There are few ways to solve the problem with clustering and classification of time-series when, as you pointing, the sequences (ordered sax words collections) or dictionaries (unordered word frequency tables) are of the different size - i.e. when obtained with the use of a sliding window and a numerosity reduction.

SaxMinDist is the function you would typically use for the numerosity reduction. I am not sure about its performance for raw classification/clustering.

One way would be to use SAX words frequency vectors - basically each of time series is converted into the numerical frequency vectors. You will have something like follows (there first column is all the words, right columns represent words occurrences in the series 1,2,...). Then you can use whatever distance you like to these vectors: Euclidean, Humming etc.

words | series1 | series 2| ...

aac | 0 | 5 |

bba | 2 | 3 |

....

ccc |5 | 0 |

Another way is suggested by Lin in this paper <http://www.cs.gmu.edu/~jessica/Lin_BOP_SSDBM.pdf> - you make bags of words and then have few ways of working with them.

Yet, another way that I have implemented in JMotif, is in this paper <http://www2.hawaii.edu/~senin/assets/papers/sax-vsm-icdm13-short.FINAL_DRAFT.pdf> - you make bags of words representing classes of sequences and then by leveraging tf*idf ranking you classify/cluster unlabeled series.

Well, there is yet another way too.

The implementation of all these techniques is not very difficult, you could reuse a lot of my code for these three techniques, but it will require some time to get through it. Once you'll checkout the code and will be able to build it in eclipse, I can help with directions/issues.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.