Re: SAXification for Motif-mining

Pavel Senin

unread,

Apr 4, 2014, 12:21:12 PM4/4/14

to jmotif-discuss

Hi:

There is optimized PAA implementation in the code, it doesn't "explode" the timeseries into the matrix anymore. I cant look into the code right now, sorry. You should be able to get it there.

I believe that short anomalies called outliers and you can find those simply by searching the points distribution - i.e. by statistical means. Discords are defined as subsequences and the emphasis is given to "structural anomaly" - this is why typically we are using significant compression, - so we do not react on the noise (in this case noise are short anomalies).

Public implementation uses PAA=Alphabet size. But you may try to change that, but also you may have better luck without the trie but with the HashTable.

Thank you!

On Fri, Apr 4, 2014 at 5:43 PM, vaske maskinsen <evilt...@gmx.net> wrote:

Hi Pavel,

some time ago I asked you about the computational complexity of PAA (SAX conversion (ts2string) for big time-series uses too much heap) because I found PAA is making SAX conversion for long series slow, if PAA-size is not identical to the series size. (Do you know of an optimized PAA implementation?)

Now I noticed, that in function char[] getSaxVals(double[] vals, int windowSize, double[] cuts), the series is always aggregated to the alphabetSize by PAA. This means that we are always forced to very low "temporal resolution" when the alphabet size is small. However this may not always be desired. In my case, I often have many samples (ca. 10s - 1000s) in my window and interesting events (discords) may sometimes be just 3 samples long (like 1,100,1). They will not be found, if PAA-size is chosen too small (like an alphabet size of 4).
Did you have a reason for choosing the PAA-size equal to alphabet size (any reference to HOT SAX or other paper...)?
Are PAA-size and alphabet size somehow "connected"? Should I just use a window that is equal to my alphabet size?

Thank you very much in advance for your answer,
Greets, Vaske.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Mahalo, Pavel.

vaske maskinsen

unread,

Apr 7, 2014, 5:02:24 AM4/7/14

to jmotif-...@googlegroups.com

Hi Pavel,

thanks for your answer, that really helped me. Is the optimizedPaa() now (in current version) used throughout the code? If not, is it safe to just exchange paa() by optimizedPaa() (I use an older, customized jMotif)?

The hashtable seems interesting. Have you used it anywhere? It should be much faster, but is it also as accurate? Which kinds of hash functions should be used? Do you know of any example implementation (or is it in jMotif)?

Thanks again,
Vaske.

Am Freitag, 4. April 2014 18:21:12 UTC+2 schrieb seninp:

Hi:

There is optimized PAA implementation in the code, it doesn't "explode" the timeseries into the matrix anymore. I cant look into the code right now, sorry. You should be able to get it there.

I believe that short anomalies called outliers and you can find those simply by searching the points distribution - i.e. by statistical means. Discords are defined as subsequences and the emphasis is given to "structural anomaly" - this is why typically we are using significant compression, - so we do not react on the noise (in this case noise are short anomalies).

Public implementation uses PAA=Alphabet size. But you may try to change that, but also you may have better luck without the trie but with the HashTable.

Thank you!

On Fri, Apr 4, 2014 at 5:43 PM, vaske maskinsen <evilt...@gmx.net> wrote:

Hi Pavel,

some time ago I asked you about the computational complexity of PAA (SAX conversion (ts2string) for big time-series uses too much heap - privat) because I found PAA is making SAX conversion for long series slow, if PAA-size is not identical to the series size. (Do you know of an optimized PAA implementation?)

Now I noticed, that in function char[] getSaxVals(double[] vals, int windowSize, double[] cuts), the series is always aggregated to the alphabetSize by PAA. This means that we are always forced to very low "temporal resolution" when the alphabet size is small. However this may not always be desired. In my case, I often have many samples (ca. 10s - 1000s) in my window and interesting events (discords) may sometimes be just 3 samples long (like 1,100,1). They will not be found, if PAA-size is chosen too small (like an alphabet size of 4).
Did you have a reason for choosing the PAA-size equal to alphabet size (any reference to HOT SAX or other paper...)?
Are PAA-size and alphabet size somehow "connected"? Should I just use a window that is equal to my alphabet size?

Thank you very much in advance for your answer,
Greets, Vaske.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout - privat.

--
Mahalo, Pavel.

Pavel Senin

unread,

Apr 7, 2014, 5:22:07 AM4/7/14

to jmotif-discuss

Hi there:

Yes, the old paa and the optimized one are fully compatible, you can simply copy-paste sources into your local branch.

Jessica suggested that hash is faster than trie; there should be no loss in precision, the hashtable is just another way to deal with heuristics so the algorithm just may converge faster.

You can see their work mentioning hash here <http://alumni.cs.ucr.edu/~wli/publications/ICDM06_Discords.pdf>. Or you can simply start with using the whatever HashTable or HashMap implementation Java has, or to re-write the trie code.

I personally think that the way I coded the Trie in jmotif is not optimal (it pre-builds the full trie and thus consumes too much memory). I wasn't really understanding the algorithm flow at that point, so it need to be rewritten. But it works as it is so far and since I was not working with discords, I never got to that code.

You'd be surprised, but trie doesn't "eat runtime", it is the registry of random visits and the distance as in attached profiling screenshot. This was done on a timeseries of ~0.6M points, single thread.

Thank you!

For more options, visit https://groups.google.com/d/optout.

--
Mahalo, Pavel.

profiling-HOTSAX.png

vaske maskinsen

unread,

Apr 7, 2014, 6:13:35 AM4/7/14

to jmotif-...@googlegroups.com

Thank you Pavel for the hints!

I'll try to implement it with a hashTable.

I have another question about the discords/motifs (sorry :)):
If I have a "MotifRecord" instance, I get its positions easily by motif.getPositions(). What I also need (e.g. to mark the motif visually in the time series) is its length in terms of the original time series (I have index from, I need index until). Using the parameters I used for searching (alphabet size, windowLength), how can I compute the size of the motif subseries?

Thank you very much!
Greets,
Vaske

Am Freitag, 4. April 2014 18:21:12 UTC+2 schrieb seninp:

Hi:

There is optimized PAA implementation in the code, it doesn't "explode" the timeseries into the matrix anymore. I cant look into the code right now, sorry. You should be able to get it there.

I believe that short anomalies called outliers and you can find those simply by searching the points distribution - i.e. by statistical means. Discords are defined as subsequences and the emphasis is given to "structural anomaly" - this is why typically we are using significant compression, - so we do not react on the noise (in this case noise are short anomalies).

Public implementation uses PAA=Alphabet size. But you may try to change that, but also you may have better luck without the trie but with the HashTable.

Thank you!

On Fri, Apr 4, 2014 at 5:43 PM, vaske maskinsen <evilt...@gmx.net> wrote:

Hi Pavel,

some time ago I asked you about the computational complexity of PAA (SAX conversion (ts2string) for big time-series uses too much heap - privat) because I found PAA is making SAX conversion for long series slow, if PAA-size is not identical to the series size. (Do you know of an optimized PAA implementation?)

Now I noticed, that in function char[] getSaxVals(double[] vals, int windowSize, double[] cuts), the series is always aggregated to the alphabetSize by PAA. This means that we are always forced to very low "temporal resolution" when the alphabet size is small. However this may not always be desired. In my case, I often have many samples (ca. 10s - 1000s) in my window and interesting events (discords) may sometimes be just 3 samples long (like 1,100,1). They will not be found, if PAA-size is chosen too small (like an alphabet size of 4).
Did you have a reason for choosing the PAA-size equal to alphabet size (any reference to HOT SAX or other paper...)?
Are PAA-size and alphabet size somehow "connected"? Should I just use a window that is equal to my alphabet size?

Thank you very much in advance for your answer,
Greets, Vaske.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout - privat.

--
Mahalo, Pavel.

Pavel Senin

unread,

Apr 9, 2014, 4:26:11 AM4/9/14

to jmotif-discuss

Hi Vaske:

The length of the discord or a motif is fixed in terms of HOTSAX and SAX - it is always the length of the sliding window.

Soon we will release an implementation that finds anomalies/motifs of a variable length.

For more options, visit https://groups.google.com/d/optout.

--
Mahalo, Pavel.

Message has been deleted

vaske maskinsen

unread,

Apr 9, 2014, 6:08:07 AM4/9/14

to jmotif-...@googlegroups.com

Hi Pavel,

I figured it out, position + window length seems pretty obvious now...
I'm looking forward to the variable length algorithm. Can you say when it will be approximately released? Will you use the SAX-trie method or a new, faster method like the hash-trick (for my huge time series I am waiting for minutes until the fixed length algorithm has finished, how long will it take until the one with variable length does?...)? I thought about how we could make it faster for long series in the last days... Maybe I can contribute something useful to the new implementation? Some discussion (maybe not here)?

Cheers,
Vaske

Am Mittwoch, 9. April 2014 10:26:11 UTC+2 schrieb seninp:

Hi Vaske:

The length of the discord or a motif is fixed in terms of HOTSAX and SAX - it is always the length of the sliding window.

Soon we will release an implementation that finds anomalies/motifs of a variable length.

On Mon, Apr 7, 2014 at 12:13 PM, vaske maskinsen <evilt...@gmx.net> wrote:

Thank you Pavel for the hints!

I'll try to implement it with a hashTable.

I have another question about the discords/motifs (sorry :)):
If I have a "MotifRecord" instance, I get its positions easily by motif.getPositions(). What I also need (e.g. to mark the motif visually in the time series) is its length in terms of the original time series (I have index from, I need index until). Using the parameters I used for searching (alphabet size, windowLength), how can I compute the size of the motif subseries?

Thank you very much!
Greets,
Vaske

Am Freitag, 4. April 2014 18:21:12 UTC+2 schrieb seninp:

Hi:

There is optimized PAA implementation in the code, it doesn't "explode" the timeseries into the matrix anymore. I cant look into the code right now, sorry. You should be able to get it there.

I believe that short anomalies called outliers and you can find those simply by searching the points distribution - i.e. by statistical means. Discords are defined as subsequences and the emphasis is given to "structural anomaly" - this is why typically we are using significant compression, - so we do not react on the noise (in this case noise are short anomalies).

Public implementation uses PAA=Alphabet size. But you may try to change that, but also you may have better luck without the trie but with the HashTable.

Thank you!

On Fri, Apr 4, 2014 at 5:43 PM, vaske maskinsen <evilt...@gmx.net> wrote:

Hi Pavel,

some time ago I asked you about the computational complexity of PAA (SAX conversion (ts2string) for big time-series uses too much heap - privat - privat - privat) because I found PAA is making SAX conversion for long series slow, if PAA-size is not identical to the series size. (Do you know of an optimized PAA implementation?)

Now I noticed, that in function char[] getSaxVals(double[] vals, int windowSize, double[] cuts), the series is always aggregated to the alphabetSize by PAA. This means that we are always forced to very low "temporal resolution" when the alphabet size is small. However this may not always be desired. In my case, I often have many samples (ca. 10s - 1000s) in my window and interesting events (discords) may sometimes be just 3 samples long (like 1,100,1). They will not be found, if PAA-size is chosen too small (like an alphabet size of 4).
Did you have a reason for choosing the PAA-size equal to alphabet size (any reference to HOT SAX or other paper...)?
Are PAA-size and alphabet size somehow "connected"? Should I just use a window that is equal to my alphabet size?

Thank you very much in advance for your answer,
Greets, Vaske.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout - privat - privat - privat.

--
Mahalo, Pavel.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout - privat.

--
Mahalo, Pavel.

Pavel Senin

unread,

Apr 9, 2014, 6:48:43 AM4/9/14

to jmotif-discuss

It depends on the algorithm-proposing paper acceptance. Once it is accepted code will be public.

If there is an interesting dataset you have the access to -- where the discords/motifs co-exist and are of variable length -- please let me know.

For more options, visit https://groups.google.com/d/optout.

--
Mahalo, Pavel.

vaske maskinsen

unread,

Apr 9, 2014, 7:42:16 AM4/9/14

to jmotif-...@googlegroups.com

Hi,

I guess I have these kinds of datasets, but that's what I am trying to find out (motifs, discords at variable lengths). Also the data are sort of confidential, but I could prepare them properly... The datasets are the result of measurements at very hight frequency (12.8 kHz), each over one day... and features derived from these. However, I have to ask people if they may be used...

I'm really looking forward to reading your paper and hope it will be accepted soon :)

Cheers,
Vaske

Hi Pavel,

some time ago I asked you about the computational complexity of PAA (SAX conversion (ts2string) for big time-series uses too much heap - privat - privat - privat - privat - privat - privat - privat) because I found PAA is making SAX conversion for long series slow, if PAA-size is not identical to the series size. (Do you know of an optimized PAA implementation?)

Now I noticed, that in function char[] getSaxVals(double[] vals, int windowSize, double[] cuts), the series is always aggregated to the alphabetSize by PAA. This means that we are always forced to very low "temporal resolution" when the alphabet size is small. However this may not always be desired. In my case, I often have many samples (ca. 10s - 1000s) in my window and interesting events (discords) may sometimes be just 3 samples long (like 1,100,1). They will not be found, if PAA-size is chosen too small (like an alphabet size of 4).
Did you have a reason for choosing the PAA-size equal to alphabet size (any reference to HOT SAX or other paper...)?
Are PAA-size and alphabet size somehow "connected"? Should I just use a window that is equal to my alphabet size?

Thank you very much in advance for your answer,
Greets, Vaske.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout - privat - privat - privat - privat - privat - privat - privat.

--
Mahalo, Pavel.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout - privat - privat - privat.

--
Mahalo, Pavel.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout - privat.

--
Mahalo, Pavel.

vaske maskinsen

unread,

Jul 17, 2014, 10:01:39 AM7/17/14

to jmotif-...@googlegroups.com

Hi Pavel,

any news about that variable length discord mining paper?
I'd really like to read it. If it is published, could you provide some link?

Cheers,
Vaske

Hi Pavel,

some time ago I asked you about the computational complexity of PAA (SAX conversion (ts2string) for big time-series uses too much heap - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat) because I found PAA is making SAX conversion for long series slow, if PAA-size is not identical to the series size. (Do you know of an optimized PAA implementation?)

Now I noticed, that in function char[] getSaxVals(double[] vals, int windowSize, double[] cuts), the series is always aggregated to the alphabetSize by PAA. This means that we are always forced to very low "temporal resolution" when the alphabet size is small. However this may not always be desired. In my case, I often have many samples (ca. 10s - 1000s) in my window and interesting events (discords) may sometimes be just 3 samples long (like 1,100,1). They will not be found, if PAA-size is chosen too small (like an alphabet size of 4).
Did you have a reason for choosing the PAA-size equal to alphabet size (any reference to HOT SAX or other paper...)?
Are PAA-size and alphabet size somehow "connected"? Should I just use a window that is equal to my alphabet size?

Thank you very much in advance for your answer,
Greets, Vaske.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat - privat.

--
Mahalo, Pavel.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout - privat - privat - privat - privat - privat - privat - privat.

--
Mahalo, Pavel.

--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout - privat - privat - privat.

--
Mahalo, Pavel.

Reply all

Reply to author

Forward