some time ago I asked you about the computational complexity of PAA (SAX conversion (ts2string) for big time-series uses too much heap - privat)
because I found PAA is making SAX conversion for long series slow, if
PAA-size is not identical to the series size. (Do you know of an
optimized PAA implementation?)
Now I noticed, that in function
char[] getSaxVals(double[] vals, int windowSize, double[] cuts), the
series is always aggregated to the alphabetSize by PAA. This means that
we are always forced to very low "temporal resolution" when the alphabet
size is small. However this may not always be desired. In my case, I
often have many samples (ca. 10s - 1000s) in my window and interesting
events (discords) may sometimes be just 3 samples long (like 1,100,1).
They will not be found, if PAA-size is chosen too small (like an
alphabet size of 4). Is this due to the SAX-trie approach? Or did you have any other reason for choosing the PAA-size equal to alphabet size (any reference to HOT SAX or other paper...)? Any other approach that allows PAA-size to be different from alphabet size? Are PAA-size and alphabet size somehow "connected"? Should I just use a window that is equal to my alphabet size?
I guess I have not completely understood the SAX-trie...
Thank you very much in advance for your answer, Greets, Vaske.