SAX conversion (ts2string) for big time-series uses too much heap

29 views
Skip to first unread message

vaske maskinsen

unread,
Feb 4, 2014, 11:32:09 AM2/4/14
to jmotif-...@googlegroups.com
Hi Pavel,

I need to convert huge time-series to SAX (~90.000 points and more).
The function SAXFactory.ts2string() seems to use a lot of heap space... (see Exception below)
What do you think could be the reason for this?

Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
    at edu.hawaii.jmotif.timeseries.TSUtils.paa(TSUtils.java:506)
    at SAXFactory_noHackyStatLogger.ts2string(SAXFactory_noHackyStatLogger.java:340)

TSUtils.java:506  => double[][] tmp = new double[paaSize][len];
The problem only arises if paaSize != len... is this a bug (do we really need such a big array for PAA)?

And another question: What if the series to convert has no normal distribution? Should I use another type of normalization than zNormalization?
One way I thought of was using a data-adaptive (histogram-)normalization but then I can't compare two SAX-sequences derived from two series with different distributions...?!?

Thanks and Greets,
Vaske

Pavel Senin

unread,
Feb 4, 2014, 11:51:38 AM2/4/14
to jmotif-discuss
Hi there:

You could try optimizedPaa instead of paa function, it should be faster and possibly less space-consuming.
In the code, I have followed a reshape transform algorithm borrowed from MatLab code, these days I think that it could be done easier, without allocating huge matrices, but I can not get free time to work on this, maybe later.

Also. You see, in all the work I do, we rarely use the whole series transform into string - since this causes a huge loss of information - typically we use sliding window. Sliding window allows to capture local features very well, while "holistic" approach will lose them.

If the series is not normal, then, for better performance, you'd need to build your own Alphabet based on the observed distribution. The new alphabet should guarantee equal probability for each letter.

In fact, if you use sliding window, then you can compare series of different length and point distribution, because short subsequences (from sliding window) will likely to be normally distributed.



--
You received this message because you are subscribed to the Google Groups "jmotif-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jmotif-discus...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Mahalo, Pavel.
Reply all
Reply to author
Forward
0 new messages