about jmotif library

jesin

unread,

Aug 13, 2012, 4:15:17 PM8/13/12

to jmotif-...@googlegroups.com

Hi,

I am using the jmotif library to saxify a time series. I found that for saxification I need to create an object of Timeseries class and then call SAXFactory.ts2string( ).

Timeseries classs constructor takes two array, one for time stamp (long) and one for values (double).

I have tested it on the example you included in your site where, seriesAValues = {0.22, 0.23, 0.24, 0.5, 0.83} and seriesATstamps = {22L, 23L, 24L, 50L, 83L}. For an alphabet size of 3 and paaSize of 3 the output is 'abc' as suggested in your site.

Now, my question to you is, what is the significance of the array that contains time stamps in generating the SAX code? If I change the seriesATstamps to {1, 2, 3, 4, 5} or to {1, 2, 10, 30, 1000} I will get the same sax code. Please let me know. I am struggling with this issue a lot.

To the best of my knowledge SAX code is defined for uniformly sampled time series. I am not sure when the time series is not uniformly sampled how the SAX code is generated. However from jmotif library it seems we can generate SAX code for non uniformly sampled time series also. Please let me know how you actually do it.

Thanks and regards,
Jesin

jesin

unread,

Aug 13, 2012, 4:53:02 PM8/13/12

to jmotif-...@googlegroups.com

I have tried to send mail directly to the owner of the code, however I could not retrieve his email address.

Pavel Senin

unread,

Aug 14, 2012, 2:00:33 AM8/14/12

to jmotif-...@googlegroups.com, jzak...@ucr.edu

Hi Jesin:

Thank you for your interest in the code and for your idea. I never thought that way - to use the timestamp information in order to check if timeseries equidistant and/or convert them into such presentation.

As you already figured, the code has no routines, check or any adjustment implemented for timestamps. It is mostly due to the nature of my telemetry data (I have implemented the jmotif library to work with Hackystat data) - it comes out equidistant with NaN values at places where no measurement available. I just did not foresee the issue you are dealing with. The easiest way for you, I think, will be to treat the data on your own and use jmotif for SAX or PAA conversion. Or we can work together on some code and embed it into jmotif to make it better. I think two solutions are possible - first is to fill gaps in series using NaNs or approximate gaps with some values. It shouldnt be much of coding.

I guess you already know, but TSUtils class implements methods for Z normalization, PAA approximation and SAX conversion, also it can handle data with NaN values (in the way which was acceptable for me). There are implementations for Timeseries class (timestamped series) and for simple double[] arrays assuming the series is equidistant.

Thank you!

--
Mahalo, Pavel.

vaske maskinsen

unread,

Aug 14, 2012, 8:22:46 AM8/14/12

to jmotif-...@googlegroups.com, jzak...@ucr.edu

Hi all!

Interesting discussion about non equidistant time-series. One way to deal with this ussue would be to simply compute interpolated values for all needed values in the "gap" between two available values. I will need that functionality too and will try if I can implement something useful...

bye, vaske.

Jesin Zakaria

unread,

Aug 14, 2012, 1:08:56 PM8/14/12

to Pavel Senin, jmotif-...@googlegroups.com

Hi All.

As a matter of fact I am from Professor Keogh's lab. If you had taken a look at the dataset available in UCR site you will find that there is no time stamp included. The reason is all the data are uniformly sampled and we can easily saxify those time series.

However if you collect data from any industry or anything it is often the case that the data are non-uniformly sampled. Currently I am doing intern-ship in Teradata and I have come across such dataset for which I need to generate SAX code.

As you said, I can fill the gaps with NaNs, and TSUtils class can handle data with NaN values, I am wondering how TSUtils class actually handle the NaN values. Does it ignore the NaNs while computing the SAX code? For example, for values = {0.22, 0.23, 0.24, NaN, 0.5, 0.83, 100} and timestamp = {22, 23, 24, 26, 50, 83, NaN}, does it generate the SAX code "abc"? In that case I can try to uniformly-sample the raw time series based on the nearest neighbours' and then fill up the gap with NaNs.

Hope to hear from you soon.

Thanks,
Jesin

Pavel Senin

unread,

Aug 14, 2012, 4:21:52 PM8/14/12

to jmotif-...@googlegroups.com, jzak...@ucr.edu

Nice to meet you. Small world indeed :). I was developing some stuff with Hackystat and timeseries for Teradata too. Funny.
And because there was no reference SAX implementation provided at UCR, I started this project, thank you guys for idea, papers, data and all the stuff you provided to community.

Yes, so about lost values and non-uniform timeseries. I hit that problem many times before and it really must be handled case-by-case, I think. Maybe here is the universal way, I don't know, it doesn't make sense to me yet.

For example, with Hackystat data, when we looking on the development activity data stream (measurements of how much time one typed on the keyboard while in Eclipse within a day window), if there was no measurement, does it mean that 1) subject did not work at all, and there will be zero eventually, or 2) something went wrong with the sensor or transmission system, and the data just lost, or 3) she is offline and I will get the real data sometime later? Now, should I wait for eventual data to be transmitted or should I substitute missing value by some average from three years history for that day of the week, or by 0, or by NaN? IMO, noise, missing values, and misplaced in time data will be always here, nevertheless, every data type must be treated the way which doesn't breaks its consistency. So, the question is how to treat your data best? What is the nature of the data you have in hands (if you want to discuss this here)?

(In our case of Hackystat we have a DPD service, which makes measurements equidistant)

Code will not handle the sample from your email. It is designed only for handling missing values within timeseries itself. There are two places where it checks if missing values are present - first is within PAA routine when computing mean value for a PAA interval: if there were some missing values but at least one is real, mean will be computed, if all are missing - NaN will make it to PAA approximation. The second place, is where SAX transform occurs, if PAA representation has NaN, it will put "_" symbol - that is an underscore symbol.

Last time I used that part of the code years ago, it might not work as intended right now, I am sorry if it is like that, there were some changes lately. However, the overall performance of analysis with underscores at place of NaN was pretty acceptable.