Nice to meet you. Small world indeed :). I was developing some stuff with Hackystat and timeseries for Teradata too. Funny.
And because there was no reference SAX implementation provided at UCR, I started this project, thank you guys for idea, papers, data and all the stuff you provided to community.
Yes, so about lost values and non-uniform timeseries. I hit that problem many times before and it really must be handled case-by-case, I think. Maybe here is the universal way, I don't know, it doesn't make sense to me yet.
For example, with Hackystat data, when we looking on the development activity data stream (measurements of how much time one typed on the keyboard while in Eclipse within a day window), if there was no measurement, does it mean that 1) subject did not work at all, and there will be zero eventually, or 2) something went wrong with the sensor or transmission system, and the data just lost, or 3) she is offline and I will get the real data sometime later? Now, should I wait for eventual data to be transmitted or should I substitute missing value by some average from three years history for that day of the week, or by 0, or by NaN? IMO, noise, missing values, and misplaced in time data will be always here, nevertheless, every data type must be treated the way which doesn't breaks its consistency. So, the question is how to treat your data best? What is the nature of the data you have in hands (if you want to discuss this here)?
(In our case of Hackystat we have a DPD service, which makes measurements equidistant)
Code will not handle the sample from your email. It is designed only for handling missing values within timeseries itself. There are two places where it checks if missing values are present - first is within PAA routine when computing mean value for a PAA interval: if there were some missing values but at least one is real, mean will be computed, if all are missing - NaN will make it to PAA approximation. The second place, is where SAX transform occurs, if PAA representation has NaN, it will put "_" symbol - that is an underscore symbol.
Last time I used that part of the code years ago, it might not work as intended right now, I am sorry if it is like that, there were some changes lately. However, the overall performance of analysis with underscores at place of NaN was pretty acceptable.
Thank you.
--
Mahalo, Pavel.