I took the time series dataset that I used in the Problem 6 Homework 8 and just calculated the Euclidian distances for the 60 points per series. I wanted to see to what extent the reduction in the features made compared to using the raw data.
A few things that I observed:
1. processing obviously took longer 60 vs. 8 features, 7 times increase in processing.
2. the resulting total distance is greater than in the reduced feature set, which is to be expected, accumulating 60 distances vs 8. What is interesting is trying to decide on what a threshold would be. You really need to run the data, produce a chart like the attached, draw a line on the page that fits the data, and then use that as the threshold. They discuss at the end of the paper that looking into better ways to establish thresholds is desirable follow on work. To me using the comparison with the minimum distance makes more sense, unless you want know how close other series are. If the min is just .001 less than several other series, you might want to factor that into the comparison.
3. The accuracy was still pretty good at 89%. I think the data is very structured as the series are all calculated within defined parameters (they have the basic pseudo code for generating each series. So comparing distances in this data set will resulting in series finding similar series in the same set. You can see this in the chart how the distances still "cluster" around each other, you just need to find the nearest neighbor.
The authors also suggest using "real" data... Such as stock market prices or electrocardiograms (ECG) Stock data is easier to get so I'll try profiling some stocks and get back to you...
Great Class, it was a pleasure meeting all of you, have a great holidays, and we'll see you next year. By the way, while this class was great for the mind, my body is really suffering after a couple of hours skiing. The snow was a little chopped up and it didn't take long to burn my quads... Maybe we can incorporate Peter's dancing suggestion into the class.
Stephen...
Attached code (with some comments..)
and a PDF reporting.