iSAX and indexing

Josh Patterson

unread,

Sep 1, 2010, 11:02:25 AM9/1/10

to jmotif-discuss

As I understand it, iSAX involves both:

1. a way to decompose timeseries into characters yet perserving some
information with a subscript
2. a mechanic for indexing that is similar in nature to a "trie",
along the lines of a b-tree (roughly) to store SAX data and perform
wildcard lookup

Is that about accurate?

I'm interested in seeing how an index like this would work being
stored in something like HBase so you could interactively search the
index with queries.

Josh

seninp

unread,

Sep 1, 2010, 11:19:11 AM9/1/10

to jmotif-discuss

Okay, I will re-read the paper over the weekend trying to get into the
mechanics of iSAX, for now I just understand that it does allow some
degree of non-match in the comparison of two pieces representing time-
series. This comes from my previous work and understanding of Dynamic
Time Warping, which allows series to be a bit different in time but
requiring them to be similar in the features and their magnitude,
which is claimed to be the same with iSAX. So I think the trick is in
the distance function implementation which quantifies the distance
between two pieces. Let's see how it works and maybe Alice will give
us later another point of view. If what I think is true - then all we
need is to code this distance routine.

The indexing is the second task. I have implemented a trie for SAX in
the Java code already. Unfortunately I don't know how HBase works and
how to store Trie within it. What I was thinking can be of great value
is to see whether the Trie actually performs faster than
HashTable<String> or Databse (MySQL, Derby) indexing and querying, and
maybe HBase? What do you think?

Josh Patterson

unread,

Sep 1, 2010, 11:31:52 AM9/1/10

to jmotif-...@googlegroups.com

Pavel,

On Wed, Sep 1, 2010 at 11:19 AM, seninp <sen...@gmail.com> wrote:

Okay, I will re-read the paper over the weekend trying to get into the
mechanics of iSAX, for now I just understand that it does allow some
degree of non-match in the comparison of two pieces representing time-
series. This comes from my previous work and understanding of Dynamic
Time Warping, which allows series to be a bit different in time but
requiring them to be similar in the features and their magnitude,
which is claimed to be the same with iSAX. So I think the trick is in
the distance function implementation which quantifies the distance
between two pieces. Let's see how it works and maybe Alice will give
us later another point of view. If what I think is true - then all we
need is to code this distance routine.

I havent touched the DTW stuff (only skimmed over it), but from what I understood of the iSAX paper, they were also holding information (cardinality) about how the sequence was decomposed so as to be able to compare sequences which had been decomposed with different params (this might be where you are talking about another dist function). I'm going to re-read the paper over lunch as I need a refresher.

The indexing is the second task. I have implemented a trie for SAX in
the Java code already. Unfortunately I don't know how HBase works and
how to store Trie within it. What I was thinking can be of great value
is to see whether the Trie actually performs faster than
HashTable<String> or Databse (MySQL, Derby) indexing and querying, and
maybe HBase? What do you think?

Yeah, the Trie would be the base for the indexing. Don't worry about HBase, thats more of a "storage mechanic", it doesnt really change how you'd work with iSAX (I work in Hadoop for a living, so I tend to look for places we can scale things out on Hadoop/HBase/Hive).

I'm not sure about performance wrt those options, but what I'm interested in at the present moment is indexing and querying Terabytes and up of timeseries data from sensors (and other sources). How practical is it to make a Terabyte of timeseries in an iSAX index query-able in an interactive / online fashion? We can do this with hadoop fairly easily but its a batch processing setup.

I also wonder how much more accurate an iSAX representation is over a regular SAX representation.

Josh

Josh Patterson

unread,

Sep 6, 2010, 3:13:38 PM9/6/10

to jmotif-...@googlegroups.com

Studied iSAX some more over the weekend, as far as I can tell the
major difference in the SAX / iSAX representations is the additional
cardinality information per symbol in the SAX word (non-exponential
superscripts), which are used in two ways:

1. when comparing 2 different timeseries we promote one timeseries to
the higher cardinality of the two to do the comparison; in the
situation (as seen in indexing) of mixed cardinalities inside a single
"word", we do promotions per symbol. The MINDIST function uses some
different mechanics but seems to use the breakpoint lookup table as
implemented already in the base jmotif lib.

2. Indexing, as best as I can tell, uses a sort of extensible hashing
(spatial hashing? Is locality sensitive hashing relevant here?) to
spatially group similar timeseries into near clusters. As the
hash-tree grows, nodes sub-divide and the tree "fans out". Afaik
lookups here should be really fast, and I'm curious to see how much
data we could index when using something like a DHT or HBase.

HBase is really interesting here since its a "multidimensional sorted
map" (big distributed hash table, in some regards) and could store the
hash based nodes very easily and in a fashion that would scale up into
10s of Terabytes.

I also think a smaller in-memory version of the index hash tree would
make an interesting 1NN classifier, similar to how I used SAX to build
a 1NN classifier for Map Reduce with a Ball Tree.

I'm currently implementing a test version of the iSAX words that
incorporate the more sophisticated cardinality rules and the new
distance function. Once I get that done, I want to run some tests on
building a 1NN classifier with the hash tree to see how well it
performs.

Josh

On Wed, Sep 1, 2010 at 11:19 AM, seninp <sen...@gmail.com> wrote:

--
Twitter: @jpatanooga
Hadoop World 2010, October 12, New York City -- register at http://bit.ly/a0cqwe

Reply all

Reply to author

Forward