a problem on chapter 3 Dictionaries and tolerant retrieval

10 views
Skip to first unread message

Hongfei Yan

unread,
Apr 15, 2014, 7:54:53 AM4/15/14
to cs41...@googlegroups.com
有同学问,该章ppt的39页
Given query, first enumerate all character sequences within a preset (possibly weighted) edit distance?

答案在书本上:
The simplest such heuristic is to restrict the search to dictionary terms beginning
with the same letter as the query string; the hope would be that
spelling errors do not occur in the first character of the query. A more sophisticated
variant of this heuristic is to use a version of the permuterm index,
in which we omit the end-of-word symbol $. Consider the set of all rotations
of the query string q. For each rotation r from this set, we traverse the
B-tree into the permuterm index, thereby retrieving all dictionary terms that
have a rotation beginning with r. For instance, if q is mase and we consider
the rotation r = sema, we would retrieve dictionary terms such as semantic
and semaphore that do not have a small edit distance to q. Unfortunately, we
would miss more pertinent dictionary terms such as mare and mane. To address
this, we refine this rotation scheme: for each rotation, we omit a suffix
of ℓ characters before performing the B-tree traversal. This ensures that each
term in the set R of terms retrieved from the dictionary includes a “long”
substring in common with q. The value of ℓ could depend on the length of q.
Alternatively, we may set it to a fixed constant such as 2.

Reply all
Reply to author
Forward
0 new messages