a problem on chapter 3 Dictionaries and tolerant retrieval

10 views

Skip to first unread message

unread,

Apr 15, 2014, 7:54:53 AM4/15/14

to cs41...@googlegroups.com

有同学问，该章ppt的39页

Given query, first enumerate all character sequences within a preset (possibly weighted) edit distance？

答案在书本上：

The simplest such heuristic is to restrict the search to dictionary terms beginning

with the same letter as the query string; the hope would be that

spelling errors do not occur in the first character of the query. A more sophisticated

variant of this heuristic is to use a version of the permuterm index,

in which we omit the end-of-word symbol $. Consider the set of all rotations

of the query string q. For each rotation r from this set, we traverse the

B-tree into the permuterm index, thereby retrieving all dictionary terms that

have a rotation beginning with r. For instance, if q is mase and we consider

the rotation r = sema, we would retrieve dictionary terms such as semantic

and semaphore that do not have a small edit distance to q. Unfortunately, we

would miss more pertinent dictionary terms such as mare and mane. To address

this, we refine this rotation scheme: for each rotation, we omit a suffix

of ℓ characters before performing the B-tree traversal. This ensures that each

term in the set R of terms retrieved from the dictionary includes a “long”

substring in common with q. The value of ℓ could depend on the length of q.

Alternatively, we may set it to a fixed constant such as 2.

Reply all

Reply to author

Forward

0 new messages