Approximate String Searching

30 views
Skip to first unread message

Thomas Passin

unread,
Oct 3, 2022, 12:57:29 AM10/3/22
to leo-editor
Here is a fairly easy way to search in a body of text for a string when there might be single-character errors in a pattern, like searching for "butterfly" but the text has "budterfly".  The paper, which I found some years ago, dates back to I think the early 1990s.  It has a basic method with a series of potential refinements.  The basic method is well suited for programming in Python.  I'm not sure that the refinements would work well, because they involve tinkering with the hash table design, and keeping the hash tables small enough that they can stay in cache memory - not what interpreted language are especially good for.

Actually, the paper covers searching for multiple patterns at one time, i.e., in a single pass.  But you can easily do it for just one pattern if that's what you want.

I thought this might be of interest to some people on the list.  Here's a link to the paper:

jkn

unread,
Oct 3, 2022, 5:25:54 AM10/3/22
to leo-editor
Thanks for this - as it happens I was slightly musing on the current search capabilities with Leo just recently...

J^n

Edward K. Ream

unread,
Oct 3, 2022, 6:18:43 AM10/3/22
to leo-e...@googlegroups.com
On Sun, Oct 2, 2022 at 11:57 PM Thomas Passin <tbp1...@gmail.com> wrote:
Here is a fairly easy way to search in a body of text for a string when there might be single-character errors in a pattern, like searching for "butterfly" but the text has "budterfly". 

or, I hope, doSomething vs. do_something.

Thanks for the link.  All contributions gratefully accepted!

Edward

Thomas Passin

unread,
Oct 3, 2022, 8:27:14 AM10/3/22
to leo-editor
On Monday, October 3, 2022 at 6:18:43 AM UTC-4 Edward K. Ream wrote:
On Sun, Oct 2, 2022 at 11:57 PM Thomas Passin <tbp1...@gmail.com> wrote:
Here is a fairly easy way to search in a body of text for a string when there might be single-character errors in a pattern, like searching for "butterfly" but the text has "budterfly". 

or, I hope, doSomething vs. do_something.

Yes, it would catch that.  Now if both legitimately appear in a text, I'm not sure how that's handled.  There is a verification step, but I forget the details.

Jacob MacDonald

unread,
Oct 3, 2022, 5:05:09 PM10/3/22
to leo-e...@googlegroups.com
Thomas Passin wrote:
> I thought this might be of interest to some people on the list. Here's a link to the paper:
>
> APPROXIMATE MULTIPLE STRING SEARCH (Muth and Manber)

The read was very interesting, but it looks like the linked paper is
by Baeza-Yates and Navarro, and I see Manber mentioned but not Muth.
In any case, there's other interesting work around the same problem,
including another paper by the same authors using finite automata.
Reply all
Reply to author
Forward
0 new messages