Fast whole word matching in text

82 views
Skip to first unread message

Nino Walker

unread,
Feb 26, 2014, 6:46:38 PM2/26/14
to concurrent-t...@googlegroups.com
Hi Niall,

I'm am also a fan of this library, and have a query in a similar vein to the spellchecking problem and wildcards.

I'm using ConcurrentInvertedRadixTree to scan thousands of tweets/second for matching terms via getKeyValuePairsForKeysContainedIn. I need to match whole words/tags ("foot" does not match "footage"), and so I manage this with by padding keys " foot ", and that works fine..., except " fine " does not match, " fine...". A regular expression "foot\b" is really what is needed.

In absence of wildcards, I'm normalizing:
 >>> re.sub(r'\s\s+', ' ', re.sub(r'\s(\W)(\w)', r' \1 \2', re.sub(r'(\w)\b', r'\1 ', "jip is recommended to run within virtualenv, which is a best practice for python/jython developers to created a standalone, portable environment.")))
'jip is recommended to run within virtualenv , which is a best practice for python / jython developers to created a standalone , portable environment .'

An optimized form of this takes 80 microseconds avg on the string above, where the scan, getKeyValuePairsForKeysContainedIn with thousands of keys, only takes ~ 40 micros.

I'm happy with this as is, but I'm curious if you have any thoughts on this.

Thanks,

dNino

Niall

unread,
Feb 27, 2014, 6:35:26 PM2/27/14
to concurrent-t...@googlegroups.com
Hi Nino,

I don't follow exactly how your example works (without spending time to examine it more closely), but in general I'd say your approach is probably one of the fastest ways to scan for keywords anyway.

Usually data structures don't fit all queries optimally, so the best approach/only option is use the data structure to reduce your search to a candidate set, and then apply (perhaps brute force) filtering on that candidate set.

It has been on my TODO list for a long time to add a Permuterm tree to the library as well. I think I have a ticket open for that. I actually implemented most of it, but there are a few edge cases outstanding. Anyway permuterm trees might be able to accelerate some kinds of regex expressions where you have wildcards in the middle, better than the inverted radix tree.

Maybe for your problem it might be useful to get some feedback on the location (start and end offsets) within the tweets that the matches are found. That way you could check if the characters preceding or succeeding the match are whitespace more quickly.

The following is a bit hacky, but it allows you to find the offsets of keyword matches within your input document.
This scans example tweets for keywords using the inverted radix tree, and it also reports the start and end offsets within the tweets of the keywords found.

It prints:
Found 'tweet' between indexes 0 and 4
Found 'tweet' between indexes 18 and 22


public class KeywordIndexes {

   
public static void main(String[] args) {
       
String myDocument = "tweets, I like to tweet.";

       
InvertedRadixTree<VoidValue> tree = new ConcurrentInvertedRadixTree<VoidValue>(new SmartArrayBasedNodeFactory());
        tree
.put("tweet", VoidValue.SINGLETON);

       
InputWrapper inputWrapper = new InputWrapper(myDocument);
       
for (CharSequence keyword : tree.getKeysContainedIn(inputWrapper)) {
           
int keywordEndIndex = inputWrapper.getLastIndexAccessed();
           
int keywordStartIndex = keywordEndIndex + 1 - keyword.length();
           
System.out.println("Found '" + keyword + "' between indexes " + keywordStartIndex + " and " + keywordEndIndex);
       
}
   
}

   
public static class InputWrapper implements CharSequence {
       
final CharSequence delegate;
       
final int offset;
       
final int length;
       
final AtomicInteger lastIndexAccessed;

       
InputWrapper(CharSequence delegate) {
           
this.delegate = delegate;
           
this.offset = 0;
           
this.length = delegate.length();
           
this.lastIndexAccessed = new AtomicInteger(-1);
       
}

       
InputWrapper(CharSequence delegate, int offset, int length, AtomicInteger lastIndexAccessed) {
           
this.delegate = delegate;
           
this.offset = offset;
           
this.length = length;
           
this.lastIndexAccessed = lastIndexAccessed;
       
}

       
@Override
       
public int length() {
           
return this.length - this.offset;
       
}

       
@Override
       
public char charAt(int index) {
            lastIndexAccessed
.set(this.offset + index);
           
return delegate.charAt(this.offset + index);
       
}

       
@Override
       
public CharSequence subSequence(int offset, int length) {
           
return new InputWrapper(delegate, this.offset + offset, this.offset + length, this.lastIndexAccessed);
       
}

       
public int getLastIndexAccessed() {
           
if (lastIndexAccessed.get() < 0) {
               
throw new IllegalStateException();
           
}
           
return lastIndexAccessed.get();
       
}
   
}
}
Reply all
Reply to author
Forward
0 new messages