Hi Nino,
I don't follow exactly how your example works (without spending time to examine it more closely), but in general I'd say your approach is probably one of the fastest ways to scan for keywords anyway.
Usually data structures don't fit all queries optimally, so the best approach/only option is use the data structure to reduce your search to a candidate set, and then apply (perhaps brute force) filtering on that candidate set.
It has been on my TODO list for a long time to add a Permuterm tree to the library as well. I think I have a ticket open for that. I actually implemented most of it, but there are a few edge cases outstanding. Anyway permuterm trees might be able to accelerate some kinds of regex expressions where you have wildcards in the middle, better than the inverted radix tree.
Maybe for your problem it might be useful to get some feedback on the location (start and end offsets) within the tweets that the matches are found. That way you could check if the characters preceding or succeeding the match are whitespace more quickly.
The following is a bit hacky, but it allows you to find the offsets of keyword matches within your input document.
This scans example tweets for keywords using the inverted radix tree, and it also reports the start and end offsets within the tweets of the keywords found.
It prints:
Found 'tweet' between indexes 0 and 4
Found 'tweet' between indexes 18 and 22
public class KeywordIndexes {
public static void main(String[] args) {
String myDocument = "tweets, I like to tweet.";
InvertedRadixTree<VoidValue> tree = new ConcurrentInvertedRadixTree<VoidValue>(new SmartArrayBasedNodeFactory());
tree.put("tweet", VoidValue.SINGLETON);
InputWrapper inputWrapper = new InputWrapper(myDocument);
for (CharSequence keyword : tree.getKeysContainedIn(inputWrapper)) {
int keywordEndIndex = inputWrapper.getLastIndexAccessed();
int keywordStartIndex = keywordEndIndex + 1 - keyword.length();
System.out.println("Found '" + keyword + "' between indexes " + keywordStartIndex + " and " + keywordEndIndex);
}
}
public static class InputWrapper implements CharSequence {
final CharSequence delegate;
final int offset;
final int length;
final AtomicInteger lastIndexAccessed;
InputWrapper(CharSequence delegate) {
this.delegate = delegate;
this.offset = 0;
this.length = delegate.length();
this.lastIndexAccessed = new AtomicInteger(-1);
}
InputWrapper(CharSequence delegate, int offset, int length, AtomicInteger lastIndexAccessed) {
this.delegate = delegate;
this.offset = offset;
this.length = length;
this.lastIndexAccessed = lastIndexAccessed;
}
@Override
public int length() {
return this.length - this.offset;
}
@Override
public char charAt(int index) {
lastIndexAccessed.set(this.offset + index);
return delegate.charAt(this.offset + index);
}
@Override
public CharSequence subSequence(int offset, int length) {
return new InputWrapper(delegate, this.offset + offset, this.offset + length, this.lastIndexAccessed);
}
public int getLastIndexAccessed() {
if (lastIndexAccessed.get() < 0) {
throw new IllegalStateException();
}
return lastIndexAccessed.get();
}
}
}