Fast user-defined pattern matching

196 views
Skip to first unread message

Shevek

unread,
Nov 14, 2018, 8:08:21 PM11/14/18
to mechanical-sympathy
Dear wizards, please advise.

I need to offer a user configuration feature for pattern matching, to
exclude objects from my billion object sort-merge (which is now working
fairly well, thank you all).

What we're mostly trying to do is exclude any record which contains any
one of a number of substrings. The computer science textbooks give
various fast-string-searching algorithms with pre-computed tables, any
of which would suit our use case, but I don't see a practical
implementation of any of them floating around...

Current practical options:
* java.util.regex, precompiled patterns
- Reputedly slow at matching, but our patterns are simple.
- We are using a single regex containing a large alternation.
- perf says this regex matcher is 50% of our runtime.
- WHY isn't Matcher.usePattern() allocation-free? It totally could
be. This means that the int[] array allocation is a major drain on the GC.
- If we use ThreadLocal<Matcher> instead, the ThreadLocal can't be
static, and hits the previously discussed issue with blowing out the
ThreadLocalMap.
* rej2
- Not tried yet - has anyone tried this?
* brics
- Trying and failing on brics may make sense before falling back to
java regex.
* Groovy Closure
- Faster than regex/pattern assuming you have a better strategy for
matching. But now you're down to repeated contains() calls.

What are the suggestions?

Thank you.

S.

Ben Evans

unread,
Dec 21, 2018, 6:31:29 AM12/21/18
to mechanica...@googlegroups.com
[Resurrecting a semi-dead thread]
Hi Shevek,

What about using a bytecode-emitting solution (something like JOni, from JRuby)?

Some work along these lines was also done in Nashorn:

http://mail.openjdk.java.net/pipermail/nashorn-dev/2013-May/001063.html

http://cr.openjdk.java.net/~hannesw/8012269/

Thanks to Charlie Nutter for helping to jog my memory about what prior
art exists for this type of approach. If there's someone here who
worked on the relevant bit of Nashorn they might be able to speak to
this point much better than I can from my very basic experience here.

Cheers,

Ben
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Michael Barker

unread,
Dec 21, 2018, 1:02:24 PM12/21/18
to mechanica...@googlegroups.com
Boyer-Moore seems to be one that is commonly used for fast sub-string matching (GNU grep uses it internally).  Works best when the same substring is reapplied multiple times.  I've used the simplified version in the past.  The Wikipedia page (https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm) has an implementation in Java.  It would need to be reworked slightly to be useful.  I.e. the charTable and offsetTable should be precalculated from the search substring.  I'd make it into a class, pass the search term (needle) as a constructor parameter and store the tables as fields.  The input (haystack) is the passed into a non-static indexOf function (without the needle).

Mike.

Ben Evans

unread,
Dec 21, 2018, 2:48:55 PM12/21/18
to mechanica...@googlegroups.com
It has been ~12 years since I looked at this, but when benchmarking
the Perl 5.8 regexp implementation (which is a modified Boyer-Moore,
if memory serves) it was significantly faster than Java 6's inbuilt
java.util.regex classes.

However, the Perl implementation has pathological special cases that
are rare, but by no means purely theoretical. At this distance, all I
can recall is that it was something to do with multiple | conditions
and backtracking, but the extreme slowdown could be reliably provoked
- and the conclusion we came to was that it was a side effect of the
algorithm's design.

I'm not sure that I have a point here, beyond the usual Caveat Emptor,
but there we are.

Ben
Reply all
Reply to author
Forward
0 new messages