Minimizing overhead for UTF-8 data

139 views
Skip to first unread message

Steve Newman

unread,
Mar 28, 2016, 7:46:40 PM3/28/16
to re2j-discuss
I'm looking at using re2j. My specific use case involves searching for a single pattern in a very large number of strings. Each string is typically 1 to 500 characters long, and is stored in UTF-8 form in a byte[]. There may be up to ~10,000,000 strings to search at once, totalling a few gigabytes, all held in RAM.

Is it possible to invoke re2j directly on UTF-8 data? I see some hints of this in the code, e.g. the UTF8Input class, but there doesn't seem to be any way to access this from the outside.

My goal is to minimize runtime overhead, and memory allocation in particular. As a temporary hack, I've created an implementation of CharSequence that works directly from a byte[], but so far I've only implemented the easy case where there are no multi-byte characters. If re2j supports UTF-8 input directly, that would seem to be the best route.

Steve

Schlussel, Rebecca T

unread,
Mar 29, 2016, 11:51:45 AM3/29/16
to Steve Newman, re2j-discuss
Hi Steve,

RE2J does not operate directly on utf8 bytes.  However, Teradata has a fork of re2j that we changed to operate on utf-8 slices.  We did this to remove the overhead of conversion as well as to allow other optimizations (for example, we implemented the more efficient DFA search algorithm).  You can find it here:  https://github.com/Teradata/re2jhttps://github.com/Teradata/re2j

Rebecca

From: re2j-d...@googlegroups.com [re2j-d...@googlegroups.com] on behalf of Steve Newman [st...@snewman.net]
Sent: Monday, March 28, 2016 7:46 PM
To: re2j-discuss
Subject: Minimizing overhead for UTF-8 data

--
You received this message because you are subscribed to the Google Groups "re2j-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to re2j-discuss...@googlegroups.com.
To post to this group, send email to re2j-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/re2j-discuss/9159d533-8166-4aad-81ea-b0c64ca1e263%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steve Newman

unread,
Mar 30, 2016, 10:16:29 AM3/30/16
to Schlussel, Rebecca T, re2j-discuss
Thanks!

Is there any documentation on this fork? The API appears straightforward enough, but the readme appears to be unchanged and I wonder if there are any specific issues to be aware of when working with this version of the library.

Steve

Schlussel, Rebecca T

unread,
Mar 30, 2016, 11:11:15 AM3/30/16
to Steve Newman, re2j-discuss
Sorry, we didn't update the documentations.   There are some tuning options that I cant tell you about along with a bit of an overview of how it works.

  • algorithm
    • This version of re2j has two different algorithms for doing regular expression matching (Both scale linearly in time and are based on finite state machines).  There's the DFA algorithm, which is much faster, but can end up using lots of memory, and the NFA algorithm, which is slower, but whose memory usage is more contained.  By default, it will try the DFA algorithm, and fall back to the NFA algorithm when a state limit is reached. You can also set it to try the DFA and fail if it hits the states limit or only use the NFA.  These options are mostly just for testing.  I recommend sticking with the default.
  • maximumNumberofDFAStates
    • This is used as a proxy for memory tracking, since the number of states is what eats up the DFA memory.  When the maximumNumberOfDFAStates is reached for a given input, we exit the DFA algorithm and fall back to the NFA algorithm.  This should be tuned appropriately for the size and workload of your machine.  Defaults to MAX_INT.
  • numberOfDFARetries
    • After an NFA fallback has happened for a given input, we still continue to first try the DFA algorithm for future inputs and fallback as necessary. When the numberOfDFARetries is hit, we no longer try the DFA for future inputs, and instead automatically start with the NFA.   The reasoning here is that often inputs will be similar, so if you fallback several times, it's likely that it will continue to happen.  In that case it's very inefficient to keep trying the first algorithm only to fall back every time.  However, you don't want to let one outlier prevent you from using the faster algorithm on the rest of your inputs. The right value for this probably depends on the variety and number of your inputs.  The default value is 5. 
  • eventsListener
    • called on fallback if for example you want to add logging every time you fall back to the NFA.
Let me know if you have questions or if something wasn't clear.

Thanks,
Rebecca




From: sne...@gmail.com [sne...@gmail.com] on behalf of Steve Newman [st...@snewman.net]
Sent: Wednesday, March 30, 2016 10:16 AM
To: Schlussel, Rebecca T
Cc: re2j-discuss
Subject: Re: Minimizing overhead for UTF-8 data

Reply all
Reply to author
Forward
0 new messages