Performance test feedback

409 views
Skip to first unread message

Schlussel, Rebecca T

unread,
Nov 3, 2015, 5:33:36 PM11/3/15
to re2j-d...@googlegroups.com
Hi,

Here at Teradata, we've been trying out using re2j to replace the regular expression library in Presto (a distributed sql query engine).  Currently Presto uses a regular expression library called Joni, which performance-wise is on par with java.util.regex. Some users had been running into worst case situations with the current backtracking algorithm, so we were looking for a regular expression library that would be more reliably performant.

We did some benchmarking where we ran queries that used Presto's regular expression functions against about 1.5 GB of data in hdfs, and found that with some changes (more on that below), re2j was usually about 3-5x slower than Joni.  However, it didn't have the exponential slow down of a backtracking algorithm, so there were cases that took more than 30 hours with Joni that took a few minutes using re2j.  You can see https://github.com/facebook/presto/issues/3859  for more info on those results.  

We made a couple of changes to re2j that improved performance for us considerably.  First was in the machine cache, where there was a tremendous amount of thread contention (we had ~120 threads running at a time).  Fixing this issue resulted in several times speedup for re2j.  Another that got about a 10% speed up was replacing an array list with an array. We're still waiting for our corporate approval to go through to contribute to the project, but once we have it we'll submit pull requests with these changes. 

We are also looking at porting over the other regular expression matching engines from the C++ version of RE2. The C++ version has four linear time regular expression engines, each of which is useful for its own set of cases.  The NFA engine, which is implemented in RE2J is the slowest, but most versatile of those.  If we have other engines in place, we should see considerable speed up for RE2J for most regular expressions.  

Alan- if you're serious about wanting to give up the project, our group at Teradata would be happy to take it over.  

Rebecca

Alan Donovan

unread,
Nov 3, 2015, 5:52:44 PM11/3/15
to Schlussel, Rebecca T, re2j-d...@googlegroups.com
Thanks Rebecca, this is music to my ears.  Sounds like your application has exactly the kind of asymptotic performance needs for which RE2/J can really shine relative to other implementations.

We have noticed some problems recently of poor RE2/J performance relative to the same algorithm in Go (see https://github.com/google/re2j/issues/12).  I started looking into them but put them aside for other priorities.  It sounds like you know what you're doing with Java profiling, so you might find some low-hanging fruit for optimization there.  Perhaps your pending changes already fix this problem (although I doubt it since the bug report isn't about concurrency).

Regarding taking it over, I suggest you talk to fel...@google.com since he was also offering to maintain it.  The more the merrier.

cheers
alan

Reply all
Reply to author
Forward
0 new messages