5x speedup in inflections

64 views
Skip to first unread message

Xavier Noria

unread,
Feb 11, 2012, 7:10:27 AM2/11/12
to rubyonrails-core
Hi gents,

I am playing around with an idea to improve the performance of singularize and pluralize for Rails 4.0. In my proof of concept I see some 5x boost, but it relies an assumption that I'd like to consult with you all. Let me explain.

As you know, inflection rules have a lhs which is a string or regexp, and a replacement string as rhs.

The current implementation collects the rules in an array, and to apply them to a particular word the array is iterated. The first pattern that matches is the one applied. In particular, the most common rule (eg append "s" to form a plural), is the *last* one because most specific rules come first. By default we have +30 rules for singulars and +30 for plurals.

My idea is to build a single regexp with an alternation, detect which segment matches, and apply its replacement. That is, let the regexp engine itself do the loop. Much faster.

I have this working with a quick hack that has shown there's a potential speedup here. To be able to know which regexp is the one that matched I use named captures. For example (?<_25>(ax)is) would be the alternation corresponding to the regexp /(ax)is/, if that's the 26th pattern. It won't win me an elegance award, but it is a hack that works (I could workaround name clashes easily if the user happens to use _25 himself, that's not important).

Named captures are the only way I've seen to be able to build the alternation and at the same time know which part matches. Because existing inflection regexps have captures.

This is the proof-of-concept: https://gist.github.com/1798985.

I believe this is correct *as long as* the user regexps have no backreferences, because if you have a = /(..)\1/ and b = /(.)\1/ then "xx" matches b, but does not match the union a|b because the backreference \1 in b now refers to the group in a.

OK, so this is the question: do you guys use backreferences in custom inflections? If you didn't we could consider ruling them out for 4.0 to be able to implement this.


Aaron Patterson

unread,
Feb 11, 2012, 8:10:46 PM2/11/12
to rubyonra...@googlegroups.com

It's possible to count the number of captures in a given regexp:

def count_captures regexp
Regexp.union([regexp, //]).match('').captures.length
end

p count_captures /(ax)is/ # => 1
p count_captures /((ax)is)/ # => 2
p count_captures /(?:(ax)is)/ # => 1

With that information, you can calculate offsets. Calculating offsets
would still result in a linear scan (since you'd have to keep a list of
the offsets, then find the one that matched), and since we can use named
captures on 1.9, I'm not sure an offset based solution would be great.

> This is the proof-of-concept: https://gist.github.com/1798985.

Seems good, but you don't need a *args to Regexp.union. :-)

> I believe this is correct *as long as* the user regexps have no
> backreferences, because if you have a = /(..)\1/ and b = /(.)\1/ then "xx"
> matches b, but does not match the union a|b because the backreference \1 in
> b now refers to the group in a.
>
> OK, so this is the question: do you guys use backreferences in custom
> inflections? If you didn't we could consider ruling them out for 4.0 to be
> able to implement this.

If you can rule out backreferences, it seems possible you could build a
state machine that would perform in linear time to the string length.
It seems though, with your algorithm, as the string gets longer, the
original implementation is faster:

https://gist.github.com/1805359

Do we know what kind of data we're dealing with? Somewhere between
strings of length 10 and strings of length 100, the original code
outperforms this version. If we're only dealing with shorter strings,
maybe this newer code would be better.

One last comment, the line:

name = md.names.detect {|n| md[n]}

The length of `names` increases as the number of rules is added to the
system. Which means in the case of a match, this algorithm will still
be O(n), where n is the number of rules defined. We can see this by
testing strings that match first and strings that match last:

https://gist.github.com/1805536

I'm using AS 3.2.1, so "zombies" will be best case, and a bunch of x's
will be worst.

If we're willing to limit the type of expressions provided to the
inflector, it's possible to construct something that matches the string
regardless of the number of rules (entries in "inflections.rb"), and
proportional to the string length.

MEOW MEOW MEOW MEOW

<3 <3 <3 <3 <3 <3

--
Aaron Patterson
http://tenderlovemaking.com/

Xavier Noria

unread,
Feb 11, 2012, 9:14:34 PM2/11/12
to rubyonra...@googlegroups.com
On Sun, Feb 12, 2012 at 2:10 AM, Aaron Patterson <tende...@ruby-lang.org> wrote:

Hi there!

It's possible to count the number of captures in a given regexp:

   def count_captures regexp
     Regexp.union([regexp, //]).match('').captures.length
   end

Yeah, Active Support had that some time ago. I think it was used in routing, but it got removed sometime in the middle of writing the AS guide.


With that information, you can calculate offsets.

Which offsets do you have in mind here?

Counting groups seems a good idea, we could get rid of the ugly _25 named captures I think.

Since individual inflection rules have well-formed regexps (or strings), we can count their groups. Therefore, if we add one extra group enclosing the whole regexp we can keep track of which group index corresponds to which rule. That'd be much cleaner!

 
> This is the proof-of-concept: https://gist.github.com/1798985.

Seems good, but you don't need a *args to Regexp.union. :-)

Oh, yes :).


If you can rule out backreferences, it seems possible you could build a
state machine that would perform in linear time to the string length.

The situation now is:

* Inflections can have arbitrary regexps (modulus backreferences if we disallow them). A priori, they can use any regexp feature, and are typically case-insensitive.

* Inflections have to be matched in order.

Given that and the fast regexp machine in 1.9, alternation seems to be a better and cost-free approach because what we are talking here is implemented quickly (cost == development cost).

Wouldn't such a state machine be a rewrite of the very regexp engine?
 

It seems though, with your algorithm, as the string gets longer, the
original implementation is faster:

 https://gist.github.com/1805359

Do we know what kind of data we're dealing with?  Somewhere between
strings of length 10 and strings of length 100, the original code
outperforms this version.  If we're only dealing with shorter strings,
maybe this newer code would be better.

Yes, I think this approach is worthwhile because in practice I believe Rails and Rails applications singularize and pluralize English words.


 name = md.names.detect {|n| md[n]}

The length of `names` increases as the number of rules is added to the
system.  Which means in the case of a match, this algorithm will still
be O(n)

Yes, yes, but seems unavoidable an O(n) that still outperforms the original algorithm with the sizes I've tested (I don't mean it does not with lager sets of inflections, only I measured the typical size which is what most people use).

 
where n is the number of rules defined.  We can see this by
testing strings that match first and strings that match last:

 https://gist.github.com/1805536

I'm using AS 3.2.1, so "zombies" will be best case, and a bunch of x's
will be worst.

Yes.

But we are not really interested in arbitrary strings. If we give a 5x boost to the vast majority of users that apply inflection to English words of ordinary length, and some minority using long strings get some penalty, that's fine I think.

Xavier Noria

unread,
Feb 11, 2012, 9:28:52 PM2/11/12
to rubyonra...@googlegroups.com
Ah, of course, another assumption here is that regexps are *in practice* simple.

All the time I have in mind the actual sets of regexps and words to apply them we find in typical Rails applications.

In particular in practice I don't expect any backtracking explosion due to quantification + alternation. Most regexps are simple and anchored.

Xavier Noria

unread,
Feb 11, 2012, 9:35:47 PM2/11/12
to rubyonra...@googlegroups.com
Nah, forget that last mail about backtracking. If there's excessive backtracking in some regexp it will be present in both approaches, the way we build the alternation does not add to it.

Aaron Patterson

unread,
Feb 11, 2012, 9:43:51 PM2/11/12
to rubyonra...@googlegroups.com

Ya, but if we're going to put arbitrary restrictions on the type of
matches people can do (i.e. no backreferences), you may as well use an
engine that can execute in O(n) time (n = string length). Otherwise,
what's the point?

> It seems though, with your algorithm, as the string gets longer, the
> > original implementation is faster:
> >
> > https://gist.github.com/1805359
> >
> > Do we know what kind of data we're dealing with? Somewhere between
> > strings of length 10 and strings of length 100, the original code
> > outperforms this version. If we're only dealing with shorter strings,
> > maybe this newer code would be better.
> >
>
> Yes, I think this approach is worthwhile because in practice I believe
> Rails and Rails applications singularize and pluralize English words.
>
>
> name = md.names.detect {|n| md[n]}
> >
> > The length of `names` increases as the number of rules is added to the
> > system. Which means in the case of a match, this algorithm will still
> > be O(n)
>
>
> Yes, yes, but seems unavoidable an O(n) that still outperforms the original
> algorithm with the sizes I've tested (I don't mean it does not with lager
> sets of inflections, only I measured the typical size which is what most
> people use).

How did you determine the length of the strings that most people use? I
think seeing this data would be useful in speeding up this code.

Xavier Noria

unread,
Feb 11, 2012, 10:12:43 PM2/11/12
to rubyonra...@googlegroups.com
On Sun, Feb 12, 2012 at 3:43 AM, Aaron Patterson <tende...@ruby-lang.org> wrote:
Ya, but if we're going to put arbitrary restrictions on the type of
matches people can do (i.e. no backreferences), you may as well use an
engine that can execute in O(n) time (n = string length).  Otherwise,
what's the point?

I don't claim this approach is better in complexity theoretically. It isn't. The point is that for what I think are typical scenarios the regexp engine loops and matches so much faster than in practice this is a potential important speedup.

Also I am aiming at max backwards compat. Disallowing backreferences is not going to be a big deal I think because I doubt they are used much in inflection rules (though I asked to the ML in case there's some use case that says we have to support them).

Of course if we restrict the regexp features to a very small set we can write a scanner ourselves. But you'd still be doing basically what the regexp engine does, and without the optimizations oniguruma may have implemented after analyzing the regexp.

Because you need to go in order and per each fragment either match the whole pattern or fail fast and try the next fragment right? That's already what happens with alternation no?

Can you beat that and still support regexps without backreferences?


How did you determine the length of the strings that most people use?  I
think seeing this data would be useful in speeding up this code.

I am guessing of course, based on my experience. Also, you do not pluralize sentences or text files no? You pluralize words, that's the expected input statistically speaking and the one that has to guide us in implementing this I believe.

Nowadays long strings get a performance boost. That does not make sense statistically speaking, English words should be the fast ones.

Xavier Noria

unread,
Feb 11, 2012, 10:31:49 PM2/11/12
to rubyonra...@googlegroups.com
On Sun, Feb 12, 2012 at 4:12 AM, Xavier Noria <f...@hashref.com> wrote:
Nowadays long strings get a performance boost. That does not make sense statistically speaking, English words should be the fast ones.

Indeed, running the benchmark against /usr/share/dict/words gives an overall speedup of more than 7x:


The bigger the sample the greater the speedup because inflections are the exception. The majority of words are inflected using the last rule, so the difference in the technique to loop is bigger.

This should happen also in real life, the exceptions are rare, in general most words will be applying the last rule.

Xavier Noria

unread,
Feb 11, 2012, 10:55:54 PM2/11/12
to rubyonra...@googlegroups.com
Ugh, I started the RuleSet class as a replacement of what the inflector does today and prepending rules was actually reversing them in the script :D.

Appending does slow it down a bit, but we are still around 6x.

Aaron Patterson

unread,
Feb 12, 2012, 12:45:51 AM2/12/12
to rubyonra...@googlegroups.com

Interesting. Have you investigated expanding the regular expressions
and doing hash based replacement via gsub!? Since we can know the
replacements in advance, it's possible to compile a hash and use it for
the replacement. If the hash misses, we can fall back to a linear scan.

Here's a quick implementation as an example. We could probably optimize
more of the expressions:

https://gist.github.com/1806575

Xavier Noria

unread,
Feb 12, 2012, 5:15:47 AM2/12/12
to rubyonra...@googlegroups.com
On Sun, Feb 12, 2012 at 6:45 AM, Aaron Patterson <tende...@ruby-lang.org> wrote:
Interesting.  Have you investigated expanding the regular expressions
and doing hash based replacement via gsub!?  Since we can know the
replacements in advance, it's possible to compile a hash and use it for
the replacement.  If the hash misses, we can fall back to a linear scan.

Here's a quick implementation as an example.  We could probably optimize
more of the expressions:

 https://gist.github.com/1806575

Interesting!

I think that approach does not guarantee the order though. Let's imagine two inflection rules r1, r2. Let's take a word that matches both, and let's suppose only r2 is precompilable. If I understand that alternative correctly the algorithm would yield the replacement for r2, rather than r1. True?

Xavier Noria

unread,
Feb 12, 2012, 8:37:56 AM2/12/12
to rubyonra...@googlegroups.com
I have implemented the variant that uses regular captures rather than named captures, based on the group counter hack.

We have a similar improvement of 6x, and the code reads better I believe:


Also, I've added a sanity check to the script that ensures the current inflector in AS and the one in the script give the same plurals for the entire /usr/share/dict/words.

Yeah, I think we should do this.

Xavier Noria

unread,
Feb 12, 2012, 5:25:08 PM2/12/12
to rubyonra...@googlegroups.com
OK, the alternation is a red herring.

I have a patch with the new approach implemented and have compared
runtimes: no difference, there's no speedup.

The 6x speedup in the proof of concept happens because the script is
not doing this apparently innocent check over the default 9
uncountable words:

inflections.uncountables.any? { |inflection| result =~
/\b#{inflection}\Z/i }

which is present in ActiveSupport::Inflections#apply_inflections.

In that method, if you return the inflection right way instead of
doing this check, you see the 6x.

Yehuda Katz

unread,
Feb 12, 2012, 6:39:39 PM2/12/12
to rubyonra...@googlegroups.com
Back in the day I implemented a patch for the inflector that gave (iirc) something like 10x improvement. It was based on the Merb inflector, and was mostly accomplished by more aggressive caching. My argument at the time was that inflections tend to be used over and over again in a given app (mostly by the router).

It required some back-compat breakage, and since the inflector isn't a large cost even in tiny API-like requests, we dropped it.

Yehuda Katz
(ph) 718.877.1325



--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group.
To post to this group, send email to rubyonra...@googlegroups.com.
To unsubscribe from this group, send email to rubyonrails-co...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.


Xavier Noria

unread,
Feb 12, 2012, 7:01:45 PM2/12/12
to rubyonra...@googlegroups.com
So, knowing the key place to improve in the current implementation,
I've rewritten that test this way:

https://github.com/rails/rails/commit/d3071db1200e90c0533f75b967c4afb519656d00

which exploits the fact that uncountables are not regexps, but words.
It is not entirely backwards compatible because the words are not
required today to be lowercase (though in practice they normally are).

Jeffrey L. Taylor

unread,
Feb 13, 2012, 10:51:32 PM2/13/12
to rubyonrails-core, pad...@googlegroups.com
Quoting Xavier Noria <f...@hashref.com>:
> Hi gents,
>
> I am playing around with an idea to improve the performance of singularize
> and pluralize for Rails 4.0. In my proof of concept I see some 5x boost,
> but it relies an assumption that I'd like to consult with you all. Let me
> explain.
>
> As you know, inflection rules have a lhs which is a string or regexp, and a
> replacement string as rhs.
>
> The current implementation collects the rules in an array, and to apply
> them to a particular word the array is iterated. The first pattern that
> matches is the one applied. In particular, the most common rule (eg append
> "s" to form a plural), is the *last* one because most specific rules come
> first. By default we have +30 rules for singulars and +30 for plurals.
>
> My idea is to build a single regexp with an alternation, detect which
> segment matches, and apply its replacement. That is, let the regexp engine
> itself do the loop. Much faster.
>

This is interesting. I tried a similar optimization parsing RSS feeds,
replacing the list of strings in a when clause with a regex. It was 35%
faster. Then I combined all the regex with a group with a unique character in
it to determine which sub-pattern matched. No faster than the list of string
approach. Interesting. There is something I don't understand about regex
performance.


Jeffrey

Reply all
Reply to author
Forward
0 new messages